diff --git a/BOOTSTRAP.md b/BOOTSTRAP.md index 8d89927..7482b43 100644 --- a/BOOTSTRAP.md +++ b/BOOTSTRAP.md @@ -4,7 +4,7 @@ - [ ] `kubeconfig` file placed at `~/infra/yandex-prod/kubeconfig` - [ ] `kubectl` context pointing to the new `yc-prod` cluster -- [ ] Domain `prod.t01tt.tech` DNS managed (can be updated later in Phase 5) +- [ ] Domaster `prod.t01tt.tech` DNS managed (can be updated later in Phase 5) - [ ] `git` and `helm` installed locally --- @@ -48,11 +48,11 @@ kubectl port-forward svc/gitea 3000:3000 -n gitea 2. Fill out the install form: - Database: **SQLite3** (default) - Site Title: **Gitea** - - Domain: **git.prod.t01tt.tech** + - Domaster: **git.prod.t01tt.tech** - Application URL: **https://git.prod.t01tt.tech** - Create admin account (username/password/email — save these) 3. Click "Install Gitea" -4. Create a new repository: **`yandex-prod`** (must be **public**, owned by admin) +4. Create a new repository: **`master`** (must be **public**, owned by admin) 5. Close the port-forward (Ctrl+C) --- @@ -63,13 +63,13 @@ kubectl port-forward svc/gitea 3000:3000 -n gitea cd ~/infra/yandex-prod git init -git remote add origin http://localhost:3000//yandex-prod.git +git remote add origin http://localhost:3000/admin/master.git # Or, once Gitea ingress works later, use: -# git remote add origin https://git.prod.t01tt.tech//yandex-prod.git +# git remote add origin https://git.prod.t01tt.tech/admin/master.git git add -A git commit -m "initial bootstrap: infrastructure manifests" -git push -u origin main +git push -u origin master # Enter Gitea admin credentials when prompted ``` @@ -93,7 +93,7 @@ sleep 2 ARGOCD_PASS=$(kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d) argocd login localhost:8080 --username admin --password "${ARGOCD_PASS}" --insecure -argocd repo add http://gitea.gitea.svc.cluster.local:3000//yandex-prod.git \ +argocd repo add http://gitea.gitea.svc.cluster.local:3000/admin/master.git \ --name yandex-prod \ --type git ``` @@ -242,7 +242,7 @@ argocd repo list If not connected, re-add via ArgoCD CLI: ```bash -argocd repo add http://gitea.gitea.svc.cluster.local:3000//yandex-prod.git \ +argocd repo add http://gitea.gitea.svc.cluster.local:3000/admin/master.git \ --name yandex-prod \ --type git ``` @@ -296,7 +296,7 @@ kubectl logs -n cnpg-system deploy/cnpg-controller-manager Common issue: pods can't schedule due to `podAntiAffinityType: required`. Ensure all 3 nodes exist and PVCs can bind. ### Gitea UI shows wrong URL after first login -Gitea caches the ROOT_URL from the `deployment.yaml` env vars. If you change the domain, update: +Gitea caches the ROOT_URL from the `deployment.yaml` env vars. If you change the domaster, update: ```bash kubectl set env deploy/gitea -n gitea \ GITEA__server__DOMAIN=git.prod.t01tt.tech \ diff --git a/PLAN.md b/PLAN.md new file mode 100644 index 0000000..8f248e7 --- /dev/null +++ b/PLAN.md @@ -0,0 +1,527 @@ +# Yandex Cloud Production Matrix Cluster — Migration Plan + +## 0. Current State + +### Existing Clusters + +| Cluster | Type | Nodes | Purpose | +|---------|------|-------|---------| +| k3s homelab (`~/infra/k3s`) | self-managed k3s | 2 (oracle, sentinel) | GitOps-based, hosts test `mrt0rtikize.ru` ESS Community | +| Yandex Cloud prod (`kubectl` default) | managed k8s v1.32.1 | 3x 2CPU/6GB | Manual Helm, hosts 3 production Matrix instances | + +### Current Prod Load (Yandex Cloud) + +| Node | CPU | RAM | What runs there | +|------|-----|-----|-----------------| +| `ofem` | 6% | 32% | element-web x3, element-call x2, well-known x3, alloy, cert-manager-cainjector | +| `uzig` | 19% | **94%** | Grafana(682Mi) + Prometheus(499Mi) + VictoriaMetrics(327Mi) + Synapse-t0rt1k(582Mi) + 3x PostgreSQL + Alertmanager + 3x Redis + 3x LiveKit SFU + Traefik | +| `efur` | 10% | 51% | Synapse-roglog(134Mi) + Synapse-uretra(137Mi) + Loki(304Mi) + 2x mas-postgresql + cert-manager + coredns | + +Key issue: **uzig at 94% RAM** — monitoring stack competes with busiest Synapse on same node. + +### Prod Matrix Instances + +| | `matrix-t0rt1k` | `matrix-roglog` | `matrix-uretra` | +|---|---|---|---| +| Domain | `t0rt1k.tech` | `roglog.space` | `uretra.space` | +| Age | 87d | 82d | 82d | +| Synapse RAM | 582Mi | 134Mi | 137Mi | +| Storage | 10+8+1 Gi (HDD) | 10+8+1 Gi (HDD) | 10+8+1 Gi (HDD) | +| Helm chart | `matrix-2.9.17` (NOT ESS) | same | same | +| MAS migration | Failed (`syn2mas` job) | OK | OK | +| Components per instance | Synapse, Element Web, Element Call, LiveKit SFU + Redis + JWT, MAS, 2x PostgreSQL, well-known | same | same | +| Ingress | Traefik, `158.160.164.95` | same LB | same LB | +| TLS | cert-manager + `letsencrypt-production` | same | same | + +### Test Instance (k3s homelab) + +| Property | Value | +|----------|-------| +| Domain | `mrt0rtikize.ru` | +| Chart | ESS Community (`oci://ghcr.io/element-hq/ess-helm/matrix-stack` v26.6.1) | +| Namespace | `matrix-mrt0rtikize` | +| Components | Synapse, MAS, Element Web, Element Admin, Matrix RTC, Hookshot, HAProxy | +| PostgreSQL | Built-in (chart-managed) | +| Storage | Longhorn | +| GitOps | ArgoCD, repo at `gitea.mrt0rtikize.ru` | + +--- + +## 1. New Cluster Architecture + +### 1.1 Platform + +- Yandex Cloud Managed Kubernetes (new cluster) +- Ability to add external nodes in future (supported experimentally, not needed now) +- Managed control plane, self-managed worker nodes + +### 1.2 GitOps Foundation + +| Component | How | Notes | +|-----------|-----|-------| +| Gitea | `kubectl apply` from `bootstrap/gitea/` | Self-hosted git server, deployed first (before ArgoCD) | +| ArgoCD | `helm install` via `bootstrap/argocd/install.sh` | Installed with `--insecure` (same as k3s), points to Gitea | +| Root App | `argocd/app-of-apps.yaml` | Scans `argocd/apps/*.yaml` recursively, deploys everything else | + +### 1.3 Infrastructure Components + +| Component | Type | Values | Notes | +|-----------|------|--------|-------| +| cert-manager | ArgoCD Helm app | `installCRDs: true`, `ClusterIssuer: letsencrypt-production` | TLS for all ingresses | +| CloudNativePG Operator | ArgoCD Helm app | `cluster.instances: 3`, `storageClass: yc-network-ssd`, `size: 50Gi`, `podAntiAffinityType: required` | HA PostgreSQL for all Matrix instances | +| Prometheus Stack | ArgoCD Helm app | Ported from `k3s/manifests/metrics/kube-prometheus-stack-values.yaml`, remoteWrite to VictoriaMetrics | Monitoring + Alertmanager | +| VictoriaMetrics | ArgoCD Helm app | Ported from `k3s/manifests/metrics/victoria-metrics-single-values.yaml` | Long-term metrics storage | +| Loki | ArgoCD Helm app | Log aggregation | — | +| Alloy/Grafana Alloy | ArgoCD Helm app | Agent for metrics/logs forwarding | — | +| Traefik | Managed by Yandex (or DaemonSet) | Cluster's built-in ingress controller | LB external IP provisioned by Yandex Cloud | + +### 1.4 ESS Instances + +Each Matrix homeserver is a separate ArgoCD Application referencing the ESS chart: + +``` +argocd/apps/ +├── matrix-mrt0rtikize.yaml (first, test migration) +├── matrix-t0rt1k.yaml (production, after procedure proven) +├── matrix-roglog.yaml +└── matrix-uretra.yaml +``` + +Each uses the **shared CloudNativePG cluster** (not built-in PostgreSQL). + +### 1.5 Directory Structure + +``` +~/infra/yandex-prod/ +├── bootstrap/ +│ ├── gitea/ +│ │ ├── namespace.yaml +│ │ ├── deployment.yaml +│ │ ├── service.yaml +│ │ ├── ingress.yaml +│ │ └── pvc.yaml +│ └── argocd/ +│ ├── install.sh +│ └── values.yaml +├── argocd/ +│ ├── app-of-apps.yaml +│ └── apps/ +│ ├── cert-manager.yaml +│ ├── cnpg-operator.yaml +│ ├── cnpg-cluster.yaml +│ ├── monitoring.yaml +│ ├── loki.yaml +│ ├── matrix-mrt0rtikize.yaml +│ ├── matrix-t0rt1k.yaml +│ ├── matrix-roglog.yaml +│ └── matrix-uretra.yaml +└── manifests/ + ├── cnpg/ + │ ├── namespace.yaml + │ ├── databases.yaml # Database CRs per homeserver + │ └── secrets.yaml # PG credentials per homeserver (or generated) + └── matrix-mrt0rtikize/ + └── (supplemental manifests, if any) +``` + +### 1.6 CloudNativePG Architecture + +``` +CloudNativePG Cluster "shared-pg" (namespace: cnpg, 3 instances) +├── Instance 1 (node A) +├── Instance 2 (node B) ← anti-affinity ensures spread +└── Instance 3 (node C) + +Databases (one pair per homeserver): +├── synapse_mrt0rtikize (owner: synapse_mrt0rtikize) +├── mas_mrt0rtikize (owner: mas_mrt0rtikize) +├── synapse_t0rt1k (owner: synapse_t0rt1k) +├── mas_t0rt1k (owner: mas_t0rt1k) +├── synapse_roglog (owner: synapse_roglog) +├── mas_roglog (owner: mas_roglog) +├── synapse_uretra (owner: synapse_uretra) +└── mas_uretra (owner: mas_uretra) + +Service: shared-pg-rw.cnpg.svc.cluster.local:5432 (primary, read-write) + shared-pg-ro.cnpg.svc.cluster.local:5432 (replicas, read-only) +``` + +Each homeserver has a dedicated PostgreSQL role and database within the same cluster. +Databases and roles are created via CNPG `Database` CRs (see `manifests/cnpg/databases.yaml`). + +Credentials are stored in per-homeserver Kubernetes Secrets (`manifests/cnpg/secrets.yaml`), +referenced by ESS via `existingSecret` / `existingSecretKey`. + +### 1.7 ESS Configuration (per instance) + +```yaml +# Shared across all instances: +serverName: +certManager: + clusterIssuer: letsencrypt-production +ingress: + className: traefik # or whatever Yandex provides + +# PostgreSQL — external, shared CNPG cluster: +postgres: + enabled: false + +synapse: + postgres: + host: shared-pg-rw.cnpg.svc.cluster.local + database: synapse_ + user: synapse_ + existingSecret: -pg-creds + existingSecretKey: synapse + media: + storage: + size: 10Gi # adjustable per instance load + storageClassName: yc-network-hdd # media is fine on HDD + ingress: + host: matrix. + +matrixAuthenticationService: + postgres: + host: shared-pg-rw.cnpg.svc.cluster.local + database: mas_ + user: mas_ + existingSecret: -pg-creds + existingSecretKey: mas + ingress: + host: account. + +elementWeb: + ingress: + host: chat. + +elementAdmin: + ingress: + host: admin. + +matrixRTC: + ingress: + host: mrtc. + +hookshot: + enabled: true + # ingress host if needed for webhooks +``` + +### 1.8 Boot Order + +``` +Step 1: kubectl apply bootstrap/gitea/ +Step 2: helm install argocd (bootstrap/argocd/install.sh) +Step 3: git push manifests/ + argocd/ to Gitea +Step 4: kubectl apply argocd/app-of-apps.yaml +Step 5: ArgoCD syncs cert-manager, CNPG operator, CNPG cluster, databases, monitoring, ESS +``` + +--- + +## 2. Migration Procedure: `mrt0rtikize.ru` (test instance) + +> Perform on the test instance first to validate the procedure before touching production. + +### 2.1 Backup (on k3s homelab) + +```bash +NS=matrix-mrt0rtikize + +# 1. Stop Synapse + MAS +kubectl scale sts -l "app.kubernetes.io/component=matrix-server" -n $NS --replicas=0 +kubectl scale deploy -l "app.kubernetes.io/component=matrix-authentication" -n $NS --replicas=0 + +# 2. Dump PostgreSQL (built-in PG, release name is "ess" but pods are named matrix-mrt0rtikize-*) +# The PG pod is named based on the ESS release. Find it: +PG_POD=$(kubectl get pods -n $NS -l "app.kubernetes.io/name=postgres" -o name | head -1) +kubectl exec -n $NS $PG_POD -- pg_dumpall -U postgres > dump-mrt0rtikize.sql + +# 3. Backup generated secrets (CRITICAL — contains signing key, MAS encryption key) +kubectl get secret matrix-mrt0rtikize-generated -n $NS -o yaml > secrets-mrt0rtikize.yaml + +# 4. Backup deployment markers +kubectl get configmap \ + -l "app.kubernetes.io/managed-by=matrix-tools-deployment-markers" \ + -n $NS -o yaml > markers-mrt0rtikize.yaml + +# 5. Backup media files +# Find PV path from the node: +kubectl get pv -n $NS -o yaml | grep -A5 "synapse-media" +# Copy from the reported path on the node to a safe location + +# 6. Save ESS values (from the ArgoCD Application or helm get values) +kubectl get application matrix-mrt0rtikize -n argocd -o yaml > app-mrt0rtikize.yaml +``` + +**Critical data that MUST be preserved:** + +| Data | Location | Why | +|------|----------|-----| +| `SYNAPSE_SIGNING_KEY` | `matrix-mrt0rtikize-generated` secret | Federation identity — all other servers know this key. Lose it = all rooms break. | +| `MAS_ENCRYPTION_SECRET` | same secret | User session encryption. Lose it = all users must re-login. | +| `MAS_RSA_PRIVATE_KEY` | same secret | OIDC signing. Lose it = re-auth needed. | +| `SYNAPSE_MACAROON` | same secret | Admin API access token. | +| PostgreSQL dump | `dump-mrt0rtikize.sql` | All user accounts, rooms, messages. | +| Media files | Synapse media PV | Uploaded images/files/avatars. | + +### 2.2 Restore (on new Yandex cluster) + +```bash +# 1. Create secrets in the matrix-mrt0rtikize namespace +NS=matrix-mrt0rtikize +kubectl create ns $NS + +# Apply the generated secrets (signing key etc — DO NOT let initSecrets regenerate it) +kubectl apply -f secrets-mrt0rtikize.yaml +kubectl apply -f markers-mrt0rtikize.yaml + +# 2. Restore PostgreSQL dumps +# CNPG service: shared-pg-rw.cnpg.svc.cluster.local +# Extract per-DB dumps from pg_dumpall or use pg_restore: +PG_POD=$(kubectl get pods -n cnpg -l "cnpg.io/cluster=shared-pg,cnpg.io/podRole=instance" -o name | head -1) + +# Restore Synapse DB: +kubectl exec -n cnpg $PG_POD -- psql -U synapse_mrt0rtikize \ + -d synapse_mrt0rtikize < dump-mrt0rtikize.sql + +# Restore MAS DB: +kubectl exec -n cnpg $PG_POD -- psql -U mas_mrt0rtikize \ + -d mas_mrt0rtikize < dump-mrt0rtikize.sql + +# (Note: pg_dumpall produces a single file for all databases. You may need to +# split it per-database first, or use pg_restore per-database.) + +# 3. Restore media files +# Copy from backup to the new PV (path depends on storage class) +# For Yandex Cloud CSI: mount the PV on a temp pod and copy files in + +# 4. Deploy ESS via ArgoCD +# The Application was already committed to git (argocd/apps/matrix-mrt0rtikize.yaml). +# ArgoCD syncs it. Since secrets + markers are pre-loaded, the chart initializes +# with the existing signing key and database credentials. + +# 5. Verify +# - Log in with an existing user +# - Check federation: https://federationtester.matrix.org/?server_name=mrt0rtikize.ru +# - Test Element Call (VoIP) +# - Monitor logs for errors +``` + +### 2.3 DNS Cutover + +Once validated: +``` +Old records: mrt0rtikize.ru → k3s cluster IP + *.mrt0rtikize.ru → k3s cluster IP + +New records: mrt0rtikize.ru → new cluster Traefik LB IP + matrix.mrt0rtikize.ru → new cluster LB + account.mrt0rtikize.ru → new cluster LB + chat.mrt0rtikize.ru → new cluster LB + admin.mrt0rtikize.ru → new cluster LB + mrtc.mrt0rtikize.ru → new cluster LB +``` + +Lower DNS TTLs 24h before cutover to minimize propagation delay. + +### 2.4 Rollback + +If migration fails: +1. Scale down Synapse + MAS on new cluster +2. Revert DNS to k3s cluster IP +3. Scale up Synapse + MAS on k3s homelab + +The old instance on k3s should still be functional (just stopped, not deleted). + +--- + +## 3. Production Migration (vague plan) + +> Repeat steps from Section 2 for each production instance, one at a time. + +### 3.1 Order + +| # | Instance | Synapse Load | Complexity | +|---|----------|--------------|------------| +| 1 | `mrt0rtikize.ru` | Minimal (test) | Low — prove procedure | +| 2 | `t0rt1k.tech` | **582Mi** (busiest) | High — schedule during low-traffic, may need extended downtime | +| 3 | `roglog.space` | 134Mi | Medium | +| 4 | `uretra.space` | 137Mi | Medium | + +### 3.2 Pre-migration Checklist (per instance) + +``` +[ ] Announce maintenance window to users +[ ] Lower DNS TTLs (24h before) +[ ] Full PostgreSQL dump + verify (pg_restore --list) +[ ] Backup media files + verify checksums +[ ] Backup generated secrets (verify signing key matches federation) +[ ] Save current Helm values (helm get values) +[ ] Document current ingress/DNS/Certificate setup +[ ] Prepare rollback procedure +``` + +### 3.3 Migration Steps (per instance) + +``` +1. Stop Synapse + MAS on old cluster +2. Create CNPG databases on new cluster +3. Restore PostgreSQL dump to CNPG +4. Restore media files to new PV +5. Apply secrets (signing key, MAS keys, macaroon) +6. Apply deployment markers +7. Deploy ESS via ArgoCD on new cluster +8. Wait for pods healthy, certs issued +9. Test: login, federation, Element Call +10. Cut over DNS +11. Monitor for 24h +12. If stable: remove old instance resources from old cluster +``` + +### 3.4 Special Considerations for Prod Instances + +**`t0rt1k.tech` (busiest instance, 582Mi Synapse):** +- Uses older `matrix-2.9.17` chart (NOT ESS). Migration means switching to ESS Community chart. +- Has a **failed `syn2mas` job** — MAS migration was incomplete. When deploying ESS which bundles MAS, the migration may need to be completed or re-done. +- The 582Mi memory usage suggests many concurrent users/rooms — dump may be large. Allocate enough storage and time for the SQL dump/restore. +- Consider running the new ESS in parallel (different hostnames) first, then switching DNS once proven. + +**`roglog.space` and `uretra.space`:** +- Lower load (134Mi/137Mi) — quicker backups, less downtime risk. +- Same chart switch (`matrix-2.9.17` → ESS). +- Can be done in shorter windows. + +**Chart migration (`matrix-2.9.17` → ESS):** +- The old chart uses separate Helm releases per component (`chat`, `element-call`, `livekit`). +- ESS bundles everything into one chart. The database schema may differ. +- Key difference: ESS uses MAS for auth (Matrix 2.0), old chart may use legacy Synapse auth. +- May need to run `syn2mas` migration or manual user migration. Investigate per-instance before cutover. + +--- + +## 4. PostgreSQL Backup (ongoing) + +CloudNativePG has built-in backup to S3-compatible storage. Configure once for automatic daily backups: + +```yaml +apiVersion: postgresql.cnpg.io/v1 +kind: ScheduledBackup +metadata: + name: shared-pg-daily + namespace: cnpg +spec: + schedule: "0 3 * * *" # 03:00 UTC daily + backupOwnerReference: self + cluster: + name: shared-pg + immediate: false + target: prefer-standby +``` + +CNPG also supports continuous WAL archiving to S3 for point-in-time recovery. +Configure Yandex Object Storage as the S3 target. + +--- + +## 5. Architecture Diagram (text) + +``` +┌────────────────────────────────────────────────────────────┐ +│ Yandex Cloud Managed K8s │ +│ │ +│ ┌───────────────────┐ ┌──────────────────────────────┐ │ +│ │ Infrastructure │ │ Matrix Layer │ │ +│ │ │ │ │ │ +│ │ Gitea (git) │ │ ┌─────────────────────────┐ │ │ +│ │ ArgoCD (gitops) │ │ │ matrix-mrt0rtikize (ESS)│ │ │ +│ │ cert-manager │ │ │ - Synapse │ │ │ +│ │ Traefik (LB) │ │ │ - MAS │ │ │ +│ │ Prometheus/Grafana │ │ │ - Element Web/Admin │ │ │ +│ │ VictoriaMetrics │ │ │ - Matrix RTC (LiveKit) │ │ │ +│ │ Loki │ │ │ - Hookshot │ │ │ +│ │ Alloy │ │ │ - HAProxy │ │ │ +│ └───────────────────┘ │ └──────────┬──────────────┘ │ │ +│ │ │ │ │ +│ ┌───────────────────┐ │ ┌──────────▼──────────────┐ │ │ +│ │ CNPG Cluster │ │ │ matrix-t0rt1k (ESS) │ │ │ +│ │ (3 nodes, SSD) │◄──┤ │ (same structure) │ │ │ +│ │ │ │ └─────────────────────────┘ │ │ +│ │ synapse_mrt0rtikize│ │ ┌─────────────────────────┐ │ │ +│ │ mas_mrt0rtikize │ │ │ matrix-roglog (ESS) │ │ │ +│ │ synapse_t0rt1k │ │ │ (same structure) │ │ │ +│ │ mas_t0rt1k │ │ └─────────────────────────┘ │ │ +│ │ synapse_roglog │ │ ┌─────────────────────────┐ │ │ +│ │ mas_roglog │ │ │ matrix-uretra (ESS) │ │ │ +│ │ synapse_uretra │ │ │ (same structure) │ │ │ +│ │ mas_uretra │ │ └─────────────────────────┘ │ │ +│ └───────────────────┘ └──────────────────────────────┘ │ +│ │ +│ External LB: │ +└────────────────────────────────────────────────────────────┘ +``` + +--- + +## 6. Implementation Notes + +### 6.1 Secrets Management + +- ESS `initSecrets` generates 14 credentials. For migration, these MUST be restored from backup (not regenerated). +- `SYNAPSE_SIGNING_KEY` is the most critical — it identifies the server to the federation. Changing it breaks all existing rooms and federation relationships. +- The `matrix*-generated` secret and deployment markers ConfigMap must be applied **before** the first ArgoCD sync, so the ESS chart does not generate new (wrong) ones. +- For fresh ESS instances (new homeservers, not migrations), let `initSecrets` generate them normally. + +### 6.2 Image Registry + +- ESS pulls from `oci.element.io` (Synapse, Element Web, Element Admin, lk-jwt-service) and `ghcr.io` (matrix-tools, hookshot), and `docker.io` (livekit, postgres, redis). +- `oci.element.io` S3 backend (`oci-element-io-images-storage-prod.s3.eu-central-1.amazonaws.com`) was observed to fail intermittently from Russia with "connection reset by peer". Images eventually pulled on retry, but consider: + - Setting `image.pullPolicy: IfNotPresent` to reduce re-pulls + - Setting up a containerd registry mirror or local pull-through cache for `oci.element.io` + - Pre-pulling images to nodes during initial setup + +### 6.3 Resource Limits + +Set `resources.requests` and `resources.limits` on all ESS components to prevent the 94% node issue seen in prod: + +```yaml +synapse: + resources: + requests: + memory: 256Mi + cpu: 100m + limits: + memory: 1Gi + cpu: 1000m +``` + +Do similar for MAS, element-web, livekit-sfu, etc. ESS chart supports per-component resource configuration. + +### 6.4 Storage Classes + +| Workload | Storage Class | Reason | +|----------|--------------|--------| +| PostgreSQL (CNPG) | `yc-network-ssd` | Database — needs low latency / high IOPS | +| Synapse media | `yc-network-hdd` (default) | Media files — sequential access, SSD benefit is marginal | +| Prometheus TSDB | `yc-network-ssd` | Time-series DB — random writes benefit from SSD | +| Loki chunks | `yc-network-hdd` | Log storage — sequential writes, HDD is fine | + +--- + +## 7. Next Steps (for next session) + +When the new cluster is ready, open a new session and point to this file. The next session should: + +1. Read this plan +2. Explore the new cluster (nodes, storage classes, ingress config) +3. Implement Phase 0 (bootstrap GitOps foundation): + - Create `~/infra/yandex-prod/` directory structure + - Write `bootstrap/gitea/` manifests + - Write `bootstrap/argocd/install.sh` + `values.yaml` + - Write `argocd/app-of-apps.yaml` + - Write infrastructure apps (cert-manager, CNPG, monitoring) + - Write ESS apps + - Push to Gitea +4. Execute Phase 1 (backup `mrt0rtikize.ru` from k3s) +5. Execute Phase 2 (restore `mrt0rtikize.ru` to new cluster) +6. Validate and plan DNS cutover diff --git a/argocd/app-of-apps.yaml b/argocd/app-of-apps.yaml index 29d4cd5..56f92aa 100644 --- a/argocd/app-of-apps.yaml +++ b/argocd/app-of-apps.yaml @@ -8,8 +8,8 @@ metadata: spec: project: default source: - repoURL: http://gitea.gitea.svc.cluster.local:3000/gitea/yandex-prod.git - targetRevision: main + repoURL: http://gitea.gitea.svc.cluster.local:3000/admin/main.git + targetRevision: master path: argocd/apps directory: recurse: true diff --git a/argocd/apps/cert-manager-issuers.yaml b/argocd/apps/cert-manager-issuers.yaml index 3c16c58..f334265 100644 --- a/argocd/apps/cert-manager-issuers.yaml +++ b/argocd/apps/cert-manager-issuers.yaml @@ -10,8 +10,8 @@ metadata: spec: project: default source: - repoURL: http://gitea.gitea.svc.cluster.local:3000/gitea/yandex-prod.git - targetRevision: main + repoURL: http://gitea.gitea.svc.cluster.local:3000/admin/main.git + targetRevision: master path: manifests/cert-manager directory: recurse: true diff --git a/argocd/apps/cnpg-cluster.yaml b/argocd/apps/cnpg-cluster.yaml index 1656175..169e4c8 100644 --- a/argocd/apps/cnpg-cluster.yaml +++ b/argocd/apps/cnpg-cluster.yaml @@ -10,8 +10,8 @@ metadata: spec: project: default source: - repoURL: http://gitea.gitea.svc.cluster.local:3000/gitea/yandex-prod.git - targetRevision: main + repoURL: http://gitea.gitea.svc.cluster.local:3000/admin/main.git + targetRevision: master path: manifests/cnpg directory: recurse: true diff --git a/argocd/apps/monitoring.yaml b/argocd/apps/monitoring.yaml index 3df93d2..23fb112 100644 --- a/argocd/apps/monitoring.yaml +++ b/argocd/apps/monitoring.yaml @@ -101,8 +101,8 @@ spec: kubeEtcd: enabled: false - - repoURL: http://gitea.gitea.svc.cluster.local:3000/gitea/yandex-prod.git - targetRevision: main + - repoURL: http://gitea.gitea.svc.cluster.local:3000/admin/main.git + targetRevision: master path: manifests/metrics/grafana directory: recurse: true diff --git a/bootstrap/argocd/install.sh b/bootstrap/argocd/install.sh index fed1032..c3bfedc 100755 --- a/bootstrap/argocd/install.sh +++ b/bootstrap/argocd/install.sh @@ -1,13 +1,12 @@ #!/bin/bash set -e -KUBECONFIG="/home/mrt0rtikize/infra/yandex-prod/kubeconfig" -KCTL="kubectl --kubeconfig ${KUBECONFIG}" +export KUBECONFIG="$(dirname "$(realpath "$0")")/../../kubeconfig" echo "=== Installing ArgoCD ===" helm repo add argo https://argoproj.github.io/argo-helm 2>/dev/null || true -helm repo update +helm repo update argo helm upgrade --install argocd argo/argo-cd \ --namespace argocd \ @@ -20,10 +19,10 @@ echo "" echo "=== ArgoCD installed ===" echo "" echo "To access ArgoCD UI:" -echo " kubectl --kubeconfig ${KUBECONFIG} port-forward svc/argocd-server -n argocd 8080:80" +echo " kubectl port-forward svc/argocd-server -n argocd 8080:80" echo "" echo "Admin password:" -kubectl --kubeconfig ${KUBECONFIG} -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d +kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d echo "" echo "" echo "Login with username: admin" diff --git a/kubeconfig b/kubeconfig index 7e7625e..51381a1 100644 --- a/kubeconfig +++ b/kubeconfig @@ -1,5 +1,46 @@ -# Placeholder for Yandex Cloud Managed Kubernetes kubeconfig. -# Copy your kubeconfig here before running bootstrap. -# -# Example: -# yc managed-kubernetes cluster get-credentials --id --external --kubeconfig kubeconfig +apiVersion: v1 +clusters: +- cluster: + certificate-authority-data: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUM1ekNDQWMrZ0F3SUJBZ0lCQURBTkJna3Foa2lHOXcwQkFRc0ZBREFWTVJNd0VRWURWUVFERXdwcmRXSmwKY201bGRHVnpNQjRYRFRJMk1EWXhNakV5TkRNd05Wb1hEVE0yTURZd09URXlORE13TlZvd0ZURVRNQkVHQTFVRQpBeE1LYTNWaVpYSnVaWFJsY3pDQ0FTSXdEUVlKS29aSWh2Y05BUUVCQlFBRGdnRVBBRENDQVFvQ2dnRUJBTEEwCnByK0hIV3NqKzhEd216am03TzFhck83djVUM29MNmhIODdnMWIxeXhIRkx1RXBmYkZ3emdnMDkrRnNhL0dqdzgKTmdiNEwreTBnWE5aK3ArM3prZ3NDenRuK3IzODFObFBnODJPZzhacG9RNm5xTGNwQTBDSGlQMzhBQ0szd1pCNworSVNQK0loZ2V2Y1Fqc0F2c3R1djlmbjJGbElDMUptS2JuRkFCZTdKbFBzUTg5czh1QU1OZWJ5aTJuckM3NlJVCjZMRU9oMUVrWC9SUFBxa05yaXdYalVEWWttcjU1OG94WG9MMDVoWkg4eXFUYmEyVkJxeE8zZlo4cktPV2VOcGoKNk5GRU1ELzEzeU9QZkNGRHYxY2NhSTI0SWx0T2VSVSsvb0VxaHNhU2ZObi8yT3dqMEZLb2sxV2l6QUVCU2lBNwpMcW1CT1BMU2dweWlQRVVmTHlrQ0F3RUFBYU5DTUVBd0RnWURWUjBQQVFIL0JBUURBZ0trTUE4R0ExVWRFd0VCCi93UUZNQU1CQWY4d0hRWURWUjBPQkJZRUZKeTNtbEF2MmpJN0cvRW92TGNvYWFJNE5QQmtNQTBHQ1NxR1NJYjMKRFFFQkN3VUFBNElCQVFCaVVMZXVDU0JjbUpGYWdwTm1vZUs5TSthcHd4QjhBdElQNEZLbmkrbVgvamY5U2kvRApXeFJsVW5GdmYyVVR0ZHhPR2xxKzJyTDRKOGdPRldXVENCaWh0aFpGYWd4Nm91N05ldnFqRlAwL1RxL0kvSTd5CktNOVJZdXZQU28xeGVtdm1neTY2cENKNyswNkV4UzVMZm9nU1VUWmRUNkwwWG5UZWRJbVg1cFBYRWE4VjEwMlEKN0ZLbmFvb0pVakhmaGFlSmN0NndUYXM0b0s4aTZPcWVpOFAzYWhXYlIwQnJ2akhVUkdTc2FrQnJRNk1UcUlIQQp2SW9PNnEvcGYrUmlIUmNyMkQ1K1gyKy9EU3pvMXRKVVM1ZGZrQVNsaWpDL0tIdzd5SlA0dmZaOXVMNHl3VjRpCkZBT3RjMldKTGM4bFZvVkhxZHg3NnFKZFlTdGI4RnJVL2hrNQotLS0tLUVORCBDRVJUSUZJQ0FURS0tLS0tCg== + server: https://89.169.129.204 + name: yc-managed-k8s-cat7eltmvnj2fp9bb5f4 +- cluster: + certificate-authority-data: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUM1ekNDQWMrZ0F3SUJBZ0lCQURBTkJna3Foa2lHOXcwQkFRc0ZBREFWTVJNd0VRWURWUVFERXdwcmRXSmwKY201bGRHVnpNQjRYRFRJMU1UQXdOREUwTURBd04xb1hEVE0xTVRBd01qRTBNREF3TjFvd0ZURVRNQkVHQTFVRQpBeE1LYTNWaVpYSnVaWFJsY3pDQ0FTSXdEUVlKS29aSWh2Y05BUUVCQlFBRGdnRVBBRENDQVFvQ2dnRUJBSzhCCitmMlNQblpZb1pYZ1laTml4Qm03MVhTQy80dmgxWHBaY2lvOTgxYkNLL3Y1TlVLbTBqLzBXSWRXdXRMV1FLZWEKMXhEazNsaEVGKys0Zk1sRG1rbkg3emlteG4vSGx3MW1MT3VDMWo4TnpCc1kyZWdVV1RDNTlTM01TSHhDcVlqNwpZWGo5TkEvOHptdEZuRTQ2RFZoMU05aDVFOWVNN2U0QkRmaVczbUpoWlZ0cjRIVzhWU1Zmc2FOa2hlZUxhYk9HCkRuSDBKRFRCTTlJdUgyV0dTSmJaRnFrOU9ibnNKd1ltS2ltMkFCK3dCVkZrMEZLWUFuT1JPZFhaSC8wd2Z2MlkKMmxvUTdkY2RrNkt1ay9oNjJuT2xscEtqRlhpTkhXVjBPTVNqdFdlbHhqTXNmMm9peGNKOUdORzRLdzBIVWhKWAppUk5DRVpuenBZckMyZzJIMWY4Q0F3RUFBYU5DTUVBd0RnWURWUjBQQVFIL0JBUURBZ0trTUE4R0ExVWRFd0VCCi93UUZNQU1CQWY4d0hRWURWUjBPQkJZRUZOQUQ5SWhxSWp2NVh6eDF1MEtJNDZzOGFkV0VNQTBHQ1NxR1NJYjMKRFFFQkN3VUFBNElCQVFCMTc0enlDR3BZUlBDRzBjY3luUG5XbXI4MW51cW1aVVVXUjNpREFQVWVONUFoOXRuawp4dlZtbGFXem44OWpmMFJvcjRNZHZtWk4yc01LWTB6WGF5MkxhVUVleTNFVHVyamJqUDh4ZFMrR1EramJIc3BIClN4aUpGcXh6ZktXM3kvK3RkZngrUGFxQUszOUFpLzFlK2lKVWU3RmY5YlNZNWpZUHVjQTE3VWNyS3U4TkkvZUEKOVd1NWQxNEUwaGlVRWx3ODRZTDVlRHVrUkgyK3cvd3EzeDRHTzA3RnFZZzZCZ3RtQjJRdGk4VG1XSk1XU2l2VgpFWHBHOG8wdUFjNjU0anZhQi9HeWYxRlpuWDdCQ3JKd0Y3RjFhbTMvdGVSbkMxR3YwLzBVR2k1YlJPbERwdFQrClVEd1I0cldWSVVMSzVicnEwNjd5dVZ2bjV2WC9UaHFaQzlydgotLS0tLUVORCBDRVJUSUZJQ0FURS0tLS0tCg== + server: https://158.160.139.227 + name: yc-managed-k8s-cats3abbp71o4bal5aus +contexts: +- context: + cluster: yc-managed-k8s-cats3abbp71o4bal5aus + user: yc-managed-k8s-cats3abbp71o4bal5aus + name: yc-playground +- context: + cluster: yc-managed-k8s-cat7eltmvnj2fp9bb5f4 + user: yc-managed-k8s-cat7eltmvnj2fp9bb5f4 + name: yc-prod +current-context: yc-prod +kind: Config +preferences: {} +users: +- name: yc-managed-k8s-cat7eltmvnj2fp9bb5f4 + user: + exec: + apiVersion: client.authentication.k8s.io/v1beta1 + args: + - k8s + - create-token + - --profile=default + command: /home/mrt0rtikize/.local/bin/yc + env: null + provideClusterInfo: false +- name: yc-managed-k8s-cats3abbp71o4bal5aus + user: + exec: + apiVersion: client.authentication.k8s.io/v1beta1 + args: + - k8s + - create-token + - --profile=default + command: /home/mrt0rtikize/.local/bin/yc + env: null + interactiveMode: IfAvailable + provideClusterInfo: false