# Yandex Cloud Production Cluster — Bootstrap Guide ## Prerequisites - [ ] `kubeconfig` file placed at `~/infra/yandex-prod/kubeconfig` - [ ] `kubectl` context pointing to the new `yc-prod` cluster - [ ] Domaster `prod.t01tt.tech` DNS managed (can be updated later in Phase 5) - [ ] `git` and `helm` installed locally --- ## Phase 0: Verify Cluster Access ```bash export KUBECONFIG=~/infra/yandex-prod/kubeconfig kubectl get nodes # Expected: 3 nodes Ready, 2CPU/8GB each kubectl get sc # Expected: yc-network-hdd (default), yc-network-ssd, yc-network-nvme, ... ``` --- ## Phase 1: Bootstrap Gitea (internal access only) Gitea hosts the Git repo that ArgoCD reads. Deploy it first, but without ingress — we access it via port-forward. ```bash kubectl apply -f bootstrap/gitea/namespace.yaml kubectl apply -f bootstrap/gitea/pvc.yaml kubectl apply -f bootstrap/gitea/deployment.yaml kubectl apply -f bootstrap/gitea/service.yaml # NOTE: Do NOT apply ingress.yaml yet — no Traefik or cert-manager exists ``` Wait for Gitea to be ready, then port-forward and configure: ```bash kubectl wait deploy/gitea -n gitea --for=condition=available --timeout=120s # Port-forward in a separate terminal: kubectl port-forward svc/gitea 3000:3000 -n gitea ``` 1. Open **http://localhost:3000** in a browser 2. Fill out the install form: - Database: **SQLite3** (default) - Site Title: **Gitea** - Domaster: **git.prod.t01tt.tech** - Application URL: **https://git.prod.t01tt.tech** - Create admin account (username/password/email — save these) 3. Click "Install Gitea" 4. Create a new repository: **`master`** (must be **public**, owned by admin) 5. Close the port-forward (Ctrl+C) --- ## Phase 2: Push Repository to Gitea ```bash cd ~/infra/yandex-prod git init git remote add origin http://localhost:3000/admin/master.git # Or, once Gitea ingress works later, use: # git remote add origin https://git.prod.t01tt.tech/admin/master.git git add -A git commit -m "initial bootstrap: infrastructure manifests" git push -u origin master # Enter Gitea admin credentials when prompted ``` --- ## Phase 3: Install ArgoCD (internal access only) ```bash bash bootstrap/argocd/install.sh # Saves the admin password — copy it ``` Add the Gitea repository to ArgoCD: ```bash # Via port-forward: kubectl port-forward svc/argocd-server 8080:80 -n argocd & sleep 2 # Login and add repo: ARGOCD_PASS=$(kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d) argocd login localhost:8080 --username admin --password "${ARGOCD_PASS}" --insecure argocd repo add http://gitea.gitea.svc.cluster.local:3000/admin/master.git \ --name yandex-prod \ --type git ``` Deploy the root app: ```bash kubectl apply -f argocd/app-of-apps.yaml ``` ArgoCD will now sync child apps according to their sync waves. You can watch progress: ```bash argocd app list ``` --- ## Phase 4: Let the Sync Waves Run Sync order (automated by ArgoCD via `argocd.argoproj.io/sync-wave` annotations): | Wave | App | What happens | |------|-----|-------------| | **-2** | `traefik` | DaemonSet deployed on all 3 nodes. NLB created → external IP provisioned | | **-1** | `cert-manager` | cert-manager operator + CRDs installed | | **0** | `cert-manager-issuers` | `letsencrypt-production` + `letsencrypt-staging` ClusterIssuers created | | **0** | `monitoring` | VM k8s-stack (metrics) + Grafana ingress deployed | | **0** | `loki` | Loki single-binary deployed | | **0** | `cnpg-operator` | CloudNativePG operator installed | | **1** | `cnpg-cluster` | `shared-pg` 3-node PostgreSQL cluster + 8 databases created | **Verify Traefik IP:** ```bash kubectl get svc traefik -n traefik -w # Wait for EXTERNAL-IP to appear. Example output: # traefik LoadBalancer 10.x.x.x 80:3xxxx/TCP,443:3xxxx/TCP 30s # traefik LoadBalancer 10.x.x.x 158.160.x.x 80:3xxxx/TCP,443:3xxxx/TCP 60s ``` Take the EXTERNAL-IP — this is your NLB IP. You'll need it in Phase 5. **Verify state:** ```bash kubectl get pods -A # Expected running pods: # traefik: traefik-xxxxx (3 pods, DaemonSet) # cert-manager: cert-manager-xxxxx, cert-manager-cainjector-xxxxx, cert-manager-webhook-xxxxx # metrics: vm-k8s-stack-* pods (vmsingle, alertmanager, grafana, node-exporter, kube-state-metrics, vmagent) # metrics: loki-0 # cnpg-system: cnpg-operator-xxxxx # cnpg: shared-pg-1, shared-pg-2, shared-pg-3 (may take a minute to start) kubectl get clusterissuer # Expected: letsencrypt-production (True), letsencrypt-staging (True) kubectl get cluster -n cnpg # Expected: shared-pg (3/3 instances ready) ``` --- ## Phase 5: DNS + Expose Gitea & ArgoCD Now that Traefik has an external IP and cert-manager is running, we can: 1. Point DNS at the NLB IP 2. Create the Gitea and ArgoCD ingress resources (with TLS) ### 5.1 Update DNS Point the following records to the Traefik NLB IP (from Phase 4): ``` git.prod.t01tt.tech → argocd.prod.t01tt.tech → grafana.prod.t01tt.tech → ``` Also create a wildcard for future hosts: ``` *.prod.t01tt.tech → ``` ### 5.2 Apply Ingresses ```bash kubectl apply -f bootstrap/gitea/ingress.yaml kubectl apply -f bootstrap/argocd/ingress.yaml ``` ### 5.3 Wait for TLS Certificates ```bash kubectl get certificate -A -w # Wait for all to show Ready=True: # gitea gitea-tls True # argocd argocd-tls True # metrics grafana-tls True ``` **Troubleshooting:** If certificates are stuck in `Pending`: - Check DNS resolves: `dig git.prod.t01tt.tech` — must return the NLB IP - Check cert-manager logs: `kubectl logs -n cert-manager deploy/cert-manager` - Check challenge: `kubectl get challenges -A` --- ## Phase 6: Verify Everything ### Gitea ``` https://git.prod.t01tt.tech ``` Login with the admin credentials from Phase 1. Verify the `yandex-prod` repo exists. ### ArgoCD ``` https://argocd.prod.t01tt.tech ``` Login with `admin` + password from Phase 3. All apps should show green (`Synced` + `Healthy`). The Ingress health may show `Healthy` immediately (by design — see `values.yaml` customization). ### Grafana ``` https://grafana.prod.t01tt.tech ``` Login with `admin` / `change-me`. Check that VM k8s-stack dashboards are available. ### PostgreSQL ```bash kubectl get databases -n cnpg # Expected: 8 Database resources, one per homeserver kubectl get pods -n cnpg # Expected: shared-pg-1, shared-pg-2, shared-pg-3 (Running) ``` ### ArgoCD Repo Connection ```bash argocd repo list # Expected: the Gitea repo with status "Successful" ``` If not connected, re-add via ArgoCD CLI: ```bash argocd repo add http://gitea.gitea.svc.cluster.local:3000/admin/master.git \ --name yandex-prod \ --type git ``` Or in the ArgoCD UI: Settings → Repositories → Connect repo. --- ## Phase 7: Post-Bootstrap Checklist - [ ] All ArgoCD apps `Synced` and `Healthy` - [ ] `https://git.prod.t01tt.tech` — Gitea accessible, SSL valid - [ ] `https://argocd.prod.t01tt.tech` — ArgoCD accessible, SSL valid - [ ] `https://grafana.prod.t01tt.tech` — Grafana accessible, SSL valid, datasources working - [ ] `kubectl get pv` — PVCs bound for all stateful components - [ ] CNPG `shared-pg` cluster status: `kubectl get cluster -n cnpg` shows 3/3 ready - [ ] Certificates all `Ready`: `kubectl get certificate -A | grep False` (should return nothing) --- ## Quick Reference: Service URLs | Service | URL | Auth | |---------|-----|------| | Gitea | `https://git.prod.t01tt.tech` | Admin user from Phase 1 | | ArgoCD | `https://argocd.prod.t01tt.tech` | `admin` / password from Phase 3 | | Grafana | `https://grafana.prod.t01tt.tech` | `admin` / `change-me` | | Traefik dashboard | `kubectl port-forward -n traefik daemonset/traefik 9000:9000` | Internal only | --- ## Troubleshooting ### Traefik NLB stuck in `` Yandex Cloud NLB provisioning can take a few minutes. Check: ```bash kubectl describe svc traefik -n traefik ``` If it's stuck for >5 minutes, verify the Yandex annotations are correct. ### Certificates stuck in `Pending` 1. Verify DNS: `dig git.prod.t01tt.tech` → must return the NLB IP 2. Check Traefik is listening: `curl -k https:// -H "Host: git.prod.t01tt.tech"` → should return 404 (expected, just verifying Traefik responds) 3. Check orders: `kubectl get orders -A` ### CNPG cluster not becoming ready ```bash kubectl describe cluster shared-pg -n cnpg kubectl logs -n cnpg-system deploy/cnpg-controller-manager ``` Common issue: pods can't schedule due to `podAntiAffinityType: required`. Ensure all 3 nodes exist and PVCs can bind. ### Gitea UI shows wrong URL after first login Gitea caches the ROOT_URL from the `deployment.yaml` env vars. If you change the domaster, update: ```bash kubectl set env deploy/gitea -n gitea \ GITEA__server__DOMAIN=git.prod.t01tt.tech \ GITEA__server__ROOT_URL=https://git.prod.t01tt.tech kubectl rollout restart deploy/gitea -n gitea ``` ### ArgoCD apps showing "Unknown" health This is normal for Ingress resources — the custom health check in `bootstrap/argocd/values.yaml` marks all Ingresses as `Healthy` once synced. For other resources, check the app details in ArgoCD UI for the specific error.