9.2 KiB
Yandex Cloud Production Cluster — Bootstrap Guide
Prerequisites
kubeconfigfile placed at~/infra/yandex-prod/kubeconfigkubectlcontext pointing to the newyc-prodcluster- Domaster
prod.t01tt.techDNS managed (can be updated later in Phase 5) gitandhelminstalled locally
Phase 0: Verify Cluster Access
export KUBECONFIG=~/infra/yandex-prod/kubeconfig
kubectl get nodes
# Expected: 3 nodes Ready, 2CPU/8GB each
kubectl get sc
# Expected: yc-network-hdd (default), yc-network-ssd, yc-network-nvme, ...
Phase 1: Bootstrap Gitea (internal access only)
Gitea hosts the Git repo that ArgoCD reads. Deploy it first, but without ingress — we access it via port-forward.
kubectl apply -f bootstrap/gitea/namespace.yaml
kubectl apply -f bootstrap/gitea/pvc.yaml
kubectl apply -f bootstrap/gitea/deployment.yaml
kubectl apply -f bootstrap/gitea/service.yaml
# NOTE: Do NOT apply ingress.yaml yet — no Traefik or cert-manager exists
Wait for Gitea to be ready, then port-forward and configure:
kubectl wait deploy/gitea -n gitea --for=condition=available --timeout=120s
# Port-forward in a separate terminal:
kubectl port-forward svc/gitea 3000:3000 -n gitea
- Open http://localhost:3000 in a browser
- Fill out the install form:
- Database: SQLite3 (default)
- Site Title: Gitea
- Domaster: git.prod.t01tt.tech
- Application URL: https://git.prod.t01tt.tech
- Create admin account (username/password/email — save these)
- Click "Install Gitea"
- Create a new repository:
master(must be public, owned by admin) - Close the port-forward (Ctrl+C)
Phase 2: Push Repository to Gitea
cd ~/infra/yandex-prod
git init
git remote add origin http://localhost:3000/admin/master.git
# Or, once Gitea ingress works later, use:
# git remote add origin https://git.prod.t01tt.tech/admin/master.git
git add -A
git commit -m "initial bootstrap: infrastructure manifests"
git push -u origin master
# Enter Gitea admin credentials when prompted
Phase 3: Install ArgoCD (internal access only)
bash bootstrap/argocd/install.sh
# Saves the admin password — copy it
Add the Gitea repository to ArgoCD:
# Via port-forward:
kubectl port-forward svc/argocd-server 8080:80 -n argocd &
sleep 2
# Login and add repo:
ARGOCD_PASS=$(kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d)
argocd login localhost:8080 --username admin --password "${ARGOCD_PASS}" --insecure
argocd repo add http://gitea.gitea.svc.cluster.local:3000/admin/master.git \
--name yandex-prod \
--type git
Deploy the root app:
kubectl apply -f argocd/app-of-apps.yaml
ArgoCD will now sync child apps according to their sync waves. You can watch progress:
argocd app list
Phase 4: Let the Sync Waves Run
Sync order (automated by ArgoCD via argocd.argoproj.io/sync-wave annotations):
| Wave | App | What happens |
|---|---|---|
| -2 | traefik |
DaemonSet deployed on all 3 nodes. NLB created → external IP provisioned |
| -1 | cert-manager |
cert-manager operator + CRDs installed |
| 0 | cert-manager-issuers |
letsencrypt-production + letsencrypt-staging ClusterIssuers created |
| 0 | monitoring |
VM k8s-stack (metrics) + Grafana ingress deployed |
| 0 | loki |
Loki single-binary deployed |
| 0 | cnpg-operator |
CloudNativePG operator installed |
| 1 | cnpg-cluster |
shared-pg 3-node PostgreSQL cluster + 8 databases created |
Verify Traefik IP:
kubectl get svc traefik -n traefik -w
# Wait for EXTERNAL-IP to appear. Example output:
# traefik LoadBalancer 10.x.x.x <pending> 80:3xxxx/TCP,443:3xxxx/TCP 30s
# traefik LoadBalancer 10.x.x.x 158.160.x.x 80:3xxxx/TCP,443:3xxxx/TCP 60s
Take the EXTERNAL-IP — this is your NLB IP. You'll need it in Phase 5.
Verify state:
kubectl get pods -A
# Expected running pods:
# traefik: traefik-xxxxx (3 pods, DaemonSet)
# cert-manager: cert-manager-xxxxx, cert-manager-cainjector-xxxxx, cert-manager-webhook-xxxxx
# metrics: vm-k8s-stack-* pods (vmsingle, alertmanager, grafana, node-exporter, kube-state-metrics, vmagent)
# metrics: loki-0
# cnpg-system: cnpg-operator-xxxxx
# cnpg: shared-pg-1, shared-pg-2, shared-pg-3 (may take a minute to start)
kubectl get clusterissuer
# Expected: letsencrypt-production (True), letsencrypt-staging (True)
kubectl get cluster -n cnpg
# Expected: shared-pg (3/3 instances ready)
Phase 5: DNS + Expose Gitea & ArgoCD
Now that Traefik has an external IP and cert-manager is running, we can:
- Point DNS at the NLB IP
- Create the Gitea and ArgoCD ingress resources (with TLS)
5.1 Update DNS
Point the following records to the Traefik NLB IP (from Phase 4):
git.prod.t01tt.tech → <NLB-IP>
argocd.prod.t01tt.tech → <NLB-IP>
grafana.prod.t01tt.tech → <NLB-IP>
Also create a wildcard for future hosts:
*.prod.t01tt.tech → <NLB-IP>
5.2 Apply Ingresses
kubectl apply -f bootstrap/gitea/ingress.yaml
kubectl apply -f bootstrap/argocd/ingress.yaml
5.3 Wait for TLS Certificates
kubectl get certificate -A -w
# Wait for all to show Ready=True:
# gitea gitea-tls True
# argocd argocd-tls True
# metrics grafana-tls True
Troubleshooting: If certificates are stuck in Pending:
- Check DNS resolves:
dig git.prod.t01tt.tech— must return the NLB IP - Check cert-manager logs:
kubectl logs -n cert-manager deploy/cert-manager - Check challenge:
kubectl get challenges -A
Phase 6: Verify Everything
Gitea
https://git.prod.t01tt.tech
Login with the admin credentials from Phase 1. Verify the yandex-prod repo exists.
ArgoCD
https://argocd.prod.t01tt.tech
Login with admin + password from Phase 3. All apps should show green (Synced + Healthy).
The Ingress health may show Healthy immediately (by design — see values.yaml customization).
Grafana
https://grafana.prod.t01tt.tech
Login with admin / change-me. Check that VM k8s-stack dashboards are available.
PostgreSQL
kubectl get databases -n cnpg
# Expected: 8 Database resources, one per homeserver
kubectl get pods -n cnpg
# Expected: shared-pg-1, shared-pg-2, shared-pg-3 (Running)
ArgoCD Repo Connection
argocd repo list
# Expected: the Gitea repo with status "Successful"
If not connected, re-add via ArgoCD CLI:
argocd repo add http://gitea.gitea.svc.cluster.local:3000/admin/master.git \
--name yandex-prod \
--type git
Or in the ArgoCD UI: Settings → Repositories → Connect repo.
Phase 7: Post-Bootstrap Checklist
- All ArgoCD apps
SyncedandHealthy https://git.prod.t01tt.tech— Gitea accessible, SSL validhttps://argocd.prod.t01tt.tech— ArgoCD accessible, SSL validhttps://grafana.prod.t01tt.tech— Grafana accessible, SSL valid, datasources workingkubectl get pv— PVCs bound for all stateful components- CNPG
shared-pgcluster status:kubectl get cluster -n cnpgshows 3/3 ready - Certificates all
Ready:kubectl get certificate -A | grep False(should return nothing)
Quick Reference: Service URLs
| Service | URL | Auth |
|---|---|---|
| Gitea | https://git.prod.t01tt.tech |
Admin user from Phase 1 |
| ArgoCD | https://argocd.prod.t01tt.tech |
admin / password from Phase 3 |
| Grafana | https://grafana.prod.t01tt.tech |
admin / change-me |
| Traefik dashboard | kubectl port-forward -n traefik daemonset/traefik 9000:9000 |
Internal only |
Troubleshooting
Traefik NLB stuck in <pending>
Yandex Cloud NLB provisioning can take a few minutes. Check:
kubectl describe svc traefik -n traefik
If it's stuck for >5 minutes, verify the Yandex annotations are correct.
Certificates stuck in Pending
- Verify DNS:
dig git.prod.t01tt.tech→ must return the NLB IP - Check Traefik is listening:
curl -k https://<NLB-IP> -H "Host: git.prod.t01tt.tech"→ should return 404 (expected, just verifying Traefik responds) - Check orders:
kubectl get orders -A
CNPG cluster not becoming ready
kubectl describe cluster shared-pg -n cnpg
kubectl logs -n cnpg-system deploy/cnpg-controller-manager
Common issue: pods can't schedule due to podAntiAffinityType: required. Ensure all 3 nodes exist and PVCs can bind.
Gitea UI shows wrong URL after first login
Gitea caches the ROOT_URL from the deployment.yaml env vars. If you change the domaster, update:
kubectl set env deploy/gitea -n gitea \
GITEA__server__DOMAIN=git.prod.t01tt.tech \
GITEA__server__ROOT_URL=https://git.prod.t01tt.tech
kubectl rollout restart deploy/gitea -n gitea
ArgoCD apps showing "Unknown" health
This is normal for Ingress resources — the custom health check in bootstrap/argocd/values.yaml marks all Ingresses as Healthy once synced. For other resources, check the app details in ArgoCD UI for the specific error.