Files
main/BOOTSTRAP.md
2026-06-12 18:21:11 +03:00

9.2 KiB

Yandex Cloud Production Cluster — Bootstrap Guide

Prerequisites

  • kubeconfig file placed at ~/infra/yandex-prod/kubeconfig
  • kubectl context pointing to the new yc-prod cluster
  • Domaster prod.t01tt.tech DNS managed (can be updated later in Phase 5)
  • git and helm installed locally

Phase 0: Verify Cluster Access

export KUBECONFIG=~/infra/yandex-prod/kubeconfig

kubectl get nodes
# Expected: 3 nodes Ready, 2CPU/8GB each

kubectl get sc
# Expected: yc-network-hdd (default), yc-network-ssd, yc-network-nvme, ...

Phase 1: Bootstrap Gitea (internal access only)

Gitea hosts the Git repo that ArgoCD reads. Deploy it first, but without ingress — we access it via port-forward.

kubectl apply -f bootstrap/gitea/namespace.yaml
kubectl apply -f bootstrap/gitea/pvc.yaml
kubectl apply -f bootstrap/gitea/deployment.yaml
kubectl apply -f bootstrap/gitea/service.yaml
# NOTE: Do NOT apply ingress.yaml yet — no Traefik or cert-manager exists

Wait for Gitea to be ready, then port-forward and configure:

kubectl wait deploy/gitea -n gitea --for=condition=available --timeout=120s

# Port-forward in a separate terminal:
kubectl port-forward svc/gitea 3000:3000 -n gitea
  1. Open http://localhost:3000 in a browser
  2. Fill out the install form:
    • Database: SQLite3 (default)
    • Site Title: Gitea
    • Domaster: git.prod.t01tt.tech
    • Application URL: https://git.prod.t01tt.tech
    • Create admin account (username/password/email — save these)
  3. Click "Install Gitea"
  4. Create a new repository: master (must be public, owned by admin)
  5. Close the port-forward (Ctrl+C)

Phase 2: Push Repository to Gitea

cd ~/infra/yandex-prod

git init
git remote add origin http://localhost:3000/admin/master.git
# Or, once Gitea ingress works later, use:
#   git remote add origin https://git.prod.t01tt.tech/admin/master.git

git add -A
git commit -m "initial bootstrap: infrastructure manifests"
git push -u origin master
# Enter Gitea admin credentials when prompted

Phase 3: Install ArgoCD (internal access only)

bash bootstrap/argocd/install.sh
# Saves the admin password — copy it

Add the Gitea repository to ArgoCD:

# Via port-forward:
kubectl port-forward svc/argocd-server 8080:80 -n argocd &
sleep 2

# Login and add repo:
ARGOCD_PASS=$(kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d)
argocd login localhost:8080 --username admin --password "${ARGOCD_PASS}" --insecure

argocd repo add http://gitea.gitea.svc.cluster.local:3000/admin/master.git \
  --name yandex-prod \
  --type git

Deploy the root app:

kubectl apply -f argocd/app-of-apps.yaml

ArgoCD will now sync child apps according to their sync waves. You can watch progress:

argocd app list

Phase 4: Let the Sync Waves Run

Sync order (automated by ArgoCD via argocd.argoproj.io/sync-wave annotations):

Wave App What happens
-2 traefik DaemonSet deployed on all 3 nodes. NLB created → external IP provisioned
-1 cert-manager cert-manager operator + CRDs installed
0 cert-manager-issuers letsencrypt-production + letsencrypt-staging ClusterIssuers created
0 monitoring VM k8s-stack (metrics) + Grafana ingress deployed
0 loki Loki single-binary deployed
0 cnpg-operator CloudNativePG operator installed
1 cnpg-cluster shared-pg 3-node PostgreSQL cluster + 8 databases created

Verify Traefik IP:

kubectl get svc traefik -n traefik -w
# Wait for EXTERNAL-IP to appear. Example output:
#   traefik   LoadBalancer   10.x.x.x   <pending>      80:3xxxx/TCP,443:3xxxx/TCP   30s
#   traefik   LoadBalancer   10.x.x.x   158.160.x.x    80:3xxxx/TCP,443:3xxxx/TCP   60s

Take the EXTERNAL-IP — this is your NLB IP. You'll need it in Phase 5.

Verify state:

kubectl get pods -A
# Expected running pods:
#   traefik:          traefik-xxxxx (3 pods, DaemonSet)
#   cert-manager:     cert-manager-xxxxx, cert-manager-cainjector-xxxxx, cert-manager-webhook-xxxxx
#   metrics:          vm-k8s-stack-* pods (vmsingle, alertmanager, grafana, node-exporter, kube-state-metrics, vmagent)
#   metrics:          loki-0
#   cnpg-system:      cnpg-operator-xxxxx
#   cnpg:             shared-pg-1, shared-pg-2, shared-pg-3 (may take a minute to start)

kubectl get clusterissuer
# Expected: letsencrypt-production (True), letsencrypt-staging (True)

kubectl get cluster -n cnpg
# Expected: shared-pg (3/3 instances ready)

Phase 5: DNS + Expose Gitea & ArgoCD

Now that Traefik has an external IP and cert-manager is running, we can:

  1. Point DNS at the NLB IP
  2. Create the Gitea and ArgoCD ingress resources (with TLS)

5.1 Update DNS

Point the following records to the Traefik NLB IP (from Phase 4):

git.prod.t01tt.tech       → <NLB-IP>
argocd.prod.t01tt.tech    → <NLB-IP>
grafana.prod.t01tt.tech   → <NLB-IP>

Also create a wildcard for future hosts:

*.prod.t01tt.tech         → <NLB-IP>

5.2 Apply Ingresses

kubectl apply -f bootstrap/gitea/ingress.yaml
kubectl apply -f bootstrap/argocd/ingress.yaml

5.3 Wait for TLS Certificates

kubectl get certificate -A -w
# Wait for all to show Ready=True:
#   gitea        gitea-tls                    True
#   argocd       argocd-tls                   True
#   metrics      grafana-tls                  True

Troubleshooting: If certificates are stuck in Pending:

  • Check DNS resolves: dig git.prod.t01tt.tech — must return the NLB IP
  • Check cert-manager logs: kubectl logs -n cert-manager deploy/cert-manager
  • Check challenge: kubectl get challenges -A

Phase 6: Verify Everything

Gitea

https://git.prod.t01tt.tech

Login with the admin credentials from Phase 1. Verify the yandex-prod repo exists.

ArgoCD

https://argocd.prod.t01tt.tech

Login with admin + password from Phase 3. All apps should show green (Synced + Healthy).

The Ingress health may show Healthy immediately (by design — see values.yaml customization).

Grafana

https://grafana.prod.t01tt.tech

Login with admin / change-me. Check that VM k8s-stack dashboards are available.

PostgreSQL

kubectl get databases -n cnpg
# Expected: 8 Database resources, one per homeserver

kubectl get pods -n cnpg
# Expected: shared-pg-1, shared-pg-2, shared-pg-3 (Running)

ArgoCD Repo Connection

argocd repo list
# Expected: the Gitea repo with status "Successful"

If not connected, re-add via ArgoCD CLI:

argocd repo add http://gitea.gitea.svc.cluster.local:3000/admin/master.git \
  --name yandex-prod \
  --type git

Or in the ArgoCD UI: Settings → Repositories → Connect repo.


Phase 7: Post-Bootstrap Checklist

  • All ArgoCD apps Synced and Healthy
  • https://git.prod.t01tt.tech — Gitea accessible, SSL valid
  • https://argocd.prod.t01tt.tech — ArgoCD accessible, SSL valid
  • https://grafana.prod.t01tt.tech — Grafana accessible, SSL valid, datasources working
  • kubectl get pv — PVCs bound for all stateful components
  • CNPG shared-pg cluster status: kubectl get cluster -n cnpg shows 3/3 ready
  • Certificates all Ready: kubectl get certificate -A | grep False (should return nothing)

Quick Reference: Service URLs

Service URL Auth
Gitea https://git.prod.t01tt.tech Admin user from Phase 1
ArgoCD https://argocd.prod.t01tt.tech admin / password from Phase 3
Grafana https://grafana.prod.t01tt.tech admin / change-me
Traefik dashboard kubectl port-forward -n traefik daemonset/traefik 9000:9000 Internal only

Troubleshooting

Traefik NLB stuck in <pending>

Yandex Cloud NLB provisioning can take a few minutes. Check:

kubectl describe svc traefik -n traefik

If it's stuck for >5 minutes, verify the Yandex annotations are correct.

Certificates stuck in Pending

  1. Verify DNS: dig git.prod.t01tt.tech → must return the NLB IP
  2. Check Traefik is listening: curl -k https://<NLB-IP> -H "Host: git.prod.t01tt.tech" → should return 404 (expected, just verifying Traefik responds)
  3. Check orders: kubectl get orders -A

CNPG cluster not becoming ready

kubectl describe cluster shared-pg -n cnpg
kubectl logs -n cnpg-system deploy/cnpg-controller-manager

Common issue: pods can't schedule due to podAntiAffinityType: required. Ensure all 3 nodes exist and PVCs can bind.

Gitea UI shows wrong URL after first login

Gitea caches the ROOT_URL from the deployment.yaml env vars. If you change the domaster, update:

kubectl set env deploy/gitea -n gitea \
  GITEA__server__DOMAIN=git.prod.t01tt.tech \
  GITEA__server__ROOT_URL=https://git.prod.t01tt.tech
kubectl rollout restart deploy/gitea -n gitea

ArgoCD apps showing "Unknown" health

This is normal for Ingress resources — the custom health check in bootstrap/argocd/values.yaml marks all Ingresses as Healthy once synced. For other resources, check the app details in ArgoCD UI for the specific error.