Files
main/BOOTSTRAP.md
2026-06-12 18:21:11 +03:00

309 lines
9.2 KiB
Markdown

# Yandex Cloud Production Cluster — Bootstrap Guide
## Prerequisites
- [ ] `kubeconfig` file placed at `~/infra/yandex-prod/kubeconfig`
- [ ] `kubectl` context pointing to the new `yc-prod` cluster
- [ ] Domaster `prod.t01tt.tech` DNS managed (can be updated later in Phase 5)
- [ ] `git` and `helm` installed locally
---
## Phase 0: Verify Cluster Access
```bash
export KUBECONFIG=~/infra/yandex-prod/kubeconfig
kubectl get nodes
# Expected: 3 nodes Ready, 2CPU/8GB each
kubectl get sc
# Expected: yc-network-hdd (default), yc-network-ssd, yc-network-nvme, ...
```
---
## Phase 1: Bootstrap Gitea (internal access only)
Gitea hosts the Git repo that ArgoCD reads. Deploy it first, but without ingress — we access it via port-forward.
```bash
kubectl apply -f bootstrap/gitea/namespace.yaml
kubectl apply -f bootstrap/gitea/pvc.yaml
kubectl apply -f bootstrap/gitea/deployment.yaml
kubectl apply -f bootstrap/gitea/service.yaml
# NOTE: Do NOT apply ingress.yaml yet — no Traefik or cert-manager exists
```
Wait for Gitea to be ready, then port-forward and configure:
```bash
kubectl wait deploy/gitea -n gitea --for=condition=available --timeout=120s
# Port-forward in a separate terminal:
kubectl port-forward svc/gitea 3000:3000 -n gitea
```
1. Open **http://localhost:3000** in a browser
2. Fill out the install form:
- Database: **SQLite3** (default)
- Site Title: **Gitea**
- Domaster: **git.prod.t01tt.tech**
- Application URL: **https://git.prod.t01tt.tech**
- Create admin account (username/password/email — save these)
3. Click "Install Gitea"
4. Create a new repository: **`master`** (must be **public**, owned by admin)
5. Close the port-forward (Ctrl+C)
---
## Phase 2: Push Repository to Gitea
```bash
cd ~/infra/yandex-prod
git init
git remote add origin http://localhost:3000/admin/master.git
# Or, once Gitea ingress works later, use:
# git remote add origin https://git.prod.t01tt.tech/admin/master.git
git add -A
git commit -m "initial bootstrap: infrastructure manifests"
git push -u origin master
# Enter Gitea admin credentials when prompted
```
---
## Phase 3: Install ArgoCD (internal access only)
```bash
bash bootstrap/argocd/install.sh
# Saves the admin password — copy it
```
Add the Gitea repository to ArgoCD:
```bash
# Via port-forward:
kubectl port-forward svc/argocd-server 8080:80 -n argocd &
sleep 2
# Login and add repo:
ARGOCD_PASS=$(kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d)
argocd login localhost:8080 --username admin --password "${ARGOCD_PASS}" --insecure
argocd repo add http://gitea.gitea.svc.cluster.local:3000/admin/master.git \
--name yandex-prod \
--type git
```
Deploy the root app:
```bash
kubectl apply -f argocd/app-of-apps.yaml
```
ArgoCD will now sync child apps according to their sync waves. You can watch progress:
```bash
argocd app list
```
---
## Phase 4: Let the Sync Waves Run
Sync order (automated by ArgoCD via `argocd.argoproj.io/sync-wave` annotations):
| Wave | App | What happens |
|------|-----|-------------|
| **-2** | `traefik` | DaemonSet deployed on all 3 nodes. NLB created → external IP provisioned |
| **-1** | `cert-manager` | cert-manager operator + CRDs installed |
| **0** | `cert-manager-issuers` | `letsencrypt-production` + `letsencrypt-staging` ClusterIssuers created |
| **0** | `monitoring` | VM k8s-stack (metrics) + Grafana ingress deployed |
| **0** | `loki` | Loki single-binary deployed |
| **0** | `cnpg-operator` | CloudNativePG operator installed |
| **1** | `cnpg-cluster` | `shared-pg` 3-node PostgreSQL cluster + 8 databases created |
**Verify Traefik IP:**
```bash
kubectl get svc traefik -n traefik -w
# Wait for EXTERNAL-IP to appear. Example output:
# traefik LoadBalancer 10.x.x.x <pending> 80:3xxxx/TCP,443:3xxxx/TCP 30s
# traefik LoadBalancer 10.x.x.x 158.160.x.x 80:3xxxx/TCP,443:3xxxx/TCP 60s
```
Take the EXTERNAL-IP — this is your NLB IP. You'll need it in Phase 5.
**Verify state:**
```bash
kubectl get pods -A
# Expected running pods:
# traefik: traefik-xxxxx (3 pods, DaemonSet)
# cert-manager: cert-manager-xxxxx, cert-manager-cainjector-xxxxx, cert-manager-webhook-xxxxx
# metrics: vm-k8s-stack-* pods (vmsingle, alertmanager, grafana, node-exporter, kube-state-metrics, vmagent)
# metrics: loki-0
# cnpg-system: cnpg-operator-xxxxx
# cnpg: shared-pg-1, shared-pg-2, shared-pg-3 (may take a minute to start)
kubectl get clusterissuer
# Expected: letsencrypt-production (True), letsencrypt-staging (True)
kubectl get cluster -n cnpg
# Expected: shared-pg (3/3 instances ready)
```
---
## Phase 5: DNS + Expose Gitea & ArgoCD
Now that Traefik has an external IP and cert-manager is running, we can:
1. Point DNS at the NLB IP
2. Create the Gitea and ArgoCD ingress resources (with TLS)
### 5.1 Update DNS
Point the following records to the Traefik NLB IP (from Phase 4):
```
git.prod.t01tt.tech → <NLB-IP>
argocd.prod.t01tt.tech → <NLB-IP>
grafana.prod.t01tt.tech → <NLB-IP>
```
Also create a wildcard for future hosts:
```
*.prod.t01tt.tech → <NLB-IP>
```
### 5.2 Apply Ingresses
```bash
kubectl apply -f bootstrap/gitea/ingress.yaml
kubectl apply -f bootstrap/argocd/ingress.yaml
```
### 5.3 Wait for TLS Certificates
```bash
kubectl get certificate -A -w
# Wait for all to show Ready=True:
# gitea gitea-tls True
# argocd argocd-tls True
# metrics grafana-tls True
```
**Troubleshooting:** If certificates are stuck in `Pending`:
- Check DNS resolves: `dig git.prod.t01tt.tech` — must return the NLB IP
- Check cert-manager logs: `kubectl logs -n cert-manager deploy/cert-manager`
- Check challenge: `kubectl get challenges -A`
---
## Phase 6: Verify Everything
### Gitea
```
https://git.prod.t01tt.tech
```
Login with the admin credentials from Phase 1. Verify the `yandex-prod` repo exists.
### ArgoCD
```
https://argocd.prod.t01tt.tech
```
Login with `admin` + password from Phase 3. All apps should show green (`Synced` + `Healthy`).
The Ingress health may show `Healthy` immediately (by design — see `values.yaml` customization).
### Grafana
```
https://grafana.prod.t01tt.tech
```
Login with `admin` / `change-me`. Check that VM k8s-stack dashboards are available.
### PostgreSQL
```bash
kubectl get databases -n cnpg
# Expected: 8 Database resources, one per homeserver
kubectl get pods -n cnpg
# Expected: shared-pg-1, shared-pg-2, shared-pg-3 (Running)
```
### ArgoCD Repo Connection
```bash
argocd repo list
# Expected: the Gitea repo with status "Successful"
```
If not connected, re-add via ArgoCD CLI:
```bash
argocd repo add http://gitea.gitea.svc.cluster.local:3000/admin/master.git \
--name yandex-prod \
--type git
```
Or in the ArgoCD UI: Settings → Repositories → Connect repo.
---
## Phase 7: Post-Bootstrap Checklist
- [ ] All ArgoCD apps `Synced` and `Healthy`
- [ ] `https://git.prod.t01tt.tech` — Gitea accessible, SSL valid
- [ ] `https://argocd.prod.t01tt.tech` — ArgoCD accessible, SSL valid
- [ ] `https://grafana.prod.t01tt.tech` — Grafana accessible, SSL valid, datasources working
- [ ] `kubectl get pv` — PVCs bound for all stateful components
- [ ] CNPG `shared-pg` cluster status: `kubectl get cluster -n cnpg` shows 3/3 ready
- [ ] Certificates all `Ready`: `kubectl get certificate -A | grep False` (should return nothing)
---
## Quick Reference: Service URLs
| Service | URL | Auth |
|---------|-----|------|
| Gitea | `https://git.prod.t01tt.tech` | Admin user from Phase 1 |
| ArgoCD | `https://argocd.prod.t01tt.tech` | `admin` / password from Phase 3 |
| Grafana | `https://grafana.prod.t01tt.tech` | `admin` / `change-me` |
| Traefik dashboard | `kubectl port-forward -n traefik daemonset/traefik 9000:9000` | Internal only |
---
## Troubleshooting
### Traefik NLB stuck in `<pending>`
Yandex Cloud NLB provisioning can take a few minutes. Check:
```bash
kubectl describe svc traefik -n traefik
```
If it's stuck for >5 minutes, verify the Yandex annotations are correct.
### Certificates stuck in `Pending`
1. Verify DNS: `dig git.prod.t01tt.tech` → must return the NLB IP
2. Check Traefik is listening: `curl -k https://<NLB-IP> -H "Host: git.prod.t01tt.tech"` → should return 404 (expected, just verifying Traefik responds)
3. Check orders: `kubectl get orders -A`
### CNPG cluster not becoming ready
```bash
kubectl describe cluster shared-pg -n cnpg
kubectl logs -n cnpg-system deploy/cnpg-controller-manager
```
Common issue: pods can't schedule due to `podAntiAffinityType: required`. Ensure all 3 nodes exist and PVCs can bind.
### Gitea UI shows wrong URL after first login
Gitea caches the ROOT_URL from the `deployment.yaml` env vars. If you change the domaster, update:
```bash
kubectl set env deploy/gitea -n gitea \
GITEA__server__DOMAIN=git.prod.t01tt.tech \
GITEA__server__ROOT_URL=https://git.prod.t01tt.tech
kubectl rollout restart deploy/gitea -n gitea
```
### ArgoCD apps showing "Unknown" health
This is normal for Ingress resources — the custom health check in `bootstrap/argocd/values.yaml` marks all Ingresses as `Healthy` once synced. For other resources, check the app details in ArgoCD UI for the specific error.