309 lines
9.2 KiB
Markdown
309 lines
9.2 KiB
Markdown
# Yandex Cloud Production Cluster — Bootstrap Guide
|
|
|
|
## Prerequisites
|
|
|
|
- [ ] `kubeconfig` file placed at `~/infra/yandex-prod/kubeconfig`
|
|
- [ ] `kubectl` context pointing to the new `yc-prod` cluster
|
|
- [ ] Domaster `prod.t01tt.tech` DNS managed (can be updated later in Phase 5)
|
|
- [ ] `git` and `helm` installed locally
|
|
|
|
---
|
|
|
|
## Phase 0: Verify Cluster Access
|
|
|
|
```bash
|
|
export KUBECONFIG=~/infra/yandex-prod/kubeconfig
|
|
|
|
kubectl get nodes
|
|
# Expected: 3 nodes Ready, 2CPU/8GB each
|
|
|
|
kubectl get sc
|
|
# Expected: yc-network-hdd (default), yc-network-ssd, yc-network-nvme, ...
|
|
```
|
|
|
|
---
|
|
|
|
## Phase 1: Bootstrap Gitea (internal access only)
|
|
|
|
Gitea hosts the Git repo that ArgoCD reads. Deploy it first, but without ingress — we access it via port-forward.
|
|
|
|
```bash
|
|
kubectl apply -f bootstrap/gitea/namespace.yaml
|
|
kubectl apply -f bootstrap/gitea/pvc.yaml
|
|
kubectl apply -f bootstrap/gitea/deployment.yaml
|
|
kubectl apply -f bootstrap/gitea/service.yaml
|
|
# NOTE: Do NOT apply ingress.yaml yet — no Traefik or cert-manager exists
|
|
```
|
|
|
|
Wait for Gitea to be ready, then port-forward and configure:
|
|
|
|
```bash
|
|
kubectl wait deploy/gitea -n gitea --for=condition=available --timeout=120s
|
|
|
|
# Port-forward in a separate terminal:
|
|
kubectl port-forward svc/gitea 3000:3000 -n gitea
|
|
```
|
|
|
|
1. Open **http://localhost:3000** in a browser
|
|
2. Fill out the install form:
|
|
- Database: **SQLite3** (default)
|
|
- Site Title: **Gitea**
|
|
- Domaster: **git.prod.t01tt.tech**
|
|
- Application URL: **https://git.prod.t01tt.tech**
|
|
- Create admin account (username/password/email — save these)
|
|
3. Click "Install Gitea"
|
|
4. Create a new repository: **`master`** (must be **public**, owned by admin)
|
|
5. Close the port-forward (Ctrl+C)
|
|
|
|
---
|
|
|
|
## Phase 2: Push Repository to Gitea
|
|
|
|
```bash
|
|
cd ~/infra/yandex-prod
|
|
|
|
git init
|
|
git remote add origin http://localhost:3000/admin/master.git
|
|
# Or, once Gitea ingress works later, use:
|
|
# git remote add origin https://git.prod.t01tt.tech/admin/master.git
|
|
|
|
git add -A
|
|
git commit -m "initial bootstrap: infrastructure manifests"
|
|
git push -u origin master
|
|
# Enter Gitea admin credentials when prompted
|
|
```
|
|
|
|
---
|
|
|
|
## Phase 3: Install ArgoCD (internal access only)
|
|
|
|
```bash
|
|
bash bootstrap/argocd/install.sh
|
|
# Saves the admin password — copy it
|
|
```
|
|
|
|
Add the Gitea repository to ArgoCD:
|
|
|
|
```bash
|
|
# Via port-forward:
|
|
kubectl port-forward svc/argocd-server 8080:80 -n argocd &
|
|
sleep 2
|
|
|
|
# Login and add repo:
|
|
ARGOCD_PASS=$(kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d)
|
|
argocd login localhost:8080 --username admin --password "${ARGOCD_PASS}" --insecure
|
|
|
|
argocd repo add http://gitea.gitea.svc.cluster.local:3000/admin/master.git \
|
|
--name yandex-prod \
|
|
--type git
|
|
```
|
|
|
|
Deploy the root app:
|
|
|
|
```bash
|
|
kubectl apply -f argocd/app-of-apps.yaml
|
|
```
|
|
|
|
ArgoCD will now sync child apps according to their sync waves. You can watch progress:
|
|
|
|
```bash
|
|
argocd app list
|
|
```
|
|
|
|
---
|
|
|
|
## Phase 4: Let the Sync Waves Run
|
|
|
|
Sync order (automated by ArgoCD via `argocd.argoproj.io/sync-wave` annotations):
|
|
|
|
| Wave | App | What happens |
|
|
|------|-----|-------------|
|
|
| **-2** | `traefik` | DaemonSet deployed on all 3 nodes. NLB created → external IP provisioned |
|
|
| **-1** | `cert-manager` | cert-manager operator + CRDs installed |
|
|
| **0** | `cert-manager-issuers` | `letsencrypt-production` + `letsencrypt-staging` ClusterIssuers created |
|
|
| **0** | `monitoring` | VM k8s-stack (metrics) + Grafana ingress deployed |
|
|
| **0** | `loki` | Loki single-binary deployed |
|
|
| **0** | `cnpg-operator` | CloudNativePG operator installed |
|
|
| **1** | `cnpg-cluster` | `shared-pg` 3-node PostgreSQL cluster + 8 databases created |
|
|
|
|
**Verify Traefik IP:**
|
|
|
|
```bash
|
|
kubectl get svc traefik -n traefik -w
|
|
# Wait for EXTERNAL-IP to appear. Example output:
|
|
# traefik LoadBalancer 10.x.x.x <pending> 80:3xxxx/TCP,443:3xxxx/TCP 30s
|
|
# traefik LoadBalancer 10.x.x.x 158.160.x.x 80:3xxxx/TCP,443:3xxxx/TCP 60s
|
|
```
|
|
|
|
Take the EXTERNAL-IP — this is your NLB IP. You'll need it in Phase 5.
|
|
|
|
**Verify state:**
|
|
|
|
```bash
|
|
kubectl get pods -A
|
|
# Expected running pods:
|
|
# traefik: traefik-xxxxx (3 pods, DaemonSet)
|
|
# cert-manager: cert-manager-xxxxx, cert-manager-cainjector-xxxxx, cert-manager-webhook-xxxxx
|
|
# metrics: vm-k8s-stack-* pods (vmsingle, alertmanager, grafana, node-exporter, kube-state-metrics, vmagent)
|
|
# metrics: loki-0
|
|
# cnpg-system: cnpg-operator-xxxxx
|
|
# cnpg: shared-pg-1, shared-pg-2, shared-pg-3 (may take a minute to start)
|
|
|
|
kubectl get clusterissuer
|
|
# Expected: letsencrypt-production (True), letsencrypt-staging (True)
|
|
|
|
kubectl get cluster -n cnpg
|
|
# Expected: shared-pg (3/3 instances ready)
|
|
```
|
|
|
|
---
|
|
|
|
## Phase 5: DNS + Expose Gitea & ArgoCD
|
|
|
|
Now that Traefik has an external IP and cert-manager is running, we can:
|
|
1. Point DNS at the NLB IP
|
|
2. Create the Gitea and ArgoCD ingress resources (with TLS)
|
|
|
|
### 5.1 Update DNS
|
|
|
|
Point the following records to the Traefik NLB IP (from Phase 4):
|
|
|
|
```
|
|
git.prod.t01tt.tech → <NLB-IP>
|
|
argocd.prod.t01tt.tech → <NLB-IP>
|
|
grafana.prod.t01tt.tech → <NLB-IP>
|
|
```
|
|
|
|
Also create a wildcard for future hosts:
|
|
```
|
|
*.prod.t01tt.tech → <NLB-IP>
|
|
```
|
|
|
|
### 5.2 Apply Ingresses
|
|
|
|
```bash
|
|
kubectl apply -f bootstrap/gitea/ingress.yaml
|
|
kubectl apply -f bootstrap/argocd/ingress.yaml
|
|
```
|
|
|
|
### 5.3 Wait for TLS Certificates
|
|
|
|
```bash
|
|
kubectl get certificate -A -w
|
|
# Wait for all to show Ready=True:
|
|
# gitea gitea-tls True
|
|
# argocd argocd-tls True
|
|
# metrics grafana-tls True
|
|
```
|
|
|
|
**Troubleshooting:** If certificates are stuck in `Pending`:
|
|
- Check DNS resolves: `dig git.prod.t01tt.tech` — must return the NLB IP
|
|
- Check cert-manager logs: `kubectl logs -n cert-manager deploy/cert-manager`
|
|
- Check challenge: `kubectl get challenges -A`
|
|
|
|
---
|
|
|
|
## Phase 6: Verify Everything
|
|
|
|
### Gitea
|
|
```
|
|
https://git.prod.t01tt.tech
|
|
```
|
|
Login with the admin credentials from Phase 1. Verify the `yandex-prod` repo exists.
|
|
|
|
### ArgoCD
|
|
```
|
|
https://argocd.prod.t01tt.tech
|
|
```
|
|
Login with `admin` + password from Phase 3. All apps should show green (`Synced` + `Healthy`).
|
|
|
|
The Ingress health may show `Healthy` immediately (by design — see `values.yaml` customization).
|
|
|
|
### Grafana
|
|
```
|
|
https://grafana.prod.t01tt.tech
|
|
```
|
|
Login with `admin` / `change-me`. Check that VM k8s-stack dashboards are available.
|
|
|
|
### PostgreSQL
|
|
```bash
|
|
kubectl get databases -n cnpg
|
|
# Expected: 8 Database resources, one per homeserver
|
|
|
|
kubectl get pods -n cnpg
|
|
# Expected: shared-pg-1, shared-pg-2, shared-pg-3 (Running)
|
|
```
|
|
|
|
### ArgoCD Repo Connection
|
|
```bash
|
|
argocd repo list
|
|
# Expected: the Gitea repo with status "Successful"
|
|
```
|
|
|
|
If not connected, re-add via ArgoCD CLI:
|
|
```bash
|
|
argocd repo add http://gitea.gitea.svc.cluster.local:3000/admin/master.git \
|
|
--name yandex-prod \
|
|
--type git
|
|
```
|
|
|
|
Or in the ArgoCD UI: Settings → Repositories → Connect repo.
|
|
|
|
---
|
|
|
|
## Phase 7: Post-Bootstrap Checklist
|
|
|
|
- [ ] All ArgoCD apps `Synced` and `Healthy`
|
|
- [ ] `https://git.prod.t01tt.tech` — Gitea accessible, SSL valid
|
|
- [ ] `https://argocd.prod.t01tt.tech` — ArgoCD accessible, SSL valid
|
|
- [ ] `https://grafana.prod.t01tt.tech` — Grafana accessible, SSL valid, datasources working
|
|
- [ ] `kubectl get pv` — PVCs bound for all stateful components
|
|
- [ ] CNPG `shared-pg` cluster status: `kubectl get cluster -n cnpg` shows 3/3 ready
|
|
- [ ] Certificates all `Ready`: `kubectl get certificate -A | grep False` (should return nothing)
|
|
|
|
---
|
|
|
|
## Quick Reference: Service URLs
|
|
|
|
| Service | URL | Auth |
|
|
|---------|-----|------|
|
|
| Gitea | `https://git.prod.t01tt.tech` | Admin user from Phase 1 |
|
|
| ArgoCD | `https://argocd.prod.t01tt.tech` | `admin` / password from Phase 3 |
|
|
| Grafana | `https://grafana.prod.t01tt.tech` | `admin` / `change-me` |
|
|
| Traefik dashboard | `kubectl port-forward -n traefik daemonset/traefik 9000:9000` | Internal only |
|
|
|
|
---
|
|
|
|
## Troubleshooting
|
|
|
|
### Traefik NLB stuck in `<pending>`
|
|
Yandex Cloud NLB provisioning can take a few minutes. Check:
|
|
```bash
|
|
kubectl describe svc traefik -n traefik
|
|
```
|
|
If it's stuck for >5 minutes, verify the Yandex annotations are correct.
|
|
|
|
### Certificates stuck in `Pending`
|
|
1. Verify DNS: `dig git.prod.t01tt.tech` → must return the NLB IP
|
|
2. Check Traefik is listening: `curl -k https://<NLB-IP> -H "Host: git.prod.t01tt.tech"` → should return 404 (expected, just verifying Traefik responds)
|
|
3. Check orders: `kubectl get orders -A`
|
|
|
|
### CNPG cluster not becoming ready
|
|
```bash
|
|
kubectl describe cluster shared-pg -n cnpg
|
|
kubectl logs -n cnpg-system deploy/cnpg-controller-manager
|
|
```
|
|
Common issue: pods can't schedule due to `podAntiAffinityType: required`. Ensure all 3 nodes exist and PVCs can bind.
|
|
|
|
### Gitea UI shows wrong URL after first login
|
|
Gitea caches the ROOT_URL from the `deployment.yaml` env vars. If you change the domaster, update:
|
|
```bash
|
|
kubectl set env deploy/gitea -n gitea \
|
|
GITEA__server__DOMAIN=git.prod.t01tt.tech \
|
|
GITEA__server__ROOT_URL=https://git.prod.t01tt.tech
|
|
kubectl rollout restart deploy/gitea -n gitea
|
|
```
|
|
|
|
### ArgoCD apps showing "Unknown" health
|
|
This is normal for Ingress resources — the custom health check in `bootstrap/argocd/values.yaml` marks all Ingresses as `Healthy` once synced. For other resources, check the app details in ArgoCD UI for the specific error.
|