--- phase: 08-observability-stack plan: 03 type: execute wave: 2 depends_on: ["08-01", "08-02"] files_modified: [] autonomous: false must_haves: truths: - "Prometheus scrapes TaskPlanner /metrics endpoint" - "Grafana can query TaskPlanner logs via Loki" - "KubePodCrashLooping alert rule exists" artifacts: [] key_links: - from: "Prometheus" to: "TaskPlanner /metrics" via: "ServiceMonitor" pattern: "servicemonitor.*taskplaner" - from: "Grafana Explore" to: "Loki datasource" via: "LogQL query" pattern: "namespace.*default.*taskplaner" --- Verify end-to-end observability stack: metrics scraping, log queries, and alerting Purpose: Confirm all Phase 8 requirements are satisfied (OBS-01 through OBS-08) Output: Verified observability stack with documented proof of functionality @/home/tho/.claude/get-shit-done/workflows/execute-plan.md @/home/tho/.claude/get-shit-done/templates/summary.md @.planning/PROJECT.md @.planning/ROADMAP.md @.planning/STATE.md @.planning/phases/08-observability-stack/CONTEXT.md @.planning/phases/08-observability-stack/08-01-SUMMARY.md @.planning/phases/08-observability-stack/08-02-SUMMARY.md Task 1: Deploy TaskPlanner with ServiceMonitor and verify Prometheus scraping (no files - deployment and verification) 1. Commit and push the metrics endpoint and ServiceMonitor changes from 08-01: ```bash git add . git commit -m "feat(metrics): add /metrics endpoint and ServiceMonitor - Add prom-client for Prometheus metrics - Expose /metrics endpoint with default Node.js metrics - Add ServiceMonitor template to Helm chart OBS-08, OBS-01" git push ``` 2. Wait for ArgoCD to sync (or trigger manual sync): ```bash # Check ArgoCD sync status kubectl get application taskplaner -n argocd -o jsonpath='{.status.sync.status}' # If not synced, wait up to 3 minutes or trigger: argocd app sync taskplaner --server argocd.tricnet.be --insecure 2>/dev/null || \ kubectl patch application taskplaner -n argocd --type merge -p '{"operation":{"initiatedBy":{"username":"admin"},"sync":{}}}' ``` 3. Wait for deployment to complete: ```bash kubectl rollout status deployment taskplaner --timeout=120s ``` 4. Verify ServiceMonitor created: ```bash kubectl get servicemonitor taskplaner ``` Expected: ServiceMonitor exists 5. Verify Prometheus is scraping TaskPlanner: ```bash # Port-forward to Prometheus kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090 & sleep 3 # Query for TaskPlanner targets curl -s "http://localhost:9090/api/v1/targets" | grep -A5 "taskplaner" # Kill port-forward kill %1 2>/dev/null ``` Expected: TaskPlanner target shows state: "up" 6. Query a TaskPlanner metric: ```bash kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090 & sleep 3 curl -s "http://localhost:9090/api/v1/query?query=process_cpu_seconds_total{namespace=\"default\",pod=~\"taskplaner.*\"}" | jq '.data.result[0].value' kill %1 2>/dev/null ``` Expected: Returns a numeric value NOTE: If ArgoCD sync takes too long, the push from earlier may already have triggered sync automatically. 1. kubectl get servicemonitor taskplaner returns a resource 2. Prometheus targets API shows TaskPlaner with state "up" 3. Prometheus query returns process_cpu_seconds_total value for TaskPlanner Prometheus successfully scraping TaskPlanner /metrics endpoint via ServiceMonitor Task 2: Verify critical alert rules exist (no files - verification only) 1. List PrometheusRules to find pod crash alerting: ```bash kubectl get prometheusrules -n monitoring -o name | head -20 ``` 2. Search for KubePodCrashLooping alert: ```bash kubectl get prometheusrules -n monitoring -o yaml | grep -A10 "KubePodCrashLooping" ``` Expected: Alert rule definition found 3. If not found by name, search for crash-related alerts: ```bash kubectl get prometheusrules -n monitoring -o yaml | grep -i "crash\|restart\|CrashLoopBackOff" | head -10 ``` 4. Verify Alertmanager is running: ```bash kubectl get pods -n monitoring -l app.kubernetes.io/name=alertmanager ``` Expected: alertmanager pod(s) Running 5. Check current alerts (should be empty if cluster healthy): ```bash kubectl port-forward -n monitoring svc/kube-prometheus-stack-alertmanager 9093:9093 & sleep 2 curl -s http://localhost:9093/api/v2/alerts | jq '.[].labels.alertname' | head -10 kill %1 2>/dev/null ``` NOTE: kube-prometheus-stack includes default Kubernetes alerting rules. KubePodCrashLooping is a standard rule that fires when a pod restarts more than once in 10 minutes. 1. kubectl get prometheusrules finds KubePodCrashLooping or equivalent crash alert 2. Alertmanager pod is Running 3. Alertmanager API responds (even if alert list is empty) KubePodCrashLooping alert rule confirmed present, Alertmanager operational Full observability stack: - TaskPlanner /metrics endpoint (OBS-08) - Prometheus scraping via ServiceMonitor (OBS-01) - Alloy collecting logs (OBS-04) - Loki storing logs (OBS-03) - Critical alerts configured (OBS-06) - Grafana dashboards (OBS-02) 1. Open Grafana: https://grafana.kube2.tricnet.de - Login: admin / GrafanaAdmin2026 2. Verify dashboards (OBS-02): - Go to Dashboards - Open "Kubernetes / Compute Resources / Namespace (Pods)" or similar - Select namespace: default - Confirm TaskPlanner pod metrics visible 3. Verify log queries (OBS-05): - Go to Explore - Select Loki datasource - Enter query: {namespace="default", pod=~"taskplaner.*"} - Click Run Query - Confirm TaskPlanner logs appear 4. Verify TaskPlanner metrics in Grafana: - Go to Explore - Select Prometheus datasource - Enter query: process_cpu_seconds_total{namespace="default", pod=~"taskplaner.*"} - Confirm metric graph appears 5. Verify Grafana accessible with TLS (OBS-07): - Confirm https:// in URL bar (no certificate warnings) Type "verified" if all checks pass, or describe what failed - [ ] ServiceMonitor created and Prometheus scraping TaskPlanner - [ ] TaskPlanner metrics visible in Prometheus queries - [ ] KubePodCrashLooping alert rule exists - [ ] Alertmanager running and responsive - [ ] Human verified: Grafana dashboards show cluster metrics - [ ] Human verified: Grafana can query TaskPlanner logs from Loki - [ ] Human verified: TaskPlanner metrics visible in Grafana 1. Prometheus scrapes TaskPlanner /metrics (OBS-01, OBS-08 complete) 2. Grafana dashboards display cluster metrics (OBS-02 verified) 3. TaskPlanner logs queryable in Grafana via Loki (OBS-05 verified) 4. KubePodCrashLooping alert rule confirmed (OBS-06 verified) 5. Grafana accessible via TLS (OBS-07 verified) After completion, create `.planning/phases/08-observability-stack/08-03-SUMMARY.md`