taskplaner/.planning/phases/08-observability-stack/08-03-PLAN.md at master

Files

Thomas Richter 8c3dc137ca docs(08): create phase plan

Phase 08: Observability Stack
- 3 plans in 2 waves
- Wave 1: 08-01 (metrics), 08-02 (Alloy) - parallel
- Wave 2: 08-03 (verification) - depends on both
- Ready for execution

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-02-03 21:24:24 +01:00

7.8 KiB

Raw Permalink Blame History

phase, plan, type, wave, depends_on, files_modified, autonomous, must_haves

phase

plan

type

wave

depends_on

files_modified

autonomous

must_haves

08-observability-stack

execute

08-01

08-02

false

truths

artifacts

key_links

Prometheus scrapes TaskPlanner /metrics endpoint

Grafana can query TaskPlanner logs via Loki

KubePodCrashLooping alert rule exists

from	to	via	pattern
Prometheus	TaskPlanner /metrics	ServiceMonitor	servicemonitor.*taskplaner

from	to	via	pattern
Grafana Explore	Loki datasource	LogQL query	namespace.default.taskplaner

Verify end-to-end observability stack: metrics scraping, log queries, and alerting

Purpose: Confirm all Phase 8 requirements are satisfied (OBS-01 through OBS-08) Output: Verified observability stack with documented proof of functionality

<execution_context> @/home/tho/.claude/get-shit-done/workflows/execute-plan.md @/home/tho/.claude/get-shit-done/templates/summary.md </execution_context>

@.planning/PROJECT.md @.planning/ROADMAP.md @.planning/STATE.md @.planning/phases/08-observability-stack/CONTEXT.md @.planning/phases/08-observability-stack/08-01-SUMMARY.md @.planning/phases/08-observability-stack/08-02-SUMMARY.md Task 1: Deploy TaskPlanner with ServiceMonitor and verify Prometheus scraping (no files - deployment and verification) 1. Commit and push the metrics endpoint and ServiceMonitor changes from 08-01: ```bash git add . git commit -m "feat(metrics): add /metrics endpoint and ServiceMonitor

   - Add prom-client for Prometheus metrics
   - Expose /metrics endpoint with default Node.js metrics
   - Add ServiceMonitor template to Helm chart

   OBS-08, OBS-01"
   git push
   ```

2. Wait for ArgoCD to sync (or trigger manual sync):
   ```bash
   # Check ArgoCD sync status
   kubectl get application taskplaner -n argocd -o jsonpath='{.status.sync.status}'
   # If not synced, wait up to 3 minutes or trigger:
   argocd app sync taskplaner --server argocd.tricnet.be --insecure 2>/dev/null || \
     kubectl patch application taskplaner -n argocd --type merge -p '{"operation":{"initiatedBy":{"username":"admin"},"sync":{}}}'
   ```

3. Wait for deployment to complete:
   ```bash
   kubectl rollout status deployment taskplaner --timeout=120s
   ```

4. Verify ServiceMonitor created:
   ```bash
   kubectl get servicemonitor taskplaner
   ```
   Expected: ServiceMonitor exists

5. Verify Prometheus is scraping TaskPlanner:
   ```bash
   # Port-forward to Prometheus
   kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090 &
   sleep 3

   # Query for TaskPlanner targets
   curl -s "http://localhost:9090/api/v1/targets" | grep -A5 "taskplaner"

   # Kill port-forward
   kill %1 2>/dev/null
   ```
   Expected: TaskPlanner target shows state: "up"

6. Query a TaskPlanner metric:
   ```bash
   kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090 &
   sleep 3
   curl -s "http://localhost:9090/api/v1/query?query=process_cpu_seconds_total{namespace=\"default\",pod=~\"taskplaner.*\"}" | jq '.data.result[0].value'
   kill %1 2>/dev/null
   ```
   Expected: Returns a numeric value

NOTE: If ArgoCD sync takes too long, the push from earlier may already have triggered sync automatically.

1. kubectl get servicemonitor taskplaner returns a resource 2. Prometheus targets API shows TaskPlaner with state "up" 3. Prometheus query returns process_cpu_seconds_total value for TaskPlanner Prometheus successfully scraping TaskPlanner /metrics endpoint via ServiceMonitor Task 2: Verify critical alert rules exist (no files - verification only) 1. List PrometheusRules to find pod crash alerting: ```bash kubectl get prometheusrules -n monitoring -o name | head -20 ```

2. Search for KubePodCrashLooping alert:
   ```bash
   kubectl get prometheusrules -n monitoring -o yaml | grep -A10 "KubePodCrashLooping"
   ```
   Expected: Alert rule definition found

3. If not found by name, search for crash-related alerts:
   ```bash
   kubectl get prometheusrules -n monitoring -o yaml | grep -i "crash\|restart\|CrashLoopBackOff" | head -10
   ```

4. Verify Alertmanager is running:
   ```bash
   kubectl get pods -n monitoring -l app.kubernetes.io/name=alertmanager
   ```
   Expected: alertmanager pod(s) Running

5. Check current alerts (should be empty if cluster healthy):
   ```bash
   kubectl port-forward -n monitoring svc/kube-prometheus-stack-alertmanager 9093:9093 &
   sleep 2
   curl -s http://localhost:9093/api/v2/alerts | jq '.[].labels.alertname' | head -10
   kill %1 2>/dev/null
   ```

NOTE: kube-prometheus-stack includes default Kubernetes alerting rules. KubePodCrashLooping is a standard rule that fires when a pod restarts more than once in 10 minutes.

1. kubectl get prometheusrules finds KubePodCrashLooping or equivalent crash alert 2. Alertmanager pod is Running 3. Alertmanager API responds (even if alert list is empty) KubePodCrashLooping alert rule confirmed present, Alertmanager operational Full observability stack: - TaskPlanner /metrics endpoint (OBS-08) - Prometheus scraping via ServiceMonitor (OBS-01) - Alloy collecting logs (OBS-04) - Loki storing logs (OBS-03) - Critical alerts configured (OBS-06) - Grafana dashboards (OBS-02) 1. Open Grafana: https://grafana.kube2.tricnet.de - Login: admin / GrafanaAdmin2026

2. Verify dashboards (OBS-02):
   - Go to Dashboards
   - Open "Kubernetes / Compute Resources / Namespace (Pods)" or similar
   - Select namespace: default
   - Confirm TaskPlanner pod metrics visible

3. Verify log queries (OBS-05):
   - Go to Explore
   - Select Loki datasource
   - Enter query: {namespace="default", pod=~"taskplaner.*"}
   - Click Run Query
   - Confirm TaskPlanner logs appear

4. Verify TaskPlanner metrics in Grafana:
   - Go to Explore
   - Select Prometheus datasource
   - Enter query: process_cpu_seconds_total{namespace="default", pod=~"taskplaner.*"}
   - Confirm metric graph appears

5. Verify Grafana accessible with TLS (OBS-07):
   - Confirm https:// in URL bar (no certificate warnings)

Type "verified" if all checks pass, or describe what failed - [ ] ServiceMonitor created and Prometheus scraping TaskPlanner - [ ] TaskPlanner metrics visible in Prometheus queries - [ ] KubePodCrashLooping alert rule exists - [ ] Alertmanager running and responsive - [ ] Human verified: Grafana dashboards show cluster metrics - [ ] Human verified: Grafana can query TaskPlanner logs from Loki - [ ] Human verified: TaskPlanner metrics visible in Grafana

<success_criteria>

Prometheus scrapes TaskPlanner /metrics (OBS-01, OBS-08 complete)
Grafana dashboards display cluster metrics (OBS-02 verified)
TaskPlanner logs queryable in Grafana via Loki (OBS-05 verified)
KubePodCrashLooping alert rule confirmed (OBS-06 verified)
Grafana accessible via TLS (OBS-07 verified) </success_criteria>

After completion, create `.planning/phases/08-observability-stack/08-03-SUMMARY.md`

7.8 KiB Raw Permalink Blame History

7.8 KiB

Raw Permalink Blame History