Files
Thomas Richter 8c3dc137ca docs(08): create phase plan
Phase 08: Observability Stack
- 3 plans in 2 waves
- Wave 1: 08-01 (metrics), 08-02 (Alloy) - parallel
- Wave 2: 08-03 (verification) - depends on both
- Ready for execution

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 21:24:24 +01:00

234 lines
7.8 KiB
Markdown

---
phase: 08-observability-stack
plan: 03
type: execute
wave: 2
depends_on: ["08-01", "08-02"]
files_modified: []
autonomous: false
must_haves:
truths:
- "Prometheus scrapes TaskPlanner /metrics endpoint"
- "Grafana can query TaskPlanner logs via Loki"
- "KubePodCrashLooping alert rule exists"
artifacts: []
key_links:
- from: "Prometheus"
to: "TaskPlanner /metrics"
via: "ServiceMonitor"
pattern: "servicemonitor.*taskplaner"
- from: "Grafana Explore"
to: "Loki datasource"
via: "LogQL query"
pattern: "namespace.*default.*taskplaner"
---
<objective>
Verify end-to-end observability stack: metrics scraping, log queries, and alerting
Purpose: Confirm all Phase 8 requirements are satisfied (OBS-01 through OBS-08)
Output: Verified observability stack with documented proof of functionality
</objective>
<execution_context>
@/home/tho/.claude/get-shit-done/workflows/execute-plan.md
@/home/tho/.claude/get-shit-done/templates/summary.md
</execution_context>
<context>
@.planning/PROJECT.md
@.planning/ROADMAP.md
@.planning/STATE.md
@.planning/phases/08-observability-stack/CONTEXT.md
@.planning/phases/08-observability-stack/08-01-SUMMARY.md
@.planning/phases/08-observability-stack/08-02-SUMMARY.md
</context>
<tasks>
<task type="auto">
<name>Task 1: Deploy TaskPlanner with ServiceMonitor and verify Prometheus scraping</name>
<files>
(no files - deployment and verification)
</files>
<action>
1. Commit and push the metrics endpoint and ServiceMonitor changes from 08-01:
```bash
git add .
git commit -m "feat(metrics): add /metrics endpoint and ServiceMonitor
- Add prom-client for Prometheus metrics
- Expose /metrics endpoint with default Node.js metrics
- Add ServiceMonitor template to Helm chart
OBS-08, OBS-01"
git push
```
2. Wait for ArgoCD to sync (or trigger manual sync):
```bash
# Check ArgoCD sync status
kubectl get application taskplaner -n argocd -o jsonpath='{.status.sync.status}'
# If not synced, wait up to 3 minutes or trigger:
argocd app sync taskplaner --server argocd.tricnet.be --insecure 2>/dev/null || \
kubectl patch application taskplaner -n argocd --type merge -p '{"operation":{"initiatedBy":{"username":"admin"},"sync":{}}}'
```
3. Wait for deployment to complete:
```bash
kubectl rollout status deployment taskplaner --timeout=120s
```
4. Verify ServiceMonitor created:
```bash
kubectl get servicemonitor taskplaner
```
Expected: ServiceMonitor exists
5. Verify Prometheus is scraping TaskPlanner:
```bash
# Port-forward to Prometheus
kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090 &
sleep 3
# Query for TaskPlanner targets
curl -s "http://localhost:9090/api/v1/targets" | grep -A5 "taskplaner"
# Kill port-forward
kill %1 2>/dev/null
```
Expected: TaskPlanner target shows state: "up"
6. Query a TaskPlanner metric:
```bash
kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090 &
sleep 3
curl -s "http://localhost:9090/api/v1/query?query=process_cpu_seconds_total{namespace=\"default\",pod=~\"taskplaner.*\"}" | jq '.data.result[0].value'
kill %1 2>/dev/null
```
Expected: Returns a numeric value
NOTE: If ArgoCD sync takes too long, the push from earlier may already have triggered sync automatically.
</action>
<verify>
1. kubectl get servicemonitor taskplaner returns a resource
2. Prometheus targets API shows TaskPlaner with state "up"
3. Prometheus query returns process_cpu_seconds_total value for TaskPlanner
</verify>
<done>
Prometheus successfully scraping TaskPlanner /metrics endpoint via ServiceMonitor
</done>
</task>
<task type="auto">
<name>Task 2: Verify critical alert rules exist</name>
<files>
(no files - verification only)
</files>
<action>
1. List PrometheusRules to find pod crash alerting:
```bash
kubectl get prometheusrules -n monitoring -o name | head -20
```
2. Search for KubePodCrashLooping alert:
```bash
kubectl get prometheusrules -n monitoring -o yaml | grep -A10 "KubePodCrashLooping"
```
Expected: Alert rule definition found
3. If not found by name, search for crash-related alerts:
```bash
kubectl get prometheusrules -n monitoring -o yaml | grep -i "crash\|restart\|CrashLoopBackOff" | head -10
```
4. Verify Alertmanager is running:
```bash
kubectl get pods -n monitoring -l app.kubernetes.io/name=alertmanager
```
Expected: alertmanager pod(s) Running
5. Check current alerts (should be empty if cluster healthy):
```bash
kubectl port-forward -n monitoring svc/kube-prometheus-stack-alertmanager 9093:9093 &
sleep 2
curl -s http://localhost:9093/api/v2/alerts | jq '.[].labels.alertname' | head -10
kill %1 2>/dev/null
```
NOTE: kube-prometheus-stack includes default Kubernetes alerting rules. KubePodCrashLooping is a standard rule that fires when a pod restarts more than once in 10 minutes.
</action>
<verify>
1. kubectl get prometheusrules finds KubePodCrashLooping or equivalent crash alert
2. Alertmanager pod is Running
3. Alertmanager API responds (even if alert list is empty)
</verify>
<done>
KubePodCrashLooping alert rule confirmed present, Alertmanager operational
</done>
</task>
<task type="checkpoint:human-verify" gate="blocking">
<what-built>
Full observability stack:
- TaskPlanner /metrics endpoint (OBS-08)
- Prometheus scraping via ServiceMonitor (OBS-01)
- Alloy collecting logs (OBS-04)
- Loki storing logs (OBS-03)
- Critical alerts configured (OBS-06)
- Grafana dashboards (OBS-02)
</what-built>
<how-to-verify>
1. Open Grafana: https://grafana.kube2.tricnet.de
- Login: admin / GrafanaAdmin2026
2. Verify dashboards (OBS-02):
- Go to Dashboards
- Open "Kubernetes / Compute Resources / Namespace (Pods)" or similar
- Select namespace: default
- Confirm TaskPlanner pod metrics visible
3. Verify log queries (OBS-05):
- Go to Explore
- Select Loki datasource
- Enter query: {namespace="default", pod=~"taskplaner.*"}
- Click Run Query
- Confirm TaskPlanner logs appear
4. Verify TaskPlanner metrics in Grafana:
- Go to Explore
- Select Prometheus datasource
- Enter query: process_cpu_seconds_total{namespace="default", pod=~"taskplaner.*"}
- Confirm metric graph appears
5. Verify Grafana accessible with TLS (OBS-07):
- Confirm https:// in URL bar (no certificate warnings)
</how-to-verify>
<resume-signal>Type "verified" if all checks pass, or describe what failed</resume-signal>
</task>
</tasks>
<verification>
- [ ] ServiceMonitor created and Prometheus scraping TaskPlanner
- [ ] TaskPlanner metrics visible in Prometheus queries
- [ ] KubePodCrashLooping alert rule exists
- [ ] Alertmanager running and responsive
- [ ] Human verified: Grafana dashboards show cluster metrics
- [ ] Human verified: Grafana can query TaskPlanner logs from Loki
- [ ] Human verified: TaskPlanner metrics visible in Grafana
</verification>
<success_criteria>
1. Prometheus scrapes TaskPlanner /metrics (OBS-01, OBS-08 complete)
2. Grafana dashboards display cluster metrics (OBS-02 verified)
3. TaskPlanner logs queryable in Grafana via Loki (OBS-05 verified)
4. KubePodCrashLooping alert rule confirmed (OBS-06 verified)
5. Grafana accessible via TLS (OBS-07 verified)
</success_criteria>
<output>
After completion, create `.planning/phases/08-observability-stack/08-03-SUMMARY.md`
</output>