Phase 08: Observability Stack - 3 plans in 2 waves - Wave 1: 08-01 (metrics), 08-02 (Alloy) - parallel - Wave 2: 08-03 (verification) - depends on both - Ready for execution Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
234 lines
7.8 KiB
Markdown
234 lines
7.8 KiB
Markdown
---
|
|
phase: 08-observability-stack
|
|
plan: 03
|
|
type: execute
|
|
wave: 2
|
|
depends_on: ["08-01", "08-02"]
|
|
files_modified: []
|
|
autonomous: false
|
|
|
|
must_haves:
|
|
truths:
|
|
- "Prometheus scrapes TaskPlanner /metrics endpoint"
|
|
- "Grafana can query TaskPlanner logs via Loki"
|
|
- "KubePodCrashLooping alert rule exists"
|
|
artifacts: []
|
|
key_links:
|
|
- from: "Prometheus"
|
|
to: "TaskPlanner /metrics"
|
|
via: "ServiceMonitor"
|
|
pattern: "servicemonitor.*taskplaner"
|
|
- from: "Grafana Explore"
|
|
to: "Loki datasource"
|
|
via: "LogQL query"
|
|
pattern: "namespace.*default.*taskplaner"
|
|
---
|
|
|
|
<objective>
|
|
Verify end-to-end observability stack: metrics scraping, log queries, and alerting
|
|
|
|
Purpose: Confirm all Phase 8 requirements are satisfied (OBS-01 through OBS-08)
|
|
Output: Verified observability stack with documented proof of functionality
|
|
</objective>
|
|
|
|
<execution_context>
|
|
@/home/tho/.claude/get-shit-done/workflows/execute-plan.md
|
|
@/home/tho/.claude/get-shit-done/templates/summary.md
|
|
</execution_context>
|
|
|
|
<context>
|
|
@.planning/PROJECT.md
|
|
@.planning/ROADMAP.md
|
|
@.planning/STATE.md
|
|
@.planning/phases/08-observability-stack/CONTEXT.md
|
|
@.planning/phases/08-observability-stack/08-01-SUMMARY.md
|
|
@.planning/phases/08-observability-stack/08-02-SUMMARY.md
|
|
</context>
|
|
|
|
<tasks>
|
|
|
|
<task type="auto">
|
|
<name>Task 1: Deploy TaskPlanner with ServiceMonitor and verify Prometheus scraping</name>
|
|
<files>
|
|
(no files - deployment and verification)
|
|
</files>
|
|
<action>
|
|
1. Commit and push the metrics endpoint and ServiceMonitor changes from 08-01:
|
|
```bash
|
|
git add .
|
|
git commit -m "feat(metrics): add /metrics endpoint and ServiceMonitor
|
|
|
|
- Add prom-client for Prometheus metrics
|
|
- Expose /metrics endpoint with default Node.js metrics
|
|
- Add ServiceMonitor template to Helm chart
|
|
|
|
OBS-08, OBS-01"
|
|
git push
|
|
```
|
|
|
|
2. Wait for ArgoCD to sync (or trigger manual sync):
|
|
```bash
|
|
# Check ArgoCD sync status
|
|
kubectl get application taskplaner -n argocd -o jsonpath='{.status.sync.status}'
|
|
# If not synced, wait up to 3 minutes or trigger:
|
|
argocd app sync taskplaner --server argocd.tricnet.be --insecure 2>/dev/null || \
|
|
kubectl patch application taskplaner -n argocd --type merge -p '{"operation":{"initiatedBy":{"username":"admin"},"sync":{}}}'
|
|
```
|
|
|
|
3. Wait for deployment to complete:
|
|
```bash
|
|
kubectl rollout status deployment taskplaner --timeout=120s
|
|
```
|
|
|
|
4. Verify ServiceMonitor created:
|
|
```bash
|
|
kubectl get servicemonitor taskplaner
|
|
```
|
|
Expected: ServiceMonitor exists
|
|
|
|
5. Verify Prometheus is scraping TaskPlanner:
|
|
```bash
|
|
# Port-forward to Prometheus
|
|
kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090 &
|
|
sleep 3
|
|
|
|
# Query for TaskPlanner targets
|
|
curl -s "http://localhost:9090/api/v1/targets" | grep -A5 "taskplaner"
|
|
|
|
# Kill port-forward
|
|
kill %1 2>/dev/null
|
|
```
|
|
Expected: TaskPlanner target shows state: "up"
|
|
|
|
6. Query a TaskPlanner metric:
|
|
```bash
|
|
kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090 &
|
|
sleep 3
|
|
curl -s "http://localhost:9090/api/v1/query?query=process_cpu_seconds_total{namespace=\"default\",pod=~\"taskplaner.*\"}" | jq '.data.result[0].value'
|
|
kill %1 2>/dev/null
|
|
```
|
|
Expected: Returns a numeric value
|
|
|
|
NOTE: If ArgoCD sync takes too long, the push from earlier may already have triggered sync automatically.
|
|
</action>
|
|
<verify>
|
|
1. kubectl get servicemonitor taskplaner returns a resource
|
|
2. Prometheus targets API shows TaskPlaner with state "up"
|
|
3. Prometheus query returns process_cpu_seconds_total value for TaskPlanner
|
|
</verify>
|
|
<done>
|
|
Prometheus successfully scraping TaskPlanner /metrics endpoint via ServiceMonitor
|
|
</done>
|
|
</task>
|
|
|
|
<task type="auto">
|
|
<name>Task 2: Verify critical alert rules exist</name>
|
|
<files>
|
|
(no files - verification only)
|
|
</files>
|
|
<action>
|
|
1. List PrometheusRules to find pod crash alerting:
|
|
```bash
|
|
kubectl get prometheusrules -n monitoring -o name | head -20
|
|
```
|
|
|
|
2. Search for KubePodCrashLooping alert:
|
|
```bash
|
|
kubectl get prometheusrules -n monitoring -o yaml | grep -A10 "KubePodCrashLooping"
|
|
```
|
|
Expected: Alert rule definition found
|
|
|
|
3. If not found by name, search for crash-related alerts:
|
|
```bash
|
|
kubectl get prometheusrules -n monitoring -o yaml | grep -i "crash\|restart\|CrashLoopBackOff" | head -10
|
|
```
|
|
|
|
4. Verify Alertmanager is running:
|
|
```bash
|
|
kubectl get pods -n monitoring -l app.kubernetes.io/name=alertmanager
|
|
```
|
|
Expected: alertmanager pod(s) Running
|
|
|
|
5. Check current alerts (should be empty if cluster healthy):
|
|
```bash
|
|
kubectl port-forward -n monitoring svc/kube-prometheus-stack-alertmanager 9093:9093 &
|
|
sleep 2
|
|
curl -s http://localhost:9093/api/v2/alerts | jq '.[].labels.alertname' | head -10
|
|
kill %1 2>/dev/null
|
|
```
|
|
|
|
NOTE: kube-prometheus-stack includes default Kubernetes alerting rules. KubePodCrashLooping is a standard rule that fires when a pod restarts more than once in 10 minutes.
|
|
</action>
|
|
<verify>
|
|
1. kubectl get prometheusrules finds KubePodCrashLooping or equivalent crash alert
|
|
2. Alertmanager pod is Running
|
|
3. Alertmanager API responds (even if alert list is empty)
|
|
</verify>
|
|
<done>
|
|
KubePodCrashLooping alert rule confirmed present, Alertmanager operational
|
|
</done>
|
|
</task>
|
|
|
|
<task type="checkpoint:human-verify" gate="blocking">
|
|
<what-built>
|
|
Full observability stack:
|
|
- TaskPlanner /metrics endpoint (OBS-08)
|
|
- Prometheus scraping via ServiceMonitor (OBS-01)
|
|
- Alloy collecting logs (OBS-04)
|
|
- Loki storing logs (OBS-03)
|
|
- Critical alerts configured (OBS-06)
|
|
- Grafana dashboards (OBS-02)
|
|
</what-built>
|
|
<how-to-verify>
|
|
1. Open Grafana: https://grafana.kube2.tricnet.de
|
|
- Login: admin / GrafanaAdmin2026
|
|
|
|
2. Verify dashboards (OBS-02):
|
|
- Go to Dashboards
|
|
- Open "Kubernetes / Compute Resources / Namespace (Pods)" or similar
|
|
- Select namespace: default
|
|
- Confirm TaskPlanner pod metrics visible
|
|
|
|
3. Verify log queries (OBS-05):
|
|
- Go to Explore
|
|
- Select Loki datasource
|
|
- Enter query: {namespace="default", pod=~"taskplaner.*"}
|
|
- Click Run Query
|
|
- Confirm TaskPlanner logs appear
|
|
|
|
4. Verify TaskPlanner metrics in Grafana:
|
|
- Go to Explore
|
|
- Select Prometheus datasource
|
|
- Enter query: process_cpu_seconds_total{namespace="default", pod=~"taskplaner.*"}
|
|
- Confirm metric graph appears
|
|
|
|
5. Verify Grafana accessible with TLS (OBS-07):
|
|
- Confirm https:// in URL bar (no certificate warnings)
|
|
</how-to-verify>
|
|
<resume-signal>Type "verified" if all checks pass, or describe what failed</resume-signal>
|
|
</task>
|
|
|
|
</tasks>
|
|
|
|
<verification>
|
|
- [ ] ServiceMonitor created and Prometheus scraping TaskPlanner
|
|
- [ ] TaskPlanner metrics visible in Prometheus queries
|
|
- [ ] KubePodCrashLooping alert rule exists
|
|
- [ ] Alertmanager running and responsive
|
|
- [ ] Human verified: Grafana dashboards show cluster metrics
|
|
- [ ] Human verified: Grafana can query TaskPlanner logs from Loki
|
|
- [ ] Human verified: TaskPlanner metrics visible in Grafana
|
|
</verification>
|
|
|
|
<success_criteria>
|
|
1. Prometheus scrapes TaskPlanner /metrics (OBS-01, OBS-08 complete)
|
|
2. Grafana dashboards display cluster metrics (OBS-02 verified)
|
|
3. TaskPlanner logs queryable in Grafana via Loki (OBS-05 verified)
|
|
4. KubePodCrashLooping alert rule confirmed (OBS-06 verified)
|
|
5. Grafana accessible via TLS (OBS-07 verified)
|
|
</success_criteria>
|
|
|
|
<output>
|
|
After completion, create `.planning/phases/08-observability-stack/08-03-SUMMARY.md`
|
|
</output>
|