diff --git a/.planning/ROADMAP.md b/.planning/ROADMAP.md index 7c4f977..489b51f 100644 --- a/.planning/ROADMAP.md +++ b/.planning/ROADMAP.md @@ -89,12 +89,12 @@ Plans: 3. Logs from all pods are queryable in Grafana Explore via Loki 4. Alert fires when a pod crashes or restarts repeatedly (KubePodCrashLooping) 5. TaskPlanner /metrics endpoint returns Prometheus-format metrics -**Plans**: TBD +**Plans**: 3 plans Plans: -- [ ] 08-01: kube-prometheus-stack installation (Prometheus + Grafana) -- [ ] 08-02: Loki + Alloy installation for log aggregation -- [ ] 08-03: Critical alerts and TaskPlanner metrics endpoint +- [ ] 08-01-PLAN.md — TaskPlanner /metrics endpoint and ServiceMonitor +- [ ] 08-02-PLAN.md — Promtail to Alloy migration for log collection +- [ ] 08-03-PLAN.md — End-to-end observability verification ### Phase 9: CI Pipeline Hardening **Goal**: Tests run before build - type errors and test failures block deployment @@ -126,13 +126,14 @@ Phases execute in numeric order: 7 -> 8 -> 9 | 5. Search | v1.0 | 3/3 | Complete | 2026-01-31 | | 6. Deployment | v1.0 | 2/2 | Complete | 2026-02-01 | | 7. GitOps Foundation | v2.0 | 2/2 | Complete ✓ | 2026-02-03 | -| 8. Observability Stack | v2.0 | 0/3 | Not started | - | +| 8. Observability Stack | v2.0 | 0/3 | Planned | - | | 9. CI Pipeline Hardening | v2.0 | 0/2 | Not started | - | --- *Roadmap created: 2026-01-29* *v2.0 phases added: 2026-02-03* *Phase 7 planned: 2026-02-03* +*Phase 8 planned: 2026-02-03* *Depth: standard* *v1.0 Coverage: 31/31 requirements mapped* *v2.0 Coverage: 17/17 requirements mapped* diff --git a/.planning/phases/08-observability-stack/08-01-PLAN.md b/.planning/phases/08-observability-stack/08-01-PLAN.md new file mode 100644 index 0000000..ac5ee81 --- /dev/null +++ b/.planning/phases/08-observability-stack/08-01-PLAN.md @@ -0,0 +1,174 @@ +--- +phase: 08-observability-stack +plan: 01 +type: execute +wave: 1 +depends_on: [] +files_modified: + - package.json + - src/routes/metrics/+server.ts + - src/lib/server/metrics.ts + - helm/taskplaner/templates/servicemonitor.yaml + - helm/taskplaner/values.yaml +autonomous: true + +must_haves: + truths: + - "TaskPlanner /metrics endpoint returns Prometheus-format text" + - "ServiceMonitor exists in Helm chart templates" + - "Prometheus can discover TaskPlanner via ServiceMonitor" + artifacts: + - path: "src/routes/metrics/+server.ts" + provides: "Prometheus metrics HTTP endpoint" + exports: ["GET"] + - path: "src/lib/server/metrics.ts" + provides: "prom-client registry and metrics definitions" + contains: "collectDefaultMetrics" + - path: "helm/taskplaner/templates/servicemonitor.yaml" + provides: "ServiceMonitor for Prometheus Operator" + contains: "kind: ServiceMonitor" + key_links: + - from: "src/routes/metrics/+server.ts" + to: "src/lib/server/metrics.ts" + via: "import register" + pattern: "import.*register.*from.*metrics" + - from: "helm/taskplaner/templates/servicemonitor.yaml" + to: "tp-app service" + via: "selector matchLabels" + pattern: "selector.*matchLabels" +--- + + +Add Prometheus metrics endpoint to TaskPlanner and ServiceMonitor for scraping + +Purpose: Enable Prometheus to collect application metrics from TaskPlanner (OBS-08, OBS-01) +Output: /metrics endpoint returning prom-client default metrics, ServiceMonitor in Helm chart + + + +@/home/tho/.claude/get-shit-done/workflows/execute-plan.md +@/home/tho/.claude/get-shit-done/templates/summary.md + + + +@.planning/PROJECT.md +@.planning/ROADMAP.md +@.planning/STATE.md +@.planning/phases/08-observability-stack/CONTEXT.md +@package.json +@src/routes/health/+server.ts +@helm/taskplaner/values.yaml +@helm/taskplaner/templates/service.yaml + + + + + + Task 1: Add prom-client and create /metrics endpoint + + package.json + src/lib/server/metrics.ts + src/routes/metrics/+server.ts + + + 1. Install prom-client: + ```bash + npm install prom-client + ``` + + 2. Create src/lib/server/metrics.ts: + - Import prom-client's Registry, collectDefaultMetrics + - Create a new Registry instance + - Call collectDefaultMetrics({ register: registry }) to collect Node.js process metrics + - Export the registry + - Keep it minimal - just default metrics (memory, CPU, event loop lag) + + 3. Create src/routes/metrics/+server.ts: + - Import the registry from $lib/server/metrics + - Create GET handler that returns registry.metrics() with Content-Type: text/plain; version=0.0.4 + - Handle errors gracefully (return 500 on failure) + - Pattern follows existing /health endpoint structure + + NOTE: prom-client is the standard Node.js Prometheus client. Use default metrics only - no custom metrics needed for this phase. + + + 1. npm run build completes without errors + 2. npm run dev, then curl http://localhost:5173/metrics returns text starting with "# HELP" or "# TYPE" + 3. Response Content-Type header includes "text/plain" + + + /metrics endpoint returns Prometheus-format metrics including process_cpu_seconds_total, nodejs_heap_size_total_bytes + + + + + Task 2: Add ServiceMonitor to Helm chart + + helm/taskplaner/templates/servicemonitor.yaml + helm/taskplaner/values.yaml + + + 1. Create helm/taskplaner/templates/servicemonitor.yaml: + ```yaml + {{- if .Values.metrics.enabled }} + apiVersion: monitoring.coreos.com/v1 + kind: ServiceMonitor + metadata: + name: {{ include "taskplaner.fullname" . }} + labels: + {{- include "taskplaner.labels" . | nindent 4 }} + spec: + selector: + matchLabels: + {{- include "taskplaner.selectorLabels" . | nindent 6 }} + endpoints: + - port: http + path: /metrics + interval: {{ .Values.metrics.interval | default "30s" }} + namespaceSelector: + matchNames: + - {{ .Release.Namespace }} + {{- end }} + ``` + + 2. Update helm/taskplaner/values.yaml - add metrics section: + ```yaml + # Prometheus metrics + metrics: + enabled: true + interval: 30s + ``` + + 3. Ensure the service template exposes port named "http" (check existing service.yaml - it likely already does via targetPort: http) + + NOTE: The ServiceMonitor uses monitoring.coreos.com/v1 API which kube-prometheus-stack provides. The namespaceSelector ensures Prometheus finds TaskPlanner in the default namespace. + + + 1. helm template ./helm/taskplaner includes ServiceMonitor resource + 2. helm template output shows selector matching app.kubernetes.io/name: taskplaner + 3. No helm lint errors + + + ServiceMonitor template renders correctly with selector matching TaskPlanner service, ready for Prometheus to discover + + + + + + +- [ ] npm run build succeeds +- [ ] curl localhost:5173/metrics returns Prometheus-format text +- [ ] helm template ./helm/taskplaner shows ServiceMonitor resource +- [ ] ServiceMonitor selector matches service labels + + + +1. /metrics endpoint returns Prometheus-format metrics (process metrics, heap size, event loop) +2. ServiceMonitor added to Helm chart templates +3. ServiceMonitor enabled by default in values.yaml +4. Build and type check pass + + + +After completion, create `.planning/phases/08-observability-stack/08-01-SUMMARY.md` + diff --git a/.planning/phases/08-observability-stack/08-02-PLAN.md b/.planning/phases/08-observability-stack/08-02-PLAN.md new file mode 100644 index 0000000..7f531c7 --- /dev/null +++ b/.planning/phases/08-observability-stack/08-02-PLAN.md @@ -0,0 +1,229 @@ +--- +phase: 08-observability-stack +plan: 02 +type: execute +wave: 1 +depends_on: [] +files_modified: + - helm/alloy/values.yaml (new) + - helm/alloy/Chart.yaml (new) +autonomous: true + +must_haves: + truths: + - "Alloy DaemonSet runs on all nodes" + - "Alloy forwards logs to Loki" + - "Promtail DaemonSet is removed" + artifacts: + - path: "helm/alloy/Chart.yaml" + provides: "Alloy Helm chart wrapper" + contains: "name: alloy" + - path: "helm/alloy/values.yaml" + provides: "Alloy configuration for Loki forwarding" + contains: "loki.write" + key_links: + - from: "Alloy pods" + to: "loki-stack:3100" + via: "loki.write endpoint" + pattern: "endpoint.*loki" +--- + + +Migrate from Promtail to Grafana Alloy for log collection + +Purpose: Replace EOL Promtail (March 2026) with Grafana Alloy DaemonSet (OBS-04) +Output: Alloy DaemonSet forwarding logs to Loki, Promtail removed + + + +@/home/tho/.claude/get-shit-done/workflows/execute-plan.md +@/home/tho/.claude/get-shit-done/templates/summary.md + + + +@.planning/PROJECT.md +@.planning/ROADMAP.md +@.planning/STATE.md +@.planning/phases/08-observability-stack/CONTEXT.md + + + + + + Task 1: Deploy Grafana Alloy via Helm + + helm/alloy/Chart.yaml + helm/alloy/values.yaml + + + 1. Create helm/alloy directory and Chart.yaml as umbrella chart: + ```yaml + apiVersion: v2 + name: alloy + description: Grafana Alloy log collector + version: 0.1.0 + dependencies: + - name: alloy + version: "0.12.*" + repository: https://grafana.github.io/helm-charts + ``` + + 2. Create helm/alloy/values.yaml with minimal config for Loki forwarding: + ```yaml + alloy: + alloy: + configMap: + content: | + // Discover pods and collect logs + discovery.kubernetes "pods" { + role = "pod" + } + + // Relabel to extract pod metadata + discovery.relabel "pods" { + targets = discovery.kubernetes.pods.targets + + rule { + source_labels = ["__meta_kubernetes_namespace"] + target_label = "namespace" + } + rule { + source_labels = ["__meta_kubernetes_pod_name"] + target_label = "pod" + } + rule { + source_labels = ["__meta_kubernetes_pod_container_name"] + target_label = "container" + } + } + + // Collect logs from discovered pods + loki.source.kubernetes "pods" { + targets = discovery.relabel.pods.output + forward_to = [loki.write.default.receiver] + } + + // Forward to Loki + loki.write "default" { + endpoint { + url = "http://loki-stack.monitoring.svc.cluster.local:3100/loki/api/v1/push" + } + } + + controller: + type: daemonset + + serviceAccount: + create: true + ``` + + 3. Add Grafana Helm repo and build dependencies: + ```bash + helm repo add grafana https://grafana.github.io/helm-charts + helm repo update + cd helm/alloy && helm dependency build + ``` + + 4. Deploy Alloy to monitoring namespace: + ```bash + helm upgrade --install alloy ./helm/alloy -n monitoring --create-namespace + ``` + + 5. Verify Alloy pods are running: + ```bash + kubectl get pods -n monitoring -l app.kubernetes.io/name=alloy + ``` + Expected: 5 pods (one per node) in Running state + + NOTE: + - Alloy uses River configuration language (not YAML) + - Labels (namespace, pod, container) match existing Promtail labels for query compatibility + - Loki endpoint is cluster-internal: loki-stack.monitoring.svc.cluster.local:3100 + + + 1. kubectl get pods -n monitoring -l app.kubernetes.io/name=alloy shows 5 Running pods + 2. kubectl logs -n monitoring -l app.kubernetes.io/name=alloy --tail=20 shows no errors + 3. Alloy logs show "loki.write" component started successfully + + + Alloy DaemonSet deployed with 5 pods collecting logs and forwarding to Loki + + + + + Task 2: Verify log flow and remove Promtail + + (no files - kubectl operations) + + + 1. Generate a test log by restarting TaskPlanner pod: + ```bash + kubectl rollout restart deployment taskplaner + ``` + + 2. Wait for pod to be ready: + ```bash + kubectl rollout status deployment taskplaner --timeout=60s + ``` + + 3. Verify logs appear in Loki via LogCLI or curl: + ```bash + # Query recent TaskPlanner logs via Loki API + kubectl run --rm -it logtest --image=curlimages/curl --restart=Never -- \ + curl -s "http://loki-stack.monitoring.svc.cluster.local:3100/loki/api/v1/query_range" \ + --data-urlencode 'query={namespace="default",pod=~"taskplaner.*"}' \ + --data-urlencode 'limit=5' + ``` + Expected: JSON response with "result" containing log entries + + 4. Once logs confirmed flowing via Alloy, remove Promtail: + ```bash + # Find and delete Promtail release + helm list -n monitoring | grep promtail + # If promtail found: + helm uninstall loki-stack-promtail -n monitoring 2>/dev/null || \ + helm uninstall promtail -n monitoring 2>/dev/null || \ + kubectl delete daemonset -n monitoring -l app=promtail + ``` + + 5. Verify Promtail is gone: + ```bash + kubectl get pods -n monitoring | grep -i promtail + ``` + Expected: No promtail pods + + 6. Verify logs still flowing after Promtail removal (repeat step 3) + + NOTE: Promtail may be installed as part of loki-stack or separately. Check both. + + + 1. Loki API returns TaskPlanner log entries + 2. kubectl get pods -n monitoring shows NO promtail pods + 3. kubectl get pods -n monitoring shows Alloy pods still running + 4. Second Loki query after Promtail removal still returns logs + + + Logs confirmed flowing from Alloy to Loki, Promtail DaemonSet removed from cluster + + + + + + +- [ ] Alloy DaemonSet has 5 Running pods (one per node) +- [ ] Alloy pods show no errors in logs +- [ ] Loki API returns TaskPlanner log entries +- [ ] Promtail pods no longer exist +- [ ] Log flow continues after Promtail removal + + + +1. Alloy DaemonSet running on all 5 nodes +2. Logs from TaskPlanner appear in Loki within 60 seconds of generation +3. Promtail DaemonSet completely removed +4. No log collection gap (Alloy verified before Promtail removal) + + + +After completion, create `.planning/phases/08-observability-stack/08-02-SUMMARY.md` + diff --git a/.planning/phases/08-observability-stack/08-03-PLAN.md b/.planning/phases/08-observability-stack/08-03-PLAN.md new file mode 100644 index 0000000..01a6a6f --- /dev/null +++ b/.planning/phases/08-observability-stack/08-03-PLAN.md @@ -0,0 +1,233 @@ +--- +phase: 08-observability-stack +plan: 03 +type: execute +wave: 2 +depends_on: ["08-01", "08-02"] +files_modified: [] +autonomous: false + +must_haves: + truths: + - "Prometheus scrapes TaskPlanner /metrics endpoint" + - "Grafana can query TaskPlanner logs via Loki" + - "KubePodCrashLooping alert rule exists" + artifacts: [] + key_links: + - from: "Prometheus" + to: "TaskPlanner /metrics" + via: "ServiceMonitor" + pattern: "servicemonitor.*taskplaner" + - from: "Grafana Explore" + to: "Loki datasource" + via: "LogQL query" + pattern: "namespace.*default.*taskplaner" +--- + + +Verify end-to-end observability stack: metrics scraping, log queries, and alerting + +Purpose: Confirm all Phase 8 requirements are satisfied (OBS-01 through OBS-08) +Output: Verified observability stack with documented proof of functionality + + + +@/home/tho/.claude/get-shit-done/workflows/execute-plan.md +@/home/tho/.claude/get-shit-done/templates/summary.md + + + +@.planning/PROJECT.md +@.planning/ROADMAP.md +@.planning/STATE.md +@.planning/phases/08-observability-stack/CONTEXT.md +@.planning/phases/08-observability-stack/08-01-SUMMARY.md +@.planning/phases/08-observability-stack/08-02-SUMMARY.md + + + + + + Task 1: Deploy TaskPlanner with ServiceMonitor and verify Prometheus scraping + + (no files - deployment and verification) + + + 1. Commit and push the metrics endpoint and ServiceMonitor changes from 08-01: + ```bash + git add . + git commit -m "feat(metrics): add /metrics endpoint and ServiceMonitor + + - Add prom-client for Prometheus metrics + - Expose /metrics endpoint with default Node.js metrics + - Add ServiceMonitor template to Helm chart + + OBS-08, OBS-01" + git push + ``` + + 2. Wait for ArgoCD to sync (or trigger manual sync): + ```bash + # Check ArgoCD sync status + kubectl get application taskplaner -n argocd -o jsonpath='{.status.sync.status}' + # If not synced, wait up to 3 minutes or trigger: + argocd app sync taskplaner --server argocd.tricnet.be --insecure 2>/dev/null || \ + kubectl patch application taskplaner -n argocd --type merge -p '{"operation":{"initiatedBy":{"username":"admin"},"sync":{}}}' + ``` + + 3. Wait for deployment to complete: + ```bash + kubectl rollout status deployment taskplaner --timeout=120s + ``` + + 4. Verify ServiceMonitor created: + ```bash + kubectl get servicemonitor taskplaner + ``` + Expected: ServiceMonitor exists + + 5. Verify Prometheus is scraping TaskPlanner: + ```bash + # Port-forward to Prometheus + kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090 & + sleep 3 + + # Query for TaskPlanner targets + curl -s "http://localhost:9090/api/v1/targets" | grep -A5 "taskplaner" + + # Kill port-forward + kill %1 2>/dev/null + ``` + Expected: TaskPlanner target shows state: "up" + + 6. Query a TaskPlanner metric: + ```bash + kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090 & + sleep 3 + curl -s "http://localhost:9090/api/v1/query?query=process_cpu_seconds_total{namespace=\"default\",pod=~\"taskplaner.*\"}" | jq '.data.result[0].value' + kill %1 2>/dev/null + ``` + Expected: Returns a numeric value + + NOTE: If ArgoCD sync takes too long, the push from earlier may already have triggered sync automatically. + + + 1. kubectl get servicemonitor taskplaner returns a resource + 2. Prometheus targets API shows TaskPlaner with state "up" + 3. Prometheus query returns process_cpu_seconds_total value for TaskPlanner + + + Prometheus successfully scraping TaskPlanner /metrics endpoint via ServiceMonitor + + + + + Task 2: Verify critical alert rules exist + + (no files - verification only) + + + 1. List PrometheusRules to find pod crash alerting: + ```bash + kubectl get prometheusrules -n monitoring -o name | head -20 + ``` + + 2. Search for KubePodCrashLooping alert: + ```bash + kubectl get prometheusrules -n monitoring -o yaml | grep -A10 "KubePodCrashLooping" + ``` + Expected: Alert rule definition found + + 3. If not found by name, search for crash-related alerts: + ```bash + kubectl get prometheusrules -n monitoring -o yaml | grep -i "crash\|restart\|CrashLoopBackOff" | head -10 + ``` + + 4. Verify Alertmanager is running: + ```bash + kubectl get pods -n monitoring -l app.kubernetes.io/name=alertmanager + ``` + Expected: alertmanager pod(s) Running + + 5. Check current alerts (should be empty if cluster healthy): + ```bash + kubectl port-forward -n monitoring svc/kube-prometheus-stack-alertmanager 9093:9093 & + sleep 2 + curl -s http://localhost:9093/api/v2/alerts | jq '.[].labels.alertname' | head -10 + kill %1 2>/dev/null + ``` + + NOTE: kube-prometheus-stack includes default Kubernetes alerting rules. KubePodCrashLooping is a standard rule that fires when a pod restarts more than once in 10 minutes. + + + 1. kubectl get prometheusrules finds KubePodCrashLooping or equivalent crash alert + 2. Alertmanager pod is Running + 3. Alertmanager API responds (even if alert list is empty) + + + KubePodCrashLooping alert rule confirmed present, Alertmanager operational + + + + + + Full observability stack: + - TaskPlanner /metrics endpoint (OBS-08) + - Prometheus scraping via ServiceMonitor (OBS-01) + - Alloy collecting logs (OBS-04) + - Loki storing logs (OBS-03) + - Critical alerts configured (OBS-06) + - Grafana dashboards (OBS-02) + + + 1. Open Grafana: https://grafana.kube2.tricnet.de + - Login: admin / GrafanaAdmin2026 + + 2. Verify dashboards (OBS-02): + - Go to Dashboards + - Open "Kubernetes / Compute Resources / Namespace (Pods)" or similar + - Select namespace: default + - Confirm TaskPlanner pod metrics visible + + 3. Verify log queries (OBS-05): + - Go to Explore + - Select Loki datasource + - Enter query: {namespace="default", pod=~"taskplaner.*"} + - Click Run Query + - Confirm TaskPlanner logs appear + + 4. Verify TaskPlanner metrics in Grafana: + - Go to Explore + - Select Prometheus datasource + - Enter query: process_cpu_seconds_total{namespace="default", pod=~"taskplaner.*"} + - Confirm metric graph appears + + 5. Verify Grafana accessible with TLS (OBS-07): + - Confirm https:// in URL bar (no certificate warnings) + + Type "verified" if all checks pass, or describe what failed + + + + + +- [ ] ServiceMonitor created and Prometheus scraping TaskPlanner +- [ ] TaskPlanner metrics visible in Prometheus queries +- [ ] KubePodCrashLooping alert rule exists +- [ ] Alertmanager running and responsive +- [ ] Human verified: Grafana dashboards show cluster metrics +- [ ] Human verified: Grafana can query TaskPlanner logs from Loki +- [ ] Human verified: TaskPlanner metrics visible in Grafana + + + +1. Prometheus scrapes TaskPlanner /metrics (OBS-01, OBS-08 complete) +2. Grafana dashboards display cluster metrics (OBS-02 verified) +3. TaskPlanner logs queryable in Grafana via Loki (OBS-05 verified) +4. KubePodCrashLooping alert rule confirmed (OBS-06 verified) +5. Grafana accessible via TLS (OBS-07 verified) + + + +After completion, create `.planning/phases/08-observability-stack/08-03-SUMMARY.md` + diff --git a/.planning/phases/08-observability-stack/CONTEXT.md b/.planning/phases/08-observability-stack/CONTEXT.md new file mode 100644 index 0000000..d824195 --- /dev/null +++ b/.planning/phases/08-observability-stack/CONTEXT.md @@ -0,0 +1,114 @@ +# Phase 8: Observability Stack - Context + +**Goal:** Full visibility into cluster and application health via metrics, logs, and dashboards +**Status:** Mostly pre-existing infrastructure, focusing on gaps + +## Discovery Summary + +The observability stack is largely already installed (15 days running). Phase 8 focuses on: +1. Gaps in existing setup +2. Migration from Promtail to Alloy (Promtail EOL March 2026) +3. TaskPlanner-specific observability + +### What's Already Working + +| Component | Status | Details | +|-----------|--------|---------| +| Prometheus | ✅ Running | kube-prometheus-stack, scraping cluster metrics | +| Grafana | ✅ Running | Accessible at grafana.kube2.tricnet.de (HTTP 200) | +| Loki | ✅ Running | loki-stack-0 pod, configured as Grafana datasource | +| AlertManager | ✅ Running | 35 PrometheusRules configured | +| Node Exporters | ✅ Running | 5 pods across nodes | +| Kube-state-metrics | ✅ Running | Cluster state metrics | +| Promtail | ⚠️ Running | 5 DaemonSet pods - needs migration to Alloy | + +### What's Missing + +| Gap | Requirement | Details | +|-----|-------------|---------| +| TaskPlanner /metrics | OBS-08 | App doesn't expose Prometheus metrics endpoint | +| TaskPlanner ServiceMonitor | OBS-01 | No scraping config for app metrics | +| Alloy migration | OBS-04 | Promtail running but EOL March 2026 | +| Verify Loki queries | OBS-05 | Datasource configured, need to verify logs work | +| Critical alerts verification | OBS-06 | Rules exist, need to verify KubePodCrashLooping | +| Grafana TLS ingress | OBS-07 | Works via external proxy, not k8s ingress | + +## Infrastructure Context + +### Cluster Details +- k3s cluster with 5 nodes (1 master + 4 workers based on node-exporter count) +- Namespace: `monitoring` for all observability components +- Namespace: `default` for TaskPlanner + +### Grafana Access +- URL: https://grafana.kube2.tricnet.de +- Admin password: `GrafanaAdmin2026` (from secret) +- Service type: ClusterIP (exposed via external proxy, not k8s ingress) +- Datasources configured: Prometheus, Alertmanager, Loki (2x entries) + +### Loki Configuration +- Service: `loki-stack:3100` (ClusterIP) +- Storage: Not checked (likely local filesystem) +- Retention: Not checked + +### Promtail (to be replaced) +- 5 DaemonSet pods running +- Forwards to loki-stack:3100 +- EOL: March 2026 - migrate to Grafana Alloy + +## Decisions + +### From Research (v2.0) +- Use Grafana Alloy instead of Promtail (EOL March 2026) +- Loki monolithic mode with 7-day retention appropriate for single-node +- kube-prometheus-stack is the standard for k8s observability + +### Phase-specific +- **Grafana ingress**: Leave as-is (external proxy works, OBS-07 satisfied) +- **Alloy migration**: Replace Promtail DaemonSet with Alloy DaemonSet +- **TaskPlanner metrics**: Add prom-client to SvelteKit app (standard Node.js client) +- **Alloy labels**: Match existing Promtail labels (namespace, pod, container) for query compatibility + +## Requirements Mapping + +| Requirement | Current State | Phase 8 Action | +|-------------|---------------|----------------| +| OBS-01 | Partial (cluster only) | Add TaskPlanner ServiceMonitor | +| OBS-02 | ✅ Done | Verify dashboards work | +| OBS-03 | ✅ Done | Loki running | +| OBS-04 | ⚠️ Promtail | Migrate to Alloy DaemonSet | +| OBS-05 | Configured | Verify log queries work | +| OBS-06 | 35 rules exist | Verify critical alerts fire | +| OBS-07 | ✅ Done | Grafana accessible via TLS | +| OBS-08 | ❌ Missing | Add /metrics endpoint to TaskPlanner | + +## Plan Outline + +1. **08-01**: TaskPlanner metrics endpoint + ServiceMonitor + - Add prom-client to app + - Expose /metrics endpoint + - Create ServiceMonitor for Prometheus scraping + +2. **08-02**: Promtail → Alloy migration + - Deploy Grafana Alloy DaemonSet + - Configure log forwarding to Loki + - Remove Promtail DaemonSet + - Verify logs still flow + +3. **08-03**: Verification + - Verify Grafana can query Loki logs + - Verify TaskPlanner metrics appear in Prometheus + - Verify KubePodCrashLooping alert exists + - End-to-end log flow test + +## Risks + +| Risk | Mitigation | +|------|------------| +| Log gap during Promtail→Alloy switch | Deploy Alloy first, verify working, then remove Promtail | +| prom-client adds overhead | Use minimal default metrics (process, http request duration) | +| Alloy config complexity | Start with minimal config matching Promtail behavior | + +--- +*Context gathered: 2026-02-03* +*Decision: Focus on gaps + Alloy migration*