docs(08): create phase plan

Phase 08: Observability Stack - 3 plans in 2 waves - Wave 1: 08-01 (metrics), 08-02 (Alloy) - parallel - Wave 2: 08-03 (verification) - depends on both - Ready for execution Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 21:24:24 +01:00
parent 3d11a090be
commit 8c3dc137ca
5 changed files with 756 additions and 5 deletions
--- a/.planning/ROADMAP.md
+++ b/.planning/ROADMAP.md
@@ -89,12 +89,12 @@ Plans:
  3. Logs from all pods are queryable in Grafana Explore via Loki
  4. Alert fires when a pod crashes or restarts repeatedly (KubePodCrashLooping)
  5. TaskPlanner /metrics endpoint returns Prometheus-format metrics
-**Plans**: TBD
+**Plans**: 3 plans
 Plans:
- [ ] 08-01: kube-prometheus-stack installation (Prometheus + Grafana)
+- [ ] 08-01-PLAN.md — TaskPlanner /metrics endpoint and ServiceMonitor
- [ ] 08-02: Loki + Alloy installation for log aggregation
+- [ ] 08-02-PLAN.md — Promtail to Alloy migration for log collection
- [ ] 08-03: Critical alerts and TaskPlanner metrics endpoint
+- [ ] 08-03-PLAN.md — End-to-end observability verification
 ### Phase 9: CI Pipeline Hardening
 **Goal**: Tests run before build - type errors and test failures block deployment
@@ -126,13 +126,14 @@ Phases execute in numeric order: 7 -> 8 -> 9
 | 5. Search | v1.0 | 3/3 | Complete | 2026-01-31 |
 | 6. Deployment | v1.0 | 2/2 | Complete | 2026-02-01 |
 | 7. GitOps Foundation | v2.0 | 2/2 | Complete ✓ | 2026-02-03 |
-| 8. Observability Stack | v2.0 | 0/3 | Not started | - |
+| 8. Observability Stack | v2.0 | 0/3 | Planned | - |
 | 9. CI Pipeline Hardening | v2.0 | 0/2 | Not started | - |
 ---
 *Roadmap created: 2026-01-29*
 *v2.0 phases added: 2026-02-03*
 *Phase 7 planned: 2026-02-03*
 *Phase 8 planned: 2026-02-03*
 *Depth: standard*
 *v1.0 Coverage: 31/31 requirements mapped*
 *v2.0 Coverage: 17/17 requirements mapped*
--- a/.planning/phases/08-observability-stack/08-01-PLAN.md
+++ b/.planning/phases/08-observability-stack/08-01-PLAN.md
@@ -0,0 +1,174 @@
 ---
 phase: 08-observability-stack
 plan: 01
 type: execute
 wave: 1
 depends_on: []
 files_modified:
  - package.json
  - src/routes/metrics/+server.ts
  - src/lib/server/metrics.ts
  - helm/taskplaner/templates/servicemonitor.yaml
  - helm/taskplaner/values.yaml
 autonomous: true
 must_haves:
  truths:
    - "TaskPlanner /metrics endpoint returns Prometheus-format text"
    - "ServiceMonitor exists in Helm chart templates"
    - "Prometheus can discover TaskPlanner via ServiceMonitor"
  artifacts:
    - path: "src/routes/metrics/+server.ts"
      provides: "Prometheus metrics HTTP endpoint"
      exports: ["GET"]
    - path: "src/lib/server/metrics.ts"
      provides: "prom-client registry and metrics definitions"
      contains: "collectDefaultMetrics"
    - path: "helm/taskplaner/templates/servicemonitor.yaml"
      provides: "ServiceMonitor for Prometheus Operator"
      contains: "kind: ServiceMonitor"
  key_links:
    - from: "src/routes/metrics/+server.ts"
      to: "src/lib/server/metrics.ts"
      via: "import register"
      pattern: "import.*register.*from.*metrics"
    - from: "helm/taskplaner/templates/servicemonitor.yaml"
      to: "tp-app service"
      via: "selector matchLabels"
      pattern: "selector.*matchLabels"
 ---
 <objective>
 Add Prometheus metrics endpoint to TaskPlanner and ServiceMonitor for scraping
 Purpose: Enable Prometheus to collect application metrics from TaskPlanner (OBS-08, OBS-01)
 Output: /metrics endpoint returning prom-client default metrics, ServiceMonitor in Helm chart
 </objective>
 <execution_context>
@/home/tho/.claude/get-shit-done/workflows/execute-plan.md
@/home/tho/.claude/get-shit-done/templates/summary.md
 </execution_context>
 <context>
@.planning/PROJECT.md
@.planning/ROADMAP.md
@.planning/STATE.md
@.planning/phases/08-observability-stack/CONTEXT.md
@package.json
@src/routes/health/+server.ts
@helm/taskplaner/values.yaml
@helm/taskplaner/templates/service.yaml
 </context>
 <tasks>
 <task type="auto">
  <name>Task 1: Add prom-client and create /metrics endpoint</name>
  <files>
    package.json
    src/lib/server/metrics.ts
    src/routes/metrics/+server.ts
  </files>
  <action>
    1. Install prom-client:
       ```bash
       npm install prom-client
       ```
    2. Create src/lib/server/metrics.ts:
       - Import prom-client's Registry, collectDefaultMetrics
       - Create a new Registry instance
       - Call collectDefaultMetrics({ register: registry }) to collect Node.js process metrics
       - Export the registry
       - Keep it minimal - just default metrics (memory, CPU, event loop lag)
    3. Create src/routes/metrics/+server.ts:
       - Import the registry from $lib/server/metrics
       - Create GET handler that returns registry.metrics() with Content-Type: text/plain; version=0.0.4
       - Handle errors gracefully (return 500 on failure)
       - Pattern follows existing /health endpoint structure
    NOTE: prom-client is the standard Node.js Prometheus client. Use default metrics only - no custom metrics needed for this phase.
  </action>
  <verify>
    1. npm run build completes without errors
    2. npm run dev, then curl http://localhost:5173/metrics returns text starting with "# HELP" or "# TYPE"
    3. Response Content-Type header includes "text/plain"
  </verify>
  <done>
    /metrics endpoint returns Prometheus-format metrics including process_cpu_seconds_total, nodejs_heap_size_total_bytes
  </done>
 </task>
 <task type="auto">
  <name>Task 2: Add ServiceMonitor to Helm chart</name>
  <files>
    helm/taskplaner/templates/servicemonitor.yaml
    helm/taskplaner/values.yaml
  </files>
  <action>
    1. Create helm/taskplaner/templates/servicemonitor.yaml:
       ```yaml
       {{- if .Values.metrics.enabled }}
       apiVersion: monitoring.coreos.com/v1
       kind: ServiceMonitor
       metadata:
         name: {{ include "taskplaner.fullname" . }}
         labels:
           {{- include "taskplaner.labels" . | nindent 4 }}
       spec:
         selector:
           matchLabels:
             {{- include "taskplaner.selectorLabels" . | nindent 6 }}
         endpoints:
           - port: http
             path: /metrics
             interval: {{ .Values.metrics.interval | default "30s" }}
         namespaceSelector:
           matchNames:
             - {{ .Release.Namespace }}
       {{- end }}
       ```
    2. Update helm/taskplaner/values.yaml - add metrics section:
       ```yaml
       # Prometheus metrics
       metrics:
         enabled: true
         interval: 30s
       ```
    3. Ensure the service template exposes port named "http" (check existing service.yaml - it likely already does via targetPort: http)
    NOTE: The ServiceMonitor uses monitoring.coreos.com/v1 API which kube-prometheus-stack provides. The namespaceSelector ensures Prometheus finds TaskPlanner in the default namespace.
  </action>
  <verify>
    1. helm template ./helm/taskplaner includes ServiceMonitor resource
    2. helm template output shows selector matching app.kubernetes.io/name: taskplaner
    3. No helm lint errors
  </verify>
  <done>
    ServiceMonitor template renders correctly with selector matching TaskPlanner service, ready for Prometheus to discover
  </done>
 </task>
 </tasks>
 <verification>
 - [ ] npm run build succeeds
 - [ ] curl localhost:5173/metrics returns Prometheus-format text
 - [ ] helm template ./helm/taskplaner shows ServiceMonitor resource
 - [ ] ServiceMonitor selector matches service labels
 </verification>
 <success_criteria>
 1. /metrics endpoint returns Prometheus-format metrics (process metrics, heap size, event loop)
 2. ServiceMonitor added to Helm chart templates
 3. ServiceMonitor enabled by default in values.yaml
 4. Build and type check pass
 </success_criteria>
 <output>
 After completion, create `.planning/phases/08-observability-stack/08-01-SUMMARY.md`
 </output>
--- a/.planning/phases/08-observability-stack/08-02-PLAN.md
+++ b/.planning/phases/08-observability-stack/08-02-PLAN.md
@@ -0,0 +1,229 @@
 ---
 phase: 08-observability-stack
 plan: 02
 type: execute
 wave: 1
 depends_on: []
 files_modified:
  - helm/alloy/values.yaml (new)
  - helm/alloy/Chart.yaml (new)
 autonomous: true
 must_haves:
  truths:
    - "Alloy DaemonSet runs on all nodes"
    - "Alloy forwards logs to Loki"
    - "Promtail DaemonSet is removed"
  artifacts:
    - path: "helm/alloy/Chart.yaml"
      provides: "Alloy Helm chart wrapper"
      contains: "name: alloy"
    - path: "helm/alloy/values.yaml"
      provides: "Alloy configuration for Loki forwarding"
      contains: "loki.write"
  key_links:
    - from: "Alloy pods"
      to: "loki-stack:3100"
      via: "loki.write endpoint"
      pattern: "endpoint.*loki"
 ---
 <objective>
 Migrate from Promtail to Grafana Alloy for log collection
 Purpose: Replace EOL Promtail (March 2026) with Grafana Alloy DaemonSet (OBS-04)
 Output: Alloy DaemonSet forwarding logs to Loki, Promtail removed
 </objective>
 <execution_context>
@/home/tho/.claude/get-shit-done/workflows/execute-plan.md
@/home/tho/.claude/get-shit-done/templates/summary.md
 </execution_context>
 <context>
@.planning/PROJECT.md
@.planning/ROADMAP.md
@.planning/STATE.md
@.planning/phases/08-observability-stack/CONTEXT.md
 </context>
 <tasks>
 <task type="auto">
  <name>Task 1: Deploy Grafana Alloy via Helm</name>
  <files>
    helm/alloy/Chart.yaml
    helm/alloy/values.yaml
  </files>
  <action>
    1. Create helm/alloy directory and Chart.yaml as umbrella chart:
       ```yaml
       apiVersion: v2
       name: alloy
       description: Grafana Alloy log collector
       version: 0.1.0
       dependencies:
         - name: alloy
           version: "0.12.*"
           repository: https://grafana.github.io/helm-charts
       ```
    2. Create helm/alloy/values.yaml with minimal config for Loki forwarding:
       ```yaml
       alloy:
         alloy:
           configMap:
             content: |
               // Discover pods and collect logs
               discovery.kubernetes "pods" {
                 role = "pod"
               }
               // Relabel to extract pod metadata
               discovery.relabel "pods" {
                 targets = discovery.kubernetes.pods.targets
                 rule {
                   source_labels = ["__meta_kubernetes_namespace"]
                   target_label  = "namespace"
                 }
                 rule {
                   source_labels = ["__meta_kubernetes_pod_name"]
                   target_label  = "pod"
                 }
                 rule {
                   source_labels = ["__meta_kubernetes_pod_container_name"]
                   target_label  = "container"
                 }
               }
               // Collect logs from discovered pods
               loki.source.kubernetes "pods" {
                 targets    = discovery.relabel.pods.output
                 forward_to = [loki.write.default.receiver]
               }
               // Forward to Loki
               loki.write "default" {
                 endpoint {
                   url = "http://loki-stack.monitoring.svc.cluster.local:3100/loki/api/v1/push"
                 }
               }
         controller:
           type: daemonset
         serviceAccount:
           create: true
       ```
    3. Add Grafana Helm repo and build dependencies:
       ```bash
       helm repo add grafana https://grafana.github.io/helm-charts
       helm repo update
       cd helm/alloy && helm dependency build
       ```
    4. Deploy Alloy to monitoring namespace:
       ```bash
       helm upgrade --install alloy ./helm/alloy -n monitoring --create-namespace
       ```
    5. Verify Alloy pods are running:
       ```bash
       kubectl get pods -n monitoring -l app.kubernetes.io/name=alloy
       ```
       Expected: 5 pods (one per node) in Running state
    NOTE:
    - Alloy uses River configuration language (not YAML)
    - Labels (namespace, pod, container) match existing Promtail labels for query compatibility
    - Loki endpoint is cluster-internal: loki-stack.monitoring.svc.cluster.local:3100
  </action>
  <verify>
    1. kubectl get pods -n monitoring -l app.kubernetes.io/name=alloy shows 5 Running pods
    2. kubectl logs -n monitoring -l app.kubernetes.io/name=alloy --tail=20 shows no errors
    3. Alloy logs show "loki.write" component started successfully
  </verify>
  <done>
    Alloy DaemonSet deployed with 5 pods collecting logs and forwarding to Loki
  </done>
 </task>
 <task type="auto">
  <name>Task 2: Verify log flow and remove Promtail</name>
  <files>
    (no files - kubectl operations)
  </files>
  <action>
    1. Generate a test log by restarting TaskPlanner pod:
       ```bash
       kubectl rollout restart deployment taskplaner
       ```
    2. Wait for pod to be ready:
       ```bash
       kubectl rollout status deployment taskplaner --timeout=60s
       ```
    3. Verify logs appear in Loki via LogCLI or curl:
       ```bash
       # Query recent TaskPlanner logs via Loki API
       kubectl run --rm -it logtest --image=curlimages/curl --restart=Never -- \
         curl -s "http://loki-stack.monitoring.svc.cluster.local:3100/loki/api/v1/query_range" \
         --data-urlencode 'query={namespace="default",pod=~"taskplaner.*"}' \
         --data-urlencode 'limit=5'
       ```
       Expected: JSON response with "result" containing log entries
    4. Once logs confirmed flowing via Alloy, remove Promtail:
       ```bash
       # Find and delete Promtail release
       helm list -n monitoring | grep promtail
       # If promtail found:
       helm uninstall loki-stack-promtail -n monitoring 2>/dev/null || \
       helm uninstall promtail -n monitoring 2>/dev/null || \
       kubectl delete daemonset -n monitoring -l app=promtail
       ```
    5. Verify Promtail is gone:
       ```bash
       kubectl get pods -n monitoring | grep -i promtail
       ```
       Expected: No promtail pods
    6. Verify logs still flowing after Promtail removal (repeat step 3)
    NOTE: Promtail may be installed as part of loki-stack or separately. Check both.
  </action>
  <verify>
    1. Loki API returns TaskPlanner log entries
    2. kubectl get pods -n monitoring shows NO promtail pods
    3. kubectl get pods -n monitoring shows Alloy pods still running
    4. Second Loki query after Promtail removal still returns logs
  </verify>
  <done>
    Logs confirmed flowing from Alloy to Loki, Promtail DaemonSet removed from cluster
  </done>
 </task>
 </tasks>
 <verification>
 - [ ] Alloy DaemonSet has 5 Running pods (one per node)
 - [ ] Alloy pods show no errors in logs
 - [ ] Loki API returns TaskPlanner log entries
 - [ ] Promtail pods no longer exist
 - [ ] Log flow continues after Promtail removal
 </verification>
 <success_criteria>
 1. Alloy DaemonSet running on all 5 nodes
 2. Logs from TaskPlanner appear in Loki within 60 seconds of generation
 3. Promtail DaemonSet completely removed
 4. No log collection gap (Alloy verified before Promtail removal)
 </success_criteria>
 <output>
 After completion, create `.planning/phases/08-observability-stack/08-02-SUMMARY.md`
 </output>
--- a/.planning/phases/08-observability-stack/08-03-PLAN.md
+++ b/.planning/phases/08-observability-stack/08-03-PLAN.md
@@ -0,0 +1,233 @@
 ---
 phase: 08-observability-stack
 plan: 03
 type: execute
 wave: 2
 depends_on: ["08-01", "08-02"]
 files_modified: []
 autonomous: false
 must_haves:
  truths:
    - "Prometheus scrapes TaskPlanner /metrics endpoint"
    - "Grafana can query TaskPlanner logs via Loki"
    - "KubePodCrashLooping alert rule exists"
  artifacts: []
  key_links:
    - from: "Prometheus"
      to: "TaskPlanner /metrics"
      via: "ServiceMonitor"
      pattern: "servicemonitor.*taskplaner"
    - from: "Grafana Explore"
      to: "Loki datasource"
      via: "LogQL query"
      pattern: "namespace.*default.*taskplaner"
 ---
 <objective>
 Verify end-to-end observability stack: metrics scraping, log queries, and alerting
 Purpose: Confirm all Phase 8 requirements are satisfied (OBS-01 through OBS-08)
 Output: Verified observability stack with documented proof of functionality
 </objective>
 <execution_context>
@/home/tho/.claude/get-shit-done/workflows/execute-plan.md
@/home/tho/.claude/get-shit-done/templates/summary.md
 </execution_context>
 <context>
@.planning/PROJECT.md
@.planning/ROADMAP.md
@.planning/STATE.md
@.planning/phases/08-observability-stack/CONTEXT.md
@.planning/phases/08-observability-stack/08-01-SUMMARY.md
@.planning/phases/08-observability-stack/08-02-SUMMARY.md
 </context>
 <tasks>
 <task type="auto">
  <name>Task 1: Deploy TaskPlanner with ServiceMonitor and verify Prometheus scraping</name>
  <files>
    (no files - deployment and verification)
  </files>
  <action>
    1. Commit and push the metrics endpoint and ServiceMonitor changes from 08-01:
       ```bash
       git add .
       git commit -m "feat(metrics): add /metrics endpoint and ServiceMonitor
       - Add prom-client for Prometheus metrics
       - Expose /metrics endpoint with default Node.js metrics
       - Add ServiceMonitor template to Helm chart
       OBS-08, OBS-01"
       git push
       ```
    2. Wait for ArgoCD to sync (or trigger manual sync):
       ```bash
       # Check ArgoCD sync status
       kubectl get application taskplaner -n argocd -o jsonpath='{.status.sync.status}'
       # If not synced, wait up to 3 minutes or trigger:
       argocd app sync taskplaner --server argocd.tricnet.be --insecure 2>/dev/null || \
         kubectl patch application taskplaner -n argocd --type merge -p '{"operation":{"initiatedBy":{"username":"admin"},"sync":{}}}'
       ```
    3. Wait for deployment to complete:
       ```bash
       kubectl rollout status deployment taskplaner --timeout=120s
       ```
    4. Verify ServiceMonitor created:
       ```bash
       kubectl get servicemonitor taskplaner
       ```
       Expected: ServiceMonitor exists
    5. Verify Prometheus is scraping TaskPlanner:
       ```bash
       # Port-forward to Prometheus
       kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090 &
       sleep 3
       # Query for TaskPlanner targets
       curl -s "http://localhost:9090/api/v1/targets" | grep -A5 "taskplaner"
       # Kill port-forward
       kill %1 2>/dev/null
       ```
       Expected: TaskPlanner target shows state: "up"
    6. Query a TaskPlanner metric:
       ```bash
       kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090 &
       sleep 3
       curl -s "http://localhost:9090/api/v1/query?query=process_cpu_seconds_total{namespace=\"default\",pod=~\"taskplaner.*\"}" | jq '.data.result[0].value'
       kill %1 2>/dev/null
       ```
       Expected: Returns a numeric value
    NOTE: If ArgoCD sync takes too long, the push from earlier may already have triggered sync automatically.
  </action>
  <verify>
    1. kubectl get servicemonitor taskplaner returns a resource
    2. Prometheus targets API shows TaskPlaner with state "up"
    3. Prometheus query returns process_cpu_seconds_total value for TaskPlanner
  </verify>
  <done>
    Prometheus successfully scraping TaskPlanner /metrics endpoint via ServiceMonitor
  </done>
 </task>
 <task type="auto">
  <name>Task 2: Verify critical alert rules exist</name>
  <files>
    (no files - verification only)
  </files>
  <action>
    1. List PrometheusRules to find pod crash alerting:
       ```bash
       kubectl get prometheusrules -n monitoring -o name | head -20
       ```
    2. Search for KubePodCrashLooping alert:
       ```bash
       kubectl get prometheusrules -n monitoring -o yaml | grep -A10 "KubePodCrashLooping"
       ```
       Expected: Alert rule definition found
    3. If not found by name, search for crash-related alerts:
       ```bash
       kubectl get prometheusrules -n monitoring -o yaml | grep -i "crash\|restart\|CrashLoopBackOff" | head -10
       ```
    4. Verify Alertmanager is running:
       ```bash
       kubectl get pods -n monitoring -l app.kubernetes.io/name=alertmanager
       ```
       Expected: alertmanager pod(s) Running
    5. Check current alerts (should be empty if cluster healthy):
       ```bash
       kubectl port-forward -n monitoring svc/kube-prometheus-stack-alertmanager 9093:9093 &
       sleep 2
       curl -s http://localhost:9093/api/v2/alerts | jq '.[].labels.alertname' | head -10
       kill %1 2>/dev/null
       ```
    NOTE: kube-prometheus-stack includes default Kubernetes alerting rules. KubePodCrashLooping is a standard rule that fires when a pod restarts more than once in 10 minutes.
  </action>
  <verify>
    1. kubectl get prometheusrules finds KubePodCrashLooping or equivalent crash alert
    2. Alertmanager pod is Running
    3. Alertmanager API responds (even if alert list is empty)
  </verify>
  <done>
    KubePodCrashLooping alert rule confirmed present, Alertmanager operational
  </done>
 </task>
 <task type="checkpoint:human-verify" gate="blocking">
  <what-built>
    Full observability stack:
    - TaskPlanner /metrics endpoint (OBS-08)
    - Prometheus scraping via ServiceMonitor (OBS-01)
    - Alloy collecting logs (OBS-04)
    - Loki storing logs (OBS-03)
    - Critical alerts configured (OBS-06)
    - Grafana dashboards (OBS-02)
  </what-built>
  <how-to-verify>
    1. Open Grafana: https://grafana.kube2.tricnet.de
       - Login: admin / GrafanaAdmin2026
    2. Verify dashboards (OBS-02):
       - Go to Dashboards
       - Open "Kubernetes / Compute Resources / Namespace (Pods)" or similar
       - Select namespace: default
       - Confirm TaskPlanner pod metrics visible
    3. Verify log queries (OBS-05):
       - Go to Explore
       - Select Loki datasource
       - Enter query: {namespace="default", pod=~"taskplaner.*"}
       - Click Run Query
       - Confirm TaskPlanner logs appear
    4. Verify TaskPlanner metrics in Grafana:
       - Go to Explore
       - Select Prometheus datasource
       - Enter query: process_cpu_seconds_total{namespace="default", pod=~"taskplaner.*"}
       - Confirm metric graph appears
    5. Verify Grafana accessible with TLS (OBS-07):
       - Confirm https:// in URL bar (no certificate warnings)
  </how-to-verify>
  <resume-signal>Type "verified" if all checks pass, or describe what failed</resume-signal>
 </task>
 </tasks>
 <verification>
 - [ ] ServiceMonitor created and Prometheus scraping TaskPlanner
 - [ ] TaskPlanner metrics visible in Prometheus queries
 - [ ] KubePodCrashLooping alert rule exists
 - [ ] Alertmanager running and responsive
 - [ ] Human verified: Grafana dashboards show cluster metrics
 - [ ] Human verified: Grafana can query TaskPlanner logs from Loki
 - [ ] Human verified: TaskPlanner metrics visible in Grafana
 </verification>
 <success_criteria>
 1. Prometheus scrapes TaskPlanner /metrics (OBS-01, OBS-08 complete)
 2. Grafana dashboards display cluster metrics (OBS-02 verified)
 3. TaskPlanner logs queryable in Grafana via Loki (OBS-05 verified)
 4. KubePodCrashLooping alert rule confirmed (OBS-06 verified)
 5. Grafana accessible via TLS (OBS-07 verified)
 </success_criteria>
 <output>
 After completion, create `.planning/phases/08-observability-stack/08-03-SUMMARY.md`
 </output>
--- a/.planning/phases/08-observability-stack/CONTEXT.md
+++ b/.planning/phases/08-observability-stack/CONTEXT.md
@@ -0,0 +1,114 @@
 # Phase 8: Observability Stack - Context
 **Goal:** Full visibility into cluster and application health via metrics, logs, and dashboards
 **Status:** Mostly pre-existing infrastructure, focusing on gaps
 ## Discovery Summary
 The observability stack is largely already installed (15 days running). Phase 8 focuses on:
 1. Gaps in existing setup
 2. Migration from Promtail to Alloy (Promtail EOL March 2026)
 3. TaskPlanner-specific observability
 ### What's Already Working
 | Component | Status | Details |
 |-----------|--------|---------|
 | Prometheus | ✅ Running | kube-prometheus-stack, scraping cluster metrics |
 | Grafana | ✅ Running | Accessible at grafana.kube2.tricnet.de (HTTP 200) |
 | Loki | ✅ Running | loki-stack-0 pod, configured as Grafana datasource |
 | AlertManager | ✅ Running | 35 PrometheusRules configured |
 | Node Exporters | ✅ Running | 5 pods across nodes |
 | Kube-state-metrics | ✅ Running | Cluster state metrics |
 | Promtail | ⚠️ Running | 5 DaemonSet pods - needs migration to Alloy |
 ### What's Missing
 | Gap | Requirement | Details |
 |-----|-------------|---------|
 | TaskPlanner /metrics | OBS-08 | App doesn't expose Prometheus metrics endpoint |
 | TaskPlanner ServiceMonitor | OBS-01 | No scraping config for app metrics |
 | Alloy migration | OBS-04 | Promtail running but EOL March 2026 |
 | Verify Loki queries | OBS-05 | Datasource configured, need to verify logs work |
 | Critical alerts verification | OBS-06 | Rules exist, need to verify KubePodCrashLooping |
 | Grafana TLS ingress | OBS-07 | Works via external proxy, not k8s ingress |
 ## Infrastructure Context
 ### Cluster Details
 - k3s cluster with 5 nodes (1 master + 4 workers based on node-exporter count)
 - Namespace: `monitoring` for all observability components
 - Namespace: `default` for TaskPlanner
 ### Grafana Access
 - URL: https://grafana.kube2.tricnet.de
 - Admin password: `GrafanaAdmin2026` (from secret)
 - Service type: ClusterIP (exposed via external proxy, not k8s ingress)
 - Datasources configured: Prometheus, Alertmanager, Loki (2x entries)
 ### Loki Configuration
 - Service: `loki-stack:3100` (ClusterIP)
 - Storage: Not checked (likely local filesystem)
 - Retention: Not checked
 ### Promtail (to be replaced)
 - 5 DaemonSet pods running
 - Forwards to loki-stack:3100
 - EOL: March 2026 - migrate to Grafana Alloy
 ## Decisions
 ### From Research (v2.0)
 - Use Grafana Alloy instead of Promtail (EOL March 2026)
 - Loki monolithic mode with 7-day retention appropriate for single-node
 - kube-prometheus-stack is the standard for k8s observability
 ### Phase-specific
 - **Grafana ingress**: Leave as-is (external proxy works, OBS-07 satisfied)
 - **Alloy migration**: Replace Promtail DaemonSet with Alloy DaemonSet
 - **TaskPlanner metrics**: Add prom-client to SvelteKit app (standard Node.js client)
 - **Alloy labels**: Match existing Promtail labels (namespace, pod, container) for query compatibility
 ## Requirements Mapping
 | Requirement | Current State | Phase 8 Action |
 |-------------|---------------|----------------|
 | OBS-01 | Partial (cluster only) | Add TaskPlanner ServiceMonitor |
 | OBS-02 | ✅ Done | Verify dashboards work |
 | OBS-03 | ✅ Done | Loki running |
 | OBS-04 | ⚠️ Promtail | Migrate to Alloy DaemonSet |
 | OBS-05 | Configured | Verify log queries work |
 | OBS-06 | 35 rules exist | Verify critical alerts fire |
 | OBS-07 | ✅ Done | Grafana accessible via TLS |
 | OBS-08 | ❌ Missing | Add /metrics endpoint to TaskPlanner |
 ## Plan Outline
 1. **08-01**: TaskPlanner metrics endpoint + ServiceMonitor
   - Add prom-client to app
   - Expose /metrics endpoint
   - Create ServiceMonitor for Prometheus scraping
 2. **08-02**: Promtail → Alloy migration
   - Deploy Grafana Alloy DaemonSet
   - Configure log forwarding to Loki
   - Remove Promtail DaemonSet
   - Verify logs still flow
 3. **08-03**: Verification
   - Verify Grafana can query Loki logs
   - Verify TaskPlanner metrics appear in Prometheus
   - Verify KubePodCrashLooping alert exists
   - End-to-end log flow test
 ## Risks
 | Risk | Mitigation |
 |------|------------|
 | Log gap during Promtail→Alloy switch | Deploy Alloy first, verify working, then remove Promtail |
 | prom-client adds overhead | Use minimal default metrics (process, http request duration) |
 | Alloy config complexity | Start with minimal config matching Promtail behavior |
 ---
 *Context gathered: 2026-02-03*
 *Decision: Focus on gaps + Alloy migration*