diff --git a/.planning/ROADMAP.md b/.planning/ROADMAP.md
index 7c4f977..489b51f 100644
--- a/.planning/ROADMAP.md
+++ b/.planning/ROADMAP.md
@@ -89,12 +89,12 @@ Plans:
3. Logs from all pods are queryable in Grafana Explore via Loki
4. Alert fires when a pod crashes or restarts repeatedly (KubePodCrashLooping)
5. TaskPlanner /metrics endpoint returns Prometheus-format metrics
-**Plans**: TBD
+**Plans**: 3 plans
Plans:
-- [ ] 08-01: kube-prometheus-stack installation (Prometheus + Grafana)
-- [ ] 08-02: Loki + Alloy installation for log aggregation
-- [ ] 08-03: Critical alerts and TaskPlanner metrics endpoint
+- [ ] 08-01-PLAN.md — TaskPlanner /metrics endpoint and ServiceMonitor
+- [ ] 08-02-PLAN.md — Promtail to Alloy migration for log collection
+- [ ] 08-03-PLAN.md — End-to-end observability verification
### Phase 9: CI Pipeline Hardening
**Goal**: Tests run before build - type errors and test failures block deployment
@@ -126,13 +126,14 @@ Phases execute in numeric order: 7 -> 8 -> 9
| 5. Search | v1.0 | 3/3 | Complete | 2026-01-31 |
| 6. Deployment | v1.0 | 2/2 | Complete | 2026-02-01 |
| 7. GitOps Foundation | v2.0 | 2/2 | Complete ✓ | 2026-02-03 |
-| 8. Observability Stack | v2.0 | 0/3 | Not started | - |
+| 8. Observability Stack | v2.0 | 0/3 | Planned | - |
| 9. CI Pipeline Hardening | v2.0 | 0/2 | Not started | - |
---
*Roadmap created: 2026-01-29*
*v2.0 phases added: 2026-02-03*
*Phase 7 planned: 2026-02-03*
+*Phase 8 planned: 2026-02-03*
*Depth: standard*
*v1.0 Coverage: 31/31 requirements mapped*
*v2.0 Coverage: 17/17 requirements mapped*
diff --git a/.planning/phases/08-observability-stack/08-01-PLAN.md b/.planning/phases/08-observability-stack/08-01-PLAN.md
new file mode 100644
index 0000000..ac5ee81
--- /dev/null
+++ b/.planning/phases/08-observability-stack/08-01-PLAN.md
@@ -0,0 +1,174 @@
+---
+phase: 08-observability-stack
+plan: 01
+type: execute
+wave: 1
+depends_on: []
+files_modified:
+ - package.json
+ - src/routes/metrics/+server.ts
+ - src/lib/server/metrics.ts
+ - helm/taskplaner/templates/servicemonitor.yaml
+ - helm/taskplaner/values.yaml
+autonomous: true
+
+must_haves:
+ truths:
+ - "TaskPlanner /metrics endpoint returns Prometheus-format text"
+ - "ServiceMonitor exists in Helm chart templates"
+ - "Prometheus can discover TaskPlanner via ServiceMonitor"
+ artifacts:
+ - path: "src/routes/metrics/+server.ts"
+ provides: "Prometheus metrics HTTP endpoint"
+ exports: ["GET"]
+ - path: "src/lib/server/metrics.ts"
+ provides: "prom-client registry and metrics definitions"
+ contains: "collectDefaultMetrics"
+ - path: "helm/taskplaner/templates/servicemonitor.yaml"
+ provides: "ServiceMonitor for Prometheus Operator"
+ contains: "kind: ServiceMonitor"
+ key_links:
+ - from: "src/routes/metrics/+server.ts"
+ to: "src/lib/server/metrics.ts"
+ via: "import register"
+ pattern: "import.*register.*from.*metrics"
+ - from: "helm/taskplaner/templates/servicemonitor.yaml"
+ to: "tp-app service"
+ via: "selector matchLabels"
+ pattern: "selector.*matchLabels"
+---
+
+
+Add Prometheus metrics endpoint to TaskPlanner and ServiceMonitor for scraping
+
+Purpose: Enable Prometheus to collect application metrics from TaskPlanner (OBS-08, OBS-01)
+Output: /metrics endpoint returning prom-client default metrics, ServiceMonitor in Helm chart
+
+
+
+@/home/tho/.claude/get-shit-done/workflows/execute-plan.md
+@/home/tho/.claude/get-shit-done/templates/summary.md
+
+
+
+@.planning/PROJECT.md
+@.planning/ROADMAP.md
+@.planning/STATE.md
+@.planning/phases/08-observability-stack/CONTEXT.md
+@package.json
+@src/routes/health/+server.ts
+@helm/taskplaner/values.yaml
+@helm/taskplaner/templates/service.yaml
+
+
+
+
+
+ Task 1: Add prom-client and create /metrics endpoint
+
+ package.json
+ src/lib/server/metrics.ts
+ src/routes/metrics/+server.ts
+
+
+ 1. Install prom-client:
+ ```bash
+ npm install prom-client
+ ```
+
+ 2. Create src/lib/server/metrics.ts:
+ - Import prom-client's Registry, collectDefaultMetrics
+ - Create a new Registry instance
+ - Call collectDefaultMetrics({ register: registry }) to collect Node.js process metrics
+ - Export the registry
+ - Keep it minimal - just default metrics (memory, CPU, event loop lag)
+
+ 3. Create src/routes/metrics/+server.ts:
+ - Import the registry from $lib/server/metrics
+ - Create GET handler that returns registry.metrics() with Content-Type: text/plain; version=0.0.4
+ - Handle errors gracefully (return 500 on failure)
+ - Pattern follows existing /health endpoint structure
+
+ NOTE: prom-client is the standard Node.js Prometheus client. Use default metrics only - no custom metrics needed for this phase.
+
+
+ 1. npm run build completes without errors
+ 2. npm run dev, then curl http://localhost:5173/metrics returns text starting with "# HELP" or "# TYPE"
+ 3. Response Content-Type header includes "text/plain"
+
+
+ /metrics endpoint returns Prometheus-format metrics including process_cpu_seconds_total, nodejs_heap_size_total_bytes
+
+
+
+
+ Task 2: Add ServiceMonitor to Helm chart
+
+ helm/taskplaner/templates/servicemonitor.yaml
+ helm/taskplaner/values.yaml
+
+
+ 1. Create helm/taskplaner/templates/servicemonitor.yaml:
+ ```yaml
+ {{- if .Values.metrics.enabled }}
+ apiVersion: monitoring.coreos.com/v1
+ kind: ServiceMonitor
+ metadata:
+ name: {{ include "taskplaner.fullname" . }}
+ labels:
+ {{- include "taskplaner.labels" . | nindent 4 }}
+ spec:
+ selector:
+ matchLabels:
+ {{- include "taskplaner.selectorLabels" . | nindent 6 }}
+ endpoints:
+ - port: http
+ path: /metrics
+ interval: {{ .Values.metrics.interval | default "30s" }}
+ namespaceSelector:
+ matchNames:
+ - {{ .Release.Namespace }}
+ {{- end }}
+ ```
+
+ 2. Update helm/taskplaner/values.yaml - add metrics section:
+ ```yaml
+ # Prometheus metrics
+ metrics:
+ enabled: true
+ interval: 30s
+ ```
+
+ 3. Ensure the service template exposes port named "http" (check existing service.yaml - it likely already does via targetPort: http)
+
+ NOTE: The ServiceMonitor uses monitoring.coreos.com/v1 API which kube-prometheus-stack provides. The namespaceSelector ensures Prometheus finds TaskPlanner in the default namespace.
+
+
+ 1. helm template ./helm/taskplaner includes ServiceMonitor resource
+ 2. helm template output shows selector matching app.kubernetes.io/name: taskplaner
+ 3. No helm lint errors
+
+
+ ServiceMonitor template renders correctly with selector matching TaskPlanner service, ready for Prometheus to discover
+
+
+
+
+
+
+- [ ] npm run build succeeds
+- [ ] curl localhost:5173/metrics returns Prometheus-format text
+- [ ] helm template ./helm/taskplaner shows ServiceMonitor resource
+- [ ] ServiceMonitor selector matches service labels
+
+
+
+1. /metrics endpoint returns Prometheus-format metrics (process metrics, heap size, event loop)
+2. ServiceMonitor added to Helm chart templates
+3. ServiceMonitor enabled by default in values.yaml
+4. Build and type check pass
+
+
+
diff --git a/.planning/phases/08-observability-stack/08-02-PLAN.md b/.planning/phases/08-observability-stack/08-02-PLAN.md
new file mode 100644
index 0000000..7f531c7
--- /dev/null
+++ b/.planning/phases/08-observability-stack/08-02-PLAN.md
@@ -0,0 +1,229 @@
+---
+phase: 08-observability-stack
+plan: 02
+type: execute
+wave: 1
+depends_on: []
+files_modified:
+ - helm/alloy/values.yaml (new)
+ - helm/alloy/Chart.yaml (new)
+autonomous: true
+
+must_haves:
+ truths:
+ - "Alloy DaemonSet runs on all nodes"
+ - "Alloy forwards logs to Loki"
+ - "Promtail DaemonSet is removed"
+ artifacts:
+ - path: "helm/alloy/Chart.yaml"
+ provides: "Alloy Helm chart wrapper"
+ contains: "name: alloy"
+ - path: "helm/alloy/values.yaml"
+ provides: "Alloy configuration for Loki forwarding"
+ contains: "loki.write"
+ key_links:
+ - from: "Alloy pods"
+ to: "loki-stack:3100"
+ via: "loki.write endpoint"
+ pattern: "endpoint.*loki"
+---
+
+
+Migrate from Promtail to Grafana Alloy for log collection
+
+Purpose: Replace EOL Promtail (March 2026) with Grafana Alloy DaemonSet (OBS-04)
+Output: Alloy DaemonSet forwarding logs to Loki, Promtail removed
+
+
+
+@/home/tho/.claude/get-shit-done/workflows/execute-plan.md
+@/home/tho/.claude/get-shit-done/templates/summary.md
+
+
+
+@.planning/PROJECT.md
+@.planning/ROADMAP.md
+@.planning/STATE.md
+@.planning/phases/08-observability-stack/CONTEXT.md
+
+
+
+
+
+ Task 1: Deploy Grafana Alloy via Helm
+
+ helm/alloy/Chart.yaml
+ helm/alloy/values.yaml
+
+
+ 1. Create helm/alloy directory and Chart.yaml as umbrella chart:
+ ```yaml
+ apiVersion: v2
+ name: alloy
+ description: Grafana Alloy log collector
+ version: 0.1.0
+ dependencies:
+ - name: alloy
+ version: "0.12.*"
+ repository: https://grafana.github.io/helm-charts
+ ```
+
+ 2. Create helm/alloy/values.yaml with minimal config for Loki forwarding:
+ ```yaml
+ alloy:
+ alloy:
+ configMap:
+ content: |
+ // Discover pods and collect logs
+ discovery.kubernetes "pods" {
+ role = "pod"
+ }
+
+ // Relabel to extract pod metadata
+ discovery.relabel "pods" {
+ targets = discovery.kubernetes.pods.targets
+
+ rule {
+ source_labels = ["__meta_kubernetes_namespace"]
+ target_label = "namespace"
+ }
+ rule {
+ source_labels = ["__meta_kubernetes_pod_name"]
+ target_label = "pod"
+ }
+ rule {
+ source_labels = ["__meta_kubernetes_pod_container_name"]
+ target_label = "container"
+ }
+ }
+
+ // Collect logs from discovered pods
+ loki.source.kubernetes "pods" {
+ targets = discovery.relabel.pods.output
+ forward_to = [loki.write.default.receiver]
+ }
+
+ // Forward to Loki
+ loki.write "default" {
+ endpoint {
+ url = "http://loki-stack.monitoring.svc.cluster.local:3100/loki/api/v1/push"
+ }
+ }
+
+ controller:
+ type: daemonset
+
+ serviceAccount:
+ create: true
+ ```
+
+ 3. Add Grafana Helm repo and build dependencies:
+ ```bash
+ helm repo add grafana https://grafana.github.io/helm-charts
+ helm repo update
+ cd helm/alloy && helm dependency build
+ ```
+
+ 4. Deploy Alloy to monitoring namespace:
+ ```bash
+ helm upgrade --install alloy ./helm/alloy -n monitoring --create-namespace
+ ```
+
+ 5. Verify Alloy pods are running:
+ ```bash
+ kubectl get pods -n monitoring -l app.kubernetes.io/name=alloy
+ ```
+ Expected: 5 pods (one per node) in Running state
+
+ NOTE:
+ - Alloy uses River configuration language (not YAML)
+ - Labels (namespace, pod, container) match existing Promtail labels for query compatibility
+ - Loki endpoint is cluster-internal: loki-stack.monitoring.svc.cluster.local:3100
+
+
+ 1. kubectl get pods -n monitoring -l app.kubernetes.io/name=alloy shows 5 Running pods
+ 2. kubectl logs -n monitoring -l app.kubernetes.io/name=alloy --tail=20 shows no errors
+ 3. Alloy logs show "loki.write" component started successfully
+
+
+ Alloy DaemonSet deployed with 5 pods collecting logs and forwarding to Loki
+
+
+
+
+ Task 2: Verify log flow and remove Promtail
+
+ (no files - kubectl operations)
+
+
+ 1. Generate a test log by restarting TaskPlanner pod:
+ ```bash
+ kubectl rollout restart deployment taskplaner
+ ```
+
+ 2. Wait for pod to be ready:
+ ```bash
+ kubectl rollout status deployment taskplaner --timeout=60s
+ ```
+
+ 3. Verify logs appear in Loki via LogCLI or curl:
+ ```bash
+ # Query recent TaskPlanner logs via Loki API
+ kubectl run --rm -it logtest --image=curlimages/curl --restart=Never -- \
+ curl -s "http://loki-stack.monitoring.svc.cluster.local:3100/loki/api/v1/query_range" \
+ --data-urlencode 'query={namespace="default",pod=~"taskplaner.*"}' \
+ --data-urlencode 'limit=5'
+ ```
+ Expected: JSON response with "result" containing log entries
+
+ 4. Once logs confirmed flowing via Alloy, remove Promtail:
+ ```bash
+ # Find and delete Promtail release
+ helm list -n monitoring | grep promtail
+ # If promtail found:
+ helm uninstall loki-stack-promtail -n monitoring 2>/dev/null || \
+ helm uninstall promtail -n monitoring 2>/dev/null || \
+ kubectl delete daemonset -n monitoring -l app=promtail
+ ```
+
+ 5. Verify Promtail is gone:
+ ```bash
+ kubectl get pods -n monitoring | grep -i promtail
+ ```
+ Expected: No promtail pods
+
+ 6. Verify logs still flowing after Promtail removal (repeat step 3)
+
+ NOTE: Promtail may be installed as part of loki-stack or separately. Check both.
+
+
+ 1. Loki API returns TaskPlanner log entries
+ 2. kubectl get pods -n monitoring shows NO promtail pods
+ 3. kubectl get pods -n monitoring shows Alloy pods still running
+ 4. Second Loki query after Promtail removal still returns logs
+
+
+ Logs confirmed flowing from Alloy to Loki, Promtail DaemonSet removed from cluster
+
+
+
+
+
+
+- [ ] Alloy DaemonSet has 5 Running pods (one per node)
+- [ ] Alloy pods show no errors in logs
+- [ ] Loki API returns TaskPlanner log entries
+- [ ] Promtail pods no longer exist
+- [ ] Log flow continues after Promtail removal
+
+
+
+1. Alloy DaemonSet running on all 5 nodes
+2. Logs from TaskPlanner appear in Loki within 60 seconds of generation
+3. Promtail DaemonSet completely removed
+4. No log collection gap (Alloy verified before Promtail removal)
+
+
+
diff --git a/.planning/phases/08-observability-stack/08-03-PLAN.md b/.planning/phases/08-observability-stack/08-03-PLAN.md
new file mode 100644
index 0000000..01a6a6f
--- /dev/null
+++ b/.planning/phases/08-observability-stack/08-03-PLAN.md
@@ -0,0 +1,233 @@
+---
+phase: 08-observability-stack
+plan: 03
+type: execute
+wave: 2
+depends_on: ["08-01", "08-02"]
+files_modified: []
+autonomous: false
+
+must_haves:
+ truths:
+ - "Prometheus scrapes TaskPlanner /metrics endpoint"
+ - "Grafana can query TaskPlanner logs via Loki"
+ - "KubePodCrashLooping alert rule exists"
+ artifacts: []
+ key_links:
+ - from: "Prometheus"
+ to: "TaskPlanner /metrics"
+ via: "ServiceMonitor"
+ pattern: "servicemonitor.*taskplaner"
+ - from: "Grafana Explore"
+ to: "Loki datasource"
+ via: "LogQL query"
+ pattern: "namespace.*default.*taskplaner"
+---
+
+
+Verify end-to-end observability stack: metrics scraping, log queries, and alerting
+
+Purpose: Confirm all Phase 8 requirements are satisfied (OBS-01 through OBS-08)
+Output: Verified observability stack with documented proof of functionality
+
+
+
+@/home/tho/.claude/get-shit-done/workflows/execute-plan.md
+@/home/tho/.claude/get-shit-done/templates/summary.md
+
+
+
+@.planning/PROJECT.md
+@.planning/ROADMAP.md
+@.planning/STATE.md
+@.planning/phases/08-observability-stack/CONTEXT.md
+@.planning/phases/08-observability-stack/08-01-SUMMARY.md
+@.planning/phases/08-observability-stack/08-02-SUMMARY.md
+
+
+
+
+
+ Task 1: Deploy TaskPlanner with ServiceMonitor and verify Prometheus scraping
+
+ (no files - deployment and verification)
+
+
+ 1. Commit and push the metrics endpoint and ServiceMonitor changes from 08-01:
+ ```bash
+ git add .
+ git commit -m "feat(metrics): add /metrics endpoint and ServiceMonitor
+
+ - Add prom-client for Prometheus metrics
+ - Expose /metrics endpoint with default Node.js metrics
+ - Add ServiceMonitor template to Helm chart
+
+ OBS-08, OBS-01"
+ git push
+ ```
+
+ 2. Wait for ArgoCD to sync (or trigger manual sync):
+ ```bash
+ # Check ArgoCD sync status
+ kubectl get application taskplaner -n argocd -o jsonpath='{.status.sync.status}'
+ # If not synced, wait up to 3 minutes or trigger:
+ argocd app sync taskplaner --server argocd.tricnet.be --insecure 2>/dev/null || \
+ kubectl patch application taskplaner -n argocd --type merge -p '{"operation":{"initiatedBy":{"username":"admin"},"sync":{}}}'
+ ```
+
+ 3. Wait for deployment to complete:
+ ```bash
+ kubectl rollout status deployment taskplaner --timeout=120s
+ ```
+
+ 4. Verify ServiceMonitor created:
+ ```bash
+ kubectl get servicemonitor taskplaner
+ ```
+ Expected: ServiceMonitor exists
+
+ 5. Verify Prometheus is scraping TaskPlanner:
+ ```bash
+ # Port-forward to Prometheus
+ kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090 &
+ sleep 3
+
+ # Query for TaskPlanner targets
+ curl -s "http://localhost:9090/api/v1/targets" | grep -A5 "taskplaner"
+
+ # Kill port-forward
+ kill %1 2>/dev/null
+ ```
+ Expected: TaskPlanner target shows state: "up"
+
+ 6. Query a TaskPlanner metric:
+ ```bash
+ kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090 &
+ sleep 3
+ curl -s "http://localhost:9090/api/v1/query?query=process_cpu_seconds_total{namespace=\"default\",pod=~\"taskplaner.*\"}" | jq '.data.result[0].value'
+ kill %1 2>/dev/null
+ ```
+ Expected: Returns a numeric value
+
+ NOTE: If ArgoCD sync takes too long, the push from earlier may already have triggered sync automatically.
+
+
+ 1. kubectl get servicemonitor taskplaner returns a resource
+ 2. Prometheus targets API shows TaskPlaner with state "up"
+ 3. Prometheus query returns process_cpu_seconds_total value for TaskPlanner
+
+
+ Prometheus successfully scraping TaskPlanner /metrics endpoint via ServiceMonitor
+
+
+
+
+ Task 2: Verify critical alert rules exist
+
+ (no files - verification only)
+
+
+ 1. List PrometheusRules to find pod crash alerting:
+ ```bash
+ kubectl get prometheusrules -n monitoring -o name | head -20
+ ```
+
+ 2. Search for KubePodCrashLooping alert:
+ ```bash
+ kubectl get prometheusrules -n monitoring -o yaml | grep -A10 "KubePodCrashLooping"
+ ```
+ Expected: Alert rule definition found
+
+ 3. If not found by name, search for crash-related alerts:
+ ```bash
+ kubectl get prometheusrules -n monitoring -o yaml | grep -i "crash\|restart\|CrashLoopBackOff" | head -10
+ ```
+
+ 4. Verify Alertmanager is running:
+ ```bash
+ kubectl get pods -n monitoring -l app.kubernetes.io/name=alertmanager
+ ```
+ Expected: alertmanager pod(s) Running
+
+ 5. Check current alerts (should be empty if cluster healthy):
+ ```bash
+ kubectl port-forward -n monitoring svc/kube-prometheus-stack-alertmanager 9093:9093 &
+ sleep 2
+ curl -s http://localhost:9093/api/v2/alerts | jq '.[].labels.alertname' | head -10
+ kill %1 2>/dev/null
+ ```
+
+ NOTE: kube-prometheus-stack includes default Kubernetes alerting rules. KubePodCrashLooping is a standard rule that fires when a pod restarts more than once in 10 minutes.
+
+
+ 1. kubectl get prometheusrules finds KubePodCrashLooping or equivalent crash alert
+ 2. Alertmanager pod is Running
+ 3. Alertmanager API responds (even if alert list is empty)
+
+
+ KubePodCrashLooping alert rule confirmed present, Alertmanager operational
+
+
+
+
+
+ Full observability stack:
+ - TaskPlanner /metrics endpoint (OBS-08)
+ - Prometheus scraping via ServiceMonitor (OBS-01)
+ - Alloy collecting logs (OBS-04)
+ - Loki storing logs (OBS-03)
+ - Critical alerts configured (OBS-06)
+ - Grafana dashboards (OBS-02)
+
+
+ 1. Open Grafana: https://grafana.kube2.tricnet.de
+ - Login: admin / GrafanaAdmin2026
+
+ 2. Verify dashboards (OBS-02):
+ - Go to Dashboards
+ - Open "Kubernetes / Compute Resources / Namespace (Pods)" or similar
+ - Select namespace: default
+ - Confirm TaskPlanner pod metrics visible
+
+ 3. Verify log queries (OBS-05):
+ - Go to Explore
+ - Select Loki datasource
+ - Enter query: {namespace="default", pod=~"taskplaner.*"}
+ - Click Run Query
+ - Confirm TaskPlanner logs appear
+
+ 4. Verify TaskPlanner metrics in Grafana:
+ - Go to Explore
+ - Select Prometheus datasource
+ - Enter query: process_cpu_seconds_total{namespace="default", pod=~"taskplaner.*"}
+ - Confirm metric graph appears
+
+ 5. Verify Grafana accessible with TLS (OBS-07):
+ - Confirm https:// in URL bar (no certificate warnings)
+
+ Type "verified" if all checks pass, or describe what failed
+
+
+
+
+
+- [ ] ServiceMonitor created and Prometheus scraping TaskPlanner
+- [ ] TaskPlanner metrics visible in Prometheus queries
+- [ ] KubePodCrashLooping alert rule exists
+- [ ] Alertmanager running and responsive
+- [ ] Human verified: Grafana dashboards show cluster metrics
+- [ ] Human verified: Grafana can query TaskPlanner logs from Loki
+- [ ] Human verified: TaskPlanner metrics visible in Grafana
+
+
+
+1. Prometheus scrapes TaskPlanner /metrics (OBS-01, OBS-08 complete)
+2. Grafana dashboards display cluster metrics (OBS-02 verified)
+3. TaskPlanner logs queryable in Grafana via Loki (OBS-05 verified)
+4. KubePodCrashLooping alert rule confirmed (OBS-06 verified)
+5. Grafana accessible via TLS (OBS-07 verified)
+
+
+
diff --git a/.planning/phases/08-observability-stack/CONTEXT.md b/.planning/phases/08-observability-stack/CONTEXT.md
new file mode 100644
index 0000000..d824195
--- /dev/null
+++ b/.planning/phases/08-observability-stack/CONTEXT.md
@@ -0,0 +1,114 @@
+# Phase 8: Observability Stack - Context
+
+**Goal:** Full visibility into cluster and application health via metrics, logs, and dashboards
+**Status:** Mostly pre-existing infrastructure, focusing on gaps
+
+## Discovery Summary
+
+The observability stack is largely already installed (15 days running). Phase 8 focuses on:
+1. Gaps in existing setup
+2. Migration from Promtail to Alloy (Promtail EOL March 2026)
+3. TaskPlanner-specific observability
+
+### What's Already Working
+
+| Component | Status | Details |
+|-----------|--------|---------|
+| Prometheus | ✅ Running | kube-prometheus-stack, scraping cluster metrics |
+| Grafana | ✅ Running | Accessible at grafana.kube2.tricnet.de (HTTP 200) |
+| Loki | ✅ Running | loki-stack-0 pod, configured as Grafana datasource |
+| AlertManager | ✅ Running | 35 PrometheusRules configured |
+| Node Exporters | ✅ Running | 5 pods across nodes |
+| Kube-state-metrics | ✅ Running | Cluster state metrics |
+| Promtail | ⚠️ Running | 5 DaemonSet pods - needs migration to Alloy |
+
+### What's Missing
+
+| Gap | Requirement | Details |
+|-----|-------------|---------|
+| TaskPlanner /metrics | OBS-08 | App doesn't expose Prometheus metrics endpoint |
+| TaskPlanner ServiceMonitor | OBS-01 | No scraping config for app metrics |
+| Alloy migration | OBS-04 | Promtail running but EOL March 2026 |
+| Verify Loki queries | OBS-05 | Datasource configured, need to verify logs work |
+| Critical alerts verification | OBS-06 | Rules exist, need to verify KubePodCrashLooping |
+| Grafana TLS ingress | OBS-07 | Works via external proxy, not k8s ingress |
+
+## Infrastructure Context
+
+### Cluster Details
+- k3s cluster with 5 nodes (1 master + 4 workers based on node-exporter count)
+- Namespace: `monitoring` for all observability components
+- Namespace: `default` for TaskPlanner
+
+### Grafana Access
+- URL: https://grafana.kube2.tricnet.de
+- Admin password: `GrafanaAdmin2026` (from secret)
+- Service type: ClusterIP (exposed via external proxy, not k8s ingress)
+- Datasources configured: Prometheus, Alertmanager, Loki (2x entries)
+
+### Loki Configuration
+- Service: `loki-stack:3100` (ClusterIP)
+- Storage: Not checked (likely local filesystem)
+- Retention: Not checked
+
+### Promtail (to be replaced)
+- 5 DaemonSet pods running
+- Forwards to loki-stack:3100
+- EOL: March 2026 - migrate to Grafana Alloy
+
+## Decisions
+
+### From Research (v2.0)
+- Use Grafana Alloy instead of Promtail (EOL March 2026)
+- Loki monolithic mode with 7-day retention appropriate for single-node
+- kube-prometheus-stack is the standard for k8s observability
+
+### Phase-specific
+- **Grafana ingress**: Leave as-is (external proxy works, OBS-07 satisfied)
+- **Alloy migration**: Replace Promtail DaemonSet with Alloy DaemonSet
+- **TaskPlanner metrics**: Add prom-client to SvelteKit app (standard Node.js client)
+- **Alloy labels**: Match existing Promtail labels (namespace, pod, container) for query compatibility
+
+## Requirements Mapping
+
+| Requirement | Current State | Phase 8 Action |
+|-------------|---------------|----------------|
+| OBS-01 | Partial (cluster only) | Add TaskPlanner ServiceMonitor |
+| OBS-02 | ✅ Done | Verify dashboards work |
+| OBS-03 | ✅ Done | Loki running |
+| OBS-04 | ⚠️ Promtail | Migrate to Alloy DaemonSet |
+| OBS-05 | Configured | Verify log queries work |
+| OBS-06 | 35 rules exist | Verify critical alerts fire |
+| OBS-07 | ✅ Done | Grafana accessible via TLS |
+| OBS-08 | ❌ Missing | Add /metrics endpoint to TaskPlanner |
+
+## Plan Outline
+
+1. **08-01**: TaskPlanner metrics endpoint + ServiceMonitor
+ - Add prom-client to app
+ - Expose /metrics endpoint
+ - Create ServiceMonitor for Prometheus scraping
+
+2. **08-02**: Promtail → Alloy migration
+ - Deploy Grafana Alloy DaemonSet
+ - Configure log forwarding to Loki
+ - Remove Promtail DaemonSet
+ - Verify logs still flow
+
+3. **08-03**: Verification
+ - Verify Grafana can query Loki logs
+ - Verify TaskPlanner metrics appear in Prometheus
+ - Verify KubePodCrashLooping alert exists
+ - End-to-end log flow test
+
+## Risks
+
+| Risk | Mitigation |
+|------|------------|
+| Log gap during Promtail→Alloy switch | Deploy Alloy first, verify working, then remove Promtail |
+| prom-client adds overhead | Use minimal default metrics (process, http request duration) |
+| Alloy config complexity | Start with minimal config matching Promtail behavior |
+
+---
+*Context gathered: 2026-02-03*
+*Decision: Focus on gaps + Alloy migration*