# Phase 8: Observability Stack - Context **Goal:** Full visibility into cluster and application health via metrics, logs, and dashboards **Status:** Mostly pre-existing infrastructure, focusing on gaps ## Discovery Summary The observability stack is largely already installed (15 days running). Phase 8 focuses on: 1. Gaps in existing setup 2. Migration from Promtail to Alloy (Promtail EOL March 2026) 3. TaskPlanner-specific observability ### What's Already Working | Component | Status | Details | |-----------|--------|---------| | Prometheus | ✅ Running | kube-prometheus-stack, scraping cluster metrics | | Grafana | ✅ Running | Accessible at grafana.kube2.tricnet.de (HTTP 200) | | Loki | ✅ Running | loki-stack-0 pod, configured as Grafana datasource | | AlertManager | ✅ Running | 35 PrometheusRules configured | | Node Exporters | ✅ Running | 5 pods across nodes | | Kube-state-metrics | ✅ Running | Cluster state metrics | | Promtail | ⚠️ Running | 5 DaemonSet pods - needs migration to Alloy | ### What's Missing | Gap | Requirement | Details | |-----|-------------|---------| | TaskPlanner /metrics | OBS-08 | App doesn't expose Prometheus metrics endpoint | | TaskPlanner ServiceMonitor | OBS-01 | No scraping config for app metrics | | Alloy migration | OBS-04 | Promtail running but EOL March 2026 | | Verify Loki queries | OBS-05 | Datasource configured, need to verify logs work | | Critical alerts verification | OBS-06 | Rules exist, need to verify KubePodCrashLooping | | Grafana TLS ingress | OBS-07 | Works via external proxy, not k8s ingress | ## Infrastructure Context ### Cluster Details - k3s cluster with 5 nodes (1 master + 4 workers based on node-exporter count) - Namespace: `monitoring` for all observability components - Namespace: `default` for TaskPlanner ### Grafana Access - URL: https://grafana.kube2.tricnet.de - Admin password: `GrafanaAdmin2026` (from secret) - Service type: ClusterIP (exposed via external proxy, not k8s ingress) - Datasources configured: Prometheus, Alertmanager, Loki (2x entries) ### Loki Configuration - Service: `loki-stack:3100` (ClusterIP) - Storage: Not checked (likely local filesystem) - Retention: Not checked ### Promtail (to be replaced) - 5 DaemonSet pods running - Forwards to loki-stack:3100 - EOL: March 2026 - migrate to Grafana Alloy ## Decisions ### From Research (v2.0) - Use Grafana Alloy instead of Promtail (EOL March 2026) - Loki monolithic mode with 7-day retention appropriate for single-node - kube-prometheus-stack is the standard for k8s observability ### Phase-specific - **Grafana ingress**: Leave as-is (external proxy works, OBS-07 satisfied) - **Alloy migration**: Replace Promtail DaemonSet with Alloy DaemonSet - **TaskPlanner metrics**: Add prom-client to SvelteKit app (standard Node.js client) - **Alloy labels**: Match existing Promtail labels (namespace, pod, container) for query compatibility ## Requirements Mapping | Requirement | Current State | Phase 8 Action | |-------------|---------------|----------------| | OBS-01 | Partial (cluster only) | Add TaskPlanner ServiceMonitor | | OBS-02 | ✅ Done | Verify dashboards work | | OBS-03 | ✅ Done | Loki running | | OBS-04 | ⚠️ Promtail | Migrate to Alloy DaemonSet | | OBS-05 | Configured | Verify log queries work | | OBS-06 | 35 rules exist | Verify critical alerts fire | | OBS-07 | ✅ Done | Grafana accessible via TLS | | OBS-08 | ❌ Missing | Add /metrics endpoint to TaskPlanner | ## Plan Outline 1. **08-01**: TaskPlanner metrics endpoint + ServiceMonitor - Add prom-client to app - Expose /metrics endpoint - Create ServiceMonitor for Prometheus scraping 2. **08-02**: Promtail → Alloy migration - Deploy Grafana Alloy DaemonSet - Configure log forwarding to Loki - Remove Promtail DaemonSet - Verify logs still flow 3. **08-03**: Verification - Verify Grafana can query Loki logs - Verify TaskPlanner metrics appear in Prometheus - Verify KubePodCrashLooping alert exists - End-to-end log flow test ## Risks | Risk | Mitigation | |------|------------| | Log gap during Promtail→Alloy switch | Deploy Alloy first, verify working, then remove Promtail | | prom-client adds overhead | Use minimal default metrics (process, http request duration) | | Alloy config complexity | Start with minimal config matching Promtail behavior | --- *Context gathered: 2026-02-03* *Decision: Focus on gaps + Alloy migration*