Phase 08: Observability Stack - 3 plans in 2 waves - Wave 1: 08-01 (metrics), 08-02 (Alloy) - parallel - Wave 2: 08-03 (verification) - depends on both - Ready for execution Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
4.4 KiB
4.4 KiB
Phase 8: Observability Stack - Context
Goal: Full visibility into cluster and application health via metrics, logs, and dashboards Status: Mostly pre-existing infrastructure, focusing on gaps
Discovery Summary
The observability stack is largely already installed (15 days running). Phase 8 focuses on:
- Gaps in existing setup
- Migration from Promtail to Alloy (Promtail EOL March 2026)
- TaskPlanner-specific observability
What's Already Working
| Component | Status | Details |
|---|---|---|
| Prometheus | ✅ Running | kube-prometheus-stack, scraping cluster metrics |
| Grafana | ✅ Running | Accessible at grafana.kube2.tricnet.de (HTTP 200) |
| Loki | ✅ Running | loki-stack-0 pod, configured as Grafana datasource |
| AlertManager | ✅ Running | 35 PrometheusRules configured |
| Node Exporters | ✅ Running | 5 pods across nodes |
| Kube-state-metrics | ✅ Running | Cluster state metrics |
| Promtail | ⚠️ Running | 5 DaemonSet pods - needs migration to Alloy |
What's Missing
| Gap | Requirement | Details |
|---|---|---|
| TaskPlanner /metrics | OBS-08 | App doesn't expose Prometheus metrics endpoint |
| TaskPlanner ServiceMonitor | OBS-01 | No scraping config for app metrics |
| Alloy migration | OBS-04 | Promtail running but EOL March 2026 |
| Verify Loki queries | OBS-05 | Datasource configured, need to verify logs work |
| Critical alerts verification | OBS-06 | Rules exist, need to verify KubePodCrashLooping |
| Grafana TLS ingress | OBS-07 | Works via external proxy, not k8s ingress |
Infrastructure Context
Cluster Details
- k3s cluster with 5 nodes (1 master + 4 workers based on node-exporter count)
- Namespace:
monitoringfor all observability components - Namespace:
defaultfor TaskPlanner
Grafana Access
- URL: https://grafana.kube2.tricnet.de
- Admin password:
GrafanaAdmin2026(from secret) - Service type: ClusterIP (exposed via external proxy, not k8s ingress)
- Datasources configured: Prometheus, Alertmanager, Loki (2x entries)
Loki Configuration
- Service:
loki-stack:3100(ClusterIP) - Storage: Not checked (likely local filesystem)
- Retention: Not checked
Promtail (to be replaced)
- 5 DaemonSet pods running
- Forwards to loki-stack:3100
- EOL: March 2026 - migrate to Grafana Alloy
Decisions
From Research (v2.0)
- Use Grafana Alloy instead of Promtail (EOL March 2026)
- Loki monolithic mode with 7-day retention appropriate for single-node
- kube-prometheus-stack is the standard for k8s observability
Phase-specific
- Grafana ingress: Leave as-is (external proxy works, OBS-07 satisfied)
- Alloy migration: Replace Promtail DaemonSet with Alloy DaemonSet
- TaskPlanner metrics: Add prom-client to SvelteKit app (standard Node.js client)
- Alloy labels: Match existing Promtail labels (namespace, pod, container) for query compatibility
Requirements Mapping
| Requirement | Current State | Phase 8 Action |
|---|---|---|
| OBS-01 | Partial (cluster only) | Add TaskPlanner ServiceMonitor |
| OBS-02 | ✅ Done | Verify dashboards work |
| OBS-03 | ✅ Done | Loki running |
| OBS-04 | ⚠️ Promtail | Migrate to Alloy DaemonSet |
| OBS-05 | Configured | Verify log queries work |
| OBS-06 | 35 rules exist | Verify critical alerts fire |
| OBS-07 | ✅ Done | Grafana accessible via TLS |
| OBS-08 | ❌ Missing | Add /metrics endpoint to TaskPlanner |
Plan Outline
-
08-01: TaskPlanner metrics endpoint + ServiceMonitor
- Add prom-client to app
- Expose /metrics endpoint
- Create ServiceMonitor for Prometheus scraping
-
08-02: Promtail → Alloy migration
- Deploy Grafana Alloy DaemonSet
- Configure log forwarding to Loki
- Remove Promtail DaemonSet
- Verify logs still flow
-
08-03: Verification
- Verify Grafana can query Loki logs
- Verify TaskPlanner metrics appear in Prometheus
- Verify KubePodCrashLooping alert exists
- End-to-end log flow test
Risks
| Risk | Mitigation |
|---|---|
| Log gap during Promtail→Alloy switch | Deploy Alloy first, verify working, then remove Promtail |
| prom-client adds overhead | Use minimal default metrics (process, http request duration) |
| Alloy config complexity | Start with minimal config matching Promtail behavior |
Context gathered: 2026-02-03 Decision: Focus on gaps + Alloy migration