Files
taskplaner/.planning/phases/08-observability-stack/CONTEXT.md
Thomas Richter 8c3dc137ca docs(08): create phase plan
Phase 08: Observability Stack
- 3 plans in 2 waves
- Wave 1: 08-01 (metrics), 08-02 (Alloy) - parallel
- Wave 2: 08-03 (verification) - depends on both
- Ready for execution

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 21:24:24 +01:00

115 lines
4.4 KiB
Markdown

# Phase 8: Observability Stack - Context
**Goal:** Full visibility into cluster and application health via metrics, logs, and dashboards
**Status:** Mostly pre-existing infrastructure, focusing on gaps
## Discovery Summary
The observability stack is largely already installed (15 days running). Phase 8 focuses on:
1. Gaps in existing setup
2. Migration from Promtail to Alloy (Promtail EOL March 2026)
3. TaskPlanner-specific observability
### What's Already Working
| Component | Status | Details |
|-----------|--------|---------|
| Prometheus | ✅ Running | kube-prometheus-stack, scraping cluster metrics |
| Grafana | ✅ Running | Accessible at grafana.kube2.tricnet.de (HTTP 200) |
| Loki | ✅ Running | loki-stack-0 pod, configured as Grafana datasource |
| AlertManager | ✅ Running | 35 PrometheusRules configured |
| Node Exporters | ✅ Running | 5 pods across nodes |
| Kube-state-metrics | ✅ Running | Cluster state metrics |
| Promtail | ⚠️ Running | 5 DaemonSet pods - needs migration to Alloy |
### What's Missing
| Gap | Requirement | Details |
|-----|-------------|---------|
| TaskPlanner /metrics | OBS-08 | App doesn't expose Prometheus metrics endpoint |
| TaskPlanner ServiceMonitor | OBS-01 | No scraping config for app metrics |
| Alloy migration | OBS-04 | Promtail running but EOL March 2026 |
| Verify Loki queries | OBS-05 | Datasource configured, need to verify logs work |
| Critical alerts verification | OBS-06 | Rules exist, need to verify KubePodCrashLooping |
| Grafana TLS ingress | OBS-07 | Works via external proxy, not k8s ingress |
## Infrastructure Context
### Cluster Details
- k3s cluster with 5 nodes (1 master + 4 workers based on node-exporter count)
- Namespace: `monitoring` for all observability components
- Namespace: `default` for TaskPlanner
### Grafana Access
- URL: https://grafana.kube2.tricnet.de
- Admin password: `GrafanaAdmin2026` (from secret)
- Service type: ClusterIP (exposed via external proxy, not k8s ingress)
- Datasources configured: Prometheus, Alertmanager, Loki (2x entries)
### Loki Configuration
- Service: `loki-stack:3100` (ClusterIP)
- Storage: Not checked (likely local filesystem)
- Retention: Not checked
### Promtail (to be replaced)
- 5 DaemonSet pods running
- Forwards to loki-stack:3100
- EOL: March 2026 - migrate to Grafana Alloy
## Decisions
### From Research (v2.0)
- Use Grafana Alloy instead of Promtail (EOL March 2026)
- Loki monolithic mode with 7-day retention appropriate for single-node
- kube-prometheus-stack is the standard for k8s observability
### Phase-specific
- **Grafana ingress**: Leave as-is (external proxy works, OBS-07 satisfied)
- **Alloy migration**: Replace Promtail DaemonSet with Alloy DaemonSet
- **TaskPlanner metrics**: Add prom-client to SvelteKit app (standard Node.js client)
- **Alloy labels**: Match existing Promtail labels (namespace, pod, container) for query compatibility
## Requirements Mapping
| Requirement | Current State | Phase 8 Action |
|-------------|---------------|----------------|
| OBS-01 | Partial (cluster only) | Add TaskPlanner ServiceMonitor |
| OBS-02 | ✅ Done | Verify dashboards work |
| OBS-03 | ✅ Done | Loki running |
| OBS-04 | ⚠️ Promtail | Migrate to Alloy DaemonSet |
| OBS-05 | Configured | Verify log queries work |
| OBS-06 | 35 rules exist | Verify critical alerts fire |
| OBS-07 | ✅ Done | Grafana accessible via TLS |
| OBS-08 | ❌ Missing | Add /metrics endpoint to TaskPlanner |
## Plan Outline
1. **08-01**: TaskPlanner metrics endpoint + ServiceMonitor
- Add prom-client to app
- Expose /metrics endpoint
- Create ServiceMonitor for Prometheus scraping
2. **08-02**: Promtail → Alloy migration
- Deploy Grafana Alloy DaemonSet
- Configure log forwarding to Loki
- Remove Promtail DaemonSet
- Verify logs still flow
3. **08-03**: Verification
- Verify Grafana can query Loki logs
- Verify TaskPlanner metrics appear in Prometheus
- Verify KubePodCrashLooping alert exists
- End-to-end log flow test
## Risks
| Risk | Mitigation |
|------|------------|
| Log gap during Promtail→Alloy switch | Deploy Alloy first, verify working, then remove Promtail |
| prom-client adds overhead | Use minimal default metrics (process, http request duration) |
| Alloy config complexity | Start with minimal config matching Promtail behavior |
---
*Context gathered: 2026-02-03*
*Decision: Focus on gaps + Alloy migration*