taskplaner/.planning/phases/08-observability-stack/CONTEXT.md

# Phase 8: Observability Stack - Context

**Goal:** Full visibility into cluster and application health via metrics, logs, and dashboards
**Status:** Mostly pre-existing infrastructure, focusing on gaps

## Discovery Summary

The observability stack is largely already installed (15 days running). Phase 8 focuses on:
1. Gaps in existing setup
2. Migration from Promtail to Alloy (Promtail EOL March 2026)
3. TaskPlanner-specific observability

### What's Already Working

| Component | Status | Details |
|-----------|--------|---------|
| Prometheus | ✅ Running | kube-prometheus-stack, scraping cluster metrics |
| Grafana | ✅ Running | Accessible at grafana.kube2.tricnet.de (HTTP 200) |
| Loki | ✅ Running | loki-stack-0 pod, configured as Grafana datasource |
| AlertManager | ✅ Running | 35 PrometheusRules configured |
| Node Exporters | ✅ Running | 5 pods across nodes |
| Kube-state-metrics | ✅ Running | Cluster state metrics |
| Promtail | ⚠️ Running | 5 DaemonSet pods - needs migration to Alloy |

### What's Missing

| Gap | Requirement | Details |
|-----|-------------|---------|
| TaskPlanner /metrics | OBS-08 | App doesn't expose Prometheus metrics endpoint |
| TaskPlanner ServiceMonitor | OBS-01 | No scraping config for app metrics |
| Alloy migration | OBS-04 | Promtail running but EOL March 2026 |
| Verify Loki queries | OBS-05 | Datasource configured, need to verify logs work |
| Critical alerts verification | OBS-06 | Rules exist, need to verify KubePodCrashLooping |
| Grafana TLS ingress | OBS-07 | Works via external proxy, not k8s ingress |

## Infrastructure Context

### Cluster Details
- k3s cluster with 5 nodes (1 master + 4 workers based on node-exporter count)
- Namespace: `monitoring` for all observability components
- Namespace: `default` for TaskPlanner

### Grafana Access
- URL: https://grafana.kube2.tricnet.de
- Admin password: `GrafanaAdmin2026` (from secret)
- Service type: ClusterIP (exposed via external proxy, not k8s ingress)
- Datasources configured: Prometheus, Alertmanager, Loki (2x entries)

### Loki Configuration
- Service: `loki-stack:3100` (ClusterIP)
- Storage: Not checked (likely local filesystem)
- Retention: Not checked

### Promtail (to be replaced)
- 5 DaemonSet pods running
- Forwards to loki-stack:3100
- EOL: March 2026 - migrate to Grafana Alloy

## Decisions

### From Research (v2.0)
- Use Grafana Alloy instead of Promtail (EOL March 2026)
- Loki monolithic mode with 7-day retention appropriate for single-node
- kube-prometheus-stack is the standard for k8s observability

### Phase-specific
- **Grafana ingress**: Leave as-is (external proxy works, OBS-07 satisfied)
- **Alloy migration**: Replace Promtail DaemonSet with Alloy DaemonSet
- **TaskPlanner metrics**: Add prom-client to SvelteKit app (standard Node.js client)
- **Alloy labels**: Match existing Promtail labels (namespace, pod, container) for query compatibility

## Requirements Mapping

| Requirement | Current State | Phase 8 Action |
|-------------|---------------|----------------|
| OBS-01 | Partial (cluster only) | Add TaskPlanner ServiceMonitor |
| OBS-02 | ✅ Done | Verify dashboards work |
| OBS-03 | ✅ Done | Loki running |
| OBS-04 | ⚠️ Promtail | Migrate to Alloy DaemonSet |
| OBS-05 | Configured | Verify log queries work |
| OBS-06 | 35 rules exist | Verify critical alerts fire |
| OBS-07 | ✅ Done | Grafana accessible via TLS |
| OBS-08 | ❌ Missing | Add /metrics endpoint to TaskPlanner |

## Plan Outline

1. **08-01**: TaskPlanner metrics endpoint + ServiceMonitor
   - Add prom-client to app
   - Expose /metrics endpoint
   - Create ServiceMonitor for Prometheus scraping

2. **08-02**: Promtail → Alloy migration
   - Deploy Grafana Alloy DaemonSet
   - Configure log forwarding to Loki
   - Remove Promtail DaemonSet
   - Verify logs still flow

3. **08-03**: Verification
   - Verify Grafana can query Loki logs
   - Verify TaskPlanner metrics appear in Prometheus
   - Verify KubePodCrashLooping alert exists
   - End-to-end log flow test

## Risks

| Risk | Mitigation |
|------|------------|
| Log gap during Promtail→Alloy switch | Deploy Alloy first, verify working, then remove Promtail |
| prom-client adds overhead | Use minimal default metrics (process, http request duration) |
| Alloy config complexity | Start with minimal config matching Promtail behavior |

---
*Context gathered: 2026-02-03*
*Decision: Focus on gaps + Alloy migration*