Phase 08: Observability Stack - 3 plans in 2 waves - Wave 1: 08-01 (metrics), 08-02 (Alloy) - parallel - Wave 2: 08-03 (verification) - depends on both - Ready for execution Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
115 lines
4.4 KiB
Markdown
115 lines
4.4 KiB
Markdown
# Phase 8: Observability Stack - Context
|
|
|
|
**Goal:** Full visibility into cluster and application health via metrics, logs, and dashboards
|
|
**Status:** Mostly pre-existing infrastructure, focusing on gaps
|
|
|
|
## Discovery Summary
|
|
|
|
The observability stack is largely already installed (15 days running). Phase 8 focuses on:
|
|
1. Gaps in existing setup
|
|
2. Migration from Promtail to Alloy (Promtail EOL March 2026)
|
|
3. TaskPlanner-specific observability
|
|
|
|
### What's Already Working
|
|
|
|
| Component | Status | Details |
|
|
|-----------|--------|---------|
|
|
| Prometheus | ✅ Running | kube-prometheus-stack, scraping cluster metrics |
|
|
| Grafana | ✅ Running | Accessible at grafana.kube2.tricnet.de (HTTP 200) |
|
|
| Loki | ✅ Running | loki-stack-0 pod, configured as Grafana datasource |
|
|
| AlertManager | ✅ Running | 35 PrometheusRules configured |
|
|
| Node Exporters | ✅ Running | 5 pods across nodes |
|
|
| Kube-state-metrics | ✅ Running | Cluster state metrics |
|
|
| Promtail | ⚠️ Running | 5 DaemonSet pods - needs migration to Alloy |
|
|
|
|
### What's Missing
|
|
|
|
| Gap | Requirement | Details |
|
|
|-----|-------------|---------|
|
|
| TaskPlanner /metrics | OBS-08 | App doesn't expose Prometheus metrics endpoint |
|
|
| TaskPlanner ServiceMonitor | OBS-01 | No scraping config for app metrics |
|
|
| Alloy migration | OBS-04 | Promtail running but EOL March 2026 |
|
|
| Verify Loki queries | OBS-05 | Datasource configured, need to verify logs work |
|
|
| Critical alerts verification | OBS-06 | Rules exist, need to verify KubePodCrashLooping |
|
|
| Grafana TLS ingress | OBS-07 | Works via external proxy, not k8s ingress |
|
|
|
|
## Infrastructure Context
|
|
|
|
### Cluster Details
|
|
- k3s cluster with 5 nodes (1 master + 4 workers based on node-exporter count)
|
|
- Namespace: `monitoring` for all observability components
|
|
- Namespace: `default` for TaskPlanner
|
|
|
|
### Grafana Access
|
|
- URL: https://grafana.kube2.tricnet.de
|
|
- Admin password: `GrafanaAdmin2026` (from secret)
|
|
- Service type: ClusterIP (exposed via external proxy, not k8s ingress)
|
|
- Datasources configured: Prometheus, Alertmanager, Loki (2x entries)
|
|
|
|
### Loki Configuration
|
|
- Service: `loki-stack:3100` (ClusterIP)
|
|
- Storage: Not checked (likely local filesystem)
|
|
- Retention: Not checked
|
|
|
|
### Promtail (to be replaced)
|
|
- 5 DaemonSet pods running
|
|
- Forwards to loki-stack:3100
|
|
- EOL: March 2026 - migrate to Grafana Alloy
|
|
|
|
## Decisions
|
|
|
|
### From Research (v2.0)
|
|
- Use Grafana Alloy instead of Promtail (EOL March 2026)
|
|
- Loki monolithic mode with 7-day retention appropriate for single-node
|
|
- kube-prometheus-stack is the standard for k8s observability
|
|
|
|
### Phase-specific
|
|
- **Grafana ingress**: Leave as-is (external proxy works, OBS-07 satisfied)
|
|
- **Alloy migration**: Replace Promtail DaemonSet with Alloy DaemonSet
|
|
- **TaskPlanner metrics**: Add prom-client to SvelteKit app (standard Node.js client)
|
|
- **Alloy labels**: Match existing Promtail labels (namespace, pod, container) for query compatibility
|
|
|
|
## Requirements Mapping
|
|
|
|
| Requirement | Current State | Phase 8 Action |
|
|
|-------------|---------------|----------------|
|
|
| OBS-01 | Partial (cluster only) | Add TaskPlanner ServiceMonitor |
|
|
| OBS-02 | ✅ Done | Verify dashboards work |
|
|
| OBS-03 | ✅ Done | Loki running |
|
|
| OBS-04 | ⚠️ Promtail | Migrate to Alloy DaemonSet |
|
|
| OBS-05 | Configured | Verify log queries work |
|
|
| OBS-06 | 35 rules exist | Verify critical alerts fire |
|
|
| OBS-07 | ✅ Done | Grafana accessible via TLS |
|
|
| OBS-08 | ❌ Missing | Add /metrics endpoint to TaskPlanner |
|
|
|
|
## Plan Outline
|
|
|
|
1. **08-01**: TaskPlanner metrics endpoint + ServiceMonitor
|
|
- Add prom-client to app
|
|
- Expose /metrics endpoint
|
|
- Create ServiceMonitor for Prometheus scraping
|
|
|
|
2. **08-02**: Promtail → Alloy migration
|
|
- Deploy Grafana Alloy DaemonSet
|
|
- Configure log forwarding to Loki
|
|
- Remove Promtail DaemonSet
|
|
- Verify logs still flow
|
|
|
|
3. **08-03**: Verification
|
|
- Verify Grafana can query Loki logs
|
|
- Verify TaskPlanner metrics appear in Prometheus
|
|
- Verify KubePodCrashLooping alert exists
|
|
- End-to-end log flow test
|
|
|
|
## Risks
|
|
|
|
| Risk | Mitigation |
|
|
|------|------------|
|
|
| Log gap during Promtail→Alloy switch | Deploy Alloy first, verify working, then remove Promtail |
|
|
| prom-client adds overhead | Use minimal default metrics (process, http request duration) |
|
|
| Alloy config complexity | Start with minimal config matching Promtail behavior |
|
|
|
|
---
|
|
*Context gathered: 2026-02-03*
|
|
*Decision: Focus on gaps + Alloy migration*
|