Files

Thomas Richter 8c3dc137ca docs(08): create phase plan

Phase 08: Observability Stack
- 3 plans in 2 waves
- Wave 1: 08-01 (metrics), 08-02 (Alloy) - parallel
- Wave 2: 08-03 (verification) - depends on both
- Ready for execution

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-02-03 21:24:24 +01:00

4.4 KiB

Raw Permalink Blame History

Phase 8: Observability Stack - Context

Goal: Full visibility into cluster and application health via metrics, logs, and dashboards Status: Mostly pre-existing infrastructure, focusing on gaps

Discovery Summary

The observability stack is largely already installed (15 days running). Phase 8 focuses on:

Gaps in existing setup
Migration from Promtail to Alloy (Promtail EOL March 2026)
TaskPlanner-specific observability

What's Already Working

Component	Status	Details
Prometheus	✅ Running	kube-prometheus-stack, scraping cluster metrics
Grafana	✅ Running	Accessible at grafana.kube2.tricnet.de (HTTP 200)
Loki	✅ Running	loki-stack-0 pod, configured as Grafana datasource
AlertManager	✅ Running	35 PrometheusRules configured
Node Exporters	✅ Running	5 pods across nodes
Kube-state-metrics	✅ Running	Cluster state metrics
Promtail	⚠️ Running	5 DaemonSet pods - needs migration to Alloy

What's Missing

Gap	Requirement	Details
TaskPlanner /metrics	OBS-08	App doesn't expose Prometheus metrics endpoint
TaskPlanner ServiceMonitor	OBS-01	No scraping config for app metrics
Alloy migration	OBS-04	Promtail running but EOL March 2026
Verify Loki queries	OBS-05	Datasource configured, need to verify logs work
Critical alerts verification	OBS-06	Rules exist, need to verify KubePodCrashLooping
Grafana TLS ingress	OBS-07	Works via external proxy, not k8s ingress

Infrastructure Context

Cluster Details

k3s cluster with 5 nodes (1 master + 4 workers based on node-exporter count)
Namespace: monitoring for all observability components
Namespace: default for TaskPlanner

Grafana Access

URL: https://grafana.kube2.tricnet.de
Admin password: GrafanaAdmin2026 (from secret)
Service type: ClusterIP (exposed via external proxy, not k8s ingress)
Datasources configured: Prometheus, Alertmanager, Loki (2x entries)

Loki Configuration

Service: loki-stack:3100 (ClusterIP)
Storage: Not checked (likely local filesystem)
Retention: Not checked

Promtail (to be replaced)

5 DaemonSet pods running
Forwards to loki-stack:3100
EOL: March 2026 - migrate to Grafana Alloy

Decisions

From Research (v2.0)

Use Grafana Alloy instead of Promtail (EOL March 2026)
Loki monolithic mode with 7-day retention appropriate for single-node
kube-prometheus-stack is the standard for k8s observability

Phase-specific

Grafana ingress: Leave as-is (external proxy works, OBS-07 satisfied)
Alloy migration: Replace Promtail DaemonSet with Alloy DaemonSet
TaskPlanner metrics: Add prom-client to SvelteKit app (standard Node.js client)
Alloy labels: Match existing Promtail labels (namespace, pod, container) for query compatibility

Requirements Mapping

Requirement	Current State	Phase 8 Action
OBS-01	Partial (cluster only)	Add TaskPlanner ServiceMonitor
OBS-02	✅ Done	Verify dashboards work
OBS-03	✅ Done	Loki running
OBS-04	⚠️ Promtail	Migrate to Alloy DaemonSet
OBS-05	Configured	Verify log queries work
OBS-06	35 rules exist	Verify critical alerts fire
OBS-07	✅ Done	Grafana accessible via TLS
OBS-08	❌ Missing	Add /metrics endpoint to TaskPlanner

Plan Outline

08-01: TaskPlanner metrics endpoint + ServiceMonitor
- Add prom-client to app
- Expose /metrics endpoint
- Create ServiceMonitor for Prometheus scraping
08-02: Promtail → Alloy migration
- Deploy Grafana Alloy DaemonSet
- Configure log forwarding to Loki
- Remove Promtail DaemonSet
- Verify logs still flow
08-03: Verification
- Verify Grafana can query Loki logs
- Verify TaskPlanner metrics appear in Prometheus
- Verify KubePodCrashLooping alert exists
- End-to-end log flow test

Risks

Risk	Mitigation
Log gap during Promtail→Alloy switch	Deploy Alloy first, verify working, then remove Promtail
prom-client adds overhead	Use minimal default metrics (process, http request duration)
Alloy config complexity	Start with minimal config matching Promtail behavior

Context gathered: 2026-02-03 Decision: Focus on gaps + Alloy migration

4.4 KiB Raw Permalink Blame History