Files
Thomas Richter 8c3dc137ca docs(08): create phase plan
Phase 08: Observability Stack
- 3 plans in 2 waves
- Wave 1: 08-01 (metrics), 08-02 (Alloy) - parallel
- Wave 2: 08-03 (verification) - depends on both
- Ready for execution

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 21:24:24 +01:00

4.4 KiB

Phase 8: Observability Stack - Context

Goal: Full visibility into cluster and application health via metrics, logs, and dashboards Status: Mostly pre-existing infrastructure, focusing on gaps

Discovery Summary

The observability stack is largely already installed (15 days running). Phase 8 focuses on:

  1. Gaps in existing setup
  2. Migration from Promtail to Alloy (Promtail EOL March 2026)
  3. TaskPlanner-specific observability

What's Already Working

Component Status Details
Prometheus Running kube-prometheus-stack, scraping cluster metrics
Grafana Running Accessible at grafana.kube2.tricnet.de (HTTP 200)
Loki Running loki-stack-0 pod, configured as Grafana datasource
AlertManager Running 35 PrometheusRules configured
Node Exporters Running 5 pods across nodes
Kube-state-metrics Running Cluster state metrics
Promtail ⚠️ Running 5 DaemonSet pods - needs migration to Alloy

What's Missing

Gap Requirement Details
TaskPlanner /metrics OBS-08 App doesn't expose Prometheus metrics endpoint
TaskPlanner ServiceMonitor OBS-01 No scraping config for app metrics
Alloy migration OBS-04 Promtail running but EOL March 2026
Verify Loki queries OBS-05 Datasource configured, need to verify logs work
Critical alerts verification OBS-06 Rules exist, need to verify KubePodCrashLooping
Grafana TLS ingress OBS-07 Works via external proxy, not k8s ingress

Infrastructure Context

Cluster Details

  • k3s cluster with 5 nodes (1 master + 4 workers based on node-exporter count)
  • Namespace: monitoring for all observability components
  • Namespace: default for TaskPlanner

Grafana Access

  • URL: https://grafana.kube2.tricnet.de
  • Admin password: GrafanaAdmin2026 (from secret)
  • Service type: ClusterIP (exposed via external proxy, not k8s ingress)
  • Datasources configured: Prometheus, Alertmanager, Loki (2x entries)

Loki Configuration

  • Service: loki-stack:3100 (ClusterIP)
  • Storage: Not checked (likely local filesystem)
  • Retention: Not checked

Promtail (to be replaced)

  • 5 DaemonSet pods running
  • Forwards to loki-stack:3100
  • EOL: March 2026 - migrate to Grafana Alloy

Decisions

From Research (v2.0)

  • Use Grafana Alloy instead of Promtail (EOL March 2026)
  • Loki monolithic mode with 7-day retention appropriate for single-node
  • kube-prometheus-stack is the standard for k8s observability

Phase-specific

  • Grafana ingress: Leave as-is (external proxy works, OBS-07 satisfied)
  • Alloy migration: Replace Promtail DaemonSet with Alloy DaemonSet
  • TaskPlanner metrics: Add prom-client to SvelteKit app (standard Node.js client)
  • Alloy labels: Match existing Promtail labels (namespace, pod, container) for query compatibility

Requirements Mapping

Requirement Current State Phase 8 Action
OBS-01 Partial (cluster only) Add TaskPlanner ServiceMonitor
OBS-02 Done Verify dashboards work
OBS-03 Done Loki running
OBS-04 ⚠️ Promtail Migrate to Alloy DaemonSet
OBS-05 Configured Verify log queries work
OBS-06 35 rules exist Verify critical alerts fire
OBS-07 Done Grafana accessible via TLS
OBS-08 Missing Add /metrics endpoint to TaskPlanner

Plan Outline

  1. 08-01: TaskPlanner metrics endpoint + ServiceMonitor

    • Add prom-client to app
    • Expose /metrics endpoint
    • Create ServiceMonitor for Prometheus scraping
  2. 08-02: Promtail → Alloy migration

    • Deploy Grafana Alloy DaemonSet
    • Configure log forwarding to Loki
    • Remove Promtail DaemonSet
    • Verify logs still flow
  3. 08-03: Verification

    • Verify Grafana can query Loki logs
    • Verify TaskPlanner metrics appear in Prometheus
    • Verify KubePodCrashLooping alert exists
    • End-to-end log flow test

Risks

Risk Mitigation
Log gap during Promtail→Alloy switch Deploy Alloy first, verify working, then remove Promtail
prom-client adds overhead Use minimal default metrics (process, http request duration)
Alloy config complexity Start with minimal config matching Promtail behavior

Context gathered: 2026-02-03 Decision: Focus on gaps + Alloy migration