diff --git a/.planning/STATE.md b/.planning/STATE.md index 5bed15b..dc75d3d 100644 --- a/.planning/STATE.md +++ b/.planning/STATE.md @@ -10,11 +10,11 @@ See: .planning/PROJECT.md (updated 2026-02-01) ## Current Position Phase: 8 of 9 (Observability Stack) - IN PROGRESS -Plan: 1 of 3 in current phase - COMPLETE +Plan: 2 of 3 in current phase - COMPLETE Status: In progress -Last activity: 2026-02-03 — Completed 08-01-PLAN.md (TaskPlanner /metrics and ServiceMonitor) +Last activity: 2026-02-03 — Completed 08-02-PLAN.md (Promtail to Alloy Migration) -Progress: [█████████████████████░░░░░░░░░] 84% (21/25 plans complete) +Progress: [██████████████████████░░░░░░░░] 88% (22/25 plans complete) ## Performance Metrics @@ -26,8 +26,8 @@ Progress: [█████████████████████░░ - Requirements satisfied: 31/31 **v2.0 Progress:** -- Plans completed: 3/7 -- Total execution time: 30 min +- Plans completed: 4/7 +- Total execution time: 38 min **By Phase (v1.0):** @@ -45,7 +45,7 @@ Progress: [█████████████████████░░ | Phase | Plans | Total | Avg/Plan | |-------|-------|-------|----------| | 07-gitops-foundation | 2/2 | 26 min | 13 min | -| 08-observability-stack | 1/3 | 4 min | 4 min | +| 08-observability-stack | 2/3 | 12 min | 6 min | ## Accumulated Context @@ -72,6 +72,11 @@ For v2.0, key decisions from research: - Use prom-client default metrics only (no custom metrics for initial setup) - ServiceMonitor enabled by default in values.yaml +**From Phase 8-02:** +- Alloy uses River config language (not YAML) +- Match Promtail labels for Loki query compatibility +- Control-plane node tolerations required for full DaemonSet coverage + ### Pending Todos - Deploy Gitea Actions runner for automatic CI builds @@ -83,10 +88,10 @@ For v2.0, key decisions from research: ## Session Continuity -Last session: 2026-02-03 21:08 UTC -Stopped at: Completed 08-01-PLAN.md +Last session: 2026-02-03 21:12 UTC +Stopped at: Completed 08-02-PLAN.md Resume file: None --- *State initialized: 2026-01-29* -*Last updated: 2026-02-03 — Completed 08-01-PLAN.md (TaskPlanner /metrics and ServiceMonitor)* +*Last updated: 2026-02-03 — Completed 08-02-PLAN.md (Promtail to Alloy Migration)* diff --git a/.planning/phases/08-observability-stack/08-02-SUMMARY.md b/.planning/phases/08-observability-stack/08-02-SUMMARY.md new file mode 100644 index 0000000..df3d114 --- /dev/null +++ b/.planning/phases/08-observability-stack/08-02-SUMMARY.md @@ -0,0 +1,114 @@ +--- +phase: 08-observability-stack +plan: 02 +subsystem: infra +tags: [alloy, grafana, loki, logging, daemonset, helm] + +# Dependency graph +requires: + - phase: 08-01 + provides: Prometheus ServiceMonitor pattern for TaskPlanner +provides: + - Grafana Alloy DaemonSet replacing Promtail + - Log forwarding to Loki via loki.write endpoint + - Helm chart wrapper for alloy configuration +affects: [08-03-verification, future-logging] + +# Tech tracking +tech-stack: + added: [grafana-alloy, river-config] + patterns: [daemonset-tolerations, helm-umbrella-chart] + +key-files: + created: + - helm/alloy/Chart.yaml + - helm/alloy/values.yaml + modified: [] + +key-decisions: + - "Match Promtail labels (namespace, pod, container) for query compatibility" + - "Add control-plane tolerations to run on all 5 nodes" + - "Disable Promtail in loki-stack rather than manual delete" + +patterns-established: + - "River config: Alloy uses River language not YAML for log pipelines" + - "DaemonSet tolerations: control-plane nodes need explicit tolerations" + +# Metrics +duration: 8min +completed: 2026-02-03 +--- + +# Phase 8 Plan 02: Promtail to Alloy Migration Summary + +**Grafana Alloy DaemonSet deployed on all 5 nodes, forwarding logs to Loki with Promtail removed** + +## Performance + +- **Duration:** 8 min +- **Started:** 2026-02-03T21:04:24Z +- **Completed:** 2026-02-03T21:12:07Z +- **Tasks:** 2 +- **Files created:** 2 + +## Accomplishments +- Deployed Grafana Alloy as DaemonSet via Helm umbrella chart +- Configured River config for Kubernetes pod log discovery with matching labels +- Verified log flow to Loki before and after Promtail removal +- Cleanly removed Promtail by disabling in loki-stack values + +## Task Commits + +Each task was committed atomically: + +1. **Task 1: Deploy Grafana Alloy via Helm** - `c295228` (feat) +2. **Task 2: Verify log flow and remove Promtail** - no code changes (kubectl operations only) + +**Plan metadata:** Pending + +## Files Created/Modified +- `helm/alloy/Chart.yaml` - Umbrella chart for grafana/alloy dependency +- `helm/alloy/values.yaml` - Alloy River config for Loki forwarding with DaemonSet tolerations + +## Decisions Made +- **Match Promtail labels:** Kept same label extraction (namespace, pod, container) for query compatibility with existing dashboards +- **Control-plane tolerations:** Added tolerations for master/control-plane nodes to ensure Alloy runs on all 5 nodes (not just 2 workers) +- **Promtail removal via Helm:** Upgraded loki-stack with `promtail.enabled=false` rather than manual deletion for clean state management + +## Deviations from Plan + +### Auto-fixed Issues + +**1. [Rule 3 - Blocking] Installed Helm locally** +- **Found during:** Task 1 (helm dependency build) +- **Issue:** helm command not found on local system +- **Fix:** Downloaded and installed Helm 3.20.0 to ~/.local/bin/ +- **Files modified:** None (binary installation) +- **Verification:** `helm version` returns correct version +- **Committed in:** N/A (environment setup) + +**2. [Rule 1 - Bug] Added control-plane tolerations** +- **Found during:** Task 1 (DaemonSet verification) +- **Issue:** Alloy only scheduled on 2 nodes (workers), not all 5 +- **Fix:** Added tolerations for node-role.kubernetes.io/master and control-plane +- **Files modified:** helm/alloy/values.yaml +- **Verification:** DaemonSet shows DESIRED=5, READY=5 +- **Committed in:** c295228 (Task 1 commit) + +--- + +**Total deviations:** 2 auto-fixed (1 blocking, 1 bug) +**Impact on plan:** Both fixes necessary for correct operation. No scope creep. + +## Issues Encountered +- Initial "entry too far behind" errors in Alloy logs - expected Loki behavior rejecting old log entries during catch-up, settles automatically +- TaskPlanner logs show "too many open files" warning - unrelated to Alloy migration, pre-existing application issue + +## Next Phase Readiness +- Alloy collecting logs from all pods cluster-wide +- Loki receiving logs via Alloy loki.write endpoint +- Ready for 08-03 verification of end-to-end observability + +--- +*Phase: 08-observability-stack* +*Completed: 2026-02-03*