--- phase: 08-observability-stack plan: 02 subsystem: infra tags: [alloy, grafana, loki, logging, daemonset, helm] # Dependency graph requires: - phase: 08-01 provides: Prometheus ServiceMonitor pattern for TaskPlanner provides: - Grafana Alloy DaemonSet replacing Promtail - Log forwarding to Loki via loki.write endpoint - Helm chart wrapper for alloy configuration affects: [08-03-verification, future-logging] # Tech tracking tech-stack: added: [grafana-alloy, river-config] patterns: [daemonset-tolerations, helm-umbrella-chart] key-files: created: - helm/alloy/Chart.yaml - helm/alloy/values.yaml modified: [] key-decisions: - "Match Promtail labels (namespace, pod, container) for query compatibility" - "Add control-plane tolerations to run on all 5 nodes" - "Disable Promtail in loki-stack rather than manual delete" patterns-established: - "River config: Alloy uses River language not YAML for log pipelines" - "DaemonSet tolerations: control-plane nodes need explicit tolerations" # Metrics duration: 8min completed: 2026-02-03 --- # Phase 8 Plan 02: Promtail to Alloy Migration Summary **Grafana Alloy DaemonSet deployed on all 5 nodes, forwarding logs to Loki with Promtail removed** ## Performance - **Duration:** 8 min - **Started:** 2026-02-03T21:04:24Z - **Completed:** 2026-02-03T21:12:07Z - **Tasks:** 2 - **Files created:** 2 ## Accomplishments - Deployed Grafana Alloy as DaemonSet via Helm umbrella chart - Configured River config for Kubernetes pod log discovery with matching labels - Verified log flow to Loki before and after Promtail removal - Cleanly removed Promtail by disabling in loki-stack values ## Task Commits Each task was committed atomically: 1. **Task 1: Deploy Grafana Alloy via Helm** - `c295228` (feat) 2. **Task 2: Verify log flow and remove Promtail** - no code changes (kubectl operations only) **Plan metadata:** Pending ## Files Created/Modified - `helm/alloy/Chart.yaml` - Umbrella chart for grafana/alloy dependency - `helm/alloy/values.yaml` - Alloy River config for Loki forwarding with DaemonSet tolerations ## Decisions Made - **Match Promtail labels:** Kept same label extraction (namespace, pod, container) for query compatibility with existing dashboards - **Control-plane tolerations:** Added tolerations for master/control-plane nodes to ensure Alloy runs on all 5 nodes (not just 2 workers) - **Promtail removal via Helm:** Upgraded loki-stack with `promtail.enabled=false` rather than manual deletion for clean state management ## Deviations from Plan ### Auto-fixed Issues **1. [Rule 3 - Blocking] Installed Helm locally** - **Found during:** Task 1 (helm dependency build) - **Issue:** helm command not found on local system - **Fix:** Downloaded and installed Helm 3.20.0 to ~/.local/bin/ - **Files modified:** None (binary installation) - **Verification:** `helm version` returns correct version - **Committed in:** N/A (environment setup) **2. [Rule 1 - Bug] Added control-plane tolerations** - **Found during:** Task 1 (DaemonSet verification) - **Issue:** Alloy only scheduled on 2 nodes (workers), not all 5 - **Fix:** Added tolerations for node-role.kubernetes.io/master and control-plane - **Files modified:** helm/alloy/values.yaml - **Verification:** DaemonSet shows DESIRED=5, READY=5 - **Committed in:** c295228 (Task 1 commit) --- **Total deviations:** 2 auto-fixed (1 blocking, 1 bug) **Impact on plan:** Both fixes necessary for correct operation. No scope creep. ## Issues Encountered - Initial "entry too far behind" errors in Alloy logs - expected Loki behavior rejecting old log entries during catch-up, settles automatically - TaskPlanner logs show "too many open files" warning - unrelated to Alloy migration, pre-existing application issue ## Next Phase Readiness - Alloy collecting logs from all pods cluster-wide - Loki receiving logs via Alloy loki.write endpoint - Ready for 08-03 verification of end-to-end observability --- *Phase: 08-observability-stack* *Completed: 2026-02-03*