docs(08): create phase plan
Phase 08: Observability Stack - 3 plans in 2 waves - Wave 1: 08-01 (metrics), 08-02 (Alloy) - parallel - Wave 2: 08-03 (verification) - depends on both - Ready for execution Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
174
.planning/phases/08-observability-stack/08-01-PLAN.md
Normal file
174
.planning/phases/08-observability-stack/08-01-PLAN.md
Normal file
@@ -0,0 +1,174 @@
|
||||
---
|
||||
phase: 08-observability-stack
|
||||
plan: 01
|
||||
type: execute
|
||||
wave: 1
|
||||
depends_on: []
|
||||
files_modified:
|
||||
- package.json
|
||||
- src/routes/metrics/+server.ts
|
||||
- src/lib/server/metrics.ts
|
||||
- helm/taskplaner/templates/servicemonitor.yaml
|
||||
- helm/taskplaner/values.yaml
|
||||
autonomous: true
|
||||
|
||||
must_haves:
|
||||
truths:
|
||||
- "TaskPlanner /metrics endpoint returns Prometheus-format text"
|
||||
- "ServiceMonitor exists in Helm chart templates"
|
||||
- "Prometheus can discover TaskPlanner via ServiceMonitor"
|
||||
artifacts:
|
||||
- path: "src/routes/metrics/+server.ts"
|
||||
provides: "Prometheus metrics HTTP endpoint"
|
||||
exports: ["GET"]
|
||||
- path: "src/lib/server/metrics.ts"
|
||||
provides: "prom-client registry and metrics definitions"
|
||||
contains: "collectDefaultMetrics"
|
||||
- path: "helm/taskplaner/templates/servicemonitor.yaml"
|
||||
provides: "ServiceMonitor for Prometheus Operator"
|
||||
contains: "kind: ServiceMonitor"
|
||||
key_links:
|
||||
- from: "src/routes/metrics/+server.ts"
|
||||
to: "src/lib/server/metrics.ts"
|
||||
via: "import register"
|
||||
pattern: "import.*register.*from.*metrics"
|
||||
- from: "helm/taskplaner/templates/servicemonitor.yaml"
|
||||
to: "tp-app service"
|
||||
via: "selector matchLabels"
|
||||
pattern: "selector.*matchLabels"
|
||||
---
|
||||
|
||||
<objective>
|
||||
Add Prometheus metrics endpoint to TaskPlanner and ServiceMonitor for scraping
|
||||
|
||||
Purpose: Enable Prometheus to collect application metrics from TaskPlanner (OBS-08, OBS-01)
|
||||
Output: /metrics endpoint returning prom-client default metrics, ServiceMonitor in Helm chart
|
||||
</objective>
|
||||
|
||||
<execution_context>
|
||||
@/home/tho/.claude/get-shit-done/workflows/execute-plan.md
|
||||
@/home/tho/.claude/get-shit-done/templates/summary.md
|
||||
</execution_context>
|
||||
|
||||
<context>
|
||||
@.planning/PROJECT.md
|
||||
@.planning/ROADMAP.md
|
||||
@.planning/STATE.md
|
||||
@.planning/phases/08-observability-stack/CONTEXT.md
|
||||
@package.json
|
||||
@src/routes/health/+server.ts
|
||||
@helm/taskplaner/values.yaml
|
||||
@helm/taskplaner/templates/service.yaml
|
||||
</context>
|
||||
|
||||
<tasks>
|
||||
|
||||
<task type="auto">
|
||||
<name>Task 1: Add prom-client and create /metrics endpoint</name>
|
||||
<files>
|
||||
package.json
|
||||
src/lib/server/metrics.ts
|
||||
src/routes/metrics/+server.ts
|
||||
</files>
|
||||
<action>
|
||||
1. Install prom-client:
|
||||
```bash
|
||||
npm install prom-client
|
||||
```
|
||||
|
||||
2. Create src/lib/server/metrics.ts:
|
||||
- Import prom-client's Registry, collectDefaultMetrics
|
||||
- Create a new Registry instance
|
||||
- Call collectDefaultMetrics({ register: registry }) to collect Node.js process metrics
|
||||
- Export the registry
|
||||
- Keep it minimal - just default metrics (memory, CPU, event loop lag)
|
||||
|
||||
3. Create src/routes/metrics/+server.ts:
|
||||
- Import the registry from $lib/server/metrics
|
||||
- Create GET handler that returns registry.metrics() with Content-Type: text/plain; version=0.0.4
|
||||
- Handle errors gracefully (return 500 on failure)
|
||||
- Pattern follows existing /health endpoint structure
|
||||
|
||||
NOTE: prom-client is the standard Node.js Prometheus client. Use default metrics only - no custom metrics needed for this phase.
|
||||
</action>
|
||||
<verify>
|
||||
1. npm run build completes without errors
|
||||
2. npm run dev, then curl http://localhost:5173/metrics returns text starting with "# HELP" or "# TYPE"
|
||||
3. Response Content-Type header includes "text/plain"
|
||||
</verify>
|
||||
<done>
|
||||
/metrics endpoint returns Prometheus-format metrics including process_cpu_seconds_total, nodejs_heap_size_total_bytes
|
||||
</done>
|
||||
</task>
|
||||
|
||||
<task type="auto">
|
||||
<name>Task 2: Add ServiceMonitor to Helm chart</name>
|
||||
<files>
|
||||
helm/taskplaner/templates/servicemonitor.yaml
|
||||
helm/taskplaner/values.yaml
|
||||
</files>
|
||||
<action>
|
||||
1. Create helm/taskplaner/templates/servicemonitor.yaml:
|
||||
```yaml
|
||||
{{- if .Values.metrics.enabled }}
|
||||
apiVersion: monitoring.coreos.com/v1
|
||||
kind: ServiceMonitor
|
||||
metadata:
|
||||
name: {{ include "taskplaner.fullname" . }}
|
||||
labels:
|
||||
{{- include "taskplaner.labels" . | nindent 4 }}
|
||||
spec:
|
||||
selector:
|
||||
matchLabels:
|
||||
{{- include "taskplaner.selectorLabels" . | nindent 6 }}
|
||||
endpoints:
|
||||
- port: http
|
||||
path: /metrics
|
||||
interval: {{ .Values.metrics.interval | default "30s" }}
|
||||
namespaceSelector:
|
||||
matchNames:
|
||||
- {{ .Release.Namespace }}
|
||||
{{- end }}
|
||||
```
|
||||
|
||||
2. Update helm/taskplaner/values.yaml - add metrics section:
|
||||
```yaml
|
||||
# Prometheus metrics
|
||||
metrics:
|
||||
enabled: true
|
||||
interval: 30s
|
||||
```
|
||||
|
||||
3. Ensure the service template exposes port named "http" (check existing service.yaml - it likely already does via targetPort: http)
|
||||
|
||||
NOTE: The ServiceMonitor uses monitoring.coreos.com/v1 API which kube-prometheus-stack provides. The namespaceSelector ensures Prometheus finds TaskPlanner in the default namespace.
|
||||
</action>
|
||||
<verify>
|
||||
1. helm template ./helm/taskplaner includes ServiceMonitor resource
|
||||
2. helm template output shows selector matching app.kubernetes.io/name: taskplaner
|
||||
3. No helm lint errors
|
||||
</verify>
|
||||
<done>
|
||||
ServiceMonitor template renders correctly with selector matching TaskPlanner service, ready for Prometheus to discover
|
||||
</done>
|
||||
</task>
|
||||
|
||||
</tasks>
|
||||
|
||||
<verification>
|
||||
- [ ] npm run build succeeds
|
||||
- [ ] curl localhost:5173/metrics returns Prometheus-format text
|
||||
- [ ] helm template ./helm/taskplaner shows ServiceMonitor resource
|
||||
- [ ] ServiceMonitor selector matches service labels
|
||||
</verification>
|
||||
|
||||
<success_criteria>
|
||||
1. /metrics endpoint returns Prometheus-format metrics (process metrics, heap size, event loop)
|
||||
2. ServiceMonitor added to Helm chart templates
|
||||
3. ServiceMonitor enabled by default in values.yaml
|
||||
4. Build and type check pass
|
||||
</success_criteria>
|
||||
|
||||
<output>
|
||||
After completion, create `.planning/phases/08-observability-stack/08-01-SUMMARY.md`
|
||||
</output>
|
||||
229
.planning/phases/08-observability-stack/08-02-PLAN.md
Normal file
229
.planning/phases/08-observability-stack/08-02-PLAN.md
Normal file
@@ -0,0 +1,229 @@
|
||||
---
|
||||
phase: 08-observability-stack
|
||||
plan: 02
|
||||
type: execute
|
||||
wave: 1
|
||||
depends_on: []
|
||||
files_modified:
|
||||
- helm/alloy/values.yaml (new)
|
||||
- helm/alloy/Chart.yaml (new)
|
||||
autonomous: true
|
||||
|
||||
must_haves:
|
||||
truths:
|
||||
- "Alloy DaemonSet runs on all nodes"
|
||||
- "Alloy forwards logs to Loki"
|
||||
- "Promtail DaemonSet is removed"
|
||||
artifacts:
|
||||
- path: "helm/alloy/Chart.yaml"
|
||||
provides: "Alloy Helm chart wrapper"
|
||||
contains: "name: alloy"
|
||||
- path: "helm/alloy/values.yaml"
|
||||
provides: "Alloy configuration for Loki forwarding"
|
||||
contains: "loki.write"
|
||||
key_links:
|
||||
- from: "Alloy pods"
|
||||
to: "loki-stack:3100"
|
||||
via: "loki.write endpoint"
|
||||
pattern: "endpoint.*loki"
|
||||
---
|
||||
|
||||
<objective>
|
||||
Migrate from Promtail to Grafana Alloy for log collection
|
||||
|
||||
Purpose: Replace EOL Promtail (March 2026) with Grafana Alloy DaemonSet (OBS-04)
|
||||
Output: Alloy DaemonSet forwarding logs to Loki, Promtail removed
|
||||
</objective>
|
||||
|
||||
<execution_context>
|
||||
@/home/tho/.claude/get-shit-done/workflows/execute-plan.md
|
||||
@/home/tho/.claude/get-shit-done/templates/summary.md
|
||||
</execution_context>
|
||||
|
||||
<context>
|
||||
@.planning/PROJECT.md
|
||||
@.planning/ROADMAP.md
|
||||
@.planning/STATE.md
|
||||
@.planning/phases/08-observability-stack/CONTEXT.md
|
||||
</context>
|
||||
|
||||
<tasks>
|
||||
|
||||
<task type="auto">
|
||||
<name>Task 1: Deploy Grafana Alloy via Helm</name>
|
||||
<files>
|
||||
helm/alloy/Chart.yaml
|
||||
helm/alloy/values.yaml
|
||||
</files>
|
||||
<action>
|
||||
1. Create helm/alloy directory and Chart.yaml as umbrella chart:
|
||||
```yaml
|
||||
apiVersion: v2
|
||||
name: alloy
|
||||
description: Grafana Alloy log collector
|
||||
version: 0.1.0
|
||||
dependencies:
|
||||
- name: alloy
|
||||
version: "0.12.*"
|
||||
repository: https://grafana.github.io/helm-charts
|
||||
```
|
||||
|
||||
2. Create helm/alloy/values.yaml with minimal config for Loki forwarding:
|
||||
```yaml
|
||||
alloy:
|
||||
alloy:
|
||||
configMap:
|
||||
content: |
|
||||
// Discover pods and collect logs
|
||||
discovery.kubernetes "pods" {
|
||||
role = "pod"
|
||||
}
|
||||
|
||||
// Relabel to extract pod metadata
|
||||
discovery.relabel "pods" {
|
||||
targets = discovery.kubernetes.pods.targets
|
||||
|
||||
rule {
|
||||
source_labels = ["__meta_kubernetes_namespace"]
|
||||
target_label = "namespace"
|
||||
}
|
||||
rule {
|
||||
source_labels = ["__meta_kubernetes_pod_name"]
|
||||
target_label = "pod"
|
||||
}
|
||||
rule {
|
||||
source_labels = ["__meta_kubernetes_pod_container_name"]
|
||||
target_label = "container"
|
||||
}
|
||||
}
|
||||
|
||||
// Collect logs from discovered pods
|
||||
loki.source.kubernetes "pods" {
|
||||
targets = discovery.relabel.pods.output
|
||||
forward_to = [loki.write.default.receiver]
|
||||
}
|
||||
|
||||
// Forward to Loki
|
||||
loki.write "default" {
|
||||
endpoint {
|
||||
url = "http://loki-stack.monitoring.svc.cluster.local:3100/loki/api/v1/push"
|
||||
}
|
||||
}
|
||||
|
||||
controller:
|
||||
type: daemonset
|
||||
|
||||
serviceAccount:
|
||||
create: true
|
||||
```
|
||||
|
||||
3. Add Grafana Helm repo and build dependencies:
|
||||
```bash
|
||||
helm repo add grafana https://grafana.github.io/helm-charts
|
||||
helm repo update
|
||||
cd helm/alloy && helm dependency build
|
||||
```
|
||||
|
||||
4. Deploy Alloy to monitoring namespace:
|
||||
```bash
|
||||
helm upgrade --install alloy ./helm/alloy -n monitoring --create-namespace
|
||||
```
|
||||
|
||||
5. Verify Alloy pods are running:
|
||||
```bash
|
||||
kubectl get pods -n monitoring -l app.kubernetes.io/name=alloy
|
||||
```
|
||||
Expected: 5 pods (one per node) in Running state
|
||||
|
||||
NOTE:
|
||||
- Alloy uses River configuration language (not YAML)
|
||||
- Labels (namespace, pod, container) match existing Promtail labels for query compatibility
|
||||
- Loki endpoint is cluster-internal: loki-stack.monitoring.svc.cluster.local:3100
|
||||
</action>
|
||||
<verify>
|
||||
1. kubectl get pods -n monitoring -l app.kubernetes.io/name=alloy shows 5 Running pods
|
||||
2. kubectl logs -n monitoring -l app.kubernetes.io/name=alloy --tail=20 shows no errors
|
||||
3. Alloy logs show "loki.write" component started successfully
|
||||
</verify>
|
||||
<done>
|
||||
Alloy DaemonSet deployed with 5 pods collecting logs and forwarding to Loki
|
||||
</done>
|
||||
</task>
|
||||
|
||||
<task type="auto">
|
||||
<name>Task 2: Verify log flow and remove Promtail</name>
|
||||
<files>
|
||||
(no files - kubectl operations)
|
||||
</files>
|
||||
<action>
|
||||
1. Generate a test log by restarting TaskPlanner pod:
|
||||
```bash
|
||||
kubectl rollout restart deployment taskplaner
|
||||
```
|
||||
|
||||
2. Wait for pod to be ready:
|
||||
```bash
|
||||
kubectl rollout status deployment taskplaner --timeout=60s
|
||||
```
|
||||
|
||||
3. Verify logs appear in Loki via LogCLI or curl:
|
||||
```bash
|
||||
# Query recent TaskPlanner logs via Loki API
|
||||
kubectl run --rm -it logtest --image=curlimages/curl --restart=Never -- \
|
||||
curl -s "http://loki-stack.monitoring.svc.cluster.local:3100/loki/api/v1/query_range" \
|
||||
--data-urlencode 'query={namespace="default",pod=~"taskplaner.*"}' \
|
||||
--data-urlencode 'limit=5'
|
||||
```
|
||||
Expected: JSON response with "result" containing log entries
|
||||
|
||||
4. Once logs confirmed flowing via Alloy, remove Promtail:
|
||||
```bash
|
||||
# Find and delete Promtail release
|
||||
helm list -n monitoring | grep promtail
|
||||
# If promtail found:
|
||||
helm uninstall loki-stack-promtail -n monitoring 2>/dev/null || \
|
||||
helm uninstall promtail -n monitoring 2>/dev/null || \
|
||||
kubectl delete daemonset -n monitoring -l app=promtail
|
||||
```
|
||||
|
||||
5. Verify Promtail is gone:
|
||||
```bash
|
||||
kubectl get pods -n monitoring | grep -i promtail
|
||||
```
|
||||
Expected: No promtail pods
|
||||
|
||||
6. Verify logs still flowing after Promtail removal (repeat step 3)
|
||||
|
||||
NOTE: Promtail may be installed as part of loki-stack or separately. Check both.
|
||||
</action>
|
||||
<verify>
|
||||
1. Loki API returns TaskPlanner log entries
|
||||
2. kubectl get pods -n monitoring shows NO promtail pods
|
||||
3. kubectl get pods -n monitoring shows Alloy pods still running
|
||||
4. Second Loki query after Promtail removal still returns logs
|
||||
</verify>
|
||||
<done>
|
||||
Logs confirmed flowing from Alloy to Loki, Promtail DaemonSet removed from cluster
|
||||
</done>
|
||||
</task>
|
||||
|
||||
</tasks>
|
||||
|
||||
<verification>
|
||||
- [ ] Alloy DaemonSet has 5 Running pods (one per node)
|
||||
- [ ] Alloy pods show no errors in logs
|
||||
- [ ] Loki API returns TaskPlanner log entries
|
||||
- [ ] Promtail pods no longer exist
|
||||
- [ ] Log flow continues after Promtail removal
|
||||
</verification>
|
||||
|
||||
<success_criteria>
|
||||
1. Alloy DaemonSet running on all 5 nodes
|
||||
2. Logs from TaskPlanner appear in Loki within 60 seconds of generation
|
||||
3. Promtail DaemonSet completely removed
|
||||
4. No log collection gap (Alloy verified before Promtail removal)
|
||||
</success_criteria>
|
||||
|
||||
<output>
|
||||
After completion, create `.planning/phases/08-observability-stack/08-02-SUMMARY.md`
|
||||
</output>
|
||||
233
.planning/phases/08-observability-stack/08-03-PLAN.md
Normal file
233
.planning/phases/08-observability-stack/08-03-PLAN.md
Normal file
@@ -0,0 +1,233 @@
|
||||
---
|
||||
phase: 08-observability-stack
|
||||
plan: 03
|
||||
type: execute
|
||||
wave: 2
|
||||
depends_on: ["08-01", "08-02"]
|
||||
files_modified: []
|
||||
autonomous: false
|
||||
|
||||
must_haves:
|
||||
truths:
|
||||
- "Prometheus scrapes TaskPlanner /metrics endpoint"
|
||||
- "Grafana can query TaskPlanner logs via Loki"
|
||||
- "KubePodCrashLooping alert rule exists"
|
||||
artifacts: []
|
||||
key_links:
|
||||
- from: "Prometheus"
|
||||
to: "TaskPlanner /metrics"
|
||||
via: "ServiceMonitor"
|
||||
pattern: "servicemonitor.*taskplaner"
|
||||
- from: "Grafana Explore"
|
||||
to: "Loki datasource"
|
||||
via: "LogQL query"
|
||||
pattern: "namespace.*default.*taskplaner"
|
||||
---
|
||||
|
||||
<objective>
|
||||
Verify end-to-end observability stack: metrics scraping, log queries, and alerting
|
||||
|
||||
Purpose: Confirm all Phase 8 requirements are satisfied (OBS-01 through OBS-08)
|
||||
Output: Verified observability stack with documented proof of functionality
|
||||
</objective>
|
||||
|
||||
<execution_context>
|
||||
@/home/tho/.claude/get-shit-done/workflows/execute-plan.md
|
||||
@/home/tho/.claude/get-shit-done/templates/summary.md
|
||||
</execution_context>
|
||||
|
||||
<context>
|
||||
@.planning/PROJECT.md
|
||||
@.planning/ROADMAP.md
|
||||
@.planning/STATE.md
|
||||
@.planning/phases/08-observability-stack/CONTEXT.md
|
||||
@.planning/phases/08-observability-stack/08-01-SUMMARY.md
|
||||
@.planning/phases/08-observability-stack/08-02-SUMMARY.md
|
||||
</context>
|
||||
|
||||
<tasks>
|
||||
|
||||
<task type="auto">
|
||||
<name>Task 1: Deploy TaskPlanner with ServiceMonitor and verify Prometheus scraping</name>
|
||||
<files>
|
||||
(no files - deployment and verification)
|
||||
</files>
|
||||
<action>
|
||||
1. Commit and push the metrics endpoint and ServiceMonitor changes from 08-01:
|
||||
```bash
|
||||
git add .
|
||||
git commit -m "feat(metrics): add /metrics endpoint and ServiceMonitor
|
||||
|
||||
- Add prom-client for Prometheus metrics
|
||||
- Expose /metrics endpoint with default Node.js metrics
|
||||
- Add ServiceMonitor template to Helm chart
|
||||
|
||||
OBS-08, OBS-01"
|
||||
git push
|
||||
```
|
||||
|
||||
2. Wait for ArgoCD to sync (or trigger manual sync):
|
||||
```bash
|
||||
# Check ArgoCD sync status
|
||||
kubectl get application taskplaner -n argocd -o jsonpath='{.status.sync.status}'
|
||||
# If not synced, wait up to 3 minutes or trigger:
|
||||
argocd app sync taskplaner --server argocd.tricnet.be --insecure 2>/dev/null || \
|
||||
kubectl patch application taskplaner -n argocd --type merge -p '{"operation":{"initiatedBy":{"username":"admin"},"sync":{}}}'
|
||||
```
|
||||
|
||||
3. Wait for deployment to complete:
|
||||
```bash
|
||||
kubectl rollout status deployment taskplaner --timeout=120s
|
||||
```
|
||||
|
||||
4. Verify ServiceMonitor created:
|
||||
```bash
|
||||
kubectl get servicemonitor taskplaner
|
||||
```
|
||||
Expected: ServiceMonitor exists
|
||||
|
||||
5. Verify Prometheus is scraping TaskPlanner:
|
||||
```bash
|
||||
# Port-forward to Prometheus
|
||||
kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090 &
|
||||
sleep 3
|
||||
|
||||
# Query for TaskPlanner targets
|
||||
curl -s "http://localhost:9090/api/v1/targets" | grep -A5 "taskplaner"
|
||||
|
||||
# Kill port-forward
|
||||
kill %1 2>/dev/null
|
||||
```
|
||||
Expected: TaskPlanner target shows state: "up"
|
||||
|
||||
6. Query a TaskPlanner metric:
|
||||
```bash
|
||||
kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090 &
|
||||
sleep 3
|
||||
curl -s "http://localhost:9090/api/v1/query?query=process_cpu_seconds_total{namespace=\"default\",pod=~\"taskplaner.*\"}" | jq '.data.result[0].value'
|
||||
kill %1 2>/dev/null
|
||||
```
|
||||
Expected: Returns a numeric value
|
||||
|
||||
NOTE: If ArgoCD sync takes too long, the push from earlier may already have triggered sync automatically.
|
||||
</action>
|
||||
<verify>
|
||||
1. kubectl get servicemonitor taskplaner returns a resource
|
||||
2. Prometheus targets API shows TaskPlaner with state "up"
|
||||
3. Prometheus query returns process_cpu_seconds_total value for TaskPlanner
|
||||
</verify>
|
||||
<done>
|
||||
Prometheus successfully scraping TaskPlanner /metrics endpoint via ServiceMonitor
|
||||
</done>
|
||||
</task>
|
||||
|
||||
<task type="auto">
|
||||
<name>Task 2: Verify critical alert rules exist</name>
|
||||
<files>
|
||||
(no files - verification only)
|
||||
</files>
|
||||
<action>
|
||||
1. List PrometheusRules to find pod crash alerting:
|
||||
```bash
|
||||
kubectl get prometheusrules -n monitoring -o name | head -20
|
||||
```
|
||||
|
||||
2. Search for KubePodCrashLooping alert:
|
||||
```bash
|
||||
kubectl get prometheusrules -n monitoring -o yaml | grep -A10 "KubePodCrashLooping"
|
||||
```
|
||||
Expected: Alert rule definition found
|
||||
|
||||
3. If not found by name, search for crash-related alerts:
|
||||
```bash
|
||||
kubectl get prometheusrules -n monitoring -o yaml | grep -i "crash\|restart\|CrashLoopBackOff" | head -10
|
||||
```
|
||||
|
||||
4. Verify Alertmanager is running:
|
||||
```bash
|
||||
kubectl get pods -n monitoring -l app.kubernetes.io/name=alertmanager
|
||||
```
|
||||
Expected: alertmanager pod(s) Running
|
||||
|
||||
5. Check current alerts (should be empty if cluster healthy):
|
||||
```bash
|
||||
kubectl port-forward -n monitoring svc/kube-prometheus-stack-alertmanager 9093:9093 &
|
||||
sleep 2
|
||||
curl -s http://localhost:9093/api/v2/alerts | jq '.[].labels.alertname' | head -10
|
||||
kill %1 2>/dev/null
|
||||
```
|
||||
|
||||
NOTE: kube-prometheus-stack includes default Kubernetes alerting rules. KubePodCrashLooping is a standard rule that fires when a pod restarts more than once in 10 minutes.
|
||||
</action>
|
||||
<verify>
|
||||
1. kubectl get prometheusrules finds KubePodCrashLooping or equivalent crash alert
|
||||
2. Alertmanager pod is Running
|
||||
3. Alertmanager API responds (even if alert list is empty)
|
||||
</verify>
|
||||
<done>
|
||||
KubePodCrashLooping alert rule confirmed present, Alertmanager operational
|
||||
</done>
|
||||
</task>
|
||||
|
||||
<task type="checkpoint:human-verify" gate="blocking">
|
||||
<what-built>
|
||||
Full observability stack:
|
||||
- TaskPlanner /metrics endpoint (OBS-08)
|
||||
- Prometheus scraping via ServiceMonitor (OBS-01)
|
||||
- Alloy collecting logs (OBS-04)
|
||||
- Loki storing logs (OBS-03)
|
||||
- Critical alerts configured (OBS-06)
|
||||
- Grafana dashboards (OBS-02)
|
||||
</what-built>
|
||||
<how-to-verify>
|
||||
1. Open Grafana: https://grafana.kube2.tricnet.de
|
||||
- Login: admin / GrafanaAdmin2026
|
||||
|
||||
2. Verify dashboards (OBS-02):
|
||||
- Go to Dashboards
|
||||
- Open "Kubernetes / Compute Resources / Namespace (Pods)" or similar
|
||||
- Select namespace: default
|
||||
- Confirm TaskPlanner pod metrics visible
|
||||
|
||||
3. Verify log queries (OBS-05):
|
||||
- Go to Explore
|
||||
- Select Loki datasource
|
||||
- Enter query: {namespace="default", pod=~"taskplaner.*"}
|
||||
- Click Run Query
|
||||
- Confirm TaskPlanner logs appear
|
||||
|
||||
4. Verify TaskPlanner metrics in Grafana:
|
||||
- Go to Explore
|
||||
- Select Prometheus datasource
|
||||
- Enter query: process_cpu_seconds_total{namespace="default", pod=~"taskplaner.*"}
|
||||
- Confirm metric graph appears
|
||||
|
||||
5. Verify Grafana accessible with TLS (OBS-07):
|
||||
- Confirm https:// in URL bar (no certificate warnings)
|
||||
</how-to-verify>
|
||||
<resume-signal>Type "verified" if all checks pass, or describe what failed</resume-signal>
|
||||
</task>
|
||||
|
||||
</tasks>
|
||||
|
||||
<verification>
|
||||
- [ ] ServiceMonitor created and Prometheus scraping TaskPlanner
|
||||
- [ ] TaskPlanner metrics visible in Prometheus queries
|
||||
- [ ] KubePodCrashLooping alert rule exists
|
||||
- [ ] Alertmanager running and responsive
|
||||
- [ ] Human verified: Grafana dashboards show cluster metrics
|
||||
- [ ] Human verified: Grafana can query TaskPlanner logs from Loki
|
||||
- [ ] Human verified: TaskPlanner metrics visible in Grafana
|
||||
</verification>
|
||||
|
||||
<success_criteria>
|
||||
1. Prometheus scrapes TaskPlanner /metrics (OBS-01, OBS-08 complete)
|
||||
2. Grafana dashboards display cluster metrics (OBS-02 verified)
|
||||
3. TaskPlanner logs queryable in Grafana via Loki (OBS-05 verified)
|
||||
4. KubePodCrashLooping alert rule confirmed (OBS-06 verified)
|
||||
5. Grafana accessible via TLS (OBS-07 verified)
|
||||
</success_criteria>
|
||||
|
||||
<output>
|
||||
After completion, create `.planning/phases/08-observability-stack/08-03-SUMMARY.md`
|
||||
</output>
|
||||
114
.planning/phases/08-observability-stack/CONTEXT.md
Normal file
114
.planning/phases/08-observability-stack/CONTEXT.md
Normal file
@@ -0,0 +1,114 @@
|
||||
# Phase 8: Observability Stack - Context
|
||||
|
||||
**Goal:** Full visibility into cluster and application health via metrics, logs, and dashboards
|
||||
**Status:** Mostly pre-existing infrastructure, focusing on gaps
|
||||
|
||||
## Discovery Summary
|
||||
|
||||
The observability stack is largely already installed (15 days running). Phase 8 focuses on:
|
||||
1. Gaps in existing setup
|
||||
2. Migration from Promtail to Alloy (Promtail EOL March 2026)
|
||||
3. TaskPlanner-specific observability
|
||||
|
||||
### What's Already Working
|
||||
|
||||
| Component | Status | Details |
|
||||
|-----------|--------|---------|
|
||||
| Prometheus | ✅ Running | kube-prometheus-stack, scraping cluster metrics |
|
||||
| Grafana | ✅ Running | Accessible at grafana.kube2.tricnet.de (HTTP 200) |
|
||||
| Loki | ✅ Running | loki-stack-0 pod, configured as Grafana datasource |
|
||||
| AlertManager | ✅ Running | 35 PrometheusRules configured |
|
||||
| Node Exporters | ✅ Running | 5 pods across nodes |
|
||||
| Kube-state-metrics | ✅ Running | Cluster state metrics |
|
||||
| Promtail | ⚠️ Running | 5 DaemonSet pods - needs migration to Alloy |
|
||||
|
||||
### What's Missing
|
||||
|
||||
| Gap | Requirement | Details |
|
||||
|-----|-------------|---------|
|
||||
| TaskPlanner /metrics | OBS-08 | App doesn't expose Prometheus metrics endpoint |
|
||||
| TaskPlanner ServiceMonitor | OBS-01 | No scraping config for app metrics |
|
||||
| Alloy migration | OBS-04 | Promtail running but EOL March 2026 |
|
||||
| Verify Loki queries | OBS-05 | Datasource configured, need to verify logs work |
|
||||
| Critical alerts verification | OBS-06 | Rules exist, need to verify KubePodCrashLooping |
|
||||
| Grafana TLS ingress | OBS-07 | Works via external proxy, not k8s ingress |
|
||||
|
||||
## Infrastructure Context
|
||||
|
||||
### Cluster Details
|
||||
- k3s cluster with 5 nodes (1 master + 4 workers based on node-exporter count)
|
||||
- Namespace: `monitoring` for all observability components
|
||||
- Namespace: `default` for TaskPlanner
|
||||
|
||||
### Grafana Access
|
||||
- URL: https://grafana.kube2.tricnet.de
|
||||
- Admin password: `GrafanaAdmin2026` (from secret)
|
||||
- Service type: ClusterIP (exposed via external proxy, not k8s ingress)
|
||||
- Datasources configured: Prometheus, Alertmanager, Loki (2x entries)
|
||||
|
||||
### Loki Configuration
|
||||
- Service: `loki-stack:3100` (ClusterIP)
|
||||
- Storage: Not checked (likely local filesystem)
|
||||
- Retention: Not checked
|
||||
|
||||
### Promtail (to be replaced)
|
||||
- 5 DaemonSet pods running
|
||||
- Forwards to loki-stack:3100
|
||||
- EOL: March 2026 - migrate to Grafana Alloy
|
||||
|
||||
## Decisions
|
||||
|
||||
### From Research (v2.0)
|
||||
- Use Grafana Alloy instead of Promtail (EOL March 2026)
|
||||
- Loki monolithic mode with 7-day retention appropriate for single-node
|
||||
- kube-prometheus-stack is the standard for k8s observability
|
||||
|
||||
### Phase-specific
|
||||
- **Grafana ingress**: Leave as-is (external proxy works, OBS-07 satisfied)
|
||||
- **Alloy migration**: Replace Promtail DaemonSet with Alloy DaemonSet
|
||||
- **TaskPlanner metrics**: Add prom-client to SvelteKit app (standard Node.js client)
|
||||
- **Alloy labels**: Match existing Promtail labels (namespace, pod, container) for query compatibility
|
||||
|
||||
## Requirements Mapping
|
||||
|
||||
| Requirement | Current State | Phase 8 Action |
|
||||
|-------------|---------------|----------------|
|
||||
| OBS-01 | Partial (cluster only) | Add TaskPlanner ServiceMonitor |
|
||||
| OBS-02 | ✅ Done | Verify dashboards work |
|
||||
| OBS-03 | ✅ Done | Loki running |
|
||||
| OBS-04 | ⚠️ Promtail | Migrate to Alloy DaemonSet |
|
||||
| OBS-05 | Configured | Verify log queries work |
|
||||
| OBS-06 | 35 rules exist | Verify critical alerts fire |
|
||||
| OBS-07 | ✅ Done | Grafana accessible via TLS |
|
||||
| OBS-08 | ❌ Missing | Add /metrics endpoint to TaskPlanner |
|
||||
|
||||
## Plan Outline
|
||||
|
||||
1. **08-01**: TaskPlanner metrics endpoint + ServiceMonitor
|
||||
- Add prom-client to app
|
||||
- Expose /metrics endpoint
|
||||
- Create ServiceMonitor for Prometheus scraping
|
||||
|
||||
2. **08-02**: Promtail → Alloy migration
|
||||
- Deploy Grafana Alloy DaemonSet
|
||||
- Configure log forwarding to Loki
|
||||
- Remove Promtail DaemonSet
|
||||
- Verify logs still flow
|
||||
|
||||
3. **08-03**: Verification
|
||||
- Verify Grafana can query Loki logs
|
||||
- Verify TaskPlanner metrics appear in Prometheus
|
||||
- Verify KubePodCrashLooping alert exists
|
||||
- End-to-end log flow test
|
||||
|
||||
## Risks
|
||||
|
||||
| Risk | Mitigation |
|
||||
|------|------------|
|
||||
| Log gap during Promtail→Alloy switch | Deploy Alloy first, verify working, then remove Promtail |
|
||||
| prom-client adds overhead | Use minimal default metrics (process, http request duration) |
|
||||
| Alloy config complexity | Start with minimal config matching Promtail behavior |
|
||||
|
||||
---
|
||||
*Context gathered: 2026-02-03*
|
||||
*Decision: Focus on gaps + Alloy migration*
|
||||
Reference in New Issue
Block a user