Files
taskplaner/.planning/research/PITFALLS-CICD-OBSERVABILITY.md
Thomas Richter 5dbabe6a2d docs: complete v2.0 CI/CD and observability research
Files:
- STACK-v2-cicd-observability.md (ArgoCD, Prometheus, Loki, Alloy)
- FEATURES.md (updated with CI/CD and observability section)
- ARCHITECTURE.md (updated with v2.0 integration architecture)
- PITFALLS-CICD-OBSERVABILITY.md (14 critical/moderate/minor pitfalls)
- SUMMARY-v2-cicd-observability.md (synthesis with roadmap implications)

Key findings:
- Stack: kube-prometheus-stack + Loki monolithic + Alloy (Promtail EOL March 2026)
- Architecture: 3-phase approach - GitOps first, observability second, CI tests last
- Critical pitfall: ArgoCD TLS redirect loop, Loki disk exhaustion, k3s metrics config

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 03:29:23 +01:00

634 lines
22 KiB
Markdown

# Domain Pitfalls: CI/CD and Observability on k3s
**Domain:** Adding ArgoCD, Prometheus, Grafana, and Loki to existing k3s cluster
**Context:** TaskPlanner on self-hosted k3s with Gitea, Traefik, Longhorn
**Researched:** 2026-02-03
**Confidence:** HIGH (verified with official documentation and community issues)
---
## Critical Pitfalls
Mistakes that cause system instability, data loss, or require significant rework.
### 1. Gitea Webhook JSON Parsing Failure with ArgoCD
**What goes wrong:** ArgoCD receives webhooks from Gitea but fails to parse them with error: `json: cannot unmarshal string into Go struct field .repository.created_at of type int64`. This happens because ArgoCD treats Gitea events as GitHub events instead of Gogs events.
**Why it happens:** Gitea is a fork of Gogs, but ArgoCD's webhook handler expects different field types. The `repository.created_at` field is a string in Gitea/Gogs but ArgoCD expects int64 for GitHub format.
**Consequences:**
- Webhooks silently fail (ArgoCD logs error but continues)
- Must wait for 3-minute polling interval for changes to sync
- False confidence that instant sync is working
**Warning signs:**
- ArgoCD server logs show webhook parsing errors
- Application sync doesn't happen immediately after push
- Webhook delivery shows success in Gitea but no ArgoCD response
**Prevention:**
- Configure webhook with `Gogs` type in Gitea, NOT `Gitea` type
- Test webhook delivery and check ArgoCD server logs: `kubectl logs -n argocd deploy/argocd-server | grep -i webhook`
- Accept 3-minute polling as fallback (webhooks are optional enhancement)
**Phase to address:** ArgoCD installation phase - verify webhook integration immediately
**Sources:**
- [ArgoCD Issue #16453 - Forgejo/Gitea webhook parsing](https://github.com/argoproj/argo-cd/issues/16453)
- [ArgoCD Issue #20444 - Gitea support lacking](https://github.com/argoproj/argo-cd/issues/20444)
---
### 2. Loki Disk Full with No Size-Based Retention
**What goes wrong:** Loki fills the entire disk because retention is only time-based, not size-based. When disk fills, Loki crashes with "no space left on device" and becomes completely non-functional - Grafana cannot even fetch labels.
**Why it happens:**
- Retention is disabled by default (`compactor.retention-enabled: false`)
- Loki only supports time-based retention (e.g., 7 days), not size-based
- High-volume logging can fill disk before retention period expires
**Consequences:**
- Complete logging system failure
- May affect other pods sharing the same Longhorn volume
- Recovery requires manual cleanup or volume expansion
**Warning signs:**
- Steadily increasing PVC usage visible in `kubectl get pvc`
- Loki compactor logs show no deletion activity
- Grafana queries become slow before complete failure
**Prevention:**
```yaml
# Loki values.yaml
loki:
compactor:
retention_enabled: true
compaction_interval: 10m
retention_delete_delay: 2h
retention_delete_worker_count: 150
working_directory: /loki/compactor
limits_config:
retention_period: 168h # 7 days - adjust based on disk size
```
- Set conservative retention period (start with 7 days)
- Run compactor as StatefulSet with persistent storage for marker files
- Set up Prometheus alert for PVC usage > 80%
- Index period MUST be 24h for retention to work
**Phase to address:** Loki installation phase - configure retention from day one
**Sources:**
- [Grafana Loki Retention Documentation](https://grafana.com/docs/loki/latest/operations/storage/retention/)
- [Loki Issue #5242 - Retention not working](https://github.com/grafana/loki/issues/5242)
---
### 3. Prometheus Volume Growth Exceeds Longhorn PVC
**What goes wrong:** Prometheus metrics storage grows beyond PVC capacity. Longhorn volume expansion via CSI can result in a faulted volume that prevents Prometheus from starting.
**Why it happens:**
- Default Prometheus retention is 15 days with no size limit
- kube-prometheus-stack defaults don't match k3s resource constraints
- Longhorn CSI volume expansion has known issues requiring specific procedure
**Consequences:**
- Prometheus pod stuck in pending/crash loop
- Loss of historical metrics
- Longhorn volume in faulted state requiring manual recovery
**Warning signs:**
- Prometheus pod restarts with OOMKilled or disk errors
- `kubectl describe pvc` shows capacity approaching limit
- Longhorn UI shows volume health degraded
**Prevention:**
```yaml
# kube-prometheus-stack values
prometheus:
prometheusSpec:
retention: 7d
retentionSize: "8GB" # Set explicit size limit
resources:
requests:
memory: 400Mi
limits:
memory: 600Mi
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: longhorn
resources:
requests:
storage: 10Gi
```
- Always set both `retention` AND `retentionSize`
- Size PVC with 20% headroom above retentionSize
- Monitor with `prometheus_tsdb_storage_blocks_bytes` metric
- For expansion: stop pod, detach volume, resize, then restart
**Phase to address:** Prometheus installation phase
**Sources:**
- [Longhorn Issue #2222 - Volume expansion faults](https://github.com/longhorn/longhorn/issues/2222)
- [kube-prometheus-stack Issue #3401 - Resource limits](https://github.com/prometheus-community/helm-charts/issues/3401)
---
### 4. ArgoCD + Traefik TLS Termination Redirect Loop
**What goes wrong:** ArgoCD UI becomes inaccessible with redirect loops or connection refused errors when accessed through Traefik. Browser shows ERR_TOO_MANY_REDIRECTS.
**Why it happens:** Traefik terminates TLS and forwards HTTP to ArgoCD. ArgoCD server, configured for TLS by default, responds with 307 redirects to HTTPS, creating infinite loop.
**Consequences:**
- Cannot access ArgoCD UI via ingress
- CLI may work with port-forward but not through ingress
- gRPC connections for CLI through ingress fail
**Warning signs:**
- Browser redirect loop when accessing ArgoCD URL
- `curl -v` shows 307 redirect responses
- Works with `kubectl port-forward` but not via ingress
**Prevention:**
```yaml
# Option 1: ConfigMap (recommended)
apiVersion: v1
kind: ConfigMap
metadata:
name: argocd-cmd-params-cm
namespace: argocd
data:
server.insecure: "true"
# Option 2: Traefik IngressRoute for dual HTTP/gRPC
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
name: argocd-server
namespace: argocd
spec:
entryPoints:
- websecure
routes:
- kind: Rule
match: Host(`argocd.example.com`)
priority: 10
services:
- name: argocd-server
port: 80
- kind: Rule
match: Host(`argocd.example.com`) && Header(`Content-Type`, `application/grpc`)
priority: 11
services:
- name: argocd-server
port: 80
scheme: h2c
tls:
certResolver: letsencrypt-prod
```
- Set `server.insecure: "true"` in argocd-cmd-params-cm ConfigMap
- Use IngressRoute (not Ingress) for proper gRPC support
- Configure separate routes for HTTP and gRPC with correct priority
**Phase to address:** ArgoCD installation phase - test immediately after ingress setup
**Sources:**
- [ArgoCD Ingress Configuration](https://argo-cd.readthedocs.io/en/stable/operator-manual/ingress/)
- [Traefik Community - ArgoCD behind Traefik](https://community.traefik.io/t/serving-argocd-behind-traefik-ingress/15901)
---
## Moderate Pitfalls
Mistakes that cause delays, debugging sessions, or technical debt.
### 5. ServiceMonitor Not Discovering Targets
**What goes wrong:** Prometheus ServiceMonitors are created but no targets appear in Prometheus. The scrape config shows 0/0 targets up.
**Why it happens:**
- Label selector mismatch between Prometheus CR and ServiceMonitor
- RBAC: Prometheus ServiceAccount lacks permission in target namespace
- Port specified as number instead of name
- ServiceMonitor in different namespace than Prometheus expects
**Prevention:**
```yaml
# Ensure Prometheus CR has permissive selectors
prometheus:
prometheusSpec:
serviceMonitorSelectorNilUsesHelmValues: false
serviceMonitorSelector: {} # Select all ServiceMonitors
serviceMonitorNamespaceSelector: {} # From all namespaces
# ServiceMonitor must use port NAME not number
spec:
endpoints:
- port: metrics # NOT 9090
```
- Use port name, never port number in ServiceMonitor
- Check RBAC: `kubectl auth can-i list endpoints --as=system:serviceaccount:monitoring:prometheus-kube-prometheus-prometheus -n default`
- Verify label matching: `kubectl get servicemonitor -A --show-labels`
**Phase to address:** Prometheus installation phase, verify with test ServiceMonitor
**Sources:**
- [Prometheus Operator Troubleshooting](https://managedkube.com/prometheus/operator/servicemonitor/troubleshooting/2019/11/07/prometheus-operator-servicemonitor-troubleshooting.html)
- [ServiceMonitor not discovered Issue #3383](https://github.com/prometheus-operator/prometheus-operator/issues/3383)
---
### 6. k3s Control Plane Metrics Not Scraped
**What goes wrong:** Prometheus dashboards show no metrics for kube-scheduler, kube-controller-manager, or etcd. These panels appear blank or show "No data."
**Why it happens:** k3s runs control plane components as a single binary, not as pods. Standard kube-prometheus-stack expects to scrape pods that don't exist.
**Prevention:**
```yaml
# kube-prometheus-stack values for k3s
kubeControllerManager:
enabled: true
endpoints:
- 192.168.1.100 # k3s server IP
service:
enabled: true
port: 10257
targetPort: 10257
kubeScheduler:
enabled: true
endpoints:
- 192.168.1.100
service:
enabled: true
port: 10259
targetPort: 10259
kubeEtcd:
enabled: false # k3s uses embedded sqlite/etcd
```
- Explicitly configure control plane endpoints with k3s server IPs
- Disable etcd monitoring if using embedded database
- OR disable these components entirely for simpler setup
**Phase to address:** Prometheus installation phase
**Sources:**
- [Prometheus for Rancher K3s Control Plane Monitoring](https://www.spectrocloud.com/blog/enabling-rancher-k3s-cluster-control-plane-monitoring-with-prometheus)
---
### 7. Promtail Not Sending Logs to Loki
**What goes wrong:** Promtail pods are running but no logs appear in Grafana/Loki. Queries return empty results.
**Why it happens:**
- Promtail started before Loki was ready
- Log path configuration doesn't match k3s container runtime paths
- Label selectors don't match actual pod labels
- Network policy blocking Promtail -> Loki communication
**Warning signs:**
- Promtail logs show "dropping target, no labels" or connection errors
- `kubectl logs -n monitoring promtail-xxx` shows retries
- Loki data source health check passes but queries return nothing
**Prevention:**
```yaml
# Verify k3s containerd log paths
promtail:
config:
snippets:
scrapeConfigs: |
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
pipeline_stages:
- cri: {}
relabel_configs:
- source_labels: [__meta_kubernetes_pod_node_name]
target_label: node
```
- Delete Promtail positions file to force re-read: `kubectl exec -n monitoring promtail-xxx -- rm /tmp/positions.yaml`
- Ensure Loki is healthy before Promtail starts (use init container or sync wave)
- Verify log paths match containerd: `/var/log/pods/*/*/*.log`
**Phase to address:** Loki installation phase
**Sources:**
- [Grafana Loki Troubleshooting](https://grafana.com/docs/loki/latest/operations/troubleshooting/)
---
### 8. ArgoCD Self-Management Bootstrap Chicken-Egg
**What goes wrong:** Attempting to have ArgoCD manage itself creates confusion about what's managing what. Initial mistakes in the ArgoCD Application manifest can lock you out.
**Why it happens:** GitOps can't install ArgoCD if ArgoCD isn't present. After bootstrap, changing ArgoCD's self-managing Application incorrectly can break the cluster.
**Prevention:**
```yaml
# Phase 1: Install ArgoCD manually (kubectl apply or helm)
# Phase 2: Create self-management Application
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: argocd
namespace: argocd
spec:
project: default
source:
repoURL: https://git.kube2.tricnet.de/tho/infrastructure.git
path: argocd
targetRevision: HEAD
destination:
server: https://kubernetes.default.svc
namespace: argocd
syncPolicy:
automated:
prune: false # CRITICAL: Don't auto-prune ArgoCD components
selfHeal: true
```
- Always bootstrap ArgoCD manually first (Helm or kubectl)
- Set `prune: false` for ArgoCD's self-management Application
- Use App of Apps pattern for managed applications
- Keep a local backup of ArgoCD Application manifest
**Phase to address:** ArgoCD installation phase - plan bootstrap strategy upfront
**Sources:**
- [Bootstrapping ArgoCD - Windsock.io](https://windsock.io/bootstrapping-argocd/)
- [Demystifying GitOps - Bootstrapping ArgoCD](https://medium.com/@aaltundemir/demystifying-gitops-bootstrapping-argo-cd-4a861284f273)
---
### 9. Sync Waves Misuse Creating False Dependencies
**What goes wrong:** Over-engineering sync waves creates unnecessary sequential deployments, increasing deployment time and complexity. Or under-engineering leads to race conditions.
**Why it happens:**
- Developers add waves "just in case"
- Misunderstanding that waves are within single Application only
- Not knowing default wave is 0 and waves can be negative
**Prevention:**
```yaml
# Use waves sparingly - only for true dependencies
# Database must exist before app
metadata:
annotations:
argocd.argoproj.io/sync-wave: "-1" # First
# App deployment
metadata:
annotations:
argocd.argoproj.io/sync-wave: "0" # Default, after database
# Don't create unnecessary chains like:
# ConfigMap (wave -3) -> Secret (wave -2) -> Service (wave -1) -> Deployment (wave 0)
# These have no real dependency and should all be wave 0
```
- Use waves only for actual dependencies (database before app, CRD before CR)
- Keep wave structure as flat as possible
- Sync waves do NOT work across different ArgoCD Applications
- For cross-Application dependencies, use ApplicationSets with Progressive Syncs
**Phase to address:** Application configuration phase
**Sources:**
- [ArgoCD Sync Phases and Waves](https://argo-cd.readthedocs.io/en/stable/user-guide/sync-waves/)
---
## Minor Pitfalls
Annoyances that are easily fixed but waste time if not known.
### 10. Grafana Default Password Not Changed
**What goes wrong:** Using default `admin/prom-operator` credentials in production exposes the monitoring stack.
**Prevention:**
```yaml
# kube-prometheus-stack values
grafana:
adminPassword: "${GRAFANA_ADMIN_PASSWORD}" # From secret
# Or use existing secret
admin:
existingSecret: grafana-admin-credentials
userKey: admin-user
passwordKey: admin-password
```
**Phase to address:** Grafana installation phase
---
### 11. Missing open-iscsi for Longhorn
**What goes wrong:** Longhorn volumes fail to attach with cryptic errors.
**Why it happens:** Longhorn requires `open-iscsi` on all nodes, which isn't installed by default on many Linux distributions.
**Prevention:**
```bash
# On each node before Longhorn installation
sudo apt-get install -y open-iscsi
sudo systemctl enable iscsid
sudo systemctl start iscsid
```
**Phase to address:** Pre-installation prerequisites check
**Sources:**
- [Longhorn Prerequisites](https://longhorn.io/docs/latest/deploy/install/#installation-requirements)
---
### 12. ClusterIP Services Not Accessible
**What goes wrong:** After installing monitoring stack, Grafana/Prometheus aren't accessible externally.
**Why it happens:** k3s defaults to ClusterIP for services. Single-node setups need explicit ingress or LoadBalancer configuration.
**Prevention:**
```yaml
# kube-prometheus-stack values
grafana:
ingress:
enabled: true
ingressClassName: traefik
hosts:
- grafana.kube2.tricnet.de
tls:
- secretName: grafana-tls
hosts:
- grafana.kube2.tricnet.de
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
```
**Phase to address:** Installation phase - configure ingress alongside deployment
---
### 13. Traefik v3 Breaking Changes for ArgoCD IngressRoute
**What goes wrong:** ArgoCD IngressRoute with gRPC support stops working after Traefik upgrade to v3.
**Why it happens:** Traefik v3 changed header matcher syntax from `Headers()` to `Header()`.
**Prevention:**
```yaml
# Traefik v2 (OLD - broken in v3)
match: Host(`argocd.example.com`) && Headers(`Content-Type`, `application/grpc`)
# Traefik v3 (NEW)
match: Host(`argocd.example.com`) && Header(`Content-Type`, `application/grpc`)
```
- Check Traefik version before applying IngressRoutes
- Test gRPC route after any Traefik upgrade
**Phase to address:** ArgoCD installation phase
**Sources:**
- [ArgoCD Issue #15534 - Traefik v3 docs](https://github.com/argoproj/argo-cd/issues/15534)
---
### 14. k3s Resource Exhaustion with Full Monitoring Stack
**What goes wrong:** Single-node k3s cluster becomes unresponsive after deploying full kube-prometheus-stack.
**Why it happens:**
- kube-prometheus-stack deploys many components (prometheus, alertmanager, grafana, node-exporter, kube-state-metrics)
- Default resource requests/limits are sized for larger clusters
- k3s server process itself needs ~500MB RAM
**Warning signs:**
- Pods stuck in Pending
- OOMKilled events
- Node NotReady status
**Prevention:**
```yaml
# Minimal kube-prometheus-stack for single-node
alertmanager:
enabled: false # Disable if not using alerts
prometheus:
prometheusSpec:
resources:
requests:
memory: 256Mi
cpu: 100m
limits:
memory: 512Mi
grafana:
resources:
requests:
memory: 128Mi
cpu: 50m
limits:
memory: 256Mi
```
- Disable unnecessary components (alertmanager if no alerts configured)
- Set explicit resource limits lower than defaults
- Monitor cluster resources: `kubectl top nodes`
- Consider: 4GB RAM minimum for k3s + monitoring + workloads
**Phase to address:** Prometheus installation phase - right-size from start
**Sources:**
- [K3s Resource Profiling](https://docs.k3s.io/reference/resource-profiling)
---
## Phase-Specific Warning Summary
| Phase | Likely Pitfall | Mitigation |
|-------|---------------|------------|
| Prerequisites | #11 Missing open-iscsi | Pre-flight check script |
| ArgoCD Installation | #4 TLS redirect loop, #8 Bootstrap | Test ingress immediately, plan bootstrap |
| ArgoCD + Gitea Integration | #1 Webhook parsing | Use Gogs webhook type, accept polling fallback |
| Prometheus Installation | #3 Volume growth, #5 ServiceMonitor, #6 Control plane, #14 Resources | Configure retention+size, verify RBAC, right-size |
| Loki Installation | #2 Disk full, #7 Promtail | Enable retention day one, verify log paths |
| Grafana Installation | #10 Default password, #12 ClusterIP | Set password, configure ingress |
| Application Configuration | #9 Sync waves | Use sparingly, only for real dependencies |
---
## Pre-Installation Checklist
Before starting installation, verify:
- [ ] open-iscsi installed on all nodes
- [ ] Longhorn healthy with available storage (check `kubectl get nodes` and Longhorn UI)
- [ ] Traefik version known (v2 vs v3 affects IngressRoute syntax)
- [ ] DNS entries configured for monitoring subdomains
- [ ] Gitea webhook type decision (use Gogs type, or accept polling fallback)
- [ ] Disk space planning: Loki retention + Prometheus retention + headroom
- [ ] Memory planning: k3s (~500MB) + monitoring (~1GB) + workloads
- [ ] Namespace strategy decided (monitoring namespace vs default)
---
## Existing Infrastructure Compatibility Notes
Based on the existing TaskPlanner setup:
**Traefik:** Already in use with cert-manager (letsencrypt-prod). New services should follow same pattern:
```yaml
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
```
**Longhorn:** Already the storage class. New PVCs should use explicit `storageClassName: longhorn` and consider replica count for single-node (set to 1).
**Gitea:** Repository already configured at `git.kube2.tricnet.de`. ArgoCD Application already exists in `argocd/application.yaml` - don't duplicate.
**Existing ArgoCD Application:** TaskPlanner is already configured with ArgoCD. The monitoring stack should be a separate Application, not added to the existing one.
---
## Sources Summary
### Official Documentation
- [ArgoCD Ingress Configuration](https://argo-cd.readthedocs.io/en/stable/operator-manual/ingress/)
- [ArgoCD Sync Phases and Waves](https://argo-cd.readthedocs.io/en/stable/user-guide/sync-waves/)
- [Grafana Loki Retention](https://grafana.com/docs/loki/latest/operations/storage/retention/)
- [Grafana Loki Troubleshooting](https://grafana.com/docs/loki/latest/operations/troubleshooting/)
- [K3s Resource Profiling](https://docs.k3s.io/reference/resource-profiling)
### Community Issues (Verified Problems)
- [ArgoCD #16453 - Gitea webhook parsing](https://github.com/argoproj/argo-cd/issues/16453)
- [ArgoCD #20444 - Gitea support](https://github.com/argoproj/argo-cd/issues/20444)
- [Loki #5242 - Retention not working](https://github.com/grafana/loki/issues/5242)
- [Longhorn #2222 - Volume expansion](https://github.com/longhorn/longhorn/issues/2222)
- [kube-prometheus-stack #3401 - Resource limits](https://github.com/prometheus-community/helm-charts/issues/3401)
- [Prometheus Operator #3383 - ServiceMonitor discovery](https://github.com/prometheus-operator/prometheus-operator/issues/3383)
### Tutorials and Guides
- [K3S Rocks - ArgoCD](https://k3s.rocks/argocd/)
- [K3S Rocks - Logging](https://k3s.rocks/logging/)
- [Bootstrapping ArgoCD](https://windsock.io/bootstrapping-argocd/)
- [Prometheus ServiceMonitor Troubleshooting](https://managedkube.com/prometheus/operator/servicemonitor/troubleshooting/2019/11/07/prometheus-operator-servicemonitor-troubleshooting.html)
- [Traefik Community - ArgoCD](https://community.traefik.io/t/serving-argocd-behind-traefik-ingress/15901)
---
*Pitfalls research for: CI/CD and Observability on k3s*
*Context: Adding to existing TaskPlanner deployment*
*Researched: 2026-02-03*