Files: - STACK-v2-cicd-observability.md (ArgoCD, Prometheus, Loki, Alloy) - FEATURES.md (updated with CI/CD and observability section) - ARCHITECTURE.md (updated with v2.0 integration architecture) - PITFALLS-CICD-OBSERVABILITY.md (14 critical/moderate/minor pitfalls) - SUMMARY-v2-cicd-observability.md (synthesis with roadmap implications) Key findings: - Stack: kube-prometheus-stack + Loki monolithic + Alloy (Promtail EOL March 2026) - Architecture: 3-phase approach - GitOps first, observability second, CI tests last - Critical pitfall: ArgoCD TLS redirect loop, Loki disk exhaustion, k3s metrics config Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
22 KiB
Domain Pitfalls: CI/CD and Observability on k3s
Domain: Adding ArgoCD, Prometheus, Grafana, and Loki to existing k3s cluster Context: TaskPlanner on self-hosted k3s with Gitea, Traefik, Longhorn Researched: 2026-02-03 Confidence: HIGH (verified with official documentation and community issues)
Critical Pitfalls
Mistakes that cause system instability, data loss, or require significant rework.
1. Gitea Webhook JSON Parsing Failure with ArgoCD
What goes wrong: ArgoCD receives webhooks from Gitea but fails to parse them with error: json: cannot unmarshal string into Go struct field .repository.created_at of type int64. This happens because ArgoCD treats Gitea events as GitHub events instead of Gogs events.
Why it happens: Gitea is a fork of Gogs, but ArgoCD's webhook handler expects different field types. The repository.created_at field is a string in Gitea/Gogs but ArgoCD expects int64 for GitHub format.
Consequences:
- Webhooks silently fail (ArgoCD logs error but continues)
- Must wait for 3-minute polling interval for changes to sync
- False confidence that instant sync is working
Warning signs:
- ArgoCD server logs show webhook parsing errors
- Application sync doesn't happen immediately after push
- Webhook delivery shows success in Gitea but no ArgoCD response
Prevention:
- Configure webhook with
Gogstype in Gitea, NOTGiteatype - Test webhook delivery and check ArgoCD server logs:
kubectl logs -n argocd deploy/argocd-server | grep -i webhook - Accept 3-minute polling as fallback (webhooks are optional enhancement)
Phase to address: ArgoCD installation phase - verify webhook integration immediately
Sources:
2. Loki Disk Full with No Size-Based Retention
What goes wrong: Loki fills the entire disk because retention is only time-based, not size-based. When disk fills, Loki crashes with "no space left on device" and becomes completely non-functional - Grafana cannot even fetch labels.
Why it happens:
- Retention is disabled by default (
compactor.retention-enabled: false) - Loki only supports time-based retention (e.g., 7 days), not size-based
- High-volume logging can fill disk before retention period expires
Consequences:
- Complete logging system failure
- May affect other pods sharing the same Longhorn volume
- Recovery requires manual cleanup or volume expansion
Warning signs:
- Steadily increasing PVC usage visible in
kubectl get pvc - Loki compactor logs show no deletion activity
- Grafana queries become slow before complete failure
Prevention:
# Loki values.yaml
loki:
compactor:
retention_enabled: true
compaction_interval: 10m
retention_delete_delay: 2h
retention_delete_worker_count: 150
working_directory: /loki/compactor
limits_config:
retention_period: 168h # 7 days - adjust based on disk size
- Set conservative retention period (start with 7 days)
- Run compactor as StatefulSet with persistent storage for marker files
- Set up Prometheus alert for PVC usage > 80%
- Index period MUST be 24h for retention to work
Phase to address: Loki installation phase - configure retention from day one
Sources:
3. Prometheus Volume Growth Exceeds Longhorn PVC
What goes wrong: Prometheus metrics storage grows beyond PVC capacity. Longhorn volume expansion via CSI can result in a faulted volume that prevents Prometheus from starting.
Why it happens:
- Default Prometheus retention is 15 days with no size limit
- kube-prometheus-stack defaults don't match k3s resource constraints
- Longhorn CSI volume expansion has known issues requiring specific procedure
Consequences:
- Prometheus pod stuck in pending/crash loop
- Loss of historical metrics
- Longhorn volume in faulted state requiring manual recovery
Warning signs:
- Prometheus pod restarts with OOMKilled or disk errors
kubectl describe pvcshows capacity approaching limit- Longhorn UI shows volume health degraded
Prevention:
# kube-prometheus-stack values
prometheus:
prometheusSpec:
retention: 7d
retentionSize: "8GB" # Set explicit size limit
resources:
requests:
memory: 400Mi
limits:
memory: 600Mi
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: longhorn
resources:
requests:
storage: 10Gi
- Always set both
retentionANDretentionSize - Size PVC with 20% headroom above retentionSize
- Monitor with
prometheus_tsdb_storage_blocks_bytesmetric - For expansion: stop pod, detach volume, resize, then restart
Phase to address: Prometheus installation phase
Sources:
4. ArgoCD + Traefik TLS Termination Redirect Loop
What goes wrong: ArgoCD UI becomes inaccessible with redirect loops or connection refused errors when accessed through Traefik. Browser shows ERR_TOO_MANY_REDIRECTS.
Why it happens: Traefik terminates TLS and forwards HTTP to ArgoCD. ArgoCD server, configured for TLS by default, responds with 307 redirects to HTTPS, creating infinite loop.
Consequences:
- Cannot access ArgoCD UI via ingress
- CLI may work with port-forward but not through ingress
- gRPC connections for CLI through ingress fail
Warning signs:
- Browser redirect loop when accessing ArgoCD URL
curl -vshows 307 redirect responses- Works with
kubectl port-forwardbut not via ingress
Prevention:
# Option 1: ConfigMap (recommended)
apiVersion: v1
kind: ConfigMap
metadata:
name: argocd-cmd-params-cm
namespace: argocd
data:
server.insecure: "true"
# Option 2: Traefik IngressRoute for dual HTTP/gRPC
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
name: argocd-server
namespace: argocd
spec:
entryPoints:
- websecure
routes:
- kind: Rule
match: Host(`argocd.example.com`)
priority: 10
services:
- name: argocd-server
port: 80
- kind: Rule
match: Host(`argocd.example.com`) && Header(`Content-Type`, `application/grpc`)
priority: 11
services:
- name: argocd-server
port: 80
scheme: h2c
tls:
certResolver: letsencrypt-prod
- Set
server.insecure: "true"in argocd-cmd-params-cm ConfigMap - Use IngressRoute (not Ingress) for proper gRPC support
- Configure separate routes for HTTP and gRPC with correct priority
Phase to address: ArgoCD installation phase - test immediately after ingress setup
Sources:
Moderate Pitfalls
Mistakes that cause delays, debugging sessions, or technical debt.
5. ServiceMonitor Not Discovering Targets
What goes wrong: Prometheus ServiceMonitors are created but no targets appear in Prometheus. The scrape config shows 0/0 targets up.
Why it happens:
- Label selector mismatch between Prometheus CR and ServiceMonitor
- RBAC: Prometheus ServiceAccount lacks permission in target namespace
- Port specified as number instead of name
- ServiceMonitor in different namespace than Prometheus expects
Prevention:
# Ensure Prometheus CR has permissive selectors
prometheus:
prometheusSpec:
serviceMonitorSelectorNilUsesHelmValues: false
serviceMonitorSelector: {} # Select all ServiceMonitors
serviceMonitorNamespaceSelector: {} # From all namespaces
# ServiceMonitor must use port NAME not number
spec:
endpoints:
- port: metrics # NOT 9090
- Use port name, never port number in ServiceMonitor
- Check RBAC:
kubectl auth can-i list endpoints --as=system:serviceaccount:monitoring:prometheus-kube-prometheus-prometheus -n default - Verify label matching:
kubectl get servicemonitor -A --show-labels
Phase to address: Prometheus installation phase, verify with test ServiceMonitor
Sources:
6. k3s Control Plane Metrics Not Scraped
What goes wrong: Prometheus dashboards show no metrics for kube-scheduler, kube-controller-manager, or etcd. These panels appear blank or show "No data."
Why it happens: k3s runs control plane components as a single binary, not as pods. Standard kube-prometheus-stack expects to scrape pods that don't exist.
Prevention:
# kube-prometheus-stack values for k3s
kubeControllerManager:
enabled: true
endpoints:
- 192.168.1.100 # k3s server IP
service:
enabled: true
port: 10257
targetPort: 10257
kubeScheduler:
enabled: true
endpoints:
- 192.168.1.100
service:
enabled: true
port: 10259
targetPort: 10259
kubeEtcd:
enabled: false # k3s uses embedded sqlite/etcd
- Explicitly configure control plane endpoints with k3s server IPs
- Disable etcd monitoring if using embedded database
- OR disable these components entirely for simpler setup
Phase to address: Prometheus installation phase
Sources:
7. Promtail Not Sending Logs to Loki
What goes wrong: Promtail pods are running but no logs appear in Grafana/Loki. Queries return empty results.
Why it happens:
- Promtail started before Loki was ready
- Log path configuration doesn't match k3s container runtime paths
- Label selectors don't match actual pod labels
- Network policy blocking Promtail -> Loki communication
Warning signs:
- Promtail logs show "dropping target, no labels" or connection errors
kubectl logs -n monitoring promtail-xxxshows retries- Loki data source health check passes but queries return nothing
Prevention:
# Verify k3s containerd log paths
promtail:
config:
snippets:
scrapeConfigs: |
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
pipeline_stages:
- cri: {}
relabel_configs:
- source_labels: [__meta_kubernetes_pod_node_name]
target_label: node
- Delete Promtail positions file to force re-read:
kubectl exec -n monitoring promtail-xxx -- rm /tmp/positions.yaml - Ensure Loki is healthy before Promtail starts (use init container or sync wave)
- Verify log paths match containerd:
/var/log/pods/*/*/*.log
Phase to address: Loki installation phase
Sources:
8. ArgoCD Self-Management Bootstrap Chicken-Egg
What goes wrong: Attempting to have ArgoCD manage itself creates confusion about what's managing what. Initial mistakes in the ArgoCD Application manifest can lock you out.
Why it happens: GitOps can't install ArgoCD if ArgoCD isn't present. After bootstrap, changing ArgoCD's self-managing Application incorrectly can break the cluster.
Prevention:
# Phase 1: Install ArgoCD manually (kubectl apply or helm)
# Phase 2: Create self-management Application
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: argocd
namespace: argocd
spec:
project: default
source:
repoURL: https://git.kube2.tricnet.de/tho/infrastructure.git
path: argocd
targetRevision: HEAD
destination:
server: https://kubernetes.default.svc
namespace: argocd
syncPolicy:
automated:
prune: false # CRITICAL: Don't auto-prune ArgoCD components
selfHeal: true
- Always bootstrap ArgoCD manually first (Helm or kubectl)
- Set
prune: falsefor ArgoCD's self-management Application - Use App of Apps pattern for managed applications
- Keep a local backup of ArgoCD Application manifest
Phase to address: ArgoCD installation phase - plan bootstrap strategy upfront
Sources:
9. Sync Waves Misuse Creating False Dependencies
What goes wrong: Over-engineering sync waves creates unnecessary sequential deployments, increasing deployment time and complexity. Or under-engineering leads to race conditions.
Why it happens:
- Developers add waves "just in case"
- Misunderstanding that waves are within single Application only
- Not knowing default wave is 0 and waves can be negative
Prevention:
# Use waves sparingly - only for true dependencies
# Database must exist before app
metadata:
annotations:
argocd.argoproj.io/sync-wave: "-1" # First
# App deployment
metadata:
annotations:
argocd.argoproj.io/sync-wave: "0" # Default, after database
# Don't create unnecessary chains like:
# ConfigMap (wave -3) -> Secret (wave -2) -> Service (wave -1) -> Deployment (wave 0)
# These have no real dependency and should all be wave 0
- Use waves only for actual dependencies (database before app, CRD before CR)
- Keep wave structure as flat as possible
- Sync waves do NOT work across different ArgoCD Applications
- For cross-Application dependencies, use ApplicationSets with Progressive Syncs
Phase to address: Application configuration phase
Sources:
Minor Pitfalls
Annoyances that are easily fixed but waste time if not known.
10. Grafana Default Password Not Changed
What goes wrong: Using default admin/prom-operator credentials in production exposes the monitoring stack.
Prevention:
# kube-prometheus-stack values
grafana:
adminPassword: "${GRAFANA_ADMIN_PASSWORD}" # From secret
# Or use existing secret
admin:
existingSecret: grafana-admin-credentials
userKey: admin-user
passwordKey: admin-password
Phase to address: Grafana installation phase
11. Missing open-iscsi for Longhorn
What goes wrong: Longhorn volumes fail to attach with cryptic errors.
Why it happens: Longhorn requires open-iscsi on all nodes, which isn't installed by default on many Linux distributions.
Prevention:
# On each node before Longhorn installation
sudo apt-get install -y open-iscsi
sudo systemctl enable iscsid
sudo systemctl start iscsid
Phase to address: Pre-installation prerequisites check
Sources:
12. ClusterIP Services Not Accessible
What goes wrong: After installing monitoring stack, Grafana/Prometheus aren't accessible externally.
Why it happens: k3s defaults to ClusterIP for services. Single-node setups need explicit ingress or LoadBalancer configuration.
Prevention:
# kube-prometheus-stack values
grafana:
ingress:
enabled: true
ingressClassName: traefik
hosts:
- grafana.kube2.tricnet.de
tls:
- secretName: grafana-tls
hosts:
- grafana.kube2.tricnet.de
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
Phase to address: Installation phase - configure ingress alongside deployment
13. Traefik v3 Breaking Changes for ArgoCD IngressRoute
What goes wrong: ArgoCD IngressRoute with gRPC support stops working after Traefik upgrade to v3.
Why it happens: Traefik v3 changed header matcher syntax from Headers() to Header().
Prevention:
# Traefik v2 (OLD - broken in v3)
match: Host(`argocd.example.com`) && Headers(`Content-Type`, `application/grpc`)
# Traefik v3 (NEW)
match: Host(`argocd.example.com`) && Header(`Content-Type`, `application/grpc`)
- Check Traefik version before applying IngressRoutes
- Test gRPC route after any Traefik upgrade
Phase to address: ArgoCD installation phase
Sources:
14. k3s Resource Exhaustion with Full Monitoring Stack
What goes wrong: Single-node k3s cluster becomes unresponsive after deploying full kube-prometheus-stack.
Why it happens:
- kube-prometheus-stack deploys many components (prometheus, alertmanager, grafana, node-exporter, kube-state-metrics)
- Default resource requests/limits are sized for larger clusters
- k3s server process itself needs ~500MB RAM
Warning signs:
- Pods stuck in Pending
- OOMKilled events
- Node NotReady status
Prevention:
# Minimal kube-prometheus-stack for single-node
alertmanager:
enabled: false # Disable if not using alerts
prometheus:
prometheusSpec:
resources:
requests:
memory: 256Mi
cpu: 100m
limits:
memory: 512Mi
grafana:
resources:
requests:
memory: 128Mi
cpu: 50m
limits:
memory: 256Mi
- Disable unnecessary components (alertmanager if no alerts configured)
- Set explicit resource limits lower than defaults
- Monitor cluster resources:
kubectl top nodes - Consider: 4GB RAM minimum for k3s + monitoring + workloads
Phase to address: Prometheus installation phase - right-size from start
Sources:
Phase-Specific Warning Summary
| Phase | Likely Pitfall | Mitigation |
|---|---|---|
| Prerequisites | #11 Missing open-iscsi | Pre-flight check script |
| ArgoCD Installation | #4 TLS redirect loop, #8 Bootstrap | Test ingress immediately, plan bootstrap |
| ArgoCD + Gitea Integration | #1 Webhook parsing | Use Gogs webhook type, accept polling fallback |
| Prometheus Installation | #3 Volume growth, #5 ServiceMonitor, #6 Control plane, #14 Resources | Configure retention+size, verify RBAC, right-size |
| Loki Installation | #2 Disk full, #7 Promtail | Enable retention day one, verify log paths |
| Grafana Installation | #10 Default password, #12 ClusterIP | Set password, configure ingress |
| Application Configuration | #9 Sync waves | Use sparingly, only for real dependencies |
Pre-Installation Checklist
Before starting installation, verify:
- open-iscsi installed on all nodes
- Longhorn healthy with available storage (check
kubectl get nodesand Longhorn UI) - Traefik version known (v2 vs v3 affects IngressRoute syntax)
- DNS entries configured for monitoring subdomains
- Gitea webhook type decision (use Gogs type, or accept polling fallback)
- Disk space planning: Loki retention + Prometheus retention + headroom
- Memory planning: k3s (~500MB) + monitoring (~1GB) + workloads
- Namespace strategy decided (monitoring namespace vs default)
Existing Infrastructure Compatibility Notes
Based on the existing TaskPlanner setup:
Traefik: Already in use with cert-manager (letsencrypt-prod). New services should follow same pattern:
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
Longhorn: Already the storage class. New PVCs should use explicit storageClassName: longhorn and consider replica count for single-node (set to 1).
Gitea: Repository already configured at git.kube2.tricnet.de. ArgoCD Application already exists in argocd/application.yaml - don't duplicate.
Existing ArgoCD Application: TaskPlanner is already configured with ArgoCD. The monitoring stack should be a separate Application, not added to the existing one.
Sources Summary
Official Documentation
- ArgoCD Ingress Configuration
- ArgoCD Sync Phases and Waves
- Grafana Loki Retention
- Grafana Loki Troubleshooting
- K3s Resource Profiling
Community Issues (Verified Problems)
- ArgoCD #16453 - Gitea webhook parsing
- ArgoCD #20444 - Gitea support
- Loki #5242 - Retention not working
- Longhorn #2222 - Volume expansion
- kube-prometheus-stack #3401 - Resource limits
- Prometheus Operator #3383 - ServiceMonitor discovery
Tutorials and Guides
- K3S Rocks - ArgoCD
- K3S Rocks - Logging
- Bootstrapping ArgoCD
- Prometheus ServiceMonitor Troubleshooting
- Traefik Community - ArgoCD
Pitfalls research for: CI/CD and Observability on k3s Context: Adding to existing TaskPlanner deployment Researched: 2026-02-03