# Domain Pitfalls: CI/CD and Observability on k3s **Domain:** Adding ArgoCD, Prometheus, Grafana, and Loki to existing k3s cluster **Context:** TaskPlanner on self-hosted k3s with Gitea, Traefik, Longhorn **Researched:** 2026-02-03 **Confidence:** HIGH (verified with official documentation and community issues) --- ## Critical Pitfalls Mistakes that cause system instability, data loss, or require significant rework. ### 1. Gitea Webhook JSON Parsing Failure with ArgoCD **What goes wrong:** ArgoCD receives webhooks from Gitea but fails to parse them with error: `json: cannot unmarshal string into Go struct field .repository.created_at of type int64`. This happens because ArgoCD treats Gitea events as GitHub events instead of Gogs events. **Why it happens:** Gitea is a fork of Gogs, but ArgoCD's webhook handler expects different field types. The `repository.created_at` field is a string in Gitea/Gogs but ArgoCD expects int64 for GitHub format. **Consequences:** - Webhooks silently fail (ArgoCD logs error but continues) - Must wait for 3-minute polling interval for changes to sync - False confidence that instant sync is working **Warning signs:** - ArgoCD server logs show webhook parsing errors - Application sync doesn't happen immediately after push - Webhook delivery shows success in Gitea but no ArgoCD response **Prevention:** - Configure webhook with `Gogs` type in Gitea, NOT `Gitea` type - Test webhook delivery and check ArgoCD server logs: `kubectl logs -n argocd deploy/argocd-server | grep -i webhook` - Accept 3-minute polling as fallback (webhooks are optional enhancement) **Phase to address:** ArgoCD installation phase - verify webhook integration immediately **Sources:** - [ArgoCD Issue #16453 - Forgejo/Gitea webhook parsing](https://github.com/argoproj/argo-cd/issues/16453) - [ArgoCD Issue #20444 - Gitea support lacking](https://github.com/argoproj/argo-cd/issues/20444) --- ### 2. Loki Disk Full with No Size-Based Retention **What goes wrong:** Loki fills the entire disk because retention is only time-based, not size-based. When disk fills, Loki crashes with "no space left on device" and becomes completely non-functional - Grafana cannot even fetch labels. **Why it happens:** - Retention is disabled by default (`compactor.retention-enabled: false`) - Loki only supports time-based retention (e.g., 7 days), not size-based - High-volume logging can fill disk before retention period expires **Consequences:** - Complete logging system failure - May affect other pods sharing the same Longhorn volume - Recovery requires manual cleanup or volume expansion **Warning signs:** - Steadily increasing PVC usage visible in `kubectl get pvc` - Loki compactor logs show no deletion activity - Grafana queries become slow before complete failure **Prevention:** ```yaml # Loki values.yaml loki: compactor: retention_enabled: true compaction_interval: 10m retention_delete_delay: 2h retention_delete_worker_count: 150 working_directory: /loki/compactor limits_config: retention_period: 168h # 7 days - adjust based on disk size ``` - Set conservative retention period (start with 7 days) - Run compactor as StatefulSet with persistent storage for marker files - Set up Prometheus alert for PVC usage > 80% - Index period MUST be 24h for retention to work **Phase to address:** Loki installation phase - configure retention from day one **Sources:** - [Grafana Loki Retention Documentation](https://grafana.com/docs/loki/latest/operations/storage/retention/) - [Loki Issue #5242 - Retention not working](https://github.com/grafana/loki/issues/5242) --- ### 3. Prometheus Volume Growth Exceeds Longhorn PVC **What goes wrong:** Prometheus metrics storage grows beyond PVC capacity. Longhorn volume expansion via CSI can result in a faulted volume that prevents Prometheus from starting. **Why it happens:** - Default Prometheus retention is 15 days with no size limit - kube-prometheus-stack defaults don't match k3s resource constraints - Longhorn CSI volume expansion has known issues requiring specific procedure **Consequences:** - Prometheus pod stuck in pending/crash loop - Loss of historical metrics - Longhorn volume in faulted state requiring manual recovery **Warning signs:** - Prometheus pod restarts with OOMKilled or disk errors - `kubectl describe pvc` shows capacity approaching limit - Longhorn UI shows volume health degraded **Prevention:** ```yaml # kube-prometheus-stack values prometheus: prometheusSpec: retention: 7d retentionSize: "8GB" # Set explicit size limit resources: requests: memory: 400Mi limits: memory: 600Mi storageSpec: volumeClaimTemplate: spec: storageClassName: longhorn resources: requests: storage: 10Gi ``` - Always set both `retention` AND `retentionSize` - Size PVC with 20% headroom above retentionSize - Monitor with `prometheus_tsdb_storage_blocks_bytes` metric - For expansion: stop pod, detach volume, resize, then restart **Phase to address:** Prometheus installation phase **Sources:** - [Longhorn Issue #2222 - Volume expansion faults](https://github.com/longhorn/longhorn/issues/2222) - [kube-prometheus-stack Issue #3401 - Resource limits](https://github.com/prometheus-community/helm-charts/issues/3401) --- ### 4. ArgoCD + Traefik TLS Termination Redirect Loop **What goes wrong:** ArgoCD UI becomes inaccessible with redirect loops or connection refused errors when accessed through Traefik. Browser shows ERR_TOO_MANY_REDIRECTS. **Why it happens:** Traefik terminates TLS and forwards HTTP to ArgoCD. ArgoCD server, configured for TLS by default, responds with 307 redirects to HTTPS, creating infinite loop. **Consequences:** - Cannot access ArgoCD UI via ingress - CLI may work with port-forward but not through ingress - gRPC connections for CLI through ingress fail **Warning signs:** - Browser redirect loop when accessing ArgoCD URL - `curl -v` shows 307 redirect responses - Works with `kubectl port-forward` but not via ingress **Prevention:** ```yaml # Option 1: ConfigMap (recommended) apiVersion: v1 kind: ConfigMap metadata: name: argocd-cmd-params-cm namespace: argocd data: server.insecure: "true" # Option 2: Traefik IngressRoute for dual HTTP/gRPC apiVersion: traefik.io/v1alpha1 kind: IngressRoute metadata: name: argocd-server namespace: argocd spec: entryPoints: - websecure routes: - kind: Rule match: Host(`argocd.example.com`) priority: 10 services: - name: argocd-server port: 80 - kind: Rule match: Host(`argocd.example.com`) && Header(`Content-Type`, `application/grpc`) priority: 11 services: - name: argocd-server port: 80 scheme: h2c tls: certResolver: letsencrypt-prod ``` - Set `server.insecure: "true"` in argocd-cmd-params-cm ConfigMap - Use IngressRoute (not Ingress) for proper gRPC support - Configure separate routes for HTTP and gRPC with correct priority **Phase to address:** ArgoCD installation phase - test immediately after ingress setup **Sources:** - [ArgoCD Ingress Configuration](https://argo-cd.readthedocs.io/en/stable/operator-manual/ingress/) - [Traefik Community - ArgoCD behind Traefik](https://community.traefik.io/t/serving-argocd-behind-traefik-ingress/15901) --- ## Moderate Pitfalls Mistakes that cause delays, debugging sessions, or technical debt. ### 5. ServiceMonitor Not Discovering Targets **What goes wrong:** Prometheus ServiceMonitors are created but no targets appear in Prometheus. The scrape config shows 0/0 targets up. **Why it happens:** - Label selector mismatch between Prometheus CR and ServiceMonitor - RBAC: Prometheus ServiceAccount lacks permission in target namespace - Port specified as number instead of name - ServiceMonitor in different namespace than Prometheus expects **Prevention:** ```yaml # Ensure Prometheus CR has permissive selectors prometheus: prometheusSpec: serviceMonitorSelectorNilUsesHelmValues: false serviceMonitorSelector: {} # Select all ServiceMonitors serviceMonitorNamespaceSelector: {} # From all namespaces # ServiceMonitor must use port NAME not number spec: endpoints: - port: metrics # NOT 9090 ``` - Use port name, never port number in ServiceMonitor - Check RBAC: `kubectl auth can-i list endpoints --as=system:serviceaccount:monitoring:prometheus-kube-prometheus-prometheus -n default` - Verify label matching: `kubectl get servicemonitor -A --show-labels` **Phase to address:** Prometheus installation phase, verify with test ServiceMonitor **Sources:** - [Prometheus Operator Troubleshooting](https://managedkube.com/prometheus/operator/servicemonitor/troubleshooting/2019/11/07/prometheus-operator-servicemonitor-troubleshooting.html) - [ServiceMonitor not discovered Issue #3383](https://github.com/prometheus-operator/prometheus-operator/issues/3383) --- ### 6. k3s Control Plane Metrics Not Scraped **What goes wrong:** Prometheus dashboards show no metrics for kube-scheduler, kube-controller-manager, or etcd. These panels appear blank or show "No data." **Why it happens:** k3s runs control plane components as a single binary, not as pods. Standard kube-prometheus-stack expects to scrape pods that don't exist. **Prevention:** ```yaml # kube-prometheus-stack values for k3s kubeControllerManager: enabled: true endpoints: - 192.168.1.100 # k3s server IP service: enabled: true port: 10257 targetPort: 10257 kubeScheduler: enabled: true endpoints: - 192.168.1.100 service: enabled: true port: 10259 targetPort: 10259 kubeEtcd: enabled: false # k3s uses embedded sqlite/etcd ``` - Explicitly configure control plane endpoints with k3s server IPs - Disable etcd monitoring if using embedded database - OR disable these components entirely for simpler setup **Phase to address:** Prometheus installation phase **Sources:** - [Prometheus for Rancher K3s Control Plane Monitoring](https://www.spectrocloud.com/blog/enabling-rancher-k3s-cluster-control-plane-monitoring-with-prometheus) --- ### 7. Promtail Not Sending Logs to Loki **What goes wrong:** Promtail pods are running but no logs appear in Grafana/Loki. Queries return empty results. **Why it happens:** - Promtail started before Loki was ready - Log path configuration doesn't match k3s container runtime paths - Label selectors don't match actual pod labels - Network policy blocking Promtail -> Loki communication **Warning signs:** - Promtail logs show "dropping target, no labels" or connection errors - `kubectl logs -n monitoring promtail-xxx` shows retries - Loki data source health check passes but queries return nothing **Prevention:** ```yaml # Verify k3s containerd log paths promtail: config: snippets: scrapeConfigs: | - job_name: kubernetes-pods kubernetes_sd_configs: - role: pod pipeline_stages: - cri: {} relabel_configs: - source_labels: [__meta_kubernetes_pod_node_name] target_label: node ``` - Delete Promtail positions file to force re-read: `kubectl exec -n monitoring promtail-xxx -- rm /tmp/positions.yaml` - Ensure Loki is healthy before Promtail starts (use init container or sync wave) - Verify log paths match containerd: `/var/log/pods/*/*/*.log` **Phase to address:** Loki installation phase **Sources:** - [Grafana Loki Troubleshooting](https://grafana.com/docs/loki/latest/operations/troubleshooting/) --- ### 8. ArgoCD Self-Management Bootstrap Chicken-Egg **What goes wrong:** Attempting to have ArgoCD manage itself creates confusion about what's managing what. Initial mistakes in the ArgoCD Application manifest can lock you out. **Why it happens:** GitOps can't install ArgoCD if ArgoCD isn't present. After bootstrap, changing ArgoCD's self-managing Application incorrectly can break the cluster. **Prevention:** ```yaml # Phase 1: Install ArgoCD manually (kubectl apply or helm) # Phase 2: Create self-management Application apiVersion: argoproj.io/v1alpha1 kind: Application metadata: name: argocd namespace: argocd spec: project: default source: repoURL: https://git.kube2.tricnet.de/tho/infrastructure.git path: argocd targetRevision: HEAD destination: server: https://kubernetes.default.svc namespace: argocd syncPolicy: automated: prune: false # CRITICAL: Don't auto-prune ArgoCD components selfHeal: true ``` - Always bootstrap ArgoCD manually first (Helm or kubectl) - Set `prune: false` for ArgoCD's self-management Application - Use App of Apps pattern for managed applications - Keep a local backup of ArgoCD Application manifest **Phase to address:** ArgoCD installation phase - plan bootstrap strategy upfront **Sources:** - [Bootstrapping ArgoCD - Windsock.io](https://windsock.io/bootstrapping-argocd/) - [Demystifying GitOps - Bootstrapping ArgoCD](https://medium.com/@aaltundemir/demystifying-gitops-bootstrapping-argo-cd-4a861284f273) --- ### 9. Sync Waves Misuse Creating False Dependencies **What goes wrong:** Over-engineering sync waves creates unnecessary sequential deployments, increasing deployment time and complexity. Or under-engineering leads to race conditions. **Why it happens:** - Developers add waves "just in case" - Misunderstanding that waves are within single Application only - Not knowing default wave is 0 and waves can be negative **Prevention:** ```yaml # Use waves sparingly - only for true dependencies # Database must exist before app metadata: annotations: argocd.argoproj.io/sync-wave: "-1" # First # App deployment metadata: annotations: argocd.argoproj.io/sync-wave: "0" # Default, after database # Don't create unnecessary chains like: # ConfigMap (wave -3) -> Secret (wave -2) -> Service (wave -1) -> Deployment (wave 0) # These have no real dependency and should all be wave 0 ``` - Use waves only for actual dependencies (database before app, CRD before CR) - Keep wave structure as flat as possible - Sync waves do NOT work across different ArgoCD Applications - For cross-Application dependencies, use ApplicationSets with Progressive Syncs **Phase to address:** Application configuration phase **Sources:** - [ArgoCD Sync Phases and Waves](https://argo-cd.readthedocs.io/en/stable/user-guide/sync-waves/) --- ## Minor Pitfalls Annoyances that are easily fixed but waste time if not known. ### 10. Grafana Default Password Not Changed **What goes wrong:** Using default `admin/prom-operator` credentials in production exposes the monitoring stack. **Prevention:** ```yaml # kube-prometheus-stack values grafana: adminPassword: "${GRAFANA_ADMIN_PASSWORD}" # From secret # Or use existing secret admin: existingSecret: grafana-admin-credentials userKey: admin-user passwordKey: admin-password ``` **Phase to address:** Grafana installation phase --- ### 11. Missing open-iscsi for Longhorn **What goes wrong:** Longhorn volumes fail to attach with cryptic errors. **Why it happens:** Longhorn requires `open-iscsi` on all nodes, which isn't installed by default on many Linux distributions. **Prevention:** ```bash # On each node before Longhorn installation sudo apt-get install -y open-iscsi sudo systemctl enable iscsid sudo systemctl start iscsid ``` **Phase to address:** Pre-installation prerequisites check **Sources:** - [Longhorn Prerequisites](https://longhorn.io/docs/latest/deploy/install/#installation-requirements) --- ### 12. ClusterIP Services Not Accessible **What goes wrong:** After installing monitoring stack, Grafana/Prometheus aren't accessible externally. **Why it happens:** k3s defaults to ClusterIP for services. Single-node setups need explicit ingress or LoadBalancer configuration. **Prevention:** ```yaml # kube-prometheus-stack values grafana: ingress: enabled: true ingressClassName: traefik hosts: - grafana.kube2.tricnet.de tls: - secretName: grafana-tls hosts: - grafana.kube2.tricnet.de annotations: cert-manager.io/cluster-issuer: letsencrypt-prod ``` **Phase to address:** Installation phase - configure ingress alongside deployment --- ### 13. Traefik v3 Breaking Changes for ArgoCD IngressRoute **What goes wrong:** ArgoCD IngressRoute with gRPC support stops working after Traefik upgrade to v3. **Why it happens:** Traefik v3 changed header matcher syntax from `Headers()` to `Header()`. **Prevention:** ```yaml # Traefik v2 (OLD - broken in v3) match: Host(`argocd.example.com`) && Headers(`Content-Type`, `application/grpc`) # Traefik v3 (NEW) match: Host(`argocd.example.com`) && Header(`Content-Type`, `application/grpc`) ``` - Check Traefik version before applying IngressRoutes - Test gRPC route after any Traefik upgrade **Phase to address:** ArgoCD installation phase **Sources:** - [ArgoCD Issue #15534 - Traefik v3 docs](https://github.com/argoproj/argo-cd/issues/15534) --- ### 14. k3s Resource Exhaustion with Full Monitoring Stack **What goes wrong:** Single-node k3s cluster becomes unresponsive after deploying full kube-prometheus-stack. **Why it happens:** - kube-prometheus-stack deploys many components (prometheus, alertmanager, grafana, node-exporter, kube-state-metrics) - Default resource requests/limits are sized for larger clusters - k3s server process itself needs ~500MB RAM **Warning signs:** - Pods stuck in Pending - OOMKilled events - Node NotReady status **Prevention:** ```yaml # Minimal kube-prometheus-stack for single-node alertmanager: enabled: false # Disable if not using alerts prometheus: prometheusSpec: resources: requests: memory: 256Mi cpu: 100m limits: memory: 512Mi grafana: resources: requests: memory: 128Mi cpu: 50m limits: memory: 256Mi ``` - Disable unnecessary components (alertmanager if no alerts configured) - Set explicit resource limits lower than defaults - Monitor cluster resources: `kubectl top nodes` - Consider: 4GB RAM minimum for k3s + monitoring + workloads **Phase to address:** Prometheus installation phase - right-size from start **Sources:** - [K3s Resource Profiling](https://docs.k3s.io/reference/resource-profiling) --- ## Phase-Specific Warning Summary | Phase | Likely Pitfall | Mitigation | |-------|---------------|------------| | Prerequisites | #11 Missing open-iscsi | Pre-flight check script | | ArgoCD Installation | #4 TLS redirect loop, #8 Bootstrap | Test ingress immediately, plan bootstrap | | ArgoCD + Gitea Integration | #1 Webhook parsing | Use Gogs webhook type, accept polling fallback | | Prometheus Installation | #3 Volume growth, #5 ServiceMonitor, #6 Control plane, #14 Resources | Configure retention+size, verify RBAC, right-size | | Loki Installation | #2 Disk full, #7 Promtail | Enable retention day one, verify log paths | | Grafana Installation | #10 Default password, #12 ClusterIP | Set password, configure ingress | | Application Configuration | #9 Sync waves | Use sparingly, only for real dependencies | --- ## Pre-Installation Checklist Before starting installation, verify: - [ ] open-iscsi installed on all nodes - [ ] Longhorn healthy with available storage (check `kubectl get nodes` and Longhorn UI) - [ ] Traefik version known (v2 vs v3 affects IngressRoute syntax) - [ ] DNS entries configured for monitoring subdomains - [ ] Gitea webhook type decision (use Gogs type, or accept polling fallback) - [ ] Disk space planning: Loki retention + Prometheus retention + headroom - [ ] Memory planning: k3s (~500MB) + monitoring (~1GB) + workloads - [ ] Namespace strategy decided (monitoring namespace vs default) --- ## Existing Infrastructure Compatibility Notes Based on the existing TaskPlanner setup: **Traefik:** Already in use with cert-manager (letsencrypt-prod). New services should follow same pattern: ```yaml annotations: cert-manager.io/cluster-issuer: letsencrypt-prod ``` **Longhorn:** Already the storage class. New PVCs should use explicit `storageClassName: longhorn` and consider replica count for single-node (set to 1). **Gitea:** Repository already configured at `git.kube2.tricnet.de`. ArgoCD Application already exists in `argocd/application.yaml` - don't duplicate. **Existing ArgoCD Application:** TaskPlanner is already configured with ArgoCD. The monitoring stack should be a separate Application, not added to the existing one. --- ## Sources Summary ### Official Documentation - [ArgoCD Ingress Configuration](https://argo-cd.readthedocs.io/en/stable/operator-manual/ingress/) - [ArgoCD Sync Phases and Waves](https://argo-cd.readthedocs.io/en/stable/user-guide/sync-waves/) - [Grafana Loki Retention](https://grafana.com/docs/loki/latest/operations/storage/retention/) - [Grafana Loki Troubleshooting](https://grafana.com/docs/loki/latest/operations/troubleshooting/) - [K3s Resource Profiling](https://docs.k3s.io/reference/resource-profiling) ### Community Issues (Verified Problems) - [ArgoCD #16453 - Gitea webhook parsing](https://github.com/argoproj/argo-cd/issues/16453) - [ArgoCD #20444 - Gitea support](https://github.com/argoproj/argo-cd/issues/20444) - [Loki #5242 - Retention not working](https://github.com/grafana/loki/issues/5242) - [Longhorn #2222 - Volume expansion](https://github.com/longhorn/longhorn/issues/2222) - [kube-prometheus-stack #3401 - Resource limits](https://github.com/prometheus-community/helm-charts/issues/3401) - [Prometheus Operator #3383 - ServiceMonitor discovery](https://github.com/prometheus-operator/prometheus-operator/issues/3383) ### Tutorials and Guides - [K3S Rocks - ArgoCD](https://k3s.rocks/argocd/) - [K3S Rocks - Logging](https://k3s.rocks/logging/) - [Bootstrapping ArgoCD](https://windsock.io/bootstrapping-argocd/) - [Prometheus ServiceMonitor Troubleshooting](https://managedkube.com/prometheus/operator/servicemonitor/troubleshooting/2019/11/07/prometheus-operator-servicemonitor-troubleshooting.html) - [Traefik Community - ArgoCD](https://community.traefik.io/t/serving-argocd-behind-traefik-ingress/15901) --- *Pitfalls research for: CI/CD and Observability on k3s* *Context: Adding to existing TaskPlanner deployment* *Researched: 2026-02-03*