Files
taskplaner/.planning/research/PITFALLS-CICD-OBSERVABILITY.md
Thomas Richter 5dbabe6a2d docs: complete v2.0 CI/CD and observability research
Files:
- STACK-v2-cicd-observability.md (ArgoCD, Prometheus, Loki, Alloy)
- FEATURES.md (updated with CI/CD and observability section)
- ARCHITECTURE.md (updated with v2.0 integration architecture)
- PITFALLS-CICD-OBSERVABILITY.md (14 critical/moderate/minor pitfalls)
- SUMMARY-v2-cicd-observability.md (synthesis with roadmap implications)

Key findings:
- Stack: kube-prometheus-stack + Loki monolithic + Alloy (Promtail EOL March 2026)
- Architecture: 3-phase approach - GitOps first, observability second, CI tests last
- Critical pitfall: ArgoCD TLS redirect loop, Loki disk exhaustion, k3s metrics config

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 03:29:23 +01:00

22 KiB

Domain Pitfalls: CI/CD and Observability on k3s

Domain: Adding ArgoCD, Prometheus, Grafana, and Loki to existing k3s cluster Context: TaskPlanner on self-hosted k3s with Gitea, Traefik, Longhorn Researched: 2026-02-03 Confidence: HIGH (verified with official documentation and community issues)


Critical Pitfalls

Mistakes that cause system instability, data loss, or require significant rework.

1. Gitea Webhook JSON Parsing Failure with ArgoCD

What goes wrong: ArgoCD receives webhooks from Gitea but fails to parse them with error: json: cannot unmarshal string into Go struct field .repository.created_at of type int64. This happens because ArgoCD treats Gitea events as GitHub events instead of Gogs events.

Why it happens: Gitea is a fork of Gogs, but ArgoCD's webhook handler expects different field types. The repository.created_at field is a string in Gitea/Gogs but ArgoCD expects int64 for GitHub format.

Consequences:

  • Webhooks silently fail (ArgoCD logs error but continues)
  • Must wait for 3-minute polling interval for changes to sync
  • False confidence that instant sync is working

Warning signs:

  • ArgoCD server logs show webhook parsing errors
  • Application sync doesn't happen immediately after push
  • Webhook delivery shows success in Gitea but no ArgoCD response

Prevention:

  • Configure webhook with Gogs type in Gitea, NOT Gitea type
  • Test webhook delivery and check ArgoCD server logs: kubectl logs -n argocd deploy/argocd-server | grep -i webhook
  • Accept 3-minute polling as fallback (webhooks are optional enhancement)

Phase to address: ArgoCD installation phase - verify webhook integration immediately

Sources:


2. Loki Disk Full with No Size-Based Retention

What goes wrong: Loki fills the entire disk because retention is only time-based, not size-based. When disk fills, Loki crashes with "no space left on device" and becomes completely non-functional - Grafana cannot even fetch labels.

Why it happens:

  • Retention is disabled by default (compactor.retention-enabled: false)
  • Loki only supports time-based retention (e.g., 7 days), not size-based
  • High-volume logging can fill disk before retention period expires

Consequences:

  • Complete logging system failure
  • May affect other pods sharing the same Longhorn volume
  • Recovery requires manual cleanup or volume expansion

Warning signs:

  • Steadily increasing PVC usage visible in kubectl get pvc
  • Loki compactor logs show no deletion activity
  • Grafana queries become slow before complete failure

Prevention:

# Loki values.yaml
loki:
  compactor:
    retention_enabled: true
    compaction_interval: 10m
    retention_delete_delay: 2h
    retention_delete_worker_count: 150
    working_directory: /loki/compactor
  limits_config:
    retention_period: 168h  # 7 days - adjust based on disk size
  • Set conservative retention period (start with 7 days)
  • Run compactor as StatefulSet with persistent storage for marker files
  • Set up Prometheus alert for PVC usage > 80%
  • Index period MUST be 24h for retention to work

Phase to address: Loki installation phase - configure retention from day one

Sources:


3. Prometheus Volume Growth Exceeds Longhorn PVC

What goes wrong: Prometheus metrics storage grows beyond PVC capacity. Longhorn volume expansion via CSI can result in a faulted volume that prevents Prometheus from starting.

Why it happens:

  • Default Prometheus retention is 15 days with no size limit
  • kube-prometheus-stack defaults don't match k3s resource constraints
  • Longhorn CSI volume expansion has known issues requiring specific procedure

Consequences:

  • Prometheus pod stuck in pending/crash loop
  • Loss of historical metrics
  • Longhorn volume in faulted state requiring manual recovery

Warning signs:

  • Prometheus pod restarts with OOMKilled or disk errors
  • kubectl describe pvc shows capacity approaching limit
  • Longhorn UI shows volume health degraded

Prevention:

# kube-prometheus-stack values
prometheus:
  prometheusSpec:
    retention: 7d
    retentionSize: "8GB"  # Set explicit size limit
    resources:
      requests:
        memory: 400Mi
      limits:
        memory: 600Mi
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: longhorn
          resources:
            requests:
              storage: 10Gi
  • Always set both retention AND retentionSize
  • Size PVC with 20% headroom above retentionSize
  • Monitor with prometheus_tsdb_storage_blocks_bytes metric
  • For expansion: stop pod, detach volume, resize, then restart

Phase to address: Prometheus installation phase

Sources:


4. ArgoCD + Traefik TLS Termination Redirect Loop

What goes wrong: ArgoCD UI becomes inaccessible with redirect loops or connection refused errors when accessed through Traefik. Browser shows ERR_TOO_MANY_REDIRECTS.

Why it happens: Traefik terminates TLS and forwards HTTP to ArgoCD. ArgoCD server, configured for TLS by default, responds with 307 redirects to HTTPS, creating infinite loop.

Consequences:

  • Cannot access ArgoCD UI via ingress
  • CLI may work with port-forward but not through ingress
  • gRPC connections for CLI through ingress fail

Warning signs:

  • Browser redirect loop when accessing ArgoCD URL
  • curl -v shows 307 redirect responses
  • Works with kubectl port-forward but not via ingress

Prevention:

# Option 1: ConfigMap (recommended)
apiVersion: v1
kind: ConfigMap
metadata:
  name: argocd-cmd-params-cm
  namespace: argocd
data:
  server.insecure: "true"

# Option 2: Traefik IngressRoute for dual HTTP/gRPC
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
  name: argocd-server
  namespace: argocd
spec:
  entryPoints:
    - websecure
  routes:
    - kind: Rule
      match: Host(`argocd.example.com`)
      priority: 10
      services:
        - name: argocd-server
          port: 80
    - kind: Rule
      match: Host(`argocd.example.com`) && Header(`Content-Type`, `application/grpc`)
      priority: 11
      services:
        - name: argocd-server
          port: 80
          scheme: h2c
  tls:
    certResolver: letsencrypt-prod
  • Set server.insecure: "true" in argocd-cmd-params-cm ConfigMap
  • Use IngressRoute (not Ingress) for proper gRPC support
  • Configure separate routes for HTTP and gRPC with correct priority

Phase to address: ArgoCD installation phase - test immediately after ingress setup

Sources:


Moderate Pitfalls

Mistakes that cause delays, debugging sessions, or technical debt.

5. ServiceMonitor Not Discovering Targets

What goes wrong: Prometheus ServiceMonitors are created but no targets appear in Prometheus. The scrape config shows 0/0 targets up.

Why it happens:

  • Label selector mismatch between Prometheus CR and ServiceMonitor
  • RBAC: Prometheus ServiceAccount lacks permission in target namespace
  • Port specified as number instead of name
  • ServiceMonitor in different namespace than Prometheus expects

Prevention:

# Ensure Prometheus CR has permissive selectors
prometheus:
  prometheusSpec:
    serviceMonitorSelectorNilUsesHelmValues: false
    serviceMonitorSelector: {}  # Select all ServiceMonitors
    serviceMonitorNamespaceSelector: {}  # From all namespaces

# ServiceMonitor must use port NAME not number
spec:
  endpoints:
    - port: metrics  # NOT 9090
  • Use port name, never port number in ServiceMonitor
  • Check RBAC: kubectl auth can-i list endpoints --as=system:serviceaccount:monitoring:prometheus-kube-prometheus-prometheus -n default
  • Verify label matching: kubectl get servicemonitor -A --show-labels

Phase to address: Prometheus installation phase, verify with test ServiceMonitor

Sources:


6. k3s Control Plane Metrics Not Scraped

What goes wrong: Prometheus dashboards show no metrics for kube-scheduler, kube-controller-manager, or etcd. These panels appear blank or show "No data."

Why it happens: k3s runs control plane components as a single binary, not as pods. Standard kube-prometheus-stack expects to scrape pods that don't exist.

Prevention:

# kube-prometheus-stack values for k3s
kubeControllerManager:
  enabled: true
  endpoints:
    - 192.168.1.100  # k3s server IP
  service:
    enabled: true
    port: 10257
    targetPort: 10257
kubeScheduler:
  enabled: true
  endpoints:
    - 192.168.1.100
  service:
    enabled: true
    port: 10259
    targetPort: 10259
kubeEtcd:
  enabled: false  # k3s uses embedded sqlite/etcd
  • Explicitly configure control plane endpoints with k3s server IPs
  • Disable etcd monitoring if using embedded database
  • OR disable these components entirely for simpler setup

Phase to address: Prometheus installation phase

Sources:


7. Promtail Not Sending Logs to Loki

What goes wrong: Promtail pods are running but no logs appear in Grafana/Loki. Queries return empty results.

Why it happens:

  • Promtail started before Loki was ready
  • Log path configuration doesn't match k3s container runtime paths
  • Label selectors don't match actual pod labels
  • Network policy blocking Promtail -> Loki communication

Warning signs:

  • Promtail logs show "dropping target, no labels" or connection errors
  • kubectl logs -n monitoring promtail-xxx shows retries
  • Loki data source health check passes but queries return nothing

Prevention:

# Verify k3s containerd log paths
promtail:
  config:
    snippets:
      scrapeConfigs: |
        - job_name: kubernetes-pods
          kubernetes_sd_configs:
            - role: pod
          pipeline_stages:
            - cri: {}
          relabel_configs:
            - source_labels: [__meta_kubernetes_pod_node_name]
              target_label: node
  • Delete Promtail positions file to force re-read: kubectl exec -n monitoring promtail-xxx -- rm /tmp/positions.yaml
  • Ensure Loki is healthy before Promtail starts (use init container or sync wave)
  • Verify log paths match containerd: /var/log/pods/*/*/*.log

Phase to address: Loki installation phase

Sources:


8. ArgoCD Self-Management Bootstrap Chicken-Egg

What goes wrong: Attempting to have ArgoCD manage itself creates confusion about what's managing what. Initial mistakes in the ArgoCD Application manifest can lock you out.

Why it happens: GitOps can't install ArgoCD if ArgoCD isn't present. After bootstrap, changing ArgoCD's self-managing Application incorrectly can break the cluster.

Prevention:

# Phase 1: Install ArgoCD manually (kubectl apply or helm)
# Phase 2: Create self-management Application
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: argocd
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://git.kube2.tricnet.de/tho/infrastructure.git
    path: argocd
    targetRevision: HEAD
  destination:
    server: https://kubernetes.default.svc
    namespace: argocd
  syncPolicy:
    automated:
      prune: false  # CRITICAL: Don't auto-prune ArgoCD components
      selfHeal: true
  • Always bootstrap ArgoCD manually first (Helm or kubectl)
  • Set prune: false for ArgoCD's self-management Application
  • Use App of Apps pattern for managed applications
  • Keep a local backup of ArgoCD Application manifest

Phase to address: ArgoCD installation phase - plan bootstrap strategy upfront

Sources:


9. Sync Waves Misuse Creating False Dependencies

What goes wrong: Over-engineering sync waves creates unnecessary sequential deployments, increasing deployment time and complexity. Or under-engineering leads to race conditions.

Why it happens:

  • Developers add waves "just in case"
  • Misunderstanding that waves are within single Application only
  • Not knowing default wave is 0 and waves can be negative

Prevention:

# Use waves sparingly - only for true dependencies
# Database must exist before app
metadata:
  annotations:
    argocd.argoproj.io/sync-wave: "-1"  # First

# App deployment
metadata:
  annotations:
    argocd.argoproj.io/sync-wave: "0"   # Default, after database

# Don't create unnecessary chains like:
# ConfigMap (wave -3) -> Secret (wave -2) -> Service (wave -1) -> Deployment (wave 0)
# These have no real dependency and should all be wave 0
  • Use waves only for actual dependencies (database before app, CRD before CR)
  • Keep wave structure as flat as possible
  • Sync waves do NOT work across different ArgoCD Applications
  • For cross-Application dependencies, use ApplicationSets with Progressive Syncs

Phase to address: Application configuration phase

Sources:


Minor Pitfalls

Annoyances that are easily fixed but waste time if not known.

10. Grafana Default Password Not Changed

What goes wrong: Using default admin/prom-operator credentials in production exposes the monitoring stack.

Prevention:

# kube-prometheus-stack values
grafana:
  adminPassword: "${GRAFANA_ADMIN_PASSWORD}"  # From secret
  # Or use existing secret
  admin:
    existingSecret: grafana-admin-credentials
    userKey: admin-user
    passwordKey: admin-password

Phase to address: Grafana installation phase


11. Missing open-iscsi for Longhorn

What goes wrong: Longhorn volumes fail to attach with cryptic errors.

Why it happens: Longhorn requires open-iscsi on all nodes, which isn't installed by default on many Linux distributions.

Prevention:

# On each node before Longhorn installation
sudo apt-get install -y open-iscsi
sudo systemctl enable iscsid
sudo systemctl start iscsid

Phase to address: Pre-installation prerequisites check

Sources:


12. ClusterIP Services Not Accessible

What goes wrong: After installing monitoring stack, Grafana/Prometheus aren't accessible externally.

Why it happens: k3s defaults to ClusterIP for services. Single-node setups need explicit ingress or LoadBalancer configuration.

Prevention:

# kube-prometheus-stack values
grafana:
  ingress:
    enabled: true
    ingressClassName: traefik
    hosts:
      - grafana.kube2.tricnet.de
    tls:
      - secretName: grafana-tls
        hosts:
          - grafana.kube2.tricnet.de
    annotations:
      cert-manager.io/cluster-issuer: letsencrypt-prod

Phase to address: Installation phase - configure ingress alongside deployment


13. Traefik v3 Breaking Changes for ArgoCD IngressRoute

What goes wrong: ArgoCD IngressRoute with gRPC support stops working after Traefik upgrade to v3.

Why it happens: Traefik v3 changed header matcher syntax from Headers() to Header().

Prevention:

# Traefik v2 (OLD - broken in v3)
match: Host(`argocd.example.com`) && Headers(`Content-Type`, `application/grpc`)

# Traefik v3 (NEW)
match: Host(`argocd.example.com`) && Header(`Content-Type`, `application/grpc`)
  • Check Traefik version before applying IngressRoutes
  • Test gRPC route after any Traefik upgrade

Phase to address: ArgoCD installation phase

Sources:


14. k3s Resource Exhaustion with Full Monitoring Stack

What goes wrong: Single-node k3s cluster becomes unresponsive after deploying full kube-prometheus-stack.

Why it happens:

  • kube-prometheus-stack deploys many components (prometheus, alertmanager, grafana, node-exporter, kube-state-metrics)
  • Default resource requests/limits are sized for larger clusters
  • k3s server process itself needs ~500MB RAM

Warning signs:

  • Pods stuck in Pending
  • OOMKilled events
  • Node NotReady status

Prevention:

# Minimal kube-prometheus-stack for single-node
alertmanager:
  enabled: false  # Disable if not using alerts
prometheus:
  prometheusSpec:
    resources:
      requests:
        memory: 256Mi
        cpu: 100m
      limits:
        memory: 512Mi
grafana:
  resources:
    requests:
      memory: 128Mi
      cpu: 50m
    limits:
      memory: 256Mi
  • Disable unnecessary components (alertmanager if no alerts configured)
  • Set explicit resource limits lower than defaults
  • Monitor cluster resources: kubectl top nodes
  • Consider: 4GB RAM minimum for k3s + monitoring + workloads

Phase to address: Prometheus installation phase - right-size from start

Sources:


Phase-Specific Warning Summary

Phase Likely Pitfall Mitigation
Prerequisites #11 Missing open-iscsi Pre-flight check script
ArgoCD Installation #4 TLS redirect loop, #8 Bootstrap Test ingress immediately, plan bootstrap
ArgoCD + Gitea Integration #1 Webhook parsing Use Gogs webhook type, accept polling fallback
Prometheus Installation #3 Volume growth, #5 ServiceMonitor, #6 Control plane, #14 Resources Configure retention+size, verify RBAC, right-size
Loki Installation #2 Disk full, #7 Promtail Enable retention day one, verify log paths
Grafana Installation #10 Default password, #12 ClusterIP Set password, configure ingress
Application Configuration #9 Sync waves Use sparingly, only for real dependencies

Pre-Installation Checklist

Before starting installation, verify:

  • open-iscsi installed on all nodes
  • Longhorn healthy with available storage (check kubectl get nodes and Longhorn UI)
  • Traefik version known (v2 vs v3 affects IngressRoute syntax)
  • DNS entries configured for monitoring subdomains
  • Gitea webhook type decision (use Gogs type, or accept polling fallback)
  • Disk space planning: Loki retention + Prometheus retention + headroom
  • Memory planning: k3s (~500MB) + monitoring (~1GB) + workloads
  • Namespace strategy decided (monitoring namespace vs default)

Existing Infrastructure Compatibility Notes

Based on the existing TaskPlanner setup:

Traefik: Already in use with cert-manager (letsencrypt-prod). New services should follow same pattern:

annotations:
  cert-manager.io/cluster-issuer: letsencrypt-prod

Longhorn: Already the storage class. New PVCs should use explicit storageClassName: longhorn and consider replica count for single-node (set to 1).

Gitea: Repository already configured at git.kube2.tricnet.de. ArgoCD Application already exists in argocd/application.yaml - don't duplicate.

Existing ArgoCD Application: TaskPlanner is already configured with ArgoCD. The monitoring stack should be a separate Application, not added to the existing one.


Sources Summary

Official Documentation

Community Issues (Verified Problems)

Tutorials and Guides


Pitfalls research for: CI/CD and Observability on k3s Context: Adding to existing TaskPlanner deployment Researched: 2026-02-03