Files

Thomas Richter 5dbabe6a2d docs: complete v2.0 CI/CD and observability research

Files:
- STACK-v2-cicd-observability.md (ArgoCD, Prometheus, Loki, Alloy)
- FEATURES.md (updated with CI/CD and observability section)
- ARCHITECTURE.md (updated with v2.0 integration architecture)
- PITFALLS-CICD-OBSERVABILITY.md (14 critical/moderate/minor pitfalls)
- SUMMARY-v2-cicd-observability.md (synthesis with roadmap implications)

Key findings:
- Stack: kube-prometheus-stack + Loki monolithic + Alloy (Promtail EOL March 2026)
- Architecture: 3-phase approach - GitOps first, observability second, CI tests last
- Critical pitfall: ArgoCD TLS redirect loop, Loki disk exhaustion, k3s metrics config

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-02-03 03:29:23 +01:00

22 KiB

Raw Blame History

Domain Pitfalls: CI/CD and Observability on k3s

Domain: Adding ArgoCD, Prometheus, Grafana, and Loki to existing k3s cluster Context: TaskPlanner on self-hosted k3s with Gitea, Traefik, Longhorn Researched: 2026-02-03 Confidence: HIGH (verified with official documentation and community issues)

Critical Pitfalls

Mistakes that cause system instability, data loss, or require significant rework.

1. Gitea Webhook JSON Parsing Failure with ArgoCD

What goes wrong: ArgoCD receives webhooks from Gitea but fails to parse them with error: json: cannot unmarshal string into Go struct field .repository.created_at of type int64. This happens because ArgoCD treats Gitea events as GitHub events instead of Gogs events.

Why it happens: Gitea is a fork of Gogs, but ArgoCD's webhook handler expects different field types. The repository.created_at field is a string in Gitea/Gogs but ArgoCD expects int64 for GitHub format.

Consequences:

Webhooks silently fail (ArgoCD logs error but continues)
Must wait for 3-minute polling interval for changes to sync
False confidence that instant sync is working

Warning signs:

ArgoCD server logs show webhook parsing errors
Application sync doesn't happen immediately after push
Webhook delivery shows success in Gitea but no ArgoCD response

Prevention:

Configure webhook with Gogs type in Gitea, NOT Gitea type
Test webhook delivery and check ArgoCD server logs: kubectl logs -n argocd deploy/argocd-server | grep -i webhook
Accept 3-minute polling as fallback (webhooks are optional enhancement)

Phase to address: ArgoCD installation phase - verify webhook integration immediately

Sources:

2. Loki Disk Full with No Size-Based Retention

What goes wrong: Loki fills the entire disk because retention is only time-based, not size-based. When disk fills, Loki crashes with "no space left on device" and becomes completely non-functional - Grafana cannot even fetch labels.

Why it happens:

Retention is disabled by default (compactor.retention-enabled: false)
Loki only supports time-based retention (e.g., 7 days), not size-based
High-volume logging can fill disk before retention period expires

Consequences:

Complete logging system failure
May affect other pods sharing the same Longhorn volume
Recovery requires manual cleanup or volume expansion

Warning signs:

Steadily increasing PVC usage visible in kubectl get pvc
Loki compactor logs show no deletion activity
Grafana queries become slow before complete failure

Prevention:

# Loki values.yaml
loki:
  compactor:
    retention_enabled: true
    compaction_interval: 10m
    retention_delete_delay: 2h
    retention_delete_worker_count: 150
    working_directory: /loki/compactor
  limits_config:
    retention_period: 168h  # 7 days - adjust based on disk size

Set conservative retention period (start with 7 days)
Run compactor as StatefulSet with persistent storage for marker files
Set up Prometheus alert for PVC usage > 80%
Index period MUST be 24h for retention to work

Phase to address: Loki installation phase - configure retention from day one

Sources:

3. Prometheus Volume Growth Exceeds Longhorn PVC

What goes wrong: Prometheus metrics storage grows beyond PVC capacity. Longhorn volume expansion via CSI can result in a faulted volume that prevents Prometheus from starting.

Why it happens:

Default Prometheus retention is 15 days with no size limit
kube-prometheus-stack defaults don't match k3s resource constraints
Longhorn CSI volume expansion has known issues requiring specific procedure

Consequences:

Prometheus pod stuck in pending/crash loop
Loss of historical metrics
Longhorn volume in faulted state requiring manual recovery

Warning signs:

Prometheus pod restarts with OOMKilled or disk errors
kubectl describe pvc shows capacity approaching limit
Longhorn UI shows volume health degraded

Prevention:

# kube-prometheus-stack values
prometheus:
  prometheusSpec:
    retention: 7d
    retentionSize: "8GB"  # Set explicit size limit
    resources:
      requests:
        memory: 400Mi
      limits:
        memory: 600Mi
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: longhorn
          resources:
            requests:
              storage: 10Gi

Always set both retention AND retentionSize
Size PVC with 20% headroom above retentionSize
Monitor with prometheus_tsdb_storage_blocks_bytes metric
For expansion: stop pod, detach volume, resize, then restart

Phase to address: Prometheus installation phase

Sources:

4. ArgoCD + Traefik TLS Termination Redirect Loop

What goes wrong: ArgoCD UI becomes inaccessible with redirect loops or connection refused errors when accessed through Traefik. Browser shows ERR_TOO_MANY_REDIRECTS.

Why it happens: Traefik terminates TLS and forwards HTTP to ArgoCD. ArgoCD server, configured for TLS by default, responds with 307 redirects to HTTPS, creating infinite loop.

Consequences:

Cannot access ArgoCD UI via ingress
CLI may work with port-forward but not through ingress
gRPC connections for CLI through ingress fail

Warning signs:

Browser redirect loop when accessing ArgoCD URL
curl -v shows 307 redirect responses
Works with kubectl port-forward but not via ingress

Prevention:

# Option 1: ConfigMap (recommended)
apiVersion: v1
kind: ConfigMap
metadata:
  name: argocd-cmd-params-cm
  namespace: argocd
data:
  server.insecure: "true"

# Option 2: Traefik IngressRoute for dual HTTP/gRPC
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
  name: argocd-server
  namespace: argocd
spec:
  entryPoints:
    - websecure
  routes:
    - kind: Rule
      match: Host(`argocd.example.com`)
      priority: 10
      services:
        - name: argocd-server
          port: 80
    - kind: Rule
      match: Host(`argocd.example.com`) && Header(`Content-Type`, `application/grpc`)
      priority: 11
      services:
        - name: argocd-server
          port: 80
          scheme: h2c
  tls:
    certResolver: letsencrypt-prod

Set server.insecure: "true" in argocd-cmd-params-cm ConfigMap
Use IngressRoute (not Ingress) for proper gRPC support
Configure separate routes for HTTP and gRPC with correct priority

Phase to address: ArgoCD installation phase - test immediately after ingress setup

Sources:

Moderate Pitfalls

Mistakes that cause delays, debugging sessions, or technical debt.

5. ServiceMonitor Not Discovering Targets

What goes wrong: Prometheus ServiceMonitors are created but no targets appear in Prometheus. The scrape config shows 0/0 targets up.

Why it happens:

Label selector mismatch between Prometheus CR and ServiceMonitor
RBAC: Prometheus ServiceAccount lacks permission in target namespace
Port specified as number instead of name
ServiceMonitor in different namespace than Prometheus expects

Prevention:

# Ensure Prometheus CR has permissive selectors
prometheus:
  prometheusSpec:
    serviceMonitorSelectorNilUsesHelmValues: false
    serviceMonitorSelector: {}  # Select all ServiceMonitors
    serviceMonitorNamespaceSelector: {}  # From all namespaces

# ServiceMonitor must use port NAME not number
spec:
  endpoints:
    - port: metrics  # NOT 9090

Use port name, never port number in ServiceMonitor
Check RBAC: kubectl auth can-i list endpoints --as=system:serviceaccount:monitoring:prometheus-kube-prometheus-prometheus -n default
Verify label matching: kubectl get servicemonitor -A --show-labels

Phase to address: Prometheus installation phase, verify with test ServiceMonitor

Sources:

6. k3s Control Plane Metrics Not Scraped

What goes wrong: Prometheus dashboards show no metrics for kube-scheduler, kube-controller-manager, or etcd. These panels appear blank or show "No data."

Why it happens: k3s runs control plane components as a single binary, not as pods. Standard kube-prometheus-stack expects to scrape pods that don't exist.

Prevention:

# kube-prometheus-stack values for k3s
kubeControllerManager:
  enabled: true
  endpoints:
    - 192.168.1.100  # k3s server IP
  service:
    enabled: true
    port: 10257
    targetPort: 10257
kubeScheduler:
  enabled: true
  endpoints:
    - 192.168.1.100
  service:
    enabled: true
    port: 10259
    targetPort: 10259
kubeEtcd:
  enabled: false  # k3s uses embedded sqlite/etcd

Explicitly configure control plane endpoints with k3s server IPs
Disable etcd monitoring if using embedded database
OR disable these components entirely for simpler setup

Phase to address: Prometheus installation phase

Sources:

Prometheus for Rancher K3s Control Plane Monitoring

7. Promtail Not Sending Logs to Loki

What goes wrong: Promtail pods are running but no logs appear in Grafana/Loki. Queries return empty results.

Why it happens:

Promtail started before Loki was ready
Log path configuration doesn't match k3s container runtime paths
Label selectors don't match actual pod labels
Network policy blocking Promtail -> Loki communication

Warning signs:

Promtail logs show "dropping target, no labels" or connection errors
kubectl logs -n monitoring promtail-xxx shows retries
Loki data source health check passes but queries return nothing

Prevention:

# Verify k3s containerd log paths
promtail:
  config:
    snippets:
      scrapeConfigs: |
        - job_name: kubernetes-pods
          kubernetes_sd_configs:
            - role: pod
          pipeline_stages:
            - cri: {}
          relabel_configs:
            - source_labels: [__meta_kubernetes_pod_node_name]
              target_label: node

Delete Promtail positions file to force re-read: kubectl exec -n monitoring promtail-xxx -- rm /tmp/positions.yaml
Ensure Loki is healthy before Promtail starts (use init container or sync wave)
Verify log paths match containerd: /var/log/pods/*/*/*.log

Phase to address: Loki installation phase

Sources:

Grafana Loki Troubleshooting

8. ArgoCD Self-Management Bootstrap Chicken-Egg

What goes wrong: Attempting to have ArgoCD manage itself creates confusion about what's managing what. Initial mistakes in the ArgoCD Application manifest can lock you out.

Why it happens: GitOps can't install ArgoCD if ArgoCD isn't present. After bootstrap, changing ArgoCD's self-managing Application incorrectly can break the cluster.

Prevention:

# Phase 1: Install ArgoCD manually (kubectl apply or helm)
# Phase 2: Create self-management Application
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: argocd
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://git.kube2.tricnet.de/tho/infrastructure.git
    path: argocd
    targetRevision: HEAD
  destination:
    server: https://kubernetes.default.svc
    namespace: argocd
  syncPolicy:
    automated:
      prune: false  # CRITICAL: Don't auto-prune ArgoCD components
      selfHeal: true

Always bootstrap ArgoCD manually first (Helm or kubectl)
Set prune: false for ArgoCD's self-management Application
Use App of Apps pattern for managed applications
Keep a local backup of ArgoCD Application manifest

Phase to address: ArgoCD installation phase - plan bootstrap strategy upfront

Sources:

9. Sync Waves Misuse Creating False Dependencies

What goes wrong: Over-engineering sync waves creates unnecessary sequential deployments, increasing deployment time and complexity. Or under-engineering leads to race conditions.

Why it happens:

Developers add waves "just in case"
Misunderstanding that waves are within single Application only
Not knowing default wave is 0 and waves can be negative

Prevention:

# Use waves sparingly - only for true dependencies
# Database must exist before app
metadata:
  annotations:
    argocd.argoproj.io/sync-wave: "-1"  # First

# App deployment
metadata:
  annotations:
    argocd.argoproj.io/sync-wave: "0"   # Default, after database

# Don't create unnecessary chains like:
# ConfigMap (wave -3) -> Secret (wave -2) -> Service (wave -1) -> Deployment (wave 0)
# These have no real dependency and should all be wave 0

Use waves only for actual dependencies (database before app, CRD before CR)
Keep wave structure as flat as possible
Sync waves do NOT work across different ArgoCD Applications
For cross-Application dependencies, use ApplicationSets with Progressive Syncs

Phase to address: Application configuration phase

Sources:

ArgoCD Sync Phases and Waves

Minor Pitfalls

Annoyances that are easily fixed but waste time if not known.

10. Grafana Default Password Not Changed

What goes wrong: Using default admin/prom-operator credentials in production exposes the monitoring stack.

Prevention:

# kube-prometheus-stack values
grafana:
  adminPassword: "${GRAFANA_ADMIN_PASSWORD}"  # From secret
  # Or use existing secret
  admin:
    existingSecret: grafana-admin-credentials
    userKey: admin-user
    passwordKey: admin-password

Phase to address: Grafana installation phase

11. Missing open-iscsi for Longhorn

What goes wrong: Longhorn volumes fail to attach with cryptic errors.

Why it happens: Longhorn requires open-iscsi on all nodes, which isn't installed by default on many Linux distributions.

Prevention:

# On each node before Longhorn installation
sudo apt-get install -y open-iscsi
sudo systemctl enable iscsid
sudo systemctl start iscsid

Phase to address: Pre-installation prerequisites check

Sources:

Longhorn Prerequisites

12. ClusterIP Services Not Accessible

What goes wrong: After installing monitoring stack, Grafana/Prometheus aren't accessible externally.

Why it happens: k3s defaults to ClusterIP for services. Single-node setups need explicit ingress or LoadBalancer configuration.

Prevention:

# kube-prometheus-stack values
grafana:
  ingress:
    enabled: true
    ingressClassName: traefik
    hosts:
      - grafana.kube2.tricnet.de
    tls:
      - secretName: grafana-tls
        hosts:
          - grafana.kube2.tricnet.de
    annotations:
      cert-manager.io/cluster-issuer: letsencrypt-prod

Phase to address: Installation phase - configure ingress alongside deployment

13. Traefik v3 Breaking Changes for ArgoCD IngressRoute

What goes wrong: ArgoCD IngressRoute with gRPC support stops working after Traefik upgrade to v3.

Why it happens: Traefik v3 changed header matcher syntax from Headers() to Header().

Prevention:

# Traefik v2 (OLD - broken in v3)
match: Host(`argocd.example.com`) && Headers(`Content-Type`, `application/grpc`)

# Traefik v3 (NEW)
match: Host(`argocd.example.com`) && Header(`Content-Type`, `application/grpc`)

Check Traefik version before applying IngressRoutes
Test gRPC route after any Traefik upgrade

Phase to address: ArgoCD installation phase

Sources:

ArgoCD Issue #15534 - Traefik v3 docs

14. k3s Resource Exhaustion with Full Monitoring Stack

What goes wrong: Single-node k3s cluster becomes unresponsive after deploying full kube-prometheus-stack.

Why it happens:

kube-prometheus-stack deploys many components (prometheus, alertmanager, grafana, node-exporter, kube-state-metrics)
Default resource requests/limits are sized for larger clusters
k3s server process itself needs ~500MB RAM

Warning signs:

Pods stuck in Pending
OOMKilled events
Node NotReady status

Prevention:

# Minimal kube-prometheus-stack for single-node
alertmanager:
  enabled: false  # Disable if not using alerts
prometheus:
  prometheusSpec:
    resources:
      requests:
        memory: 256Mi
        cpu: 100m
      limits:
        memory: 512Mi
grafana:
  resources:
    requests:
      memory: 128Mi
      cpu: 50m
    limits:
      memory: 256Mi

Disable unnecessary components (alertmanager if no alerts configured)
Set explicit resource limits lower than defaults
Monitor cluster resources: kubectl top nodes
Consider: 4GB RAM minimum for k3s + monitoring + workloads

Phase to address: Prometheus installation phase - right-size from start

Sources:

K3s Resource Profiling

Phase-Specific Warning Summary

Phase	Likely Pitfall	Mitigation
Prerequisites	#11 Missing open-iscsi	Pre-flight check script
ArgoCD Installation	#4 TLS redirect loop, #8 Bootstrap	Test ingress immediately, plan bootstrap
ArgoCD + Gitea Integration	#1 Webhook parsing	Use Gogs webhook type, accept polling fallback
Prometheus Installation	#3 Volume growth, #5 ServiceMonitor, #6 Control plane, #14 Resources	Configure retention+size, verify RBAC, right-size
Loki Installation	#2 Disk full, #7 Promtail	Enable retention day one, verify log paths
Grafana Installation	#10 Default password, #12 ClusterIP	Set password, configure ingress
Application Configuration	#9 Sync waves	Use sparingly, only for real dependencies

Pre-Installation Checklist

Before starting installation, verify:

open-iscsi installed on all nodes
Longhorn healthy with available storage (check kubectl get nodes and Longhorn UI)
Traefik version known (v2 vs v3 affects IngressRoute syntax)
DNS entries configured for monitoring subdomains
Gitea webhook type decision (use Gogs type, or accept polling fallback)
Disk space planning: Loki retention + Prometheus retention + headroom
Memory planning: k3s (~500MB) + monitoring (~1GB) + workloads
Namespace strategy decided (monitoring namespace vs default)

Existing Infrastructure Compatibility Notes

Based on the existing TaskPlanner setup:

Traefik: Already in use with cert-manager (letsencrypt-prod). New services should follow same pattern:

annotations:
  cert-manager.io/cluster-issuer: letsencrypt-prod

Longhorn: Already the storage class. New PVCs should use explicit storageClassName: longhorn and consider replica count for single-node (set to 1).

Gitea: Repository already configured at git.kube2.tricnet.de. ArgoCD Application already exists in argocd/application.yaml - don't duplicate.

Existing ArgoCD Application: TaskPlanner is already configured with ArgoCD. The monitoring stack should be a separate Application, not added to the existing one.

Sources Summary

Official Documentation

Community Issues (Verified Problems)

Tutorials and Guides

Pitfalls research for: CI/CD and Observability on k3s Context: Adding to existing TaskPlanner deployment Researched: 2026-02-03

22 KiB Raw Blame History

Domain Pitfalls: CI/CD and Observability on k3s

Critical Pitfalls

1. Gitea Webhook JSON Parsing Failure with ArgoCD

2. Loki Disk Full with No Size-Based Retention

3. Prometheus Volume Growth Exceeds Longhorn PVC

4. ArgoCD + Traefik TLS Termination Redirect Loop

Moderate Pitfalls

5. ServiceMonitor Not Discovering Targets

6. k3s Control Plane Metrics Not Scraped

7. Promtail Not Sending Logs to Loki

8. ArgoCD Self-Management Bootstrap Chicken-Egg

9. Sync Waves Misuse Creating False Dependencies

Minor Pitfalls

10. Grafana Default Password Not Changed

11. Missing open-iscsi for Longhorn

12. ClusterIP Services Not Accessible

13. Traefik v3 Breaking Changes for ArgoCD IngressRoute

14. k3s Resource Exhaustion with Full Monitoring Stack

Phase-Specific Warning Summary

Pre-Installation Checklist

Existing Infrastructure Compatibility Notes

Sources Summary

Official Documentation

Community Issues (Verified Problems)

Tutorials and Guides

22 KiB

Raw Blame History