taskplaner/.planning/research/PITFALLS-CICD-OBSERVABILITY.md

# Domain Pitfalls: CI/CD and Observability on k3s

**Domain:** Adding ArgoCD, Prometheus, Grafana, and Loki to existing k3s cluster
**Context:** TaskPlanner on self-hosted k3s with Gitea, Traefik, Longhorn
**Researched:** 2026-02-03
**Confidence:** HIGH (verified with official documentation and community issues)

---

## Critical Pitfalls

Mistakes that cause system instability, data loss, or require significant rework.

### 1. Gitea Webhook JSON Parsing Failure with ArgoCD

**What goes wrong:** ArgoCD receives webhooks from Gitea but fails to parse them with error: `json: cannot unmarshal string into Go struct field .repository.created_at of type int64`. This happens because ArgoCD treats Gitea events as GitHub events instead of Gogs events.

**Why it happens:** Gitea is a fork of Gogs, but ArgoCD's webhook handler expects different field types. The `repository.created_at` field is a string in Gitea/Gogs but ArgoCD expects int64 for GitHub format.

**Consequences:**
- Webhooks silently fail (ArgoCD logs error but continues)
- Must wait for 3-minute polling interval for changes to sync
- False confidence that instant sync is working

**Warning signs:**
- ArgoCD server logs show webhook parsing errors
- Application sync doesn't happen immediately after push
- Webhook delivery shows success in Gitea but no ArgoCD response

**Prevention:**
- Configure webhook with `Gogs` type in Gitea, NOT `Gitea` type
- Test webhook delivery and check ArgoCD server logs: `kubectl logs -n argocd deploy/argocd-server | grep -i webhook`
- Accept 3-minute polling as fallback (webhooks are optional enhancement)

**Phase to address:** ArgoCD installation phase - verify webhook integration immediately

**Sources:**
- [ArgoCD Issue #16453 - Forgejo/Gitea webhook parsing](https://github.com/argoproj/argo-cd/issues/16453)
- [ArgoCD Issue #20444 - Gitea support lacking](https://github.com/argoproj/argo-cd/issues/20444)

---

### 2. Loki Disk Full with No Size-Based Retention

**What goes wrong:** Loki fills the entire disk because retention is only time-based, not size-based. When disk fills, Loki crashes with "no space left on device" and becomes completely non-functional - Grafana cannot even fetch labels.

**Why it happens:**
- Retention is disabled by default (`compactor.retention-enabled: false`)
- Loki only supports time-based retention (e.g., 7 days), not size-based
- High-volume logging can fill disk before retention period expires

**Consequences:**
- Complete logging system failure
- May affect other pods sharing the same Longhorn volume
- Recovery requires manual cleanup or volume expansion

**Warning signs:**
- Steadily increasing PVC usage visible in `kubectl get pvc`
- Loki compactor logs show no deletion activity
- Grafana queries become slow before complete failure

**Prevention:**
```yaml
# Loki values.yaml
loki:
  compactor:
    retention_enabled: true
    compaction_interval: 10m
    retention_delete_delay: 2h
    retention_delete_worker_count: 150
    working_directory: /loki/compactor
  limits_config:
    retention_period: 168h  # 7 days - adjust based on disk size
```

- Set conservative retention period (start with 7 days)
- Run compactor as StatefulSet with persistent storage for marker files
- Set up Prometheus alert for PVC usage > 80%
- Index period MUST be 24h for retention to work

**Phase to address:** Loki installation phase - configure retention from day one

**Sources:**
- [Grafana Loki Retention Documentation](https://grafana.com/docs/loki/latest/operations/storage/retention/)
- [Loki Issue #5242 - Retention not working](https://github.com/grafana/loki/issues/5242)

---

### 3. Prometheus Volume Growth Exceeds Longhorn PVC

**What goes wrong:** Prometheus metrics storage grows beyond PVC capacity. Longhorn volume expansion via CSI can result in a faulted volume that prevents Prometheus from starting.

**Why it happens:**
- Default Prometheus retention is 15 days with no size limit
- kube-prometheus-stack defaults don't match k3s resource constraints
- Longhorn CSI volume expansion has known issues requiring specific procedure

**Consequences:**
- Prometheus pod stuck in pending/crash loop
- Loss of historical metrics
- Longhorn volume in faulted state requiring manual recovery

**Warning signs:**
- Prometheus pod restarts with OOMKilled or disk errors
- `kubectl describe pvc` shows capacity approaching limit
- Longhorn UI shows volume health degraded

**Prevention:**
```yaml
# kube-prometheus-stack values
prometheus:
  prometheusSpec:
    retention: 7d
    retentionSize: "8GB"  # Set explicit size limit
    resources:
      requests:
        memory: 400Mi
      limits:
        memory: 600Mi
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: longhorn
          resources:
            requests:
              storage: 10Gi
```

- Always set both `retention` AND `retentionSize`
- Size PVC with 20% headroom above retentionSize
- Monitor with `prometheus_tsdb_storage_blocks_bytes` metric
- For expansion: stop pod, detach volume, resize, then restart

**Phase to address:** Prometheus installation phase

**Sources:**
- [Longhorn Issue #2222 - Volume expansion faults](https://github.com/longhorn/longhorn/issues/2222)
- [kube-prometheus-stack Issue #3401 - Resource limits](https://github.com/prometheus-community/helm-charts/issues/3401)

---

### 4. ArgoCD + Traefik TLS Termination Redirect Loop

**What goes wrong:** ArgoCD UI becomes inaccessible with redirect loops or connection refused errors when accessed through Traefik. Browser shows ERR_TOO_MANY_REDIRECTS.

**Why it happens:** Traefik terminates TLS and forwards HTTP to ArgoCD. ArgoCD server, configured for TLS by default, responds with 307 redirects to HTTPS, creating infinite loop.

**Consequences:**
- Cannot access ArgoCD UI via ingress
- CLI may work with port-forward but not through ingress
- gRPC connections for CLI through ingress fail

**Warning signs:**
- Browser redirect loop when accessing ArgoCD URL
- `curl -v` shows 307 redirect responses
- Works with `kubectl port-forward` but not via ingress

**Prevention:**
```yaml
# Option 1: ConfigMap (recommended)
apiVersion: v1
kind: ConfigMap
metadata:
  name: argocd-cmd-params-cm
  namespace: argocd
data:
  server.insecure: "true"

# Option 2: Traefik IngressRoute for dual HTTP/gRPC
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
  name: argocd-server
  namespace: argocd
spec:
  entryPoints:
    - websecure
  routes:
    - kind: Rule
      match: Host(`argocd.example.com`)
      priority: 10
      services:
        - name: argocd-server
          port: 80
    - kind: Rule
      match: Host(`argocd.example.com`) && Header(`Content-Type`, `application/grpc`)
      priority: 11
      services:
        - name: argocd-server
          port: 80
          scheme: h2c
  tls:
    certResolver: letsencrypt-prod
```

- Set `server.insecure: "true"` in argocd-cmd-params-cm ConfigMap
- Use IngressRoute (not Ingress) for proper gRPC support
- Configure separate routes for HTTP and gRPC with correct priority

**Phase to address:** ArgoCD installation phase - test immediately after ingress setup

**Sources:**
- [ArgoCD Ingress Configuration](https://argo-cd.readthedocs.io/en/stable/operator-manual/ingress/)
- [Traefik Community - ArgoCD behind Traefik](https://community.traefik.io/t/serving-argocd-behind-traefik-ingress/15901)

---

## Moderate Pitfalls

Mistakes that cause delays, debugging sessions, or technical debt.

### 5. ServiceMonitor Not Discovering Targets

**What goes wrong:** Prometheus ServiceMonitors are created but no targets appear in Prometheus. The scrape config shows 0/0 targets up.

**Why it happens:**
- Label selector mismatch between Prometheus CR and ServiceMonitor
- RBAC: Prometheus ServiceAccount lacks permission in target namespace
- Port specified as number instead of name
- ServiceMonitor in different namespace than Prometheus expects

**Prevention:**
```yaml
# Ensure Prometheus CR has permissive selectors
prometheus:
  prometheusSpec:
    serviceMonitorSelectorNilUsesHelmValues: false
    serviceMonitorSelector: {}  # Select all ServiceMonitors
    serviceMonitorNamespaceSelector: {}  # From all namespaces

# ServiceMonitor must use port NAME not number
spec:
  endpoints:
    - port: metrics  # NOT 9090
```

- Use port name, never port number in ServiceMonitor
- Check RBAC: `kubectl auth can-i list endpoints --as=system:serviceaccount:monitoring:prometheus-kube-prometheus-prometheus -n default`
- Verify label matching: `kubectl get servicemonitor -A --show-labels`

**Phase to address:** Prometheus installation phase, verify with test ServiceMonitor

**Sources:**
- [Prometheus Operator Troubleshooting](https://managedkube.com/prometheus/operator/servicemonitor/troubleshooting/2019/11/07/prometheus-operator-servicemonitor-troubleshooting.html)
- [ServiceMonitor not discovered Issue #3383](https://github.com/prometheus-operator/prometheus-operator/issues/3383)

---

### 6. k3s Control Plane Metrics Not Scraped

**What goes wrong:** Prometheus dashboards show no metrics for kube-scheduler, kube-controller-manager, or etcd. These panels appear blank or show "No data."

**Why it happens:** k3s runs control plane components as a single binary, not as pods. Standard kube-prometheus-stack expects to scrape pods that don't exist.

**Prevention:**
```yaml
# kube-prometheus-stack values for k3s
kubeControllerManager:
  enabled: true
  endpoints:
    - 192.168.1.100  # k3s server IP
  service:
    enabled: true
    port: 10257
    targetPort: 10257
kubeScheduler:
  enabled: true
  endpoints:
    - 192.168.1.100
  service:
    enabled: true
    port: 10259
    targetPort: 10259
kubeEtcd:
  enabled: false  # k3s uses embedded sqlite/etcd
```

- Explicitly configure control plane endpoints with k3s server IPs
- Disable etcd monitoring if using embedded database
- OR disable these components entirely for simpler setup

**Phase to address:** Prometheus installation phase

**Sources:**
- [Prometheus for Rancher K3s Control Plane Monitoring](https://www.spectrocloud.com/blog/enabling-rancher-k3s-cluster-control-plane-monitoring-with-prometheus)

---

### 7. Promtail Not Sending Logs to Loki

**What goes wrong:** Promtail pods are running but no logs appear in Grafana/Loki. Queries return empty results.

**Why it happens:**
- Promtail started before Loki was ready
- Log path configuration doesn't match k3s container runtime paths
- Label selectors don't match actual pod labels
- Network policy blocking Promtail -> Loki communication

**Warning signs:**
- Promtail logs show "dropping target, no labels" or connection errors
- `kubectl logs -n monitoring promtail-xxx` shows retries
- Loki data source health check passes but queries return nothing

**Prevention:**
```yaml
# Verify k3s containerd log paths
promtail:
  config:
    snippets:
      scrapeConfigs: |
        - job_name: kubernetes-pods
          kubernetes_sd_configs:
            - role: pod
          pipeline_stages:
            - cri: {}
          relabel_configs:
            - source_labels: [__meta_kubernetes_pod_node_name]
              target_label: node
```

- Delete Promtail positions file to force re-read: `kubectl exec -n monitoring promtail-xxx -- rm /tmp/positions.yaml`
- Ensure Loki is healthy before Promtail starts (use init container or sync wave)
- Verify log paths match containerd: `/var/log/pods/*/*/*.log`

**Phase to address:** Loki installation phase

**Sources:**
- [Grafana Loki Troubleshooting](https://grafana.com/docs/loki/latest/operations/troubleshooting/)

---

### 8. ArgoCD Self-Management Bootstrap Chicken-Egg

**What goes wrong:** Attempting to have ArgoCD manage itself creates confusion about what's managing what. Initial mistakes in the ArgoCD Application manifest can lock you out.

**Why it happens:** GitOps can't install ArgoCD if ArgoCD isn't present. After bootstrap, changing ArgoCD's self-managing Application incorrectly can break the cluster.

**Prevention:**
```yaml
# Phase 1: Install ArgoCD manually (kubectl apply or helm)
# Phase 2: Create self-management Application
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: argocd
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://git.kube2.tricnet.de/tho/infrastructure.git
    path: argocd
    targetRevision: HEAD
  destination:
    server: https://kubernetes.default.svc
    namespace: argocd
  syncPolicy:
    automated:
      prune: false  # CRITICAL: Don't auto-prune ArgoCD components
      selfHeal: true
```

- Always bootstrap ArgoCD manually first (Helm or kubectl)
- Set `prune: false` for ArgoCD's self-management Application
- Use App of Apps pattern for managed applications
- Keep a local backup of ArgoCD Application manifest

**Phase to address:** ArgoCD installation phase - plan bootstrap strategy upfront

**Sources:**
- [Bootstrapping ArgoCD - Windsock.io](https://windsock.io/bootstrapping-argocd/)
- [Demystifying GitOps - Bootstrapping ArgoCD](https://medium.com/@aaltundemir/demystifying-gitops-bootstrapping-argo-cd-4a861284f273)

---

### 9. Sync Waves Misuse Creating False Dependencies

**What goes wrong:** Over-engineering sync waves creates unnecessary sequential deployments, increasing deployment time and complexity. Or under-engineering leads to race conditions.

**Why it happens:**
- Developers add waves "just in case"
- Misunderstanding that waves are within single Application only
- Not knowing default wave is 0 and waves can be negative

**Prevention:**
```yaml
# Use waves sparingly - only for true dependencies
# Database must exist before app
metadata:
  annotations:
    argocd.argoproj.io/sync-wave: "-1"  # First

# App deployment
metadata:
  annotations:
    argocd.argoproj.io/sync-wave: "0"   # Default, after database

# Don't create unnecessary chains like:
# ConfigMap (wave -3) -> Secret (wave -2) -> Service (wave -1) -> Deployment (wave 0)
# These have no real dependency and should all be wave 0
```

- Use waves only for actual dependencies (database before app, CRD before CR)
- Keep wave structure as flat as possible
- Sync waves do NOT work across different ArgoCD Applications
- For cross-Application dependencies, use ApplicationSets with Progressive Syncs

**Phase to address:** Application configuration phase

**Sources:**
- [ArgoCD Sync Phases and Waves](https://argo-cd.readthedocs.io/en/stable/user-guide/sync-waves/)

---

## Minor Pitfalls

Annoyances that are easily fixed but waste time if not known.

### 10. Grafana Default Password Not Changed

**What goes wrong:** Using default `admin/prom-operator` credentials in production exposes the monitoring stack.

**Prevention:**
```yaml
# kube-prometheus-stack values
grafana:
  adminPassword: "${GRAFANA_ADMIN_PASSWORD}"  # From secret
  # Or use existing secret
  admin:
    existingSecret: grafana-admin-credentials
    userKey: admin-user
    passwordKey: admin-password
```

**Phase to address:** Grafana installation phase

---

### 11. Missing open-iscsi for Longhorn

**What goes wrong:** Longhorn volumes fail to attach with cryptic errors.

**Why it happens:** Longhorn requires `open-iscsi` on all nodes, which isn't installed by default on many Linux distributions.

**Prevention:**
```bash
# On each node before Longhorn installation
sudo apt-get install -y open-iscsi
sudo systemctl enable iscsid
sudo systemctl start iscsid
```

**Phase to address:** Pre-installation prerequisites check

**Sources:**
- [Longhorn Prerequisites](https://longhorn.io/docs/latest/deploy/install/#installation-requirements)

---

### 12. ClusterIP Services Not Accessible

**What goes wrong:** After installing monitoring stack, Grafana/Prometheus aren't accessible externally.

**Why it happens:** k3s defaults to ClusterIP for services. Single-node setups need explicit ingress or LoadBalancer configuration.

**Prevention:**
```yaml
# kube-prometheus-stack values
grafana:
  ingress:
    enabled: true
    ingressClassName: traefik
    hosts:
      - grafana.kube2.tricnet.de
    tls:
      - secretName: grafana-tls
        hosts:
          - grafana.kube2.tricnet.de
    annotations:
      cert-manager.io/cluster-issuer: letsencrypt-prod
```

**Phase to address:** Installation phase - configure ingress alongside deployment

---

### 13. Traefik v3 Breaking Changes for ArgoCD IngressRoute

**What goes wrong:** ArgoCD IngressRoute with gRPC support stops working after Traefik upgrade to v3.

**Why it happens:** Traefik v3 changed header matcher syntax from `Headers()` to `Header()`.

**Prevention:**
```yaml
# Traefik v2 (OLD - broken in v3)
match: Host(`argocd.example.com`) && Headers(`Content-Type`, `application/grpc`)

# Traefik v3 (NEW)
match: Host(`argocd.example.com`) && Header(`Content-Type`, `application/grpc`)
```

- Check Traefik version before applying IngressRoutes
- Test gRPC route after any Traefik upgrade

**Phase to address:** ArgoCD installation phase

**Sources:**
- [ArgoCD Issue #15534 - Traefik v3 docs](https://github.com/argoproj/argo-cd/issues/15534)

---

### 14. k3s Resource Exhaustion with Full Monitoring Stack

**What goes wrong:** Single-node k3s cluster becomes unresponsive after deploying full kube-prometheus-stack.

**Why it happens:**
- kube-prometheus-stack deploys many components (prometheus, alertmanager, grafana, node-exporter, kube-state-metrics)
- Default resource requests/limits are sized for larger clusters
- k3s server process itself needs ~500MB RAM

**Warning signs:**
- Pods stuck in Pending
- OOMKilled events
- Node NotReady status

**Prevention:**
```yaml
# Minimal kube-prometheus-stack for single-node
alertmanager:
  enabled: false  # Disable if not using alerts
prometheus:
  prometheusSpec:
    resources:
      requests:
        memory: 256Mi
        cpu: 100m
      limits:
        memory: 512Mi
grafana:
  resources:
    requests:
      memory: 128Mi
      cpu: 50m
    limits:
      memory: 256Mi
```

- Disable unnecessary components (alertmanager if no alerts configured)
- Set explicit resource limits lower than defaults
- Monitor cluster resources: `kubectl top nodes`
- Consider: 4GB RAM minimum for k3s + monitoring + workloads

**Phase to address:** Prometheus installation phase - right-size from start

**Sources:**
- [K3s Resource Profiling](https://docs.k3s.io/reference/resource-profiling)

---

## Phase-Specific Warning Summary

| Phase | Likely Pitfall | Mitigation |
|-------|---------------|------------|
| Prerequisites | #11 Missing open-iscsi | Pre-flight check script |
| ArgoCD Installation | #4 TLS redirect loop, #8 Bootstrap | Test ingress immediately, plan bootstrap |
| ArgoCD + Gitea Integration | #1 Webhook parsing | Use Gogs webhook type, accept polling fallback |
| Prometheus Installation | #3 Volume growth, #5 ServiceMonitor, #6 Control plane, #14 Resources | Configure retention+size, verify RBAC, right-size |
| Loki Installation | #2 Disk full, #7 Promtail | Enable retention day one, verify log paths |
| Grafana Installation | #10 Default password, #12 ClusterIP | Set password, configure ingress |
| Application Configuration | #9 Sync waves | Use sparingly, only for real dependencies |

---

## Pre-Installation Checklist

Before starting installation, verify:

- [ ] open-iscsi installed on all nodes
- [ ] Longhorn healthy with available storage (check `kubectl get nodes` and Longhorn UI)
- [ ] Traefik version known (v2 vs v3 affects IngressRoute syntax)
- [ ] DNS entries configured for monitoring subdomains
- [ ] Gitea webhook type decision (use Gogs type, or accept polling fallback)
- [ ] Disk space planning: Loki retention + Prometheus retention + headroom
- [ ] Memory planning: k3s (~500MB) + monitoring (~1GB) + workloads
- [ ] Namespace strategy decided (monitoring namespace vs default)

---

## Existing Infrastructure Compatibility Notes

Based on the existing TaskPlanner setup:

**Traefik:** Already in use with cert-manager (letsencrypt-prod). New services should follow same pattern:
```yaml
annotations:
  cert-manager.io/cluster-issuer: letsencrypt-prod
```

**Longhorn:** Already the storage class. New PVCs should use explicit `storageClassName: longhorn` and consider replica count for single-node (set to 1).

**Gitea:** Repository already configured at `git.kube2.tricnet.de`. ArgoCD Application already exists in `argocd/application.yaml` - don't duplicate.

**Existing ArgoCD Application:** TaskPlanner is already configured with ArgoCD. The monitoring stack should be a separate Application, not added to the existing one.

---

## Sources Summary

### Official Documentation
- [ArgoCD Ingress Configuration](https://argo-cd.readthedocs.io/en/stable/operator-manual/ingress/)
- [ArgoCD Sync Phases and Waves](https://argo-cd.readthedocs.io/en/stable/user-guide/sync-waves/)
- [Grafana Loki Retention](https://grafana.com/docs/loki/latest/operations/storage/retention/)
- [Grafana Loki Troubleshooting](https://grafana.com/docs/loki/latest/operations/troubleshooting/)
- [K3s Resource Profiling](https://docs.k3s.io/reference/resource-profiling)

### Community Issues (Verified Problems)
- [ArgoCD #16453 - Gitea webhook parsing](https://github.com/argoproj/argo-cd/issues/16453)
- [ArgoCD #20444 - Gitea support](https://github.com/argoproj/argo-cd/issues/20444)
- [Loki #5242 - Retention not working](https://github.com/grafana/loki/issues/5242)
- [Longhorn #2222 - Volume expansion](https://github.com/longhorn/longhorn/issues/2222)
- [kube-prometheus-stack #3401 - Resource limits](https://github.com/prometheus-community/helm-charts/issues/3401)
- [Prometheus Operator #3383 - ServiceMonitor discovery](https://github.com/prometheus-operator/prometheus-operator/issues/3383)

### Tutorials and Guides
- [K3S Rocks - ArgoCD](https://k3s.rocks/argocd/)
- [K3S Rocks - Logging](https://k3s.rocks/logging/)
- [Bootstrapping ArgoCD](https://windsock.io/bootstrapping-argocd/)
- [Prometheus ServiceMonitor Troubleshooting](https://managedkube.com/prometheus/operator/servicemonitor/troubleshooting/2019/11/07/prometheus-operator-servicemonitor-troubleshooting.html)
- [Traefik Community - ArgoCD](https://community.traefik.io/t/serving-argocd-behind-traefik-ingress/15901)

---
*Pitfalls research for: CI/CD and Observability on k3s*
*Context: Adding to existing TaskPlanner deployment*
*Researched: 2026-02-03*