Files: - STACK-v2-cicd-observability.md (ArgoCD, Prometheus, Loki, Alloy) - FEATURES.md (updated with CI/CD and observability section) - ARCHITECTURE.md (updated with v2.0 integration architecture) - PITFALLS-CICD-OBSERVABILITY.md (14 critical/moderate/minor pitfalls) - SUMMARY-v2-cicd-observability.md (synthesis with roadmap implications) Key findings: - Stack: kube-prometheus-stack + Loki monolithic + Alloy (Promtail EOL March 2026) - Architecture: 3-phase approach - GitOps first, observability second, CI tests last - Critical pitfall: ArgoCD TLS redirect loop, Loki disk exhaustion, k3s metrics config Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
634 lines
22 KiB
Markdown
634 lines
22 KiB
Markdown
# Domain Pitfalls: CI/CD and Observability on k3s
|
|
|
|
**Domain:** Adding ArgoCD, Prometheus, Grafana, and Loki to existing k3s cluster
|
|
**Context:** TaskPlanner on self-hosted k3s with Gitea, Traefik, Longhorn
|
|
**Researched:** 2026-02-03
|
|
**Confidence:** HIGH (verified with official documentation and community issues)
|
|
|
|
---
|
|
|
|
## Critical Pitfalls
|
|
|
|
Mistakes that cause system instability, data loss, or require significant rework.
|
|
|
|
### 1. Gitea Webhook JSON Parsing Failure with ArgoCD
|
|
|
|
**What goes wrong:** ArgoCD receives webhooks from Gitea but fails to parse them with error: `json: cannot unmarshal string into Go struct field .repository.created_at of type int64`. This happens because ArgoCD treats Gitea events as GitHub events instead of Gogs events.
|
|
|
|
**Why it happens:** Gitea is a fork of Gogs, but ArgoCD's webhook handler expects different field types. The `repository.created_at` field is a string in Gitea/Gogs but ArgoCD expects int64 for GitHub format.
|
|
|
|
**Consequences:**
|
|
- Webhooks silently fail (ArgoCD logs error but continues)
|
|
- Must wait for 3-minute polling interval for changes to sync
|
|
- False confidence that instant sync is working
|
|
|
|
**Warning signs:**
|
|
- ArgoCD server logs show webhook parsing errors
|
|
- Application sync doesn't happen immediately after push
|
|
- Webhook delivery shows success in Gitea but no ArgoCD response
|
|
|
|
**Prevention:**
|
|
- Configure webhook with `Gogs` type in Gitea, NOT `Gitea` type
|
|
- Test webhook delivery and check ArgoCD server logs: `kubectl logs -n argocd deploy/argocd-server | grep -i webhook`
|
|
- Accept 3-minute polling as fallback (webhooks are optional enhancement)
|
|
|
|
**Phase to address:** ArgoCD installation phase - verify webhook integration immediately
|
|
|
|
**Sources:**
|
|
- [ArgoCD Issue #16453 - Forgejo/Gitea webhook parsing](https://github.com/argoproj/argo-cd/issues/16453)
|
|
- [ArgoCD Issue #20444 - Gitea support lacking](https://github.com/argoproj/argo-cd/issues/20444)
|
|
|
|
---
|
|
|
|
### 2. Loki Disk Full with No Size-Based Retention
|
|
|
|
**What goes wrong:** Loki fills the entire disk because retention is only time-based, not size-based. When disk fills, Loki crashes with "no space left on device" and becomes completely non-functional - Grafana cannot even fetch labels.
|
|
|
|
**Why it happens:**
|
|
- Retention is disabled by default (`compactor.retention-enabled: false`)
|
|
- Loki only supports time-based retention (e.g., 7 days), not size-based
|
|
- High-volume logging can fill disk before retention period expires
|
|
|
|
**Consequences:**
|
|
- Complete logging system failure
|
|
- May affect other pods sharing the same Longhorn volume
|
|
- Recovery requires manual cleanup or volume expansion
|
|
|
|
**Warning signs:**
|
|
- Steadily increasing PVC usage visible in `kubectl get pvc`
|
|
- Loki compactor logs show no deletion activity
|
|
- Grafana queries become slow before complete failure
|
|
|
|
**Prevention:**
|
|
```yaml
|
|
# Loki values.yaml
|
|
loki:
|
|
compactor:
|
|
retention_enabled: true
|
|
compaction_interval: 10m
|
|
retention_delete_delay: 2h
|
|
retention_delete_worker_count: 150
|
|
working_directory: /loki/compactor
|
|
limits_config:
|
|
retention_period: 168h # 7 days - adjust based on disk size
|
|
```
|
|
|
|
- Set conservative retention period (start with 7 days)
|
|
- Run compactor as StatefulSet with persistent storage for marker files
|
|
- Set up Prometheus alert for PVC usage > 80%
|
|
- Index period MUST be 24h for retention to work
|
|
|
|
**Phase to address:** Loki installation phase - configure retention from day one
|
|
|
|
**Sources:**
|
|
- [Grafana Loki Retention Documentation](https://grafana.com/docs/loki/latest/operations/storage/retention/)
|
|
- [Loki Issue #5242 - Retention not working](https://github.com/grafana/loki/issues/5242)
|
|
|
|
---
|
|
|
|
### 3. Prometheus Volume Growth Exceeds Longhorn PVC
|
|
|
|
**What goes wrong:** Prometheus metrics storage grows beyond PVC capacity. Longhorn volume expansion via CSI can result in a faulted volume that prevents Prometheus from starting.
|
|
|
|
**Why it happens:**
|
|
- Default Prometheus retention is 15 days with no size limit
|
|
- kube-prometheus-stack defaults don't match k3s resource constraints
|
|
- Longhorn CSI volume expansion has known issues requiring specific procedure
|
|
|
|
**Consequences:**
|
|
- Prometheus pod stuck in pending/crash loop
|
|
- Loss of historical metrics
|
|
- Longhorn volume in faulted state requiring manual recovery
|
|
|
|
**Warning signs:**
|
|
- Prometheus pod restarts with OOMKilled or disk errors
|
|
- `kubectl describe pvc` shows capacity approaching limit
|
|
- Longhorn UI shows volume health degraded
|
|
|
|
**Prevention:**
|
|
```yaml
|
|
# kube-prometheus-stack values
|
|
prometheus:
|
|
prometheusSpec:
|
|
retention: 7d
|
|
retentionSize: "8GB" # Set explicit size limit
|
|
resources:
|
|
requests:
|
|
memory: 400Mi
|
|
limits:
|
|
memory: 600Mi
|
|
storageSpec:
|
|
volumeClaimTemplate:
|
|
spec:
|
|
storageClassName: longhorn
|
|
resources:
|
|
requests:
|
|
storage: 10Gi
|
|
```
|
|
|
|
- Always set both `retention` AND `retentionSize`
|
|
- Size PVC with 20% headroom above retentionSize
|
|
- Monitor with `prometheus_tsdb_storage_blocks_bytes` metric
|
|
- For expansion: stop pod, detach volume, resize, then restart
|
|
|
|
**Phase to address:** Prometheus installation phase
|
|
|
|
**Sources:**
|
|
- [Longhorn Issue #2222 - Volume expansion faults](https://github.com/longhorn/longhorn/issues/2222)
|
|
- [kube-prometheus-stack Issue #3401 - Resource limits](https://github.com/prometheus-community/helm-charts/issues/3401)
|
|
|
|
---
|
|
|
|
### 4. ArgoCD + Traefik TLS Termination Redirect Loop
|
|
|
|
**What goes wrong:** ArgoCD UI becomes inaccessible with redirect loops or connection refused errors when accessed through Traefik. Browser shows ERR_TOO_MANY_REDIRECTS.
|
|
|
|
**Why it happens:** Traefik terminates TLS and forwards HTTP to ArgoCD. ArgoCD server, configured for TLS by default, responds with 307 redirects to HTTPS, creating infinite loop.
|
|
|
|
**Consequences:**
|
|
- Cannot access ArgoCD UI via ingress
|
|
- CLI may work with port-forward but not through ingress
|
|
- gRPC connections for CLI through ingress fail
|
|
|
|
**Warning signs:**
|
|
- Browser redirect loop when accessing ArgoCD URL
|
|
- `curl -v` shows 307 redirect responses
|
|
- Works with `kubectl port-forward` but not via ingress
|
|
|
|
**Prevention:**
|
|
```yaml
|
|
# Option 1: ConfigMap (recommended)
|
|
apiVersion: v1
|
|
kind: ConfigMap
|
|
metadata:
|
|
name: argocd-cmd-params-cm
|
|
namespace: argocd
|
|
data:
|
|
server.insecure: "true"
|
|
|
|
# Option 2: Traefik IngressRoute for dual HTTP/gRPC
|
|
apiVersion: traefik.io/v1alpha1
|
|
kind: IngressRoute
|
|
metadata:
|
|
name: argocd-server
|
|
namespace: argocd
|
|
spec:
|
|
entryPoints:
|
|
- websecure
|
|
routes:
|
|
- kind: Rule
|
|
match: Host(`argocd.example.com`)
|
|
priority: 10
|
|
services:
|
|
- name: argocd-server
|
|
port: 80
|
|
- kind: Rule
|
|
match: Host(`argocd.example.com`) && Header(`Content-Type`, `application/grpc`)
|
|
priority: 11
|
|
services:
|
|
- name: argocd-server
|
|
port: 80
|
|
scheme: h2c
|
|
tls:
|
|
certResolver: letsencrypt-prod
|
|
```
|
|
|
|
- Set `server.insecure: "true"` in argocd-cmd-params-cm ConfigMap
|
|
- Use IngressRoute (not Ingress) for proper gRPC support
|
|
- Configure separate routes for HTTP and gRPC with correct priority
|
|
|
|
**Phase to address:** ArgoCD installation phase - test immediately after ingress setup
|
|
|
|
**Sources:**
|
|
- [ArgoCD Ingress Configuration](https://argo-cd.readthedocs.io/en/stable/operator-manual/ingress/)
|
|
- [Traefik Community - ArgoCD behind Traefik](https://community.traefik.io/t/serving-argocd-behind-traefik-ingress/15901)
|
|
|
|
---
|
|
|
|
## Moderate Pitfalls
|
|
|
|
Mistakes that cause delays, debugging sessions, or technical debt.
|
|
|
|
### 5. ServiceMonitor Not Discovering Targets
|
|
|
|
**What goes wrong:** Prometheus ServiceMonitors are created but no targets appear in Prometheus. The scrape config shows 0/0 targets up.
|
|
|
|
**Why it happens:**
|
|
- Label selector mismatch between Prometheus CR and ServiceMonitor
|
|
- RBAC: Prometheus ServiceAccount lacks permission in target namespace
|
|
- Port specified as number instead of name
|
|
- ServiceMonitor in different namespace than Prometheus expects
|
|
|
|
**Prevention:**
|
|
```yaml
|
|
# Ensure Prometheus CR has permissive selectors
|
|
prometheus:
|
|
prometheusSpec:
|
|
serviceMonitorSelectorNilUsesHelmValues: false
|
|
serviceMonitorSelector: {} # Select all ServiceMonitors
|
|
serviceMonitorNamespaceSelector: {} # From all namespaces
|
|
|
|
# ServiceMonitor must use port NAME not number
|
|
spec:
|
|
endpoints:
|
|
- port: metrics # NOT 9090
|
|
```
|
|
|
|
- Use port name, never port number in ServiceMonitor
|
|
- Check RBAC: `kubectl auth can-i list endpoints --as=system:serviceaccount:monitoring:prometheus-kube-prometheus-prometheus -n default`
|
|
- Verify label matching: `kubectl get servicemonitor -A --show-labels`
|
|
|
|
**Phase to address:** Prometheus installation phase, verify with test ServiceMonitor
|
|
|
|
**Sources:**
|
|
- [Prometheus Operator Troubleshooting](https://managedkube.com/prometheus/operator/servicemonitor/troubleshooting/2019/11/07/prometheus-operator-servicemonitor-troubleshooting.html)
|
|
- [ServiceMonitor not discovered Issue #3383](https://github.com/prometheus-operator/prometheus-operator/issues/3383)
|
|
|
|
---
|
|
|
|
### 6. k3s Control Plane Metrics Not Scraped
|
|
|
|
**What goes wrong:** Prometheus dashboards show no metrics for kube-scheduler, kube-controller-manager, or etcd. These panels appear blank or show "No data."
|
|
|
|
**Why it happens:** k3s runs control plane components as a single binary, not as pods. Standard kube-prometheus-stack expects to scrape pods that don't exist.
|
|
|
|
**Prevention:**
|
|
```yaml
|
|
# kube-prometheus-stack values for k3s
|
|
kubeControllerManager:
|
|
enabled: true
|
|
endpoints:
|
|
- 192.168.1.100 # k3s server IP
|
|
service:
|
|
enabled: true
|
|
port: 10257
|
|
targetPort: 10257
|
|
kubeScheduler:
|
|
enabled: true
|
|
endpoints:
|
|
- 192.168.1.100
|
|
service:
|
|
enabled: true
|
|
port: 10259
|
|
targetPort: 10259
|
|
kubeEtcd:
|
|
enabled: false # k3s uses embedded sqlite/etcd
|
|
```
|
|
|
|
- Explicitly configure control plane endpoints with k3s server IPs
|
|
- Disable etcd monitoring if using embedded database
|
|
- OR disable these components entirely for simpler setup
|
|
|
|
**Phase to address:** Prometheus installation phase
|
|
|
|
**Sources:**
|
|
- [Prometheus for Rancher K3s Control Plane Monitoring](https://www.spectrocloud.com/blog/enabling-rancher-k3s-cluster-control-plane-monitoring-with-prometheus)
|
|
|
|
---
|
|
|
|
### 7. Promtail Not Sending Logs to Loki
|
|
|
|
**What goes wrong:** Promtail pods are running but no logs appear in Grafana/Loki. Queries return empty results.
|
|
|
|
**Why it happens:**
|
|
- Promtail started before Loki was ready
|
|
- Log path configuration doesn't match k3s container runtime paths
|
|
- Label selectors don't match actual pod labels
|
|
- Network policy blocking Promtail -> Loki communication
|
|
|
|
**Warning signs:**
|
|
- Promtail logs show "dropping target, no labels" or connection errors
|
|
- `kubectl logs -n monitoring promtail-xxx` shows retries
|
|
- Loki data source health check passes but queries return nothing
|
|
|
|
**Prevention:**
|
|
```yaml
|
|
# Verify k3s containerd log paths
|
|
promtail:
|
|
config:
|
|
snippets:
|
|
scrapeConfigs: |
|
|
- job_name: kubernetes-pods
|
|
kubernetes_sd_configs:
|
|
- role: pod
|
|
pipeline_stages:
|
|
- cri: {}
|
|
relabel_configs:
|
|
- source_labels: [__meta_kubernetes_pod_node_name]
|
|
target_label: node
|
|
```
|
|
|
|
- Delete Promtail positions file to force re-read: `kubectl exec -n monitoring promtail-xxx -- rm /tmp/positions.yaml`
|
|
- Ensure Loki is healthy before Promtail starts (use init container or sync wave)
|
|
- Verify log paths match containerd: `/var/log/pods/*/*/*.log`
|
|
|
|
**Phase to address:** Loki installation phase
|
|
|
|
**Sources:**
|
|
- [Grafana Loki Troubleshooting](https://grafana.com/docs/loki/latest/operations/troubleshooting/)
|
|
|
|
---
|
|
|
|
### 8. ArgoCD Self-Management Bootstrap Chicken-Egg
|
|
|
|
**What goes wrong:** Attempting to have ArgoCD manage itself creates confusion about what's managing what. Initial mistakes in the ArgoCD Application manifest can lock you out.
|
|
|
|
**Why it happens:** GitOps can't install ArgoCD if ArgoCD isn't present. After bootstrap, changing ArgoCD's self-managing Application incorrectly can break the cluster.
|
|
|
|
**Prevention:**
|
|
```yaml
|
|
# Phase 1: Install ArgoCD manually (kubectl apply or helm)
|
|
# Phase 2: Create self-management Application
|
|
apiVersion: argoproj.io/v1alpha1
|
|
kind: Application
|
|
metadata:
|
|
name: argocd
|
|
namespace: argocd
|
|
spec:
|
|
project: default
|
|
source:
|
|
repoURL: https://git.kube2.tricnet.de/tho/infrastructure.git
|
|
path: argocd
|
|
targetRevision: HEAD
|
|
destination:
|
|
server: https://kubernetes.default.svc
|
|
namespace: argocd
|
|
syncPolicy:
|
|
automated:
|
|
prune: false # CRITICAL: Don't auto-prune ArgoCD components
|
|
selfHeal: true
|
|
```
|
|
|
|
- Always bootstrap ArgoCD manually first (Helm or kubectl)
|
|
- Set `prune: false` for ArgoCD's self-management Application
|
|
- Use App of Apps pattern for managed applications
|
|
- Keep a local backup of ArgoCD Application manifest
|
|
|
|
**Phase to address:** ArgoCD installation phase - plan bootstrap strategy upfront
|
|
|
|
**Sources:**
|
|
- [Bootstrapping ArgoCD - Windsock.io](https://windsock.io/bootstrapping-argocd/)
|
|
- [Demystifying GitOps - Bootstrapping ArgoCD](https://medium.com/@aaltundemir/demystifying-gitops-bootstrapping-argo-cd-4a861284f273)
|
|
|
|
---
|
|
|
|
### 9. Sync Waves Misuse Creating False Dependencies
|
|
|
|
**What goes wrong:** Over-engineering sync waves creates unnecessary sequential deployments, increasing deployment time and complexity. Or under-engineering leads to race conditions.
|
|
|
|
**Why it happens:**
|
|
- Developers add waves "just in case"
|
|
- Misunderstanding that waves are within single Application only
|
|
- Not knowing default wave is 0 and waves can be negative
|
|
|
|
**Prevention:**
|
|
```yaml
|
|
# Use waves sparingly - only for true dependencies
|
|
# Database must exist before app
|
|
metadata:
|
|
annotations:
|
|
argocd.argoproj.io/sync-wave: "-1" # First
|
|
|
|
# App deployment
|
|
metadata:
|
|
annotations:
|
|
argocd.argoproj.io/sync-wave: "0" # Default, after database
|
|
|
|
# Don't create unnecessary chains like:
|
|
# ConfigMap (wave -3) -> Secret (wave -2) -> Service (wave -1) -> Deployment (wave 0)
|
|
# These have no real dependency and should all be wave 0
|
|
```
|
|
|
|
- Use waves only for actual dependencies (database before app, CRD before CR)
|
|
- Keep wave structure as flat as possible
|
|
- Sync waves do NOT work across different ArgoCD Applications
|
|
- For cross-Application dependencies, use ApplicationSets with Progressive Syncs
|
|
|
|
**Phase to address:** Application configuration phase
|
|
|
|
**Sources:**
|
|
- [ArgoCD Sync Phases and Waves](https://argo-cd.readthedocs.io/en/stable/user-guide/sync-waves/)
|
|
|
|
---
|
|
|
|
## Minor Pitfalls
|
|
|
|
Annoyances that are easily fixed but waste time if not known.
|
|
|
|
### 10. Grafana Default Password Not Changed
|
|
|
|
**What goes wrong:** Using default `admin/prom-operator` credentials in production exposes the monitoring stack.
|
|
|
|
**Prevention:**
|
|
```yaml
|
|
# kube-prometheus-stack values
|
|
grafana:
|
|
adminPassword: "${GRAFANA_ADMIN_PASSWORD}" # From secret
|
|
# Or use existing secret
|
|
admin:
|
|
existingSecret: grafana-admin-credentials
|
|
userKey: admin-user
|
|
passwordKey: admin-password
|
|
```
|
|
|
|
**Phase to address:** Grafana installation phase
|
|
|
|
---
|
|
|
|
### 11. Missing open-iscsi for Longhorn
|
|
|
|
**What goes wrong:** Longhorn volumes fail to attach with cryptic errors.
|
|
|
|
**Why it happens:** Longhorn requires `open-iscsi` on all nodes, which isn't installed by default on many Linux distributions.
|
|
|
|
**Prevention:**
|
|
```bash
|
|
# On each node before Longhorn installation
|
|
sudo apt-get install -y open-iscsi
|
|
sudo systemctl enable iscsid
|
|
sudo systemctl start iscsid
|
|
```
|
|
|
|
**Phase to address:** Pre-installation prerequisites check
|
|
|
|
**Sources:**
|
|
- [Longhorn Prerequisites](https://longhorn.io/docs/latest/deploy/install/#installation-requirements)
|
|
|
|
---
|
|
|
|
### 12. ClusterIP Services Not Accessible
|
|
|
|
**What goes wrong:** After installing monitoring stack, Grafana/Prometheus aren't accessible externally.
|
|
|
|
**Why it happens:** k3s defaults to ClusterIP for services. Single-node setups need explicit ingress or LoadBalancer configuration.
|
|
|
|
**Prevention:**
|
|
```yaml
|
|
# kube-prometheus-stack values
|
|
grafana:
|
|
ingress:
|
|
enabled: true
|
|
ingressClassName: traefik
|
|
hosts:
|
|
- grafana.kube2.tricnet.de
|
|
tls:
|
|
- secretName: grafana-tls
|
|
hosts:
|
|
- grafana.kube2.tricnet.de
|
|
annotations:
|
|
cert-manager.io/cluster-issuer: letsencrypt-prod
|
|
```
|
|
|
|
**Phase to address:** Installation phase - configure ingress alongside deployment
|
|
|
|
---
|
|
|
|
### 13. Traefik v3 Breaking Changes for ArgoCD IngressRoute
|
|
|
|
**What goes wrong:** ArgoCD IngressRoute with gRPC support stops working after Traefik upgrade to v3.
|
|
|
|
**Why it happens:** Traefik v3 changed header matcher syntax from `Headers()` to `Header()`.
|
|
|
|
**Prevention:**
|
|
```yaml
|
|
# Traefik v2 (OLD - broken in v3)
|
|
match: Host(`argocd.example.com`) && Headers(`Content-Type`, `application/grpc`)
|
|
|
|
# Traefik v3 (NEW)
|
|
match: Host(`argocd.example.com`) && Header(`Content-Type`, `application/grpc`)
|
|
```
|
|
|
|
- Check Traefik version before applying IngressRoutes
|
|
- Test gRPC route after any Traefik upgrade
|
|
|
|
**Phase to address:** ArgoCD installation phase
|
|
|
|
**Sources:**
|
|
- [ArgoCD Issue #15534 - Traefik v3 docs](https://github.com/argoproj/argo-cd/issues/15534)
|
|
|
|
---
|
|
|
|
### 14. k3s Resource Exhaustion with Full Monitoring Stack
|
|
|
|
**What goes wrong:** Single-node k3s cluster becomes unresponsive after deploying full kube-prometheus-stack.
|
|
|
|
**Why it happens:**
|
|
- kube-prometheus-stack deploys many components (prometheus, alertmanager, grafana, node-exporter, kube-state-metrics)
|
|
- Default resource requests/limits are sized for larger clusters
|
|
- k3s server process itself needs ~500MB RAM
|
|
|
|
**Warning signs:**
|
|
- Pods stuck in Pending
|
|
- OOMKilled events
|
|
- Node NotReady status
|
|
|
|
**Prevention:**
|
|
```yaml
|
|
# Minimal kube-prometheus-stack for single-node
|
|
alertmanager:
|
|
enabled: false # Disable if not using alerts
|
|
prometheus:
|
|
prometheusSpec:
|
|
resources:
|
|
requests:
|
|
memory: 256Mi
|
|
cpu: 100m
|
|
limits:
|
|
memory: 512Mi
|
|
grafana:
|
|
resources:
|
|
requests:
|
|
memory: 128Mi
|
|
cpu: 50m
|
|
limits:
|
|
memory: 256Mi
|
|
```
|
|
|
|
- Disable unnecessary components (alertmanager if no alerts configured)
|
|
- Set explicit resource limits lower than defaults
|
|
- Monitor cluster resources: `kubectl top nodes`
|
|
- Consider: 4GB RAM minimum for k3s + monitoring + workloads
|
|
|
|
**Phase to address:** Prometheus installation phase - right-size from start
|
|
|
|
**Sources:**
|
|
- [K3s Resource Profiling](https://docs.k3s.io/reference/resource-profiling)
|
|
|
|
---
|
|
|
|
## Phase-Specific Warning Summary
|
|
|
|
| Phase | Likely Pitfall | Mitigation |
|
|
|-------|---------------|------------|
|
|
| Prerequisites | #11 Missing open-iscsi | Pre-flight check script |
|
|
| ArgoCD Installation | #4 TLS redirect loop, #8 Bootstrap | Test ingress immediately, plan bootstrap |
|
|
| ArgoCD + Gitea Integration | #1 Webhook parsing | Use Gogs webhook type, accept polling fallback |
|
|
| Prometheus Installation | #3 Volume growth, #5 ServiceMonitor, #6 Control plane, #14 Resources | Configure retention+size, verify RBAC, right-size |
|
|
| Loki Installation | #2 Disk full, #7 Promtail | Enable retention day one, verify log paths |
|
|
| Grafana Installation | #10 Default password, #12 ClusterIP | Set password, configure ingress |
|
|
| Application Configuration | #9 Sync waves | Use sparingly, only for real dependencies |
|
|
|
|
---
|
|
|
|
## Pre-Installation Checklist
|
|
|
|
Before starting installation, verify:
|
|
|
|
- [ ] open-iscsi installed on all nodes
|
|
- [ ] Longhorn healthy with available storage (check `kubectl get nodes` and Longhorn UI)
|
|
- [ ] Traefik version known (v2 vs v3 affects IngressRoute syntax)
|
|
- [ ] DNS entries configured for monitoring subdomains
|
|
- [ ] Gitea webhook type decision (use Gogs type, or accept polling fallback)
|
|
- [ ] Disk space planning: Loki retention + Prometheus retention + headroom
|
|
- [ ] Memory planning: k3s (~500MB) + monitoring (~1GB) + workloads
|
|
- [ ] Namespace strategy decided (monitoring namespace vs default)
|
|
|
|
---
|
|
|
|
## Existing Infrastructure Compatibility Notes
|
|
|
|
Based on the existing TaskPlanner setup:
|
|
|
|
**Traefik:** Already in use with cert-manager (letsencrypt-prod). New services should follow same pattern:
|
|
```yaml
|
|
annotations:
|
|
cert-manager.io/cluster-issuer: letsencrypt-prod
|
|
```
|
|
|
|
**Longhorn:** Already the storage class. New PVCs should use explicit `storageClassName: longhorn` and consider replica count for single-node (set to 1).
|
|
|
|
**Gitea:** Repository already configured at `git.kube2.tricnet.de`. ArgoCD Application already exists in `argocd/application.yaml` - don't duplicate.
|
|
|
|
**Existing ArgoCD Application:** TaskPlanner is already configured with ArgoCD. The monitoring stack should be a separate Application, not added to the existing one.
|
|
|
|
---
|
|
|
|
## Sources Summary
|
|
|
|
### Official Documentation
|
|
- [ArgoCD Ingress Configuration](https://argo-cd.readthedocs.io/en/stable/operator-manual/ingress/)
|
|
- [ArgoCD Sync Phases and Waves](https://argo-cd.readthedocs.io/en/stable/user-guide/sync-waves/)
|
|
- [Grafana Loki Retention](https://grafana.com/docs/loki/latest/operations/storage/retention/)
|
|
- [Grafana Loki Troubleshooting](https://grafana.com/docs/loki/latest/operations/troubleshooting/)
|
|
- [K3s Resource Profiling](https://docs.k3s.io/reference/resource-profiling)
|
|
|
|
### Community Issues (Verified Problems)
|
|
- [ArgoCD #16453 - Gitea webhook parsing](https://github.com/argoproj/argo-cd/issues/16453)
|
|
- [ArgoCD #20444 - Gitea support](https://github.com/argoproj/argo-cd/issues/20444)
|
|
- [Loki #5242 - Retention not working](https://github.com/grafana/loki/issues/5242)
|
|
- [Longhorn #2222 - Volume expansion](https://github.com/longhorn/longhorn/issues/2222)
|
|
- [kube-prometheus-stack #3401 - Resource limits](https://github.com/prometheus-community/helm-charts/issues/3401)
|
|
- [Prometheus Operator #3383 - ServiceMonitor discovery](https://github.com/prometheus-operator/prometheus-operator/issues/3383)
|
|
|
|
### Tutorials and Guides
|
|
- [K3S Rocks - ArgoCD](https://k3s.rocks/argocd/)
|
|
- [K3S Rocks - Logging](https://k3s.rocks/logging/)
|
|
- [Bootstrapping ArgoCD](https://windsock.io/bootstrapping-argocd/)
|
|
- [Prometheus ServiceMonitor Troubleshooting](https://managedkube.com/prometheus/operator/servicemonitor/troubleshooting/2019/11/07/prometheus-operator-servicemonitor-troubleshooting.html)
|
|
- [Traefik Community - ArgoCD](https://community.traefik.io/t/serving-argocd-behind-traefik-ingress/15901)
|
|
|
|
---
|
|
*Pitfalls research for: CI/CD and Observability on k3s*
|
|
*Context: Adding to existing TaskPlanner deployment*
|
|
*Researched: 2026-02-03*
|