diff --git a/.planning/research/ARCHITECTURE.md b/.planning/research/ARCHITECTURE.md index cfb850e..124d9d7 100644 --- a/.planning/research/ARCHITECTURE.md +++ b/.planning/research/ARCHITECTURE.md @@ -257,7 +257,7 @@ func (s *LocalStorage) Store(ctx context.Context, file io.Reader) (string, error | | | v | [FTS5 trigger auto-updates index] - | | + | v v v [UI shows new note] <--JSON response-- [Return created note] ``` @@ -513,3 +513,621 @@ Based on component dependencies, suggested implementation order: --- *Architecture research for: Personal task/notes web application* *Researched: 2026-01-29* + +--- + +# v2.0 Architecture: CI/CD and Observability Integration + +**Domain:** GitOps CI/CD and Observability Stack +**Researched:** 2026-02-03 +**Confidence:** HIGH (verified with official documentation) + +## Executive Summary + +This section details how ArgoCD, Prometheus, Grafana, and Loki integrate with the existing k3s/Gitea/Traefik architecture. The integration follows established patterns for self-hosted Kubernetes observability stacks, with specific considerations for k3s's lightweight nature and Traefik as the ingress controller. + +Key insight: The existing CI/CD foundation (Gitea Actions + ArgoCD Application) is already in place. This milestone adds observability and operational automation rather than building from scratch. + +## Current Architecture Overview + +``` + Internet + | + [Traefik] + (Ingress) + | + +-------------------------+-------------------------+ + | | | + task.kube2 git.kube2 (future) + .tricnet.de .tricnet.de argocd/grafana + | | + [TaskPlaner] [Gitea] + (default ns) + Actions + | Runner + | | + [Longhorn PVC] | + (data store) | + v + [Container Registry] + git.kube2.tricnet.de +``` + +### Existing Components + +| Component | Namespace | Purpose | Status | +|-----------|-----------|---------|--------| +| k3s | - | Kubernetes distribution | Running | +| Traefik | kube-system | Ingress controller | Running | +| Longhorn | longhorn-system | Persistent storage | Running | +| cert-manager | cert-manager | TLS certificates | Running | +| Gitea | gitea (assumed) | Git hosting + CI | Running | +| TaskPlaner | default | Application | Running | +| ArgoCD Application | argocd | GitOps deployment | Defined (may need install) | + +### Existing CI/CD Pipeline + +From `.gitea/workflows/build.yaml`: +1. Push to master triggers Gitea Actions +2. Build Docker image with BuildX +3. Push to Gitea Container Registry +4. Update Helm values.yaml with new image tag +5. Commit with `[skip ci]` +6. ArgoCD detects change and syncs + +**Current gap:** ArgoCD may not be installed yet (Application manifest exists but needs ArgoCD server). + +## Integration Architecture + +### Target State + +``` + Internet + | + [Traefik] + (Ingress) + | + +----------+----------+----------+----------+----------+ + | | | | | | + task.* git.* argocd.* grafana.* (internal) + | | | | | +[TaskPlaner] [Gitea] [ArgoCD] [Grafana] [Prometheus] + | | | | [Loki] + | | | | [Alloy] + | +---webhook---> | | + | | | | + +------ metrics ------+----------+--------->+ + +------ logs ---------+---------[Alloy]---->+ (to Loki) +``` + +### Namespace Strategy + +| Namespace | Components | Rationale | +|-----------|------------|-----------| +| `argocd` | ArgoCD server, repo-server, application-controller | Standard convention; ClusterRoleBinding expects this | +| `monitoring` | Prometheus, Grafana, Alertmanager | Consolidate observability; kube-prometheus-stack default | +| `loki` | Loki, Alloy (DaemonSet) | Separate from metrics for resource isolation | +| `default` | TaskPlaner | Existing app deployment | +| `gitea` | Gitea + Actions Runner | Assumed existing | + +**Alternative considered:** All observability in single namespace +**Decision:** Separate `monitoring` and `loki` because: +- Different scaling characteristics (Alloy is DaemonSet, Prometheus is StatefulSet) +- Easier resource quota management +- Standard community practice + +## Component Integration Details + +### 1. ArgoCD Integration + +**Installation Method:** Helm chart from `argo/argo-cd` + +**Integration Points:** + +| Integration | How | Configuration | +|-------------|-----|---------------| +| Gitea Repository | HTTPS clone | Repository credential in argocd-secret | +| Gitea Webhook | POST to `/api/webhook` | Reduces sync delay from 3min to seconds | +| Traefik Ingress | IngressRoute or Ingress | `server.insecure=true` to avoid redirect loops | +| TLS | cert-manager annotation | Let's Encrypt via existing cluster-issuer | + +**Critical Configuration:** + +```yaml +# Helm values for ArgoCD with Traefik +configs: + params: + server.insecure: true # Required: Traefik handles TLS + +server: + ingress: + enabled: true + ingressClassName: traefik + annotations: + cert-manager.io/cluster-issuer: letsencrypt-prod + hosts: + - argocd.kube2.tricnet.de + tls: + - secretName: argocd-tls + hosts: + - argocd.kube2.tricnet.de +``` + +**Webhook Setup for Gitea:** + +1. In ArgoCD secret, set `webhook.gogs.secret` (Gitea uses Gogs-compatible webhooks) +2. In Gitea repository settings, add webhook: + - URL: `https://argocd.kube2.tricnet.de/api/webhook` + - Content type: `application/json` + - Secret: Same as configured in ArgoCD + +**Known Limitation:** Webhooks work for Applications but not ApplicationSets with Gitea. + +### 2. Prometheus/Grafana Integration (kube-prometheus-stack) + +**Installation Method:** Helm chart `prometheus-community/kube-prometheus-stack` + +**Integration Points:** + +| Integration | How | Configuration | +|-------------|-----|---------------| +| k3s metrics | Exposed kube-* endpoints | k3s config modification required | +| Traefik metrics | ServiceMonitor | Traefik exposes `:9100/metrics` | +| TaskPlaner metrics | ServiceMonitor (future) | App must expose `/metrics` endpoint | +| Grafana UI | Traefik Ingress | Standard Kubernetes Ingress | + +**Critical k3s Configuration:** + +k3s binds controller-manager, scheduler, and proxy to localhost by default. For Prometheus scraping, expose on 0.0.0.0. + +Create/modify `/etc/rancher/k3s/config.yaml`: + +```yaml +kube-controller-manager-arg: + - "bind-address=0.0.0.0" +kube-proxy-arg: + - "metrics-bind-address=0.0.0.0" +kube-scheduler-arg: + - "bind-address=0.0.0.0" +``` + +Then restart k3s: `sudo systemctl restart k3s` + +**k3s-specific Helm values:** + +```yaml +# Disable etcd monitoring (k3s uses sqlite, not etcd) +defaultRules: + rules: + etcd: false + +kubeEtcd: + enabled: false + +# Fix endpoint discovery for k3s +kubeControllerManager: + enabled: true + endpoints: + - + service: + enabled: true + port: 10257 + targetPort: 10257 + +kubeScheduler: + enabled: true + endpoints: + - + service: + enabled: true + port: 10259 + targetPort: 10259 + +kubeProxy: + enabled: true + endpoints: + - + service: + enabled: true + port: 10249 + targetPort: 10249 + +# Grafana ingress +grafana: + ingress: + enabled: true + ingressClassName: traefik + annotations: + cert-manager.io/cluster-issuer: letsencrypt-prod + hosts: + - grafana.kube2.tricnet.de + tls: + - secretName: grafana-tls + hosts: + - grafana.kube2.tricnet.de +``` + +**ServiceMonitor for TaskPlaner (future):** + +Once TaskPlaner exposes `/metrics`: + +```yaml +apiVersion: monitoring.coreos.com/v1 +kind: ServiceMonitor +metadata: + name: taskplaner + namespace: monitoring + labels: + release: prometheus # Must match kube-prometheus-stack release +spec: + namespaceSelector: + matchNames: + - default + selector: + matchLabels: + app.kubernetes.io/name: taskplaner + endpoints: + - port: http + path: /metrics + interval: 30s +``` + +### 3. Loki + Alloy Integration (Log Aggregation) + +**Important:** Promtail is deprecated (LTS until Feb 2026, EOL March 2026). Use **Grafana Alloy** instead. + +**Installation Method:** +- Loki: Helm chart `grafana/loki` (monolithic mode for single node) +- Alloy: Helm chart `grafana/alloy` + +**Integration Points:** + +| Integration | How | Configuration | +|-------------|-----|---------------| +| Pod logs | Alloy DaemonSet | Mounts `/var/log/pods` | +| Loki storage | Longhorn PVC or MinIO | Single-binary uses filesystem | +| Grafana datasource | Auto-configured | kube-prometheus-stack integration | +| k3s node logs | Alloy journal reader | journalctl access | + +**Deployment Mode Decision:** + +| Mode | When to Use | Our Choice | +|------|-------------|------------| +| Monolithic (single-binary) | Small deployments, <100GB/day | **Yes - single node k3s** | +| Simple Scalable | Medium deployments | No | +| Microservices | Large scale, HA required | No | + +**Loki Helm values (monolithic):** + +```yaml +deploymentMode: SingleBinary + +singleBinary: + replicas: 1 + persistence: + enabled: true + storageClass: longhorn + size: 10Gi + +# Disable components not needed in monolithic +read: + replicas: 0 +write: + replicas: 0 +backend: + replicas: 0 + +# Use filesystem storage (not S3/MinIO for simplicity) +loki: + storage: + type: filesystem + schemaConfig: + configs: + - from: "2024-01-01" + store: tsdb + object_store: filesystem + schema: v13 + index: + prefix: index_ + period: 24h +``` + +**Alloy DaemonSet Configuration:** + +```yaml +# alloy-values.yaml +alloy: + configMap: + create: true + content: | + // Kubernetes logs collection + loki.source.kubernetes "pods" { + targets = discovery.kubernetes.pods.targets + forward_to = [loki.write.default.receiver] + } + + // Send to Loki + loki.write "default" { + endpoint { + url = "http://loki.loki.svc.cluster.local:3100/loki/api/v1/push" + } + } + + // Kubernetes discovery + discovery.kubernetes "pods" { + role = "pod" + } +``` + +### 4. Traefik Metrics Integration + +Traefik already exposes Prometheus metrics. Enable scraping: + +**Option A: ServiceMonitor (if using kube-prometheus-stack)** + +```yaml +apiVersion: monitoring.coreos.com/v1 +kind: ServiceMonitor +metadata: + name: traefik + namespace: monitoring + labels: + release: prometheus +spec: + namespaceSelector: + matchNames: + - kube-system + selector: + matchLabels: + app.kubernetes.io/name: traefik + endpoints: + - port: metrics + path: /metrics + interval: 30s +``` + +**Option B: Verify Traefik metrics are enabled** + +Check Traefik deployment args include: +``` +--entrypoints.metrics.address=:8888 +--metrics.prometheus=true +--metrics.prometheus.entryPoint=metrics +``` + +## Data Flow Diagrams + +### Metrics Flow + +``` ++------------------+ +------------------+ +------------------+ +| TaskPlaner | | Traefik | | k3s core | +| /metrics | | :9100/metrics | | :10249,10257... | ++--------+---------+ +--------+---------+ +--------+---------+ + | | | + +------------------------+------------------------+ + | + v + +-------------------+ + | Prometheus | + | (ServiceMonitors) | + +--------+----------+ + | + v + +-------------------+ + | Grafana | + | (Dashboards) | + +-------------------+ +``` + +### Log Flow + +``` ++------------------+ +------------------+ +------------------+ +| TaskPlaner | | Traefik | | Other Pods | +| stdout/stderr | | access logs | | stdout/stderr | ++--------+---------+ +--------+---------+ +--------+---------+ + | | | + +------------------------+------------------------+ + | + /var/log/pods + | + v + +-------------------+ + | Alloy DaemonSet | + | (log collection) | + +--------+----------+ + | + v + +-------------------+ + | Loki | + | (log storage) | + +--------+----------+ + | + v + +-------------------+ + | Grafana | + | (log queries) | + +-------------------+ +``` + +### GitOps Flow + +``` ++------------+ +------------+ +---------------+ +------------+ +| Developer | --> | Gitea | --> | Gitea Actions | --> | Container | +| git push | | Repository | | (build.yaml) | | Registry | ++------------+ +-----+------+ +-------+-------+ +------------+ + | | + | (update values.yaml) + | | + v v + +------------+ +------------+ + | Webhook | ----> | ArgoCD | + | (notify) | | Server | + +------------+ +-----+------+ + | + (sync app) + | + v + +------------+ + | Kubernetes | + | (deploy) | + +------------+ +``` + +## Build Order (Dependencies) + +Based on component dependencies, recommended installation order: + +### Phase 1: ArgoCD (no dependencies on observability) + +``` +1. Install ArgoCD via Helm + - Creates namespace: argocd + - Verify existing Application manifest works + - Configure Gitea webhook + +Dependencies: None (Traefik already running) +Validates: GitOps pipeline end-to-end +``` + +### Phase 2: kube-prometheus-stack (foundational observability) + +``` +2. Configure k3s metrics exposure + - Modify /etc/rancher/k3s/config.yaml + - Restart k3s + +3. Install kube-prometheus-stack via Helm + - Creates namespace: monitoring + - Includes: Prometheus, Grafana, Alertmanager + - Includes: Default dashboards and alerts + +Dependencies: k3s metrics exposed +Validates: Basic cluster monitoring working +``` + +### Phase 3: Loki + Alloy (log aggregation) + +``` +4. Install Loki via Helm (monolithic mode) + - Creates namespace: loki + - Configure storage with Longhorn + +5. Install Alloy via Helm + - DaemonSet in loki namespace + - Configure Kubernetes log discovery + - Point to Loki endpoint + +6. Add Loki datasource to Grafana + - URL: http://loki.loki.svc.cluster.local:3100 + +Dependencies: Grafana from step 3, storage +Validates: Logs visible in Grafana Explore +``` + +### Phase 4: Application Integration + +``` +7. Add TaskPlaner metrics endpoint (if not exists) + - Expose /metrics in app + - Create ServiceMonitor + +8. Create application dashboards in Grafana + - TaskPlaner-specific metrics + - Request latency, error rates + +Dependencies: All previous phases +Validates: Full observability of application +``` + +## Resource Requirements + +| Component | CPU Request | Memory Request | Storage | +|-----------|-------------|----------------|---------| +| ArgoCD (all) | 500m | 512Mi | - | +| Prometheus | 200m | 512Mi | 10Gi (Longhorn) | +| Grafana | 100m | 256Mi | 1Gi (Longhorn) | +| Alertmanager | 50m | 64Mi | 1Gi (Longhorn) | +| Loki | 200m | 256Mi | 10Gi (Longhorn) | +| Alloy (per node) | 100m | 128Mi | - | + +**Total additional:** ~1.2 CPU cores, ~1.7Gi RAM, ~22Gi storage + +## Security Considerations + +### Network Policies + +Consider network policies to restrict: +- Prometheus scraping only from monitoring namespace +- Loki ingestion only from Alloy +- Grafana access only via Traefik + +### Secrets Management + +| Secret | Location | Purpose | +|--------|----------|---------| +| `argocd-initial-admin-secret` | argocd ns | Initial admin password | +| `argocd-secret` | argocd ns | Webhook secrets, repo credentials | +| `grafana-admin` | monitoring ns | Grafana admin password | + +### Ingress Authentication + +For production, consider: +- ArgoCD: Built-in OIDC/OAuth integration +- Grafana: Built-in auth (local, LDAP, OAuth) +- Prometheus: Traefik BasicAuth middleware (already pattern in use) + +## Anti-Patterns to Avoid + +### 1. Skipping k3s Metrics Configuration + +**What happens:** Prometheus installs but most dashboards show "No data" +**Prevention:** Configure k3s to expose metrics BEFORE installing kube-prometheus-stack + +### 2. Using Promtail Instead of Alloy + +**What happens:** Technical debt - Promtail EOL is March 2026 +**Prevention:** Use Alloy from the start; migration documentation exists + +### 3. Running Loki in Microservices Mode for Small Clusters + +**What happens:** Unnecessary complexity, resource overhead +**Prevention:** Monolithic mode for clusters under 100GB/day log volume + +### 4. Forgetting server.insecure for ArgoCD with Traefik + +**What happens:** Redirect loop (ERR_TOO_MANY_REDIRECTS) +**Prevention:** Always set `configs.params.server.insecure=true` when Traefik handles TLS + +### 5. ServiceMonitor Label Mismatch + +**What happens:** Prometheus doesn't discover custom ServiceMonitors +**Prevention:** Ensure `release: ` label matches kube-prometheus-stack release + +## Sources + +**ArgoCD:** +- [ArgoCD Webhook Configuration](https://argo-cd.readthedocs.io/en/stable/operator-manual/webhook/) +- [ArgoCD Ingress Configuration](https://argo-cd.readthedocs.io/en/stable/operator-manual/ingress/) +- [ArgoCD Installation](https://argo-cd.readthedocs.io/en/stable/operator-manual/installation/) +- [Mastering GitOps: ArgoCD and Gitea on Kubernetes](https://blog.stackademic.com/mastering-gitops-a-comprehensive-guide-to-self-hosting-argocd-and-gitea-on-kubernetes-9cdf36856c38) + +**Prometheus/Grafana:** +- [kube-prometheus-stack Helm Chart](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack) +- [Prometheus on K3s](https://fabianlee.org/2022/07/02/prometheus-installing-kube-prometheus-stack-on-k3s-cluster/) +- [K3s Monitoring Guide](https://github.com/cablespaghetti/k3s-monitoring) +- [ServiceMonitor Explained](https://dkbalachandar.wordpress.com/2025/07/21/kubernetes-servicemonitor-explained-how-to-monitor-services-with-prometheus/) + +**Loki/Alloy:** +- [Loki Monolithic Installation](https://grafana.com/docs/loki/latest/setup/install/helm/install-monolithic/) +- [Loki Deployment Modes](https://grafana.com/docs/loki/latest/get-started/deployment-modes/) +- [Migrate from Promtail to Alloy](https://grafana.com/docs/alloy/latest/set-up/migrate/from-promtail/) +- [Grafana Loki 3.4 Release](https://grafana.com/blog/2025/02/13/grafana-loki-3.4-standardized-storage-config-sizing-guidance-and-promtail-merging-into-alloy/) +- [Alloy Replacing Promtail](https://docs-bigbang.dso.mil/latest/docs/adrs/0004-alloy-replacing-promtail/) + +**Traefik Integration:** +- [Traefik Metrics with Prometheus](https://traefik.io/blog/capture-traefik-metrics-for-apps-on-kubernetes-with-prometheus) + +--- +*Last updated: 2026-02-03* diff --git a/.planning/research/FEATURES.md b/.planning/research/FEATURES.md index b4ed8e3..7bdf5f2 100644 --- a/.planning/research/FEATURES.md +++ b/.planning/research/FEATURES.md @@ -210,5 +210,241 @@ Features to defer until product-market fit is established: - Evernote features page (verified via WebFetch) --- -*Feature research for: Personal Task/Notes Web App* -*Researched: 2026-01-29* + +# CI/CD and Observability Features + +**Domain:** CI/CD pipelines and Kubernetes observability for personal project +**Researched:** 2026-02-03 +**Context:** Single-user, self-hosted TaskPlanner app with existing basic Gitea Actions pipeline + +## Current State + +Based on the existing `.gitea/workflows/build.yaml`: +- Build and push Docker images to Gitea Container Registry +- Docker layer caching enabled +- Automatic Helm values update with new image tag +- No tests in pipeline +- No GitOps automation (ArgoCD defined but requires manual sync) +- No observability stack + +--- + +## Table Stakes + +Features required for production-grade operations. Missing any of these means the system is incomplete for reliable self-hosting. + +### CI/CD Pipeline + +| Feature | Why Expected | Complexity | Notes | +|---------|--------------|------------|-------| +| **Automated tests in pipeline** | Catch bugs before deployment; without tests, pipeline is just a build script | Low | Start with unit tests (70% of test pyramid), add integration tests later | +| **Build caching** | Already have this | - | Using Docker layer cache to registry | +| **Lint/static analysis** | Catch errors early (fail fast principle) | Low | ESLint, TypeScript checking | +| **Pipeline as code** | Already have this | - | Workflow defined in `.gitea/workflows/` | +| **Automated deployment trigger** | Manual `helm upgrade` defeats CI/CD purpose | Low | ArgoCD auto-sync on Git changes | +| **Container image tagging** | Already have this | - | SHA-based tags with `latest` | + +### GitOps + +| Feature | Why Expected | Complexity | Notes | +|---------|--------------|------------|-------| +| **Git as single source of truth** | Core GitOps principle; cluster state should match Git | Low | ArgoCD watches Git repo, syncs to cluster | +| **Auto-sync** | Manual sync defeats GitOps purpose | Low | ArgoCD `syncPolicy.automated.enabled: true` | +| **Self-healing** | Prevents drift; if someone kubectl edits, ArgoCD reverts | Low | ArgoCD `selfHeal: true` | +| **Health checks** | Know if deployment succeeded | Low | ArgoCD built-in health status | + +### Observability + +| Feature | Why Expected | Complexity | Notes | +|---------|--------------|------------|-------| +| **Basic metrics collection** | Know if app is running, resource usage | Medium | Prometheus + kube-state-metrics | +| **Metrics visualization** | Metrics without dashboards are useless | Low | Grafana with pre-built Kubernetes dashboards | +| **Container logs aggregation** | Debug issues without `kubectl logs` | Medium | Loki (lightweight, label-based) | +| **Basic alerting** | Know when something breaks | Low | AlertManager with 3-5 critical alerts | + +--- + +## Differentiators + +Features that add significant value but are not strictly required for a single-user personal app. Implement if you want learning/practice or improved reliability. + +### CI/CD Pipeline + +| Feature | Value Proposition | Complexity | Notes | +|---------|-------------------|------------|-------| +| **Smoke tests on deploy** | Verify deployment actually works | Medium | Hit health endpoint after deploy | +| **Build notifications** | Know when builds fail without watching | Low | Slack/Discord/email webhook | +| **DORA metrics tracking** | Track deployment frequency, lead time | Medium | Measure CI/CD effectiveness | +| **Parallel test execution** | Faster feedback on larger test suites | Medium | Only valuable with substantial test suite | +| **Dependency vulnerability scanning** | Catch security issues early | Low | `npm audit`, Trivy for container images | + +### GitOps + +| Feature | Value Proposition | Complexity | Notes | +|---------|-------------------|------------|-------| +| **Automated pruning** | Remove resources deleted from Git | Low | ArgoCD `prune: true` | +| **Sync windows** | Control when syncs happen | Low | Useful if you want maintenance windows | +| **Application health dashboard** | Visual cluster state | Low | ArgoCD UI already provides this | +| **Git commit status** | See deployment status in Gitea | Medium | ArgoCD notifications to Git | + +### Observability + +| Feature | Value Proposition | Complexity | Notes | +|---------|-------------------|------------|-------| +| **Application-level metrics** | Track business metrics (tasks created, etc.) | Medium | Custom Prometheus metrics in app | +| **Request tracing** | Debug latency issues | High | OpenTelemetry, Tempo/Jaeger | +| **SLO/SLI dashboards** | Define and track reliability targets | Medium | Error budgets, latency percentiles | +| **Log-based alerting** | Alert on error patterns | Medium | Loki alerting rules | +| **Uptime monitoring** | External availability check | Low | Uptime Kuma or similar | + +--- + +## Anti-Features + +Features that are overkill for a single-user personal app. Actively avoid these to prevent over-engineering. + +| Anti-Feature | Why Avoid | What to Do Instead | +|--------------|-----------|-------------------| +| **Multi-environment promotion (dev/staging/prod)** | Single user, single environment | Deploy directly to prod; use feature flags if needed | +| **Blue-green/canary deployments** | Complex rollout for single user is overkill | Simple rolling update; ArgoCD rollback if needed | +| **Full E2E test suite in CI** | Expensive, slow, diminishing returns for personal app | Unit + smoke tests; manual E2E when needed | +| **High availability ArgoCD** | HA is for multi-team, multi-tenant | Single replica ArgoCD is fine | +| **Distributed tracing** | Overkill unless debugging microservices latency | Only add if you have multiple services with latency issues | +| **ELK stack for logging** | Resource-heavy; Elasticsearch needs significant memory | Use Loki instead (label-based, lightweight) | +| **Full APM solution** | DataDog/NewRelic-style solutions are enterprise-focused | Prometheus + Grafana + Loki covers personal needs | +| **Secrets management (Vault)** | Complex for single user with few secrets | Kubernetes secrets or sealed-secrets | +| **Policy enforcement (OPA/Gatekeeper)** | You are the only user; no policy conflicts | Skip entirely | +| **Multi-cluster management** | Single cluster, single app | Skip entirely | +| **Cost optimization/FinOps** | Personal project; cost is fixed/minimal | Skip entirely | +| **AI-assisted observability** | Marketing hype; manual review is fine at this scale | Skip entirely | + +--- + +## Feature Dependencies + +``` +Automated Tests + | + v +Lint/Static Analysis --> Build --> Push Image --> Update Git + | + v + ArgoCD Auto-Sync + | + v + Health Check Pass + | + v + Deployment Complete + | + v + Metrics/Logs Available in Grafana +``` + +Key ordering constraints: +1. Tests before build (fail fast) +2. ArgoCD watches Git, so Git update triggers deploy +3. Observability stack must be deployed before app for metrics collection + +--- + +## MVP Recommendation for CI/CD and Observability + +For production-grade operations on a personal project, prioritize in this order: + +### Phase 1: GitOps Foundation +1. Enable ArgoCD auto-sync with self-healing +2. Add basic health checks + +*Rationale:* Eliminates manual `helm upgrade`, establishes GitOps workflow + +### Phase 2: Basic Observability +1. Prometheus + Grafana (kube-prometheus-stack helm chart) +2. Loki for log aggregation +3. 3-5 critical alerts (pod crashes, high memory, app down) + +*Rationale:* Can't operate what you can't see; minimum viable observability + +### Phase 3: CI Pipeline Hardening +1. Add unit tests to pipeline +2. Add linting/type checking +3. Smoke test after deploy (optional) + +*Rationale:* Tests catch bugs before they reach production + +### Defer to Later (if ever) +- Application-level custom metrics +- SLO dashboards +- Advanced alerting +- Request tracing +- Extensive E2E tests + +--- + +## Complexity Budget + +For a single-user personal project, the total complexity budget should be LOW-MEDIUM: + +| Category | Recommended Complexity | Over-Budget Indicator | +|----------|----------------------|----------------------| +| CI Pipeline | LOW | More than 10 min build time; complex test matrix | +| GitOps | LOW | Multi-environment promotion; complex sync policies | +| Metrics | MEDIUM | Custom exporters; high-cardinality metrics | +| Logging | LOW | Full-text search; complex log parsing | +| Alerting | LOW | More than 10 alerts; complex routing | +| Tracing | SKIP | Any tracing for single-service app | + +--- + +## Essential Alerts for Personal Project + +Based on best practices, these 5 alerts are sufficient for a single-user app: + +| Alert | Condition | Why Critical | +|-------|-----------|--------------| +| **Pod CrashLooping** | restarts > 3 in 15 min | App is failing repeatedly | +| **Pod OOMKilled** | OOM event detected | Memory limits too low or leak | +| **High Memory Usage** | memory > 85% for 5 min | Approaching resource limits | +| **App Unavailable** | probe failures > 3 | Users cannot access app | +| **Disk Running Low** | disk > 80% used | Persistent storage filling up | + +**Key principle:** Alerts should be symptom-based and actionable. If an alert fires and you don't need to do anything, remove it. + +--- + +## Sources + +### CI/CD Best Practices +- [TeamCity CI/CD Guide](https://www.jetbrains.com/teamcity/ci-cd-guide/ci-cd-best-practices/) +- [Spacelift CI/CD Best Practices](https://spacelift.io/blog/ci-cd-best-practices) +- [GitLab CI/CD Best Practices](https://about.gitlab.com/blog/how-to-keep-up-with-ci-cd-best-practices/) +- [AWS CI/CD Best Practices](https://docs.aws.amazon.com/prescriptive-guidance/latest/strategy-cicd-litmus/cicd-best-practices.html) + +### Observability +- [Kubernetes Observability Trends 2026](https://www.usdsi.org/data-science-insights/kubernetes-observability-and-monitoring-trends-in-2026) +- [Spectro Cloud: Choosing the Right Monitoring Stack](https://www.spectrocloud.com/blog/choosing-the-right-kubernetes-monitoring-stack) +- [ClickHouse: Mastering Kubernetes Observability](https://clickhouse.com/resources/engineering/mastering-kubernetes-observability-guide) +- [Kubernetes Official Observability Docs](https://kubernetes.io/docs/concepts/cluster-administration/observability/) + +### ArgoCD/GitOps +- [ArgoCD Auto Sync Documentation](https://argo-cd.readthedocs.io/en/stable/user-guide/auto_sync/) +- [ArgoCD Best Practices](https://argo-cd.readthedocs.io/en/stable/user-guide/best_practices/) +- [mkdev: ArgoCD Self-Heal and Sync Windows](https://mkdev.me/posts/argo-cd-self-heal-sync-windows-and-diffing) + +### Alerting +- [Sysdig: Alerting on Kubernetes](https://www.sysdig.com/blog/alerting-kubernetes) +- [Groundcover: Kubernetes Alerting](https://www.groundcover.com/kubernetes-monitoring/kubernetes-alerting) +- [Sematext: 10 Must-Have Kubernetes Alerts](https://sematext.com/blog/top-10-must-have-alerts-for-kubernetes/) + +### Logging +- [Plural: Loki vs ELK for Kubernetes](https://www.plural.sh/blog/loki-vs-elk-kubernetes/) +- [Loki vs ELK Comparison](https://alexandre-vazquez.com/loki-vs-elk/) + +### Testing Pyramid +- [CircleCI: Testing Pyramid](https://circleci.com/blog/testing-pyramid/) +- [Semaphore: Testing Pyramid](https://semaphore.io/blog/testing-pyramid) +- [AWS: Testing Stages in CI/CD](https://docs.aws.amazon.com/whitepapers/latest/practicing-continuous-integration-continuous-delivery/testing-stages-in-continuous-integration-and-continuous-delivery.html) + +### Homelab/Personal Projects +- [Prometheus and Grafana Homelab Setup](https://unixorn.github.io/post/homelab/homelab-setup-prometheus-and-grafana/) +- [Better Stack: Install Prometheus/Grafana with Helm](https://betterstack.com/community/questions/install-prometheus-and-grafana-on-kubernetes-with-helm/) diff --git a/.planning/research/PITFALLS-CICD-OBSERVABILITY.md b/.planning/research/PITFALLS-CICD-OBSERVABILITY.md new file mode 100644 index 0000000..aa692aa --- /dev/null +++ b/.planning/research/PITFALLS-CICD-OBSERVABILITY.md @@ -0,0 +1,633 @@ +# Domain Pitfalls: CI/CD and Observability on k3s + +**Domain:** Adding ArgoCD, Prometheus, Grafana, and Loki to existing k3s cluster +**Context:** TaskPlanner on self-hosted k3s with Gitea, Traefik, Longhorn +**Researched:** 2026-02-03 +**Confidence:** HIGH (verified with official documentation and community issues) + +--- + +## Critical Pitfalls + +Mistakes that cause system instability, data loss, or require significant rework. + +### 1. Gitea Webhook JSON Parsing Failure with ArgoCD + +**What goes wrong:** ArgoCD receives webhooks from Gitea but fails to parse them with error: `json: cannot unmarshal string into Go struct field .repository.created_at of type int64`. This happens because ArgoCD treats Gitea events as GitHub events instead of Gogs events. + +**Why it happens:** Gitea is a fork of Gogs, but ArgoCD's webhook handler expects different field types. The `repository.created_at` field is a string in Gitea/Gogs but ArgoCD expects int64 for GitHub format. + +**Consequences:** +- Webhooks silently fail (ArgoCD logs error but continues) +- Must wait for 3-minute polling interval for changes to sync +- False confidence that instant sync is working + +**Warning signs:** +- ArgoCD server logs show webhook parsing errors +- Application sync doesn't happen immediately after push +- Webhook delivery shows success in Gitea but no ArgoCD response + +**Prevention:** +- Configure webhook with `Gogs` type in Gitea, NOT `Gitea` type +- Test webhook delivery and check ArgoCD server logs: `kubectl logs -n argocd deploy/argocd-server | grep -i webhook` +- Accept 3-minute polling as fallback (webhooks are optional enhancement) + +**Phase to address:** ArgoCD installation phase - verify webhook integration immediately + +**Sources:** +- [ArgoCD Issue #16453 - Forgejo/Gitea webhook parsing](https://github.com/argoproj/argo-cd/issues/16453) +- [ArgoCD Issue #20444 - Gitea support lacking](https://github.com/argoproj/argo-cd/issues/20444) + +--- + +### 2. Loki Disk Full with No Size-Based Retention + +**What goes wrong:** Loki fills the entire disk because retention is only time-based, not size-based. When disk fills, Loki crashes with "no space left on device" and becomes completely non-functional - Grafana cannot even fetch labels. + +**Why it happens:** +- Retention is disabled by default (`compactor.retention-enabled: false`) +- Loki only supports time-based retention (e.g., 7 days), not size-based +- High-volume logging can fill disk before retention period expires + +**Consequences:** +- Complete logging system failure +- May affect other pods sharing the same Longhorn volume +- Recovery requires manual cleanup or volume expansion + +**Warning signs:** +- Steadily increasing PVC usage visible in `kubectl get pvc` +- Loki compactor logs show no deletion activity +- Grafana queries become slow before complete failure + +**Prevention:** +```yaml +# Loki values.yaml +loki: + compactor: + retention_enabled: true + compaction_interval: 10m + retention_delete_delay: 2h + retention_delete_worker_count: 150 + working_directory: /loki/compactor + limits_config: + retention_period: 168h # 7 days - adjust based on disk size +``` + +- Set conservative retention period (start with 7 days) +- Run compactor as StatefulSet with persistent storage for marker files +- Set up Prometheus alert for PVC usage > 80% +- Index period MUST be 24h for retention to work + +**Phase to address:** Loki installation phase - configure retention from day one + +**Sources:** +- [Grafana Loki Retention Documentation](https://grafana.com/docs/loki/latest/operations/storage/retention/) +- [Loki Issue #5242 - Retention not working](https://github.com/grafana/loki/issues/5242) + +--- + +### 3. Prometheus Volume Growth Exceeds Longhorn PVC + +**What goes wrong:** Prometheus metrics storage grows beyond PVC capacity. Longhorn volume expansion via CSI can result in a faulted volume that prevents Prometheus from starting. + +**Why it happens:** +- Default Prometheus retention is 15 days with no size limit +- kube-prometheus-stack defaults don't match k3s resource constraints +- Longhorn CSI volume expansion has known issues requiring specific procedure + +**Consequences:** +- Prometheus pod stuck in pending/crash loop +- Loss of historical metrics +- Longhorn volume in faulted state requiring manual recovery + +**Warning signs:** +- Prometheus pod restarts with OOMKilled or disk errors +- `kubectl describe pvc` shows capacity approaching limit +- Longhorn UI shows volume health degraded + +**Prevention:** +```yaml +# kube-prometheus-stack values +prometheus: + prometheusSpec: + retention: 7d + retentionSize: "8GB" # Set explicit size limit + resources: + requests: + memory: 400Mi + limits: + memory: 600Mi + storageSpec: + volumeClaimTemplate: + spec: + storageClassName: longhorn + resources: + requests: + storage: 10Gi +``` + +- Always set both `retention` AND `retentionSize` +- Size PVC with 20% headroom above retentionSize +- Monitor with `prometheus_tsdb_storage_blocks_bytes` metric +- For expansion: stop pod, detach volume, resize, then restart + +**Phase to address:** Prometheus installation phase + +**Sources:** +- [Longhorn Issue #2222 - Volume expansion faults](https://github.com/longhorn/longhorn/issues/2222) +- [kube-prometheus-stack Issue #3401 - Resource limits](https://github.com/prometheus-community/helm-charts/issues/3401) + +--- + +### 4. ArgoCD + Traefik TLS Termination Redirect Loop + +**What goes wrong:** ArgoCD UI becomes inaccessible with redirect loops or connection refused errors when accessed through Traefik. Browser shows ERR_TOO_MANY_REDIRECTS. + +**Why it happens:** Traefik terminates TLS and forwards HTTP to ArgoCD. ArgoCD server, configured for TLS by default, responds with 307 redirects to HTTPS, creating infinite loop. + +**Consequences:** +- Cannot access ArgoCD UI via ingress +- CLI may work with port-forward but not through ingress +- gRPC connections for CLI through ingress fail + +**Warning signs:** +- Browser redirect loop when accessing ArgoCD URL +- `curl -v` shows 307 redirect responses +- Works with `kubectl port-forward` but not via ingress + +**Prevention:** +```yaml +# Option 1: ConfigMap (recommended) +apiVersion: v1 +kind: ConfigMap +metadata: + name: argocd-cmd-params-cm + namespace: argocd +data: + server.insecure: "true" + +# Option 2: Traefik IngressRoute for dual HTTP/gRPC +apiVersion: traefik.io/v1alpha1 +kind: IngressRoute +metadata: + name: argocd-server + namespace: argocd +spec: + entryPoints: + - websecure + routes: + - kind: Rule + match: Host(`argocd.example.com`) + priority: 10 + services: + - name: argocd-server + port: 80 + - kind: Rule + match: Host(`argocd.example.com`) && Header(`Content-Type`, `application/grpc`) + priority: 11 + services: + - name: argocd-server + port: 80 + scheme: h2c + tls: + certResolver: letsencrypt-prod +``` + +- Set `server.insecure: "true"` in argocd-cmd-params-cm ConfigMap +- Use IngressRoute (not Ingress) for proper gRPC support +- Configure separate routes for HTTP and gRPC with correct priority + +**Phase to address:** ArgoCD installation phase - test immediately after ingress setup + +**Sources:** +- [ArgoCD Ingress Configuration](https://argo-cd.readthedocs.io/en/stable/operator-manual/ingress/) +- [Traefik Community - ArgoCD behind Traefik](https://community.traefik.io/t/serving-argocd-behind-traefik-ingress/15901) + +--- + +## Moderate Pitfalls + +Mistakes that cause delays, debugging sessions, or technical debt. + +### 5. ServiceMonitor Not Discovering Targets + +**What goes wrong:** Prometheus ServiceMonitors are created but no targets appear in Prometheus. The scrape config shows 0/0 targets up. + +**Why it happens:** +- Label selector mismatch between Prometheus CR and ServiceMonitor +- RBAC: Prometheus ServiceAccount lacks permission in target namespace +- Port specified as number instead of name +- ServiceMonitor in different namespace than Prometheus expects + +**Prevention:** +```yaml +# Ensure Prometheus CR has permissive selectors +prometheus: + prometheusSpec: + serviceMonitorSelectorNilUsesHelmValues: false + serviceMonitorSelector: {} # Select all ServiceMonitors + serviceMonitorNamespaceSelector: {} # From all namespaces + +# ServiceMonitor must use port NAME not number +spec: + endpoints: + - port: metrics # NOT 9090 +``` + +- Use port name, never port number in ServiceMonitor +- Check RBAC: `kubectl auth can-i list endpoints --as=system:serviceaccount:monitoring:prometheus-kube-prometheus-prometheus -n default` +- Verify label matching: `kubectl get servicemonitor -A --show-labels` + +**Phase to address:** Prometheus installation phase, verify with test ServiceMonitor + +**Sources:** +- [Prometheus Operator Troubleshooting](https://managedkube.com/prometheus/operator/servicemonitor/troubleshooting/2019/11/07/prometheus-operator-servicemonitor-troubleshooting.html) +- [ServiceMonitor not discovered Issue #3383](https://github.com/prometheus-operator/prometheus-operator/issues/3383) + +--- + +### 6. k3s Control Plane Metrics Not Scraped + +**What goes wrong:** Prometheus dashboards show no metrics for kube-scheduler, kube-controller-manager, or etcd. These panels appear blank or show "No data." + +**Why it happens:** k3s runs control plane components as a single binary, not as pods. Standard kube-prometheus-stack expects to scrape pods that don't exist. + +**Prevention:** +```yaml +# kube-prometheus-stack values for k3s +kubeControllerManager: + enabled: true + endpoints: + - 192.168.1.100 # k3s server IP + service: + enabled: true + port: 10257 + targetPort: 10257 +kubeScheduler: + enabled: true + endpoints: + - 192.168.1.100 + service: + enabled: true + port: 10259 + targetPort: 10259 +kubeEtcd: + enabled: false # k3s uses embedded sqlite/etcd +``` + +- Explicitly configure control plane endpoints with k3s server IPs +- Disable etcd monitoring if using embedded database +- OR disable these components entirely for simpler setup + +**Phase to address:** Prometheus installation phase + +**Sources:** +- [Prometheus for Rancher K3s Control Plane Monitoring](https://www.spectrocloud.com/blog/enabling-rancher-k3s-cluster-control-plane-monitoring-with-prometheus) + +--- + +### 7. Promtail Not Sending Logs to Loki + +**What goes wrong:** Promtail pods are running but no logs appear in Grafana/Loki. Queries return empty results. + +**Why it happens:** +- Promtail started before Loki was ready +- Log path configuration doesn't match k3s container runtime paths +- Label selectors don't match actual pod labels +- Network policy blocking Promtail -> Loki communication + +**Warning signs:** +- Promtail logs show "dropping target, no labels" or connection errors +- `kubectl logs -n monitoring promtail-xxx` shows retries +- Loki data source health check passes but queries return nothing + +**Prevention:** +```yaml +# Verify k3s containerd log paths +promtail: + config: + snippets: + scrapeConfigs: | + - job_name: kubernetes-pods + kubernetes_sd_configs: + - role: pod + pipeline_stages: + - cri: {} + relabel_configs: + - source_labels: [__meta_kubernetes_pod_node_name] + target_label: node +``` + +- Delete Promtail positions file to force re-read: `kubectl exec -n monitoring promtail-xxx -- rm /tmp/positions.yaml` +- Ensure Loki is healthy before Promtail starts (use init container or sync wave) +- Verify log paths match containerd: `/var/log/pods/*/*/*.log` + +**Phase to address:** Loki installation phase + +**Sources:** +- [Grafana Loki Troubleshooting](https://grafana.com/docs/loki/latest/operations/troubleshooting/) + +--- + +### 8. ArgoCD Self-Management Bootstrap Chicken-Egg + +**What goes wrong:** Attempting to have ArgoCD manage itself creates confusion about what's managing what. Initial mistakes in the ArgoCD Application manifest can lock you out. + +**Why it happens:** GitOps can't install ArgoCD if ArgoCD isn't present. After bootstrap, changing ArgoCD's self-managing Application incorrectly can break the cluster. + +**Prevention:** +```yaml +# Phase 1: Install ArgoCD manually (kubectl apply or helm) +# Phase 2: Create self-management Application +apiVersion: argoproj.io/v1alpha1 +kind: Application +metadata: + name: argocd + namespace: argocd +spec: + project: default + source: + repoURL: https://git.kube2.tricnet.de/tho/infrastructure.git + path: argocd + targetRevision: HEAD + destination: + server: https://kubernetes.default.svc + namespace: argocd + syncPolicy: + automated: + prune: false # CRITICAL: Don't auto-prune ArgoCD components + selfHeal: true +``` + +- Always bootstrap ArgoCD manually first (Helm or kubectl) +- Set `prune: false` for ArgoCD's self-management Application +- Use App of Apps pattern for managed applications +- Keep a local backup of ArgoCD Application manifest + +**Phase to address:** ArgoCD installation phase - plan bootstrap strategy upfront + +**Sources:** +- [Bootstrapping ArgoCD - Windsock.io](https://windsock.io/bootstrapping-argocd/) +- [Demystifying GitOps - Bootstrapping ArgoCD](https://medium.com/@aaltundemir/demystifying-gitops-bootstrapping-argo-cd-4a861284f273) + +--- + +### 9. Sync Waves Misuse Creating False Dependencies + +**What goes wrong:** Over-engineering sync waves creates unnecessary sequential deployments, increasing deployment time and complexity. Or under-engineering leads to race conditions. + +**Why it happens:** +- Developers add waves "just in case" +- Misunderstanding that waves are within single Application only +- Not knowing default wave is 0 and waves can be negative + +**Prevention:** +```yaml +# Use waves sparingly - only for true dependencies +# Database must exist before app +metadata: + annotations: + argocd.argoproj.io/sync-wave: "-1" # First + +# App deployment +metadata: + annotations: + argocd.argoproj.io/sync-wave: "0" # Default, after database + +# Don't create unnecessary chains like: +# ConfigMap (wave -3) -> Secret (wave -2) -> Service (wave -1) -> Deployment (wave 0) +# These have no real dependency and should all be wave 0 +``` + +- Use waves only for actual dependencies (database before app, CRD before CR) +- Keep wave structure as flat as possible +- Sync waves do NOT work across different ArgoCD Applications +- For cross-Application dependencies, use ApplicationSets with Progressive Syncs + +**Phase to address:** Application configuration phase + +**Sources:** +- [ArgoCD Sync Phases and Waves](https://argo-cd.readthedocs.io/en/stable/user-guide/sync-waves/) + +--- + +## Minor Pitfalls + +Annoyances that are easily fixed but waste time if not known. + +### 10. Grafana Default Password Not Changed + +**What goes wrong:** Using default `admin/prom-operator` credentials in production exposes the monitoring stack. + +**Prevention:** +```yaml +# kube-prometheus-stack values +grafana: + adminPassword: "${GRAFANA_ADMIN_PASSWORD}" # From secret + # Or use existing secret + admin: + existingSecret: grafana-admin-credentials + userKey: admin-user + passwordKey: admin-password +``` + +**Phase to address:** Grafana installation phase + +--- + +### 11. Missing open-iscsi for Longhorn + +**What goes wrong:** Longhorn volumes fail to attach with cryptic errors. + +**Why it happens:** Longhorn requires `open-iscsi` on all nodes, which isn't installed by default on many Linux distributions. + +**Prevention:** +```bash +# On each node before Longhorn installation +sudo apt-get install -y open-iscsi +sudo systemctl enable iscsid +sudo systemctl start iscsid +``` + +**Phase to address:** Pre-installation prerequisites check + +**Sources:** +- [Longhorn Prerequisites](https://longhorn.io/docs/latest/deploy/install/#installation-requirements) + +--- + +### 12. ClusterIP Services Not Accessible + +**What goes wrong:** After installing monitoring stack, Grafana/Prometheus aren't accessible externally. + +**Why it happens:** k3s defaults to ClusterIP for services. Single-node setups need explicit ingress or LoadBalancer configuration. + +**Prevention:** +```yaml +# kube-prometheus-stack values +grafana: + ingress: + enabled: true + ingressClassName: traefik + hosts: + - grafana.kube2.tricnet.de + tls: + - secretName: grafana-tls + hosts: + - grafana.kube2.tricnet.de + annotations: + cert-manager.io/cluster-issuer: letsencrypt-prod +``` + +**Phase to address:** Installation phase - configure ingress alongside deployment + +--- + +### 13. Traefik v3 Breaking Changes for ArgoCD IngressRoute + +**What goes wrong:** ArgoCD IngressRoute with gRPC support stops working after Traefik upgrade to v3. + +**Why it happens:** Traefik v3 changed header matcher syntax from `Headers()` to `Header()`. + +**Prevention:** +```yaml +# Traefik v2 (OLD - broken in v3) +match: Host(`argocd.example.com`) && Headers(`Content-Type`, `application/grpc`) + +# Traefik v3 (NEW) +match: Host(`argocd.example.com`) && Header(`Content-Type`, `application/grpc`) +``` + +- Check Traefik version before applying IngressRoutes +- Test gRPC route after any Traefik upgrade + +**Phase to address:** ArgoCD installation phase + +**Sources:** +- [ArgoCD Issue #15534 - Traefik v3 docs](https://github.com/argoproj/argo-cd/issues/15534) + +--- + +### 14. k3s Resource Exhaustion with Full Monitoring Stack + +**What goes wrong:** Single-node k3s cluster becomes unresponsive after deploying full kube-prometheus-stack. + +**Why it happens:** +- kube-prometheus-stack deploys many components (prometheus, alertmanager, grafana, node-exporter, kube-state-metrics) +- Default resource requests/limits are sized for larger clusters +- k3s server process itself needs ~500MB RAM + +**Warning signs:** +- Pods stuck in Pending +- OOMKilled events +- Node NotReady status + +**Prevention:** +```yaml +# Minimal kube-prometheus-stack for single-node +alertmanager: + enabled: false # Disable if not using alerts +prometheus: + prometheusSpec: + resources: + requests: + memory: 256Mi + cpu: 100m + limits: + memory: 512Mi +grafana: + resources: + requests: + memory: 128Mi + cpu: 50m + limits: + memory: 256Mi +``` + +- Disable unnecessary components (alertmanager if no alerts configured) +- Set explicit resource limits lower than defaults +- Monitor cluster resources: `kubectl top nodes` +- Consider: 4GB RAM minimum for k3s + monitoring + workloads + +**Phase to address:** Prometheus installation phase - right-size from start + +**Sources:** +- [K3s Resource Profiling](https://docs.k3s.io/reference/resource-profiling) + +--- + +## Phase-Specific Warning Summary + +| Phase | Likely Pitfall | Mitigation | +|-------|---------------|------------| +| Prerequisites | #11 Missing open-iscsi | Pre-flight check script | +| ArgoCD Installation | #4 TLS redirect loop, #8 Bootstrap | Test ingress immediately, plan bootstrap | +| ArgoCD + Gitea Integration | #1 Webhook parsing | Use Gogs webhook type, accept polling fallback | +| Prometheus Installation | #3 Volume growth, #5 ServiceMonitor, #6 Control plane, #14 Resources | Configure retention+size, verify RBAC, right-size | +| Loki Installation | #2 Disk full, #7 Promtail | Enable retention day one, verify log paths | +| Grafana Installation | #10 Default password, #12 ClusterIP | Set password, configure ingress | +| Application Configuration | #9 Sync waves | Use sparingly, only for real dependencies | + +--- + +## Pre-Installation Checklist + +Before starting installation, verify: + +- [ ] open-iscsi installed on all nodes +- [ ] Longhorn healthy with available storage (check `kubectl get nodes` and Longhorn UI) +- [ ] Traefik version known (v2 vs v3 affects IngressRoute syntax) +- [ ] DNS entries configured for monitoring subdomains +- [ ] Gitea webhook type decision (use Gogs type, or accept polling fallback) +- [ ] Disk space planning: Loki retention + Prometheus retention + headroom +- [ ] Memory planning: k3s (~500MB) + monitoring (~1GB) + workloads +- [ ] Namespace strategy decided (monitoring namespace vs default) + +--- + +## Existing Infrastructure Compatibility Notes + +Based on the existing TaskPlanner setup: + +**Traefik:** Already in use with cert-manager (letsencrypt-prod). New services should follow same pattern: +```yaml +annotations: + cert-manager.io/cluster-issuer: letsencrypt-prod +``` + +**Longhorn:** Already the storage class. New PVCs should use explicit `storageClassName: longhorn` and consider replica count for single-node (set to 1). + +**Gitea:** Repository already configured at `git.kube2.tricnet.de`. ArgoCD Application already exists in `argocd/application.yaml` - don't duplicate. + +**Existing ArgoCD Application:** TaskPlanner is already configured with ArgoCD. The monitoring stack should be a separate Application, not added to the existing one. + +--- + +## Sources Summary + +### Official Documentation +- [ArgoCD Ingress Configuration](https://argo-cd.readthedocs.io/en/stable/operator-manual/ingress/) +- [ArgoCD Sync Phases and Waves](https://argo-cd.readthedocs.io/en/stable/user-guide/sync-waves/) +- [Grafana Loki Retention](https://grafana.com/docs/loki/latest/operations/storage/retention/) +- [Grafana Loki Troubleshooting](https://grafana.com/docs/loki/latest/operations/troubleshooting/) +- [K3s Resource Profiling](https://docs.k3s.io/reference/resource-profiling) + +### Community Issues (Verified Problems) +- [ArgoCD #16453 - Gitea webhook parsing](https://github.com/argoproj/argo-cd/issues/16453) +- [ArgoCD #20444 - Gitea support](https://github.com/argoproj/argo-cd/issues/20444) +- [Loki #5242 - Retention not working](https://github.com/grafana/loki/issues/5242) +- [Longhorn #2222 - Volume expansion](https://github.com/longhorn/longhorn/issues/2222) +- [kube-prometheus-stack #3401 - Resource limits](https://github.com/prometheus-community/helm-charts/issues/3401) +- [Prometheus Operator #3383 - ServiceMonitor discovery](https://github.com/prometheus-operator/prometheus-operator/issues/3383) + +### Tutorials and Guides +- [K3S Rocks - ArgoCD](https://k3s.rocks/argocd/) +- [K3S Rocks - Logging](https://k3s.rocks/logging/) +- [Bootstrapping ArgoCD](https://windsock.io/bootstrapping-argocd/) +- [Prometheus ServiceMonitor Troubleshooting](https://managedkube.com/prometheus/operator/servicemonitor/troubleshooting/2019/11/07/prometheus-operator-servicemonitor-troubleshooting.html) +- [Traefik Community - ArgoCD](https://community.traefik.io/t/serving-argocd-behind-traefik-ingress/15901) + +--- +*Pitfalls research for: CI/CD and Observability on k3s* +*Context: Adding to existing TaskPlanner deployment* +*Researched: 2026-02-03* diff --git a/.planning/research/STACK-v2-cicd-observability.md b/.planning/research/STACK-v2-cicd-observability.md new file mode 100644 index 0000000..6922f7c --- /dev/null +++ b/.planning/research/STACK-v2-cicd-observability.md @@ -0,0 +1,583 @@ +# Technology Stack: CI/CD Testing, ArgoCD GitOps, and Observability + +**Project:** TaskPlanner v2.0 Production Operations +**Researched:** 2026-02-03 +**Scope:** Stack additions for existing k3s-deployed SvelteKit app + +## Executive Summary + +This research covers three areas: (1) adding tests to the existing Gitea Actions pipeline, (2) ArgoCD for GitOps deployment automation, and (3) Prometheus/Grafana/Loki observability. The existing setup already has ArgoCD configured; research focuses on validating that configuration and adding the observability stack. + +**Key finding:** Promtail is EOL on 2026-03-02. Use Grafana Alloy instead for log collection. + +--- + +## 1. CI/CD Testing Stack + +### Recommended Stack + +| Component | Version | Purpose | Rationale | +|-----------|---------|---------|-----------| +| Playwright | ^1.58.1 (existing) | E2E testing | Already configured, comprehensive browser automation | +| Vitest | ^3.0.0 | Unit/component tests | Official Svelte recommendation for Vite-based projects | +| @testing-library/svelte | ^5.0.0 | Component testing utilities | Streamlined component assertions | +| mcr.microsoft.com/playwright | v1.58.1 | CI browser execution | Pre-installed browsers, eliminates install step | + +### Why This Stack + +**Playwright (keep existing):** Already configured with `playwright.config.ts` and `tests/docker-deployment.spec.ts`. The existing tests cover critical paths: health endpoint, CSRF-protected form submissions, and data persistence. Extend rather than replace. + +**Vitest (add):** Svelte officially recommends Vitest for unit and component testing when using Vite (which SvelteKit uses). Vitest shares Vite's config, eliminating configuration overhead. Jest muscle memory transfers directly. + +**NOT recommended:** +- Jest: Requires separate configuration, slower than Vitest, no Vite integration +- Cypress: Overlaps with Playwright; adding both creates maintenance burden +- @vitest/browser with Playwright: Adds complexity; save for later if jsdom proves insufficient + +### Gitea Actions Workflow Updates + +The existing workflow at `.gitea/workflows/build.yaml` needs a test stage. Gitea Actions uses GitHub Actions syntax. + +**Recommended workflow structure:** + +```yaml +name: Build and Push + +on: + push: + branches: [master, main] + pull_request: + branches: [master, main] + +env: + REGISTRY: git.kube2.tricnet.de + IMAGE_NAME: tho/taskplaner + +jobs: + test: + runs-on: ubuntu-latest + container: + image: mcr.microsoft.com/playwright:v1.58.1-noble + steps: + - uses: actions/checkout@v4 + + - name: Install dependencies + run: npm ci + + - name: Run type check + run: npm run check + + - name: Run unit tests + run: npm run test:unit + + - name: Run E2E tests + run: npm run test:e2e + env: + CI: true + + build: + needs: test + runs-on: ubuntu-latest + if: github.event_name != 'pull_request' + steps: + # ... existing build steps ... +``` + +**Key decisions:** +- Use Playwright Docker image to avoid browser installation (saves 2-3 minutes) +- Run tests before build to fail fast +- Only build/push on push to master, not PRs +- Type checking (`svelte-check`) catches errors before runtime + +### Package.json Scripts to Add + +```json +{ + "scripts": { + "test": "npm run test:unit && npm run test:e2e", + "test:unit": "vitest run", + "test:unit:watch": "vitest", + "test:e2e": "playwright test", + "test:e2e:docker": "BASE_URL=http://localhost:3000 playwright test tests/docker-deployment.spec.ts" + } +} +``` + +### Installation + +```bash +# Add Vitest and testing utilities +npm install -D vitest @testing-library/svelte jsdom +``` + +### Vitest Configuration + +Create `vitest.config.ts`: + +```typescript +import { defineConfig } from 'vitest/config'; +import { sveltekit } from '@sveltejs/kit/vite'; + +export default defineConfig({ + plugins: [sveltekit()], + test: { + include: ['src/**/*.{test,spec}.{js,ts}'], + environment: 'jsdom', + globals: true, + setupFiles: ['./src/test-setup.ts'] + } +}); +``` + +### Confidence: HIGH + +Sources: +- [Svelte Testing Documentation](https://svelte.dev/docs/svelte/testing) - Official recommendation for Vitest +- [Playwright CI Setup](https://playwright.dev/docs/ci-intro) - Docker image and CI best practices +- Existing `playwright.config.ts` in project + +--- + +## 2. ArgoCD GitOps Stack + +### Current State + +ArgoCD is already configured in `argocd/application.yaml`. The configuration is correct and follows best practices: + +```yaml +syncPolicy: + automated: + prune: true # Removes resources deleted from Git + selfHeal: true # Reverts manual changes +``` + +### Recommended Stack + +| Component | Version | Purpose | Rationale | +|-----------|---------|---------|-----------| +| ArgoCD Helm Chart | 9.4.0 | GitOps controller | Latest stable, deploys ArgoCD v3.3.0 | + +### What's Already Done (No Changes Needed) + +1. **Application manifest:** `argocd/application.yaml` correctly points to `helm/taskplaner` +2. **Auto-sync enabled:** `automated.prune` and `selfHeal` are configured +3. **Git-based image tags:** Pipeline updates `values.yaml` with new image tag +4. **Namespace creation:** `CreateNamespace=true` is set + +### What May Need Verification + +1. **ArgoCD installation:** Verify ArgoCD is actually deployed on the k3s cluster +2. **Repository credentials:** If the Gitea repo is private, ArgoCD needs credentials +3. **Registry secret:** The `gitea-registry-secret` placeholder needs real credentials + +### Installation (if ArgoCD not yet installed) + +```bash +# Add ArgoCD Helm repository +helm repo add argo https://argoproj.github.io/argo-helm +helm repo update + +# Install ArgoCD (minimal for single-node k3s) +helm install argocd argo/argo-cd \ + --namespace argocd \ + --create-namespace \ + --set server.service.type=ClusterIP \ + --set configs.params.server\.insecure=true # If behind Traefik TLS termination +``` + +### Apply Application + +```bash +kubectl apply -f argocd/application.yaml +``` + +### NOT Recommended + +- **ArgoCD Image Updater:** Overkill for single-app deployment; the current approach of updating values.yaml in Git is simpler and provides better audit trail +- **ApplicationSets:** Unnecessary for single environment +- **App of Apps pattern:** Unnecessary complexity for one application + +### Confidence: HIGH + +Sources: +- [ArgoCD Helm Chart on Artifact Hub](https://artifacthub.io/packages/helm/argo/argo-cd) - Version 9.4.0 confirmed +- [ArgoCD Helm GitHub Releases](https://github.com/argoproj/argo-helm/releases) - Release notes +- Existing `argocd/application.yaml` in project + +--- + +## 3. Observability Stack + +### Recommended Stack + +| Component | Chart | Version | Purpose | +|-----------|-------|---------|---------| +| kube-prometheus-stack | prometheus-community/kube-prometheus-stack | 81.4.2 | Prometheus + Grafana + Alertmanager | +| Loki | grafana/loki | 6.51.0 | Log aggregation (monolithic mode) | +| Grafana Alloy | grafana/alloy | 1.5.3 | Log collection agent | + +### Why This Stack + +**kube-prometheus-stack (not standalone Prometheus):** Single chart deploys Prometheus, Grafana, Alertmanager, node-exporter, and kube-state-metrics. Pre-configured with Kubernetes dashboards. This is the standard approach. + +**Loki (not ELK/Elasticsearch):** "Like Prometheus, but for logs." Integrates natively with Grafana. Much lower resource footprint than Elasticsearch. Uses same label-based querying as Prometheus. + +**Grafana Alloy (not Promtail):** CRITICAL - Promtail reaches End-of-Life on 2026-03-02 (next month). Grafana Alloy is the official replacement. It's based on OpenTelemetry Collector and supports logs, metrics, and traces in one agent. + +### NOT Recommended + +- **Promtail:** EOL 2026-03-02. Do not install; use Alloy +- **loki-stack Helm chart:** Deprecated, no longer maintained +- **Elasticsearch/ELK:** Resource-heavy, complex, overkill for single-user app +- **Loki microservices mode:** Requires 3+ nodes, object storage; overkill for personal app +- **Separate Prometheus + Grafana charts:** kube-prometheus-stack bundles them correctly + +### Architecture + +``` + +------------------+ + | Grafana | + | (Dashboards/UI) | + +--------+---------+ + | + +--------------------+--------------------+ + | | + +--------v---------+ +----------v---------+ + | Prometheus | | Loki | + | (Metrics) | | (Logs) | + +--------+---------+ +----------+---------+ + | | + +--------------+---------------+ | + | | | | + +-----v-----+ +-----v-----+ +------v------+ +--------v---------+ + | node- | | kube- | | TaskPlanner | | Grafana Alloy | + | exporter | | state- | | /metrics | | (Log Shipper) | + | | | metrics | | | | | + +-----------+ +-----------+ +-------------+ +------------------+ +``` + +### Installation + +```bash +# Add Helm repositories +helm repo add prometheus-community https://prometheus-community.github.io/helm-charts +helm repo add grafana https://grafana.github.io/helm-charts +helm repo update + +# Create monitoring namespace +kubectl create namespace monitoring + +# Install kube-prometheus-stack +helm install prometheus prometheus-community/kube-prometheus-stack \ + --namespace monitoring \ + --values prometheus-values.yaml + +# Install Loki (monolithic mode for single-node) +helm install loki grafana/loki \ + --namespace monitoring \ + --values loki-values.yaml + +# Install Alloy for log collection +helm install alloy grafana/alloy \ + --namespace monitoring \ + --values alloy-values.yaml +``` + +### Recommended Values Files + +#### prometheus-values.yaml (minimal for k3s single-node) + +```yaml +# Reduce resource usage for single-node k3s +prometheus: + prometheusSpec: + retention: 15d + resources: + requests: + cpu: 200m + memory: 512Mi + limits: + cpu: 1000m + memory: 2Gi + storageSpec: + volumeClaimTemplate: + spec: + storageClassName: longhorn # Use existing Longhorn + accessModes: ["ReadWriteOnce"] + resources: + requests: + storage: 20Gi + +alertmanager: + alertmanagerSpec: + resources: + requests: + cpu: 50m + memory: 64Mi + limits: + cpu: 200m + memory: 256Mi + storage: + volumeClaimTemplate: + spec: + storageClassName: longhorn + accessModes: ["ReadWriteOnce"] + resources: + requests: + storage: 5Gi + +grafana: + persistence: + enabled: true + storageClassName: longhorn + size: 5Gi + # Grafana will be exposed via Traefik + ingress: + enabled: true + ingressClassName: traefik + annotations: + cert-manager.io/cluster-issuer: letsencrypt-prod + hosts: + - grafana.kube2.tricnet.de + tls: + - secretName: grafana-tls + hosts: + - grafana.kube2.tricnet.de + +# Disable components not needed for single-node +kubeControllerManager: + enabled: false # k3s bundles this differently +kubeScheduler: + enabled: false # k3s bundles this differently +kubeProxy: + enabled: false # k3s uses different proxy +``` + +#### loki-values.yaml (monolithic mode) + +```yaml +deploymentMode: SingleBinary + +loki: + auth_enabled: false + commonConfig: + replication_factor: 1 + storage: + type: filesystem + schemaConfig: + configs: + - from: "2024-01-01" + store: tsdb + object_store: filesystem + schema: v13 + index: + prefix: loki_index_ + period: 24h + +singleBinary: + replicas: 1 + resources: + requests: + cpu: 100m + memory: 256Mi + limits: + cpu: 500m + memory: 1Gi + persistence: + enabled: true + storageClass: longhorn + size: 10Gi + +# Disable components not needed for monolithic +backend: + replicas: 0 +read: + replicas: 0 +write: + replicas: 0 + +# Gateway not needed for internal access +gateway: + enabled: false +``` + +#### alloy-values.yaml + +```yaml +alloy: + configMap: + content: |- + // Discover and collect logs from all pods + discovery.kubernetes "pods" { + role = "pod" + } + + discovery.relabel "pods" { + targets = discovery.kubernetes.pods.targets + + rule { + source_labels = ["__meta_kubernetes_namespace"] + target_label = "namespace" + } + rule { + source_labels = ["__meta_kubernetes_pod_name"] + target_label = "pod" + } + rule { + source_labels = ["__meta_kubernetes_pod_container_name"] + target_label = "container" + } + } + + loki.source.kubernetes "pods" { + targets = discovery.relabel.pods.output + forward_to = [loki.write.local.receiver] + } + + loki.write "local" { + endpoint { + url = "http://loki.monitoring.svc:3100/loki/api/v1/push" + } + } + +controller: + type: daemonset + +resources: + requests: + cpu: 50m + memory: 64Mi + limits: + cpu: 200m + memory: 256Mi +``` + +### TaskPlanner Metrics Endpoint + +The app needs a `/metrics` endpoint for Prometheus to scrape. SvelteKit options: + +1. **prom-client library** (recommended): Standard Prometheus client for Node.js +2. **Custom endpoint**: Simple counter/gauge implementation + +Add to `package.json`: +```bash +npm install prom-client +``` + +Add ServiceMonitor for Prometheus to scrape TaskPlanner: + +```yaml +apiVersion: monitoring.coreos.com/v1 +kind: ServiceMonitor +metadata: + name: taskplaner + namespace: monitoring + labels: + release: prometheus # Must match Prometheus selector +spec: + selector: + matchLabels: + app.kubernetes.io/name: taskplaner + namespaceSelector: + matchNames: + - default + endpoints: + - port: http + path: /metrics + interval: 30s +``` + +### Resource Summary + +Total additional resource requirements for observability: + +| Component | CPU Request | Memory Request | Storage | +|-----------|-------------|----------------|---------| +| Prometheus | 200m | 512Mi | 20Gi | +| Alertmanager | 50m | 64Mi | 5Gi | +| Grafana | 100m | 128Mi | 5Gi | +| Loki | 100m | 256Mi | 10Gi | +| Alloy (per node) | 50m | 64Mi | - | +| **Total** | ~500m | ~1Gi | 40Gi | + +This fits comfortably on a single k3s node with 4+ cores and 8GB+ RAM. + +### Confidence: HIGH + +Sources: +- [kube-prometheus-stack on Artifact Hub](https://artifacthub.io/packages/helm/prometheus-community/kube-prometheus-stack) - Version 81.4.2 +- [Grafana Loki Helm Installation](https://grafana.com/docs/loki/latest/setup/install/helm/) - Monolithic mode guidance +- [Grafana Alloy Kubernetes Deployment](https://grafana.com/docs/alloy/latest/set-up/install/kubernetes/) - Alloy setup +- [Promtail Deprecation Notice](https://grafana.com/docs/loki/latest/send-data/promtail/installation/) - EOL 2026-03-02 +- [Migrate from Promtail to Alloy](https://grafana.com/docs/alloy/latest/set-up/migrate/from-promtail/) - Migration guide + +--- + +## Summary: What to Install + +### Immediate Actions + +| Category | Add | Version | Notes | +|----------|-----|---------|-------| +| Testing | vitest | ^3.0.0 | Unit tests | +| Testing | @testing-library/svelte | ^5.0.0 | Component testing | +| Metrics | prom-client | ^15.0.0 | Prometheus metrics from app | + +### Helm Charts to Deploy + +| Chart | Repository | Version | Namespace | +|-------|------------|---------|-----------| +| kube-prometheus-stack | prometheus-community | 81.4.2 | monitoring | +| loki | grafana | 6.51.0 | monitoring | +| alloy | grafana | 1.5.3 | monitoring | + +### Already Configured (Verify, Don't Re-install) + +| Component | Status | Action | +|-----------|--------|--------| +| ArgoCD Application | Configured in `argocd/application.yaml` | Verify ArgoCD is running | +| Playwright | Configured in `playwright.config.ts` | Keep, extend tests | + +### Do NOT Install + +| Component | Reason | +|-----------|--------| +| Promtail | EOL 2026-03-02, use Alloy instead | +| loki-stack chart | Deprecated, unmaintained | +| Elasticsearch/ELK | Overkill, resource-heavy | +| Jest | Vitest is better for Vite projects | +| ArgoCD Image Updater | Current Git-based approach is simpler | + +--- + +## Helm Repository Commands + +```bash +# Add all needed repositories +helm repo add prometheus-community https://prometheus-community.github.io/helm-charts +helm repo add grafana https://grafana.github.io/helm-charts +helm repo add argo https://argoproj.github.io/argo-helm +helm repo update + +# Verify +helm search repo prometheus-community/kube-prometheus-stack +helm search repo grafana/loki +helm search repo grafana/alloy +helm search repo argo/argo-cd +``` + +--- + +## Sources + +### Official Documentation +- [Svelte Testing](https://svelte.dev/docs/svelte/testing) +- [Playwright CI Setup](https://playwright.dev/docs/ci-intro) +- [ArgoCD Helm Chart](https://artifacthub.io/packages/helm/argo/argo-cd) +- [kube-prometheus-stack](https://artifacthub.io/packages/helm/prometheus-community/kube-prometheus-stack) +- [Grafana Loki Helm](https://grafana.com/docs/loki/latest/setup/install/helm/) +- [Grafana Alloy](https://grafana.com/docs/alloy/latest/set-up/install/kubernetes/) + +### Critical Updates +- [Promtail EOL Notice](https://grafana.com/docs/loki/latest/send-data/promtail/installation/) - EOL 2026-03-02 +- [Promtail to Alloy Migration](https://grafana.com/docs/alloy/latest/set-up/migrate/from-promtail/) diff --git a/.planning/research/SUMMARY-v2-cicd-observability.md b/.planning/research/SUMMARY-v2-cicd-observability.md new file mode 100644 index 0000000..ef690ec --- /dev/null +++ b/.planning/research/SUMMARY-v2-cicd-observability.md @@ -0,0 +1,328 @@ +# Project Research Summary: v2.0 CI/CD and Observability + +**Project:** TaskPlanner v2.0 Production Operations +**Domain:** CI/CD Testing, GitOps Deployment, and Kubernetes Observability +**Researched:** 2026-02-03 +**Confidence:** HIGH + +## Executive Summary + +This research covers production-readiness improvements for a self-hosted SvelteKit task management application running on k3s. The milestone adds three capabilities: (1) automated testing in the existing Gitea Actions pipeline, (2) ArgoCD-based GitOps deployment automation, and (3) a complete observability stack (Prometheus, Grafana, Loki). The infrastructure foundation already exists—k3s cluster, Gitea with Actions, Traefik ingress, Longhorn storage, and a defined ArgoCD Application manifest. + +**Recommended approach:** Implement in three phases prioritizing operational foundation first. Phase 1 enables GitOps automation (ArgoCD), Phase 2 establishes observability (kube-prometheus-stack + Loki/Alloy), and Phase 3 hardens the CI pipeline with comprehensive testing. This ordering delivers immediate value (hands-off deployments) before adding observability, then solidifies quality gates last. The stack is standard for self-hosted k3s: ArgoCD for GitOps, kube-prometheus-stack for metrics/dashboards, Loki in monolithic mode for logs, and Grafana Alloy for log collection (Promtail is EOL March 2026). + +**Key risks:** (1) ArgoCD + Traefik TLS termination requires `server.insecure: true` or redirect loops occur, (2) Loki disk exhaustion without retention configuration (filesystem storage has no size limits), (3) k3s control plane metrics need explicit endpoint configuration, and (4) Gitea webhooks fail JSON parsing with ArgoCD (use polling or accept webhook limitations). All risks have documented mitigations from production k3s deployments. + +## Key Findings + +### Recommended Stack + +**GitOps:** ArgoCD is already configured in `argocd/application.yaml` with correct auto-sync and self-heal policies. The Application manifest exists but ArgoCD server installation is needed. Gitea webhooks to ArgoCD have known JSON parsing issues (Gitea uses Gogs format but ArgoCD expects GitHub); fallback to 3-minute polling is acceptable for single-user workload. ArgoCD Image Updater is unnecessary—the existing pattern of updating `values.yaml` in Git provides better audit trails. + +**Observability:** The standard k3s stack is kube-prometheus-stack (single Helm chart bundling Prometheus, Grafana, Alertmanager, node-exporter, kube-state-metrics), Loki in monolithic SingleBinary mode for logs, and Grafana Alloy for log collection. CRITICAL: Promtail reaches End-of-Life on 2026-03-02 (next month)—use Alloy instead. Loki's monolithic mode uses filesystem storage, appropriate for single-node deployments under 100GB/day log volume. k3s requires explicit configuration to expose control plane metrics (scheduler, controller-manager bind to localhost by default). + +**Testing:** Playwright is already configured with E2E tests in `tests/docker-deployment.spec.ts`. Add Vitest for unit/component testing (official Svelte recommendation for Vite-based projects). Use the Playwright Docker image (`mcr.microsoft.com/playwright:v1.58.1-noble`) in Gitea Actions to avoid 2-3 minute browser installation overhead. Run tests before build to fail fast. + +**Core technologies:** +- **ArgoCD 3.3.0** (via Helm chart 9.4.0): GitOps deployment automation — already configured, needs installation +- **kube-prometheus-stack 81.4.2**: Bundled Prometheus + Grafana + Alertmanager — standard k3s observability stack +- **Loki 6.51.0** (monolithic mode): Log aggregation — lightweight, label-based like Prometheus +- **Grafana Alloy 1.5.3**: Log collection agent — Promtail replacement (EOL March 2026) +- **Vitest 3.0**: Unit/component tests — official Svelte recommendation, shares Vite config +- **Playwright 1.58.1**: E2E testing — already in use, comprehensive browser automation + +### Expected Features + +**Must have (table stakes):** +- **Automated tests in CI pipeline** — without tests, pipeline is just a build script; fail fast before deployment +- **GitOps auto-sync** — manual `helm upgrade` defeats CI/CD purpose; Git is single source of truth +- **Self-healing deployments** — ArgoCD reverts manual changes to maintain Git state +- **Basic metrics collection** — Prometheus scraping cluster and app metrics for visibility +- **Metrics visualization** — Grafana dashboards; metrics without visualization are useless +- **Log aggregation** — Loki centralized logging; no more `kubectl logs` per pod +- **Basic alerting** — 3-5 critical alerts (pod crashes, OOM, app down, disk full) + +**Should have (differentiators):** +- **Application-level metrics** — custom Prometheus metrics in TaskPlanner (`/metrics` endpoint) +- **Gitea webhook integration** — reduces sync delay from 3min to seconds (accept limitations) +- **Smoke tests on deploy** — verify deployment health after ArgoCD sync +- **k3s control plane monitoring** — scheduler, controller-manager metrics in dashboards +- **Traefik metrics integration** — ingress traffic patterns and latency + +**Defer (v2+):** +- **Distributed tracing** — overkill unless debugging microservices latency +- **SLO/SLI dashboards** — error budgets and reliability tracking (nice-to-have for learning) +- **Log-based alerting** — Loki alerting rules beyond basic metrics alerts +- **DORA metrics** — deployment frequency, lead time tracking +- **Vulnerability scanning** — Trivy for container images, npm audit + +**Anti-features (actively avoid):** +- **Multi-environment promotion** — single user, single environment; deploy directly to prod +- **Blue-green/canary deployments** — complex rollout for single-user app +- **ArgoCD high availability** — HA for multi-team, not personal projects +- **ELK stack** — resource-heavy; Loki is lightweight alternative +- **Secrets management (Vault)** — overkill; Kubernetes secrets sufficient +- **Policy enforcement (OPA)** — single user has no policy conflicts + +### Architecture Approach + +The existing architecture has Gitea Actions building Docker images and pushing to Gitea Container Registry, then updating `helm/taskplaner/values.yaml` with the new image tag via Git commit. ArgoCD watches this repository and syncs changes to the k3s cluster. The observability stack integrates via ServiceMonitors (for Prometheus scraping), Alloy DaemonSet (for log collection), and Traefik ingress (for Grafana/ArgoCD UIs). + +**Integration points:** +1. **Gitea → ArgoCD**: HTTPS repository clone (credentials in `argocd-secret`), optional webhook (Gogs type), automatic sync on Git changes +2. **Prometheus → Targets**: ServiceMonitors for TaskPlanner, Traefik, k3s control plane; scrapes `/metrics` endpoints every 30s +3. **Alloy → Loki**: DaemonSet reads `/var/log/pods`, forwards to Loki HTTP endpoint in `loki` namespace +4. **Grafana → Data Sources**: Auto-configured Prometheus and Loki datasources via kube-prometheus-stack integration +5. **Traefik → Ingress**: All UIs (Grafana, ArgoCD) exposed via Traefik with cert-manager TLS + +**Namespace strategy:** +- `argocd`: ArgoCD server, repo-server, application-controller (standard convention) +- `monitoring`: Prometheus, Grafana, Alertmanager (kube-prometheus-stack default) +- `loki`: Loki SingleBinary, Alloy DaemonSet (separate for resource isolation) +- `default`: TaskPlanner application (existing) + +**Major components:** +1. **ArgoCD Server** — GitOps controller; watches Git, syncs to cluster, exposes UI/API +2. **Prometheus** — metrics storage and querying; scrapes targets via ServiceMonitors +3. **Grafana** — visualization layer; queries Prometheus and Loki, displays dashboards +4. **Loki** — log aggregation; receives from Alloy, stores on filesystem, queries via LogQL +5. **Alloy DaemonSet** — log collection; reads pod logs, ships to Loki with Kubernetes labels +6. **kube-state-metrics** — Kubernetes object metrics (pod status, deployments, etc.) +7. **node-exporter** — node-level metrics (CPU, memory, disk, network) + +**Data flows:** +- **Metrics**: TaskPlanner/Traefik/k3s expose `/metrics` → Prometheus scrapes → Grafana queries → dashboards display +- **Logs**: Pod stdout/stderr → `/var/log/pods` → Alloy reads → Loki stores → Grafana Explore queries +- **GitOps**: Developer pushes Git → Gitea Actions builds → updates values.yaml → ArgoCD syncs → Kubernetes deploys +- **Observability**: Metrics + Logs converge in Grafana for unified troubleshooting + +### Critical Pitfalls + +1. **ArgoCD + Traefik TLS Redirect Loop** — ArgoCD expects HTTPS but Traefik terminates TLS, causing infinite 307 redirects. Set `server.insecure: true` in `argocd-cmd-params-cm` ConfigMap. Use IngressRoute (not Ingress) for proper gRPC support with correct Header matcher syntax. + +2. **Loki Disk Exhaustion Without Retention** — Loki fills disk because retention is disabled by default and only supports time-based retention (no size limits). Configure `compactor.retention_enabled: true` with `retention_period: 168h` (7 days). Set up Prometheus alert for PVC > 80% usage. Index period MUST be 24h for retention to work. + +3. **Prometheus Volume Growth Exceeds PVC** — Default 15-day retention without size limits causes disk full. Set BOTH `retention: 7d` AND `retentionSize: 8GB`. Size PVC with 20% headroom. Longhorn volume expansion has known issues requiring pod stop, detach, resize, restart procedure. + +4. **k3s Control Plane Metrics Not Scraped** — k3s runs scheduler/controller-manager as single binary binding to localhost, not as pods. Modify `/etc/rancher/k3s/config.yaml` to set `bind-address=0.0.0.0` for each component, then restart k3s. Configure explicit endpoints with k3s server IP in kube-prometheus-stack values. + +5. **Gitea Webhook JSON Parsing Failure** — ArgoCD treats Gitea webhooks as GitHub events but field types differ (e.g., `repository.created_at` is string in Gitea, int64 in GitHub). Webhooks silently fail with parsing errors in ArgoCD logs. Use Gogs webhook type or accept 3-minute polling interval as fallback. + +6. **ServiceMonitor Not Discovering Targets** — Label selector mismatch between Prometheus CR and ServiceMonitor, or RBAC issues. Use port NAME (not number) in ServiceMonitor endpoints. Set `serviceMonitorSelector: {}` for permissive selection. Verify RBAC with `kubectl auth can-i list endpoints`. + +7. **k3s Resource Exhaustion** — Full kube-prometheus-stack deploys many components sized for larger clusters. Single-node k3s with 8GB RAM needs explicit resource limits. Disable alertmanager if not using alerts. Set Prometheus to `256Mi` request, Grafana to `128Mi`. Monitor with `kubectl top nodes`. + +## Implications for Roadmap + +Based on research, suggested phase structure prioritizes operational foundation before observability, then CI hardening: + +### Phase 1: GitOps Foundation (ArgoCD) +**Rationale:** Eliminates manual `helm upgrade` commands and establishes Git as single source of truth. ArgoCD is the lowest-hanging fruit—Application manifest already exists, just needs server installation. Immediate value: hands-off deployments. + +**Delivers:** +- ArgoCD installed via Helm in `argocd` namespace +- Existing `argocd/application.yaml` applied and syncing +- Auto-sync with self-heal enabled (already configured) +- Traefik ingress for ArgoCD UI with TLS +- Health checks showing deployment status + +**Addresses:** +- Automated deployment trigger (table stakes from FEATURES.md) +- Git as single source of truth (GitOps principle) +- Self-healing (prevents manual drift) + +**Avoids:** +- Pitfall #1: ArgoCD TLS redirect loop (configure `server.insecure: true`) +- Pitfall #5: Gitea webhook parsing (use Gogs type or polling) + +**Configuration needed:** +- ArgoCD Helm values with `server.insecure: true` +- Gitea repository credentials in `argocd-secret` +- IngressRoute for ArgoCD UI (Traefik v3 syntax) +- Optional webhook in Gitea (test but accept polling fallback) + +### Phase 2: Observability Stack (Prometheus/Grafana/Loki) +**Rationale:** Can't operate what you can't see. Establishes visibility before adding CI complexity. Observability enables debugging issues from Phase 1 and provides baseline before Phase 3 changes. + +**Delivers:** +- kube-prometheus-stack (Prometheus + Grafana + Alertmanager) +- k3s control plane metrics exposed and scraped +- Pre-built Kubernetes dashboards in Grafana +- Loki in monolithic mode with retention configured +- Alloy DaemonSet collecting pod logs +- 3-5 critical alerts (pod crashes, OOM, disk full, app down) +- Traefik metrics integration +- Ingress for Grafana UI with TLS + +**Addresses:** +- Basic metrics collection (table stakes) +- Metrics visualization (table stakes) +- Log aggregation (table stakes) +- Basic alerting (table stakes) +- k3s control plane monitoring (differentiator) + +**Avoids:** +- Pitfall #2: Loki disk full (configure retention from day one) +- Pitfall #3: Prometheus volume growth (set retention + size limits) +- Pitfall #4: k3s metrics not scraped (configure endpoints) +- Pitfall #6: ServiceMonitor discovery (verify RBAC, use port names) +- Pitfall #7: Resource exhaustion (right-size for single-node) + +**Configuration needed:** +- Modify `/etc/rancher/k3s/config.yaml` to expose control plane metrics +- kube-prometheus-stack values with k3s-specific endpoints and resource limits +- Loki values with retention enabled and monolithic mode +- Alloy values with Kubernetes log discovery pointing to Loki +- ServiceMonitors for Traefik (and future TaskPlanner metrics) + +**Sub-phases:** +1. Configure k3s metrics exposure (restart k3s) +2. Install kube-prometheus-stack (Prometheus + Grafana) +3. Install Loki + Alloy (log aggregation) +4. Verify dashboards and create critical alerts + +### Phase 3: CI Pipeline Hardening (Tests) +**Rationale:** Tests catch bugs before deployment. Comes last because Phases 1-2 provide operational foundation to observe test failures and deployment issues. Playwright already configured; just needs integration into pipeline plus Vitest addition. + +**Delivers:** +- Vitest installed for unit/component tests +- Test suite structure established +- Gitea Actions workflow updated with test stage +- Tests run before build (fail fast) +- Playwright Docker image for browser tests (no install overhead) +- Type checking (`svelte-check`) in pipeline +- NPM scripts for local testing + +**Addresses:** +- Automated tests in pipeline (table stakes) +- Lint/static analysis (table stakes) +- Pipeline fail-fast principle + +**Avoids:** +- Over-engineering with extensive E2E suite (start simple) +- Test complexity that slows iterations + +**Configuration needed:** +- Install Vitest + @testing-library/svelte +- Create `vitest.config.ts` +- Update `.gitea/workflows/build.yaml` with test job +- Add NPM scripts for test commands +- Configure test container image + +**Test pyramid for personal app:** +- Unit tests: 70% (Vitest, fast, isolated) +- Integration tests: 20% (API endpoints, database) +- E2E tests: 10% (Playwright, critical paths only) + +### Phase Ordering Rationale + +**Why GitOps first:** +- ArgoCD configuration already exists (lowest effort) +- Immediate value: eliminates manual deployment +- Foundation for observing subsequent changes +- No dependencies on other phases + +**Why Observability second:** +- Provides visibility into GitOps operations from Phase 1 +- Required before adding CI complexity (Phase 3) +- k3s metrics configuration requires cluster restart (minimize disruptions) +- Baseline metrics needed to measure impact of changes + +**Why CI Testing last:** +- Tests benefit from observability (can see failures in Grafana) +- GitOps ensures test failures block bad deployments +- Building on working foundation reduces moving parts +- Can iterate on test coverage after core infrastructure solid + +**Dependencies respected:** +- Tests before build → CI pipeline structure +- ArgoCD watches Git → Git update triggers deploy +- Observability before app changes → baseline established +- Prometheus before alerts → scraping functional before alerting + +### Research Flags + +**Phases needing deeper research during planning:** +- **Phase 2.1 (k3s metrics)**: Verify exact k3s version and config file location; k3s installation methods vary +- **Phase 2.3 (Loki retention)**: Confirm disk capacity planning based on actual log volume + +**Phases with standard patterns (skip research-phase):** +- **Phase 1 (ArgoCD)**: Well-documented Helm installation, existing Application manifest, standard Traefik pattern +- **Phase 2.2 (kube-prometheus-stack)**: Standard chart with k3s-specific values, extensive community examples +- **Phase 3 (Testing)**: Playwright already configured, Vitest is official Svelte recommendation + +**Research confidence:** +- GitOps: HIGH (official ArgoCD docs + existing config) +- Observability: HIGH (official Helm charts + k3s community guides) +- Testing: HIGH (official Svelte docs + existing Playwright setup) +- Pitfalls: HIGH (verified with GitHub issues and production reports) + +## Confidence Assessment + +| Area | Confidence | Notes | +|------|------------|-------| +| Stack | HIGH | All components verified with official Helm charts and version numbers. Promtail EOL confirmed from Grafana docs. | +| Features | HIGH | Table stakes derived from CI/CD best practices and Kubernetes observability standards. Anti-features validated against homelab community patterns. | +| Architecture | HIGH | Integration patterns verified with official documentation (ArgoCD, Prometheus Operator, Loki). Namespace strategy follows community conventions. | +| Pitfalls | HIGH | All critical pitfalls sourced from verified GitHub issues with reproduction steps and fixes. k3s-specific issues confirmed from k3s.rocks tutorials. | + +**Overall confidence:** HIGH + +### Gaps to Address + +**Gitea webhook reliability:** Research confirms JSON parsing issues with ArgoCD but workarounds exist (use Gogs type). Need to test in actual environment and decide whether to invest in debugging webhook vs. accepting 3-minute polling. For single-user workload, polling is acceptable. + +**k3s version compatibility:** Research assumes recent k3s (v1.27+). Need to verify actual cluster version and k3s installation method (server vs. embedded) affects config file location and metrics exposure. Standard install at `/etc/rancher/k3s/config.yaml` may differ for k3d or other variants. + +**Longhorn replica count:** Single-node k3s requires Longhorn replica count set to 1 (default is 3). Verify existing Longhorn configuration handles this correctly for new PVCs created by observability stack. + +**Resource capacity:** Research estimates ~1.2 CPU cores and ~1.7GB RAM for observability stack. Verify actual k3s node has headroom beyond existing TaskPlanner, Gitea, Traefik, Longhorn workloads. Minimum 4GB RAM recommended for k3s + monitoring + apps. + +**TLS certificate limits:** Adding Grafana and ArgoCD ingresses increases Let's Encrypt certificate count. Verify current usage doesn't approach rate limits (50 certs per domain per week). + +## Sources + +### Primary (HIGH confidence) + +**Official Documentation:** +- [Svelte Testing Documentation](https://svelte.dev/docs/svelte/testing) - Vitest recommendation +- [Playwright CI Setup](https://playwright.dev/docs/ci-intro) - Docker image and best practices +- [ArgoCD Helm Chart](https://artifacthub.io/packages/helm/argo/argo-cd) - Version 9.4.0 +- [kube-prometheus-stack](https://artifacthub.io/packages/helm/prometheus-community/kube-prometheus-stack) - Version 81.4.2 +- [Grafana Loki Helm](https://grafana.com/docs/loki/latest/setup/install/helm/) - Monolithic mode +- [Grafana Alloy](https://grafana.com/docs/alloy/latest/set-up/install/kubernetes/) - Installation and config +- [Promtail EOL Notice](https://grafana.com/docs/loki/latest/send-data/promtail/installation/) - EOL 2026-03-02 +- [ArgoCD Ingress Configuration](https://argo-cd.readthedocs.io/en/stable/operator-manual/ingress/) - TLS termination +- [Grafana Loki Retention](https://grafana.com/docs/loki/latest/operations/storage/retention/) - Compactor config + +**Verified Issues:** +- [ArgoCD #16453](https://github.com/argoproj/argo-cd/issues/16453) - Gitea webhook parsing failure +- [Loki #5242](https://github.com/grafana/loki/issues/5242) - Retention not working +- [Longhorn #2222](https://github.com/longhorn/longhorn/issues/2222) - Volume expansion issues +- [kube-prometheus-stack #3401](https://github.com/prometheus-community/helm-charts/issues/3401) - Resource limits +- [Prometheus Operator #3383](https://github.com/prometheus-operator/prometheus-operator/issues/3383) - ServiceMonitor discovery + +### Secondary (MEDIUM confidence) + +**Community Tutorials:** +- [K3S Rocks - ArgoCD](https://k3s.rocks/argocd/) - k3s-specific ArgoCD setup +- [K3S Rocks - Logging](https://k3s.rocks/logging/) - Loki on k3s patterns +- [Prometheus on K3s](https://fabianlee.org/2022/07/02/prometheus-installing-kube-prometheus-stack-on-k3s-cluster/) - k3s control plane configuration +- [K3s Monitoring Guide](https://github.com/cablespaghetti/k3s-monitoring) - Complete k3s observability stack +- [Bootstrapping ArgoCD](https://windsock.io/bootstrapping-argocd/) - Initial setup patterns +- [ServiceMonitor Troubleshooting](https://managedkube.com/prometheus/operator/servicemonitor/troubleshooting/2019/11/07/prometheus-operator-servicemonitor-troubleshooting.html) - Common issues + +**Best Practices:** +- [CI/CD Best Practices](https://www.jetbrains.com/teamcity/ci-cd-guide/ci-cd-best-practices/) - Testing pyramid, fail fast +- [Kubernetes Observability](https://www.usdsi.org/data-science-insights/kubernetes-observability-and-monitoring-trends-in-2026) - Stack selection +- [ArgoCD Best Practices](https://argo-cd.readthedocs.io/en/stable/user-guide/best_practices/) - Sync waves, self-management + +### Tertiary (LOW confidence) + +- None - all research verified with official sources or production issue reports + +--- + +*Research completed: 2026-02-03* +*Ready for roadmap: Yes* +*Files synthesized: STACK-v2-cicd-observability.md, FEATURES.md, ARCHITECTURE.md, PITFALLS-CICD-OBSERVABILITY.md*