# Project Research Summary: v2.0 CI/CD and Observability

**Project:** TaskPlanner v2.0 Production Operations
**Domain:** CI/CD Testing, GitOps Deployment, and Kubernetes Observability
**Researched:** 2026-02-03
**Confidence:** HIGH

## Executive Summary

This research covers production-readiness improvements for a self-hosted SvelteKit task management application running on k3s. The milestone adds three capabilities: (1) automated testing in the existing Gitea Actions pipeline, (2) ArgoCD-based GitOps deployment automation, and (3) a complete observability stack (Prometheus, Grafana, Loki). The infrastructure foundation already exists—k3s cluster, Gitea with Actions, Traefik ingress, Longhorn storage, and a defined ArgoCD Application manifest.

**Recommended approach:** Implement in three phases prioritizing operational foundation first. Phase 1 enables GitOps automation (ArgoCD), Phase 2 establishes observability (kube-prometheus-stack + Loki/Alloy), and Phase 3 hardens the CI pipeline with comprehensive testing. This ordering delivers immediate value (hands-off deployments) before adding observability, then solidifies quality gates last. The stack is standard for self-hosted k3s: ArgoCD for GitOps, kube-prometheus-stack for metrics/dashboards, Loki in monolithic mode for logs, and Grafana Alloy for log collection (Promtail is EOL March 2026).

**Key risks:** (1) ArgoCD + Traefik TLS termination requires `server.insecure: true` or redirect loops occur, (2) Loki disk exhaustion without retention configuration (filesystem storage has no size limits), (3) k3s control plane metrics need explicit endpoint configuration, and (4) Gitea webhooks fail JSON parsing with ArgoCD (use polling or accept webhook limitations). All risks have documented mitigations from production k3s deployments.

## Key Findings

### Recommended Stack

**GitOps:** ArgoCD is already configured in `argocd/application.yaml` with correct auto-sync and self-heal policies. The Application manifest exists but ArgoCD server installation is needed. Gitea webhooks to ArgoCD have known JSON parsing issues (Gitea uses Gogs format but ArgoCD expects GitHub); fallback to 3-minute polling is acceptable for single-user workload. ArgoCD Image Updater is unnecessary—the existing pattern of updating `values.yaml` in Git provides better audit trails.

**Observability:** The standard k3s stack is kube-prometheus-stack (single Helm chart bundling Prometheus, Grafana, Alertmanager, node-exporter, kube-state-metrics), Loki in monolithic SingleBinary mode for logs, and Grafana Alloy for log collection. CRITICAL: Promtail reaches End-of-Life on 2026-03-02 (next month)—use Alloy instead. Loki's monolithic mode uses filesystem storage, appropriate for single-node deployments under 100GB/day log volume. k3s requires explicit configuration to expose control plane metrics (scheduler, controller-manager bind to localhost by default).

**Testing:** Playwright is already configured with E2E tests in `tests/docker-deployment.spec.ts`. Add Vitest for unit/component testing (official Svelte recommendation for Vite-based projects). Use the Playwright Docker image (`mcr.microsoft.com/playwright:v1.58.1-noble`) in Gitea Actions to avoid 2-3 minute browser installation overhead. Run tests before build to fail fast.

**Core technologies:**
- **ArgoCD 3.3.0** (via Helm chart 9.4.0): GitOps deployment automation — already configured, needs installation
- **kube-prometheus-stack 81.4.2**: Bundled Prometheus + Grafana + Alertmanager — standard k3s observability stack
- **Loki 6.51.0** (monolithic mode): Log aggregation — lightweight, label-based like Prometheus
- **Grafana Alloy 1.5.3**: Log collection agent — Promtail replacement (EOL March 2026)
- **Vitest 3.0**: Unit/component tests — official Svelte recommendation, shares Vite config
- **Playwright 1.58.1**: E2E testing — already in use, comprehensive browser automation

### Expected Features

**Must have (table stakes):**
- **Automated tests in CI pipeline** — without tests, pipeline is just a build script; fail fast before deployment
- **GitOps auto-sync** — manual `helm upgrade` defeats CI/CD purpose; Git is single source of truth
- **Self-healing deployments** — ArgoCD reverts manual changes to maintain Git state
- **Basic metrics collection** — Prometheus scraping cluster and app metrics for visibility
- **Metrics visualization** — Grafana dashboards; metrics without visualization are useless
- **Log aggregation** — Loki centralized logging; no more `kubectl logs` per pod
- **Basic alerting** — 3-5 critical alerts (pod crashes, OOM, app down, disk full)

**Should have (differentiators):**
- **Application-level metrics** — custom Prometheus metrics in TaskPlanner (`/metrics` endpoint)
- **Gitea webhook integration** — reduces sync delay from 3min to seconds (accept limitations)
- **Smoke tests on deploy** — verify deployment health after ArgoCD sync
- **k3s control plane monitoring** — scheduler, controller-manager metrics in dashboards
- **Traefik metrics integration** — ingress traffic patterns and latency

**Defer (v2+):**
- **Distributed tracing** — overkill unless debugging microservices latency
- **SLO/SLI dashboards** — error budgets and reliability tracking (nice-to-have for learning)
- **Log-based alerting** — Loki alerting rules beyond basic metrics alerts
- **DORA metrics** — deployment frequency, lead time tracking
- **Vulnerability scanning** — Trivy for container images, npm audit

**Anti-features (actively avoid):**
- **Multi-environment promotion** — single user, single environment; deploy directly to prod
- **Blue-green/canary deployments** — complex rollout for single-user app
- **ArgoCD high availability** — HA for multi-team, not personal projects
- **ELK stack** — resource-heavy; Loki is lightweight alternative
- **Secrets management (Vault)** — overkill; Kubernetes secrets sufficient
- **Policy enforcement (OPA)** — single user has no policy conflicts

### Architecture Approach

The existing architecture has Gitea Actions building Docker images and pushing to Gitea Container Registry, then updating `helm/taskplaner/values.yaml` with the new image tag via Git commit. ArgoCD watches this repository and syncs changes to the k3s cluster. The observability stack integrates via ServiceMonitors (for Prometheus scraping), Alloy DaemonSet (for log collection), and Traefik ingress (for Grafana/ArgoCD UIs).

**Integration points:**
1. **Gitea → ArgoCD**: HTTPS repository clone (credentials in `argocd-secret`), optional webhook (Gogs type), automatic sync on Git changes
2. **Prometheus → Targets**: ServiceMonitors for TaskPlanner, Traefik, k3s control plane; scrapes `/metrics` endpoints every 30s
3. **Alloy → Loki**: DaemonSet reads `/var/log/pods`, forwards to Loki HTTP endpoint in `loki` namespace
4. **Grafana → Data Sources**: Auto-configured Prometheus and Loki datasources via kube-prometheus-stack integration
5. **Traefik → Ingress**: All UIs (Grafana, ArgoCD) exposed via Traefik with cert-manager TLS

**Namespace strategy:**
- `argocd`: ArgoCD server, repo-server, application-controller (standard convention)
- `monitoring`: Prometheus, Grafana, Alertmanager (kube-prometheus-stack default)
- `loki`: Loki SingleBinary, Alloy DaemonSet (separate for resource isolation)
- `default`: TaskPlanner application (existing)

**Major components:**
1. **ArgoCD Server** — GitOps controller; watches Git, syncs to cluster, exposes UI/API
2. **Prometheus** — metrics storage and querying; scrapes targets via ServiceMonitors
3. **Grafana** — visualization layer; queries Prometheus and Loki, displays dashboards
4. **Loki** — log aggregation; receives from Alloy, stores on filesystem, queries via LogQL
5. **Alloy DaemonSet** — log collection; reads pod logs, ships to Loki with Kubernetes labels
6. **kube-state-metrics** — Kubernetes object metrics (pod status, deployments, etc.)
7. **node-exporter** — node-level metrics (CPU, memory, disk, network)

**Data flows:**
- **Metrics**: TaskPlanner/Traefik/k3s expose `/metrics` → Prometheus scrapes → Grafana queries → dashboards display
- **Logs**: Pod stdout/stderr → `/var/log/pods` → Alloy reads → Loki stores → Grafana Explore queries
- **GitOps**: Developer pushes Git → Gitea Actions builds → updates values.yaml → ArgoCD syncs → Kubernetes deploys
- **Observability**: Metrics + Logs converge in Grafana for unified troubleshooting

### Critical Pitfalls

1. **ArgoCD + Traefik TLS Redirect Loop** — ArgoCD expects HTTPS but Traefik terminates TLS, causing infinite 307 redirects. Set `server.insecure: true` in `argocd-cmd-params-cm` ConfigMap. Use IngressRoute (not Ingress) for proper gRPC support with correct Header matcher syntax.

2. **Loki Disk Exhaustion Without Retention** — Loki fills disk because retention is disabled by default and only supports time-based retention (no size limits). Configure `compactor.retention_enabled: true` with `retention_period: 168h` (7 days). Set up Prometheus alert for PVC > 80% usage. Index period MUST be 24h for retention to work.

3. **Prometheus Volume Growth Exceeds PVC** — Default 15-day retention without size limits causes disk full. Set BOTH `retention: 7d` AND `retentionSize: 8GB`. Size PVC with 20% headroom. Longhorn volume expansion has known issues requiring pod stop, detach, resize, restart procedure.

4. **k3s Control Plane Metrics Not Scraped** — k3s runs scheduler/controller-manager as single binary binding to localhost, not as pods. Modify `/etc/rancher/k3s/config.yaml` to set `bind-address=0.0.0.0` for each component, then restart k3s. Configure explicit endpoints with k3s server IP in kube-prometheus-stack values.

5. **Gitea Webhook JSON Parsing Failure** — ArgoCD treats Gitea webhooks as GitHub events but field types differ (e.g., `repository.created_at` is string in Gitea, int64 in GitHub). Webhooks silently fail with parsing errors in ArgoCD logs. Use Gogs webhook type or accept 3-minute polling interval as fallback.

6. **ServiceMonitor Not Discovering Targets** — Label selector mismatch between Prometheus CR and ServiceMonitor, or RBAC issues. Use port NAME (not number) in ServiceMonitor endpoints. Set `serviceMonitorSelector: {}` for permissive selection. Verify RBAC with `kubectl auth can-i list endpoints`.

7. **k3s Resource Exhaustion** — Full kube-prometheus-stack deploys many components sized for larger clusters. Single-node k3s with 8GB RAM needs explicit resource limits. Disable alertmanager if not using alerts. Set Prometheus to `256Mi` request, Grafana to `128Mi`. Monitor with `kubectl top nodes`.

## Implications for Roadmap

Based on research, suggested phase structure prioritizes operational foundation before observability, then CI hardening:

### Phase 1: GitOps Foundation (ArgoCD)
**Rationale:** Eliminates manual `helm upgrade` commands and establishes Git as single source of truth. ArgoCD is the lowest-hanging fruit—Application manifest already exists, just needs server installation. Immediate value: hands-off deployments.

**Delivers:**
- ArgoCD installed via Helm in `argocd` namespace
- Existing `argocd/application.yaml` applied and syncing
- Auto-sync with self-heal enabled (already configured)
- Traefik ingress for ArgoCD UI with TLS
- Health checks showing deployment status

**Addresses:**
- Automated deployment trigger (table stakes from FEATURES.md)
- Git as single source of truth (GitOps principle)
- Self-healing (prevents manual drift)

**Avoids:**
- Pitfall #1: ArgoCD TLS redirect loop (configure `server.insecure: true`)
- Pitfall #5: Gitea webhook parsing (use Gogs type or polling)

**Configuration needed:**
- ArgoCD Helm values with `server.insecure: true`
- Gitea repository credentials in `argocd-secret`
- IngressRoute for ArgoCD UI (Traefik v3 syntax)
- Optional webhook in Gitea (test but accept polling fallback)

### Phase 2: Observability Stack (Prometheus/Grafana/Loki)
**Rationale:** Can't operate what you can't see. Establishes visibility before adding CI complexity. Observability enables debugging issues from Phase 1 and provides baseline before Phase 3 changes.

**Delivers:**
- kube-prometheus-stack (Prometheus + Grafana + Alertmanager)
- k3s control plane metrics exposed and scraped
- Pre-built Kubernetes dashboards in Grafana
- Loki in monolithic mode with retention configured
- Alloy DaemonSet collecting pod logs
- 3-5 critical alerts (pod crashes, OOM, disk full, app down)
- Traefik metrics integration
- Ingress for Grafana UI with TLS

**Addresses:**
- Basic metrics collection (table stakes)
- Metrics visualization (table stakes)
- Log aggregation (table stakes)
- Basic alerting (table stakes)
- k3s control plane monitoring (differentiator)

**Avoids:**
- Pitfall #2: Loki disk full (configure retention from day one)
- Pitfall #3: Prometheus volume growth (set retention + size limits)
- Pitfall #4: k3s metrics not scraped (configure endpoints)
- Pitfall #6: ServiceMonitor discovery (verify RBAC, use port names)
- Pitfall #7: Resource exhaustion (right-size for single-node)

**Configuration needed:**
- Modify `/etc/rancher/k3s/config.yaml` to expose control plane metrics
- kube-prometheus-stack values with k3s-specific endpoints and resource limits
- Loki values with retention enabled and monolithic mode
- Alloy values with Kubernetes log discovery pointing to Loki
- ServiceMonitors for Traefik (and future TaskPlanner metrics)

**Sub-phases:**
1. Configure k3s metrics exposure (restart k3s)
2. Install kube-prometheus-stack (Prometheus + Grafana)
3. Install Loki + Alloy (log aggregation)
4. Verify dashboards and create critical alerts

### Phase 3: CI Pipeline Hardening (Tests)
**Rationale:** Tests catch bugs before deployment. Comes last because Phases 1-2 provide operational foundation to observe test failures and deployment issues. Playwright already configured; just needs integration into pipeline plus Vitest addition.

**Delivers:**
- Vitest installed for unit/component tests
- Test suite structure established
- Gitea Actions workflow updated with test stage
- Tests run before build (fail fast)
- Playwright Docker image for browser tests (no install overhead)
- Type checking (`svelte-check`) in pipeline
- NPM scripts for local testing

**Addresses:**
- Automated tests in pipeline (table stakes)
- Lint/static analysis (table stakes)
- Pipeline fail-fast principle

**Avoids:**
- Over-engineering with extensive E2E suite (start simple)
- Test complexity that slows iterations

**Configuration needed:**
- Install Vitest + @testing-library/svelte
- Create `vitest.config.ts`
- Update `.gitea/workflows/build.yaml` with test job
- Add NPM scripts for test commands
- Configure test container image

**Test pyramid for personal app:**
- Unit tests: 70% (Vitest, fast, isolated)
- Integration tests: 20% (API endpoints, database)
- E2E tests: 10% (Playwright, critical paths only)

### Phase Ordering Rationale

**Why GitOps first:**
- ArgoCD configuration already exists (lowest effort)
- Immediate value: eliminates manual deployment
- Foundation for observing subsequent changes
- No dependencies on other phases

**Why Observability second:**
- Provides visibility into GitOps operations from Phase 1
- Required before adding CI complexity (Phase 3)
- k3s metrics configuration requires cluster restart (minimize disruptions)
- Baseline metrics needed to measure impact of changes

**Why CI Testing last:**
- Tests benefit from observability (can see failures in Grafana)
- GitOps ensures test failures block bad deployments
- Building on working foundation reduces moving parts
- Can iterate on test coverage after core infrastructure solid

**Dependencies respected:**
- Tests before build → CI pipeline structure
- ArgoCD watches Git → Git update triggers deploy
- Observability before app changes → baseline established
- Prometheus before alerts → scraping functional before alerting

### Research Flags

**Phases needing deeper research during planning:**
- **Phase 2.1 (k3s metrics)**: Verify exact k3s version and config file location; k3s installation methods vary
- **Phase 2.3 (Loki retention)**: Confirm disk capacity planning based on actual log volume

**Phases with standard patterns (skip research-phase):**
- **Phase 1 (ArgoCD)**: Well-documented Helm installation, existing Application manifest, standard Traefik pattern
- **Phase 2.2 (kube-prometheus-stack)**: Standard chart with k3s-specific values, extensive community examples
- **Phase 3 (Testing)**: Playwright already configured, Vitest is official Svelte recommendation

**Research confidence:**
- GitOps: HIGH (official ArgoCD docs + existing config)
- Observability: HIGH (official Helm charts + k3s community guides)
- Testing: HIGH (official Svelte docs + existing Playwright setup)
- Pitfalls: HIGH (verified with GitHub issues and production reports)

## Confidence Assessment

| Area | Confidence | Notes |
|------|------------|-------|
| Stack | HIGH | All components verified with official Helm charts and version numbers. Promtail EOL confirmed from Grafana docs. |
| Features | HIGH | Table stakes derived from CI/CD best practices and Kubernetes observability standards. Anti-features validated against homelab community patterns. |
| Architecture | HIGH | Integration patterns verified with official documentation (ArgoCD, Prometheus Operator, Loki). Namespace strategy follows community conventions. |
| Pitfalls | HIGH | All critical pitfalls sourced from verified GitHub issues with reproduction steps and fixes. k3s-specific issues confirmed from k3s.rocks tutorials. |

**Overall confidence:** HIGH

### Gaps to Address

**Gitea webhook reliability:** Research confirms JSON parsing issues with ArgoCD but workarounds exist (use Gogs type). Need to test in actual environment and decide whether to invest in debugging webhook vs. accepting 3-minute polling. For single-user workload, polling is acceptable.

**k3s version compatibility:** Research assumes recent k3s (v1.27+). Need to verify actual cluster version and k3s installation method (server vs. embedded) affects config file location and metrics exposure. Standard install at `/etc/rancher/k3s/config.yaml` may differ for k3d or other variants.

**Longhorn replica count:** Single-node k3s requires Longhorn replica count set to 1 (default is 3). Verify existing Longhorn configuration handles this correctly for new PVCs created by observability stack.

**Resource capacity:** Research estimates ~1.2 CPU cores and ~1.7GB RAM for observability stack. Verify actual k3s node has headroom beyond existing TaskPlanner, Gitea, Traefik, Longhorn workloads. Minimum 4GB RAM recommended for k3s + monitoring + apps.

**TLS certificate limits:** Adding Grafana and ArgoCD ingresses increases Let's Encrypt certificate count. Verify current usage doesn't approach rate limits (50 certs per domain per week).

## Sources

### Primary (HIGH confidence)

**Official Documentation:**
- [Svelte Testing Documentation](https://svelte.dev/docs/svelte/testing) - Vitest recommendation
- [Playwright CI Setup](https://playwright.dev/docs/ci-intro) - Docker image and best practices
- [ArgoCD Helm Chart](https://artifacthub.io/packages/helm/argo/argo-cd) - Version 9.4.0
- [kube-prometheus-stack](https://artifacthub.io/packages/helm/prometheus-community/kube-prometheus-stack) - Version 81.4.2
- [Grafana Loki Helm](https://grafana.com/docs/loki/latest/setup/install/helm/) - Monolithic mode
- [Grafana Alloy](https://grafana.com/docs/alloy/latest/set-up/install/kubernetes/) - Installation and config
- [Promtail EOL Notice](https://grafana.com/docs/loki/latest/send-data/promtail/installation/) - EOL 2026-03-02
- [ArgoCD Ingress Configuration](https://argo-cd.readthedocs.io/en/stable/operator-manual/ingress/) - TLS termination
- [Grafana Loki Retention](https://grafana.com/docs/loki/latest/operations/storage/retention/) - Compactor config

**Verified Issues:**
- [ArgoCD #16453](https://github.com/argoproj/argo-cd/issues/16453) - Gitea webhook parsing failure
- [Loki #5242](https://github.com/grafana/loki/issues/5242) - Retention not working
- [Longhorn #2222](https://github.com/longhorn/longhorn/issues/2222) - Volume expansion issues
- [kube-prometheus-stack #3401](https://github.com/prometheus-community/helm-charts/issues/3401) - Resource limits
- [Prometheus Operator #3383](https://github.com/prometheus-operator/prometheus-operator/issues/3383) - ServiceMonitor discovery

### Secondary (MEDIUM confidence)

**Community Tutorials:**
- [K3S Rocks - ArgoCD](https://k3s.rocks/argocd/) - k3s-specific ArgoCD setup
- [K3S Rocks - Logging](https://k3s.rocks/logging/) - Loki on k3s patterns
- [Prometheus on K3s](https://fabianlee.org/2022/07/02/prometheus-installing-kube-prometheus-stack-on-k3s-cluster/) - k3s control plane configuration
- [K3s Monitoring Guide](https://github.com/cablespaghetti/k3s-monitoring) - Complete k3s observability stack
- [Bootstrapping ArgoCD](https://windsock.io/bootstrapping-argocd/) - Initial setup patterns
- [ServiceMonitor Troubleshooting](https://managedkube.com/prometheus/operator/servicemonitor/troubleshooting/2019/11/07/prometheus-operator-servicemonitor-troubleshooting.html) - Common issues

**Best Practices:**
- [CI/CD Best Practices](https://www.jetbrains.com/teamcity/ci-cd-guide/ci-cd-best-practices/) - Testing pyramid, fail fast
- [Kubernetes Observability](https://www.usdsi.org/data-science-insights/kubernetes-observability-and-monitoring-trends-in-2026) - Stack selection
- [ArgoCD Best Practices](https://argo-cd.readthedocs.io/en/stable/user-guide/best_practices/) - Sync waves, self-management

### Tertiary (LOW confidence)

- None - all research verified with official sources or production issue reports

---

*Research completed: 2026-02-03*
*Ready for roadmap: Yes*
*Files synthesized: STACK-v2-cicd-observability.md, FEATURES.md, ARCHITECTURE.md, PITFALLS-CICD-OBSERVABILITY.md*