docs: complete v2.0 CI/CD and observability research

Files: - STACK-v2-cicd-observability.md (ArgoCD, Prometheus, Loki, Alloy) - FEATURES.md (updated with CI/CD and observability section) - ARCHITECTURE.md (updated with v2.0 integration architecture) - PITFALLS-CICD-OBSERVABILITY.md (14 critical/moderate/minor pitfalls) - SUMMARY-v2-cicd-observability.md (synthesis with roadmap implications) Key findings: - Stack: kube-prometheus-stack + Loki monolithic + Alloy (Promtail EOL March 2026) - Architecture: 3-phase approach - GitOps first, observability second, CI tests last - Critical pitfall: ArgoCD TLS redirect loop, Loki disk exhaustion, k3s metrics config Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 03:29:23 +01:00
parent 6cdd5aa8c7
commit 5dbabe6a2d
5 changed files with 2401 additions and 3 deletions
--- a/.planning/research/FEATURES.md
+++ b/.planning/research/FEATURES.md
@@ -210,5 +210,241 @@ Features to defer until product-market fit is established:
 - Evernote features page (verified via WebFetch)

 ---
-*Feature research for: Personal Task/Notes Web App*
-*Researched: 2026-01-29*
+
+# CI/CD and Observability Features
+
+**Domain:** CI/CD pipelines and Kubernetes observability for personal project
+**Researched:** 2026-02-03
+**Context:** Single-user, self-hosted TaskPlanner app with existing basic Gitea Actions pipeline
+
+## Current State
+
+Based on the existing `.gitea/workflows/build.yaml`:
+- Build and push Docker images to Gitea Container Registry
+- Docker layer caching enabled
+- Automatic Helm values update with new image tag
+- No tests in pipeline
+- No GitOps automation (ArgoCD defined but requires manual sync)
+- No observability stack
+
+---
+
+## Table Stakes
+
+Features required for production-grade operations. Missing any of these means the system is incomplete for reliable self-hosting.
+
+### CI/CD Pipeline
+
+| Feature | Why Expected | Complexity | Notes |
+|---------|--------------|------------|-------|
+| **Automated tests in pipeline** | Catch bugs before deployment; without tests, pipeline is just a build script | Low | Start with unit tests (70% of test pyramid), add integration tests later |
+| **Build caching** | Already have this | - | Using Docker layer cache to registry |
+| **Lint/static analysis** | Catch errors early (fail fast principle) | Low | ESLint, TypeScript checking |
+| **Pipeline as code** | Already have this | - | Workflow defined in `.gitea/workflows/` |
+| **Automated deployment trigger** | Manual `helm upgrade` defeats CI/CD purpose | Low | ArgoCD auto-sync on Git changes |
+| **Container image tagging** | Already have this | - | SHA-based tags with `latest` |
+
+### GitOps
+
+| Feature | Why Expected | Complexity | Notes |
+|---------|--------------|------------|-------|
+| **Git as single source of truth** | Core GitOps principle; cluster state should match Git | Low | ArgoCD watches Git repo, syncs to cluster |
+| **Auto-sync** | Manual sync defeats GitOps purpose | Low | ArgoCD `syncPolicy.automated.enabled: true` |
+| **Self-healing** | Prevents drift; if someone kubectl edits, ArgoCD reverts | Low | ArgoCD `selfHeal: true` |
+| **Health checks** | Know if deployment succeeded | Low | ArgoCD built-in health status |
+
+### Observability
+
+| Feature | Why Expected | Complexity | Notes |
+|---------|--------------|------------|-------|
+| **Basic metrics collection** | Know if app is running, resource usage | Medium | Prometheus + kube-state-metrics |
+| **Metrics visualization** | Metrics without dashboards are useless | Low | Grafana with pre-built Kubernetes dashboards |
+| **Container logs aggregation** | Debug issues without `kubectl logs` | Medium | Loki (lightweight, label-based) |
+| **Basic alerting** | Know when something breaks | Low | AlertManager with 3-5 critical alerts |
+
+---
+
+## Differentiators
+
+Features that add significant value but are not strictly required for a single-user personal app. Implement if you want learning/practice or improved reliability.
+
+### CI/CD Pipeline
+
+| Feature | Value Proposition | Complexity | Notes |
+|---------|-------------------|------------|-------|
+| **Smoke tests on deploy** | Verify deployment actually works | Medium | Hit health endpoint after deploy |
+| **Build notifications** | Know when builds fail without watching | Low | Slack/Discord/email webhook |
+| **DORA metrics tracking** | Track deployment frequency, lead time | Medium | Measure CI/CD effectiveness |
+| **Parallel test execution** | Faster feedback on larger test suites | Medium | Only valuable with substantial test suite |
+| **Dependency vulnerability scanning** | Catch security issues early | Low | `npm audit`, Trivy for container images |
+
+### GitOps
+
+| Feature | Value Proposition | Complexity | Notes |
+|---------|-------------------|------------|-------|
+| **Automated pruning** | Remove resources deleted from Git | Low | ArgoCD `prune: true` |
+| **Sync windows** | Control when syncs happen | Low | Useful if you want maintenance windows |
+| **Application health dashboard** | Visual cluster state | Low | ArgoCD UI already provides this |
+| **Git commit status** | See deployment status in Gitea | Medium | ArgoCD notifications to Git |
+
+### Observability
+
+| Feature | Value Proposition | Complexity | Notes |
+|---------|-------------------|------------|-------|
+| **Application-level metrics** | Track business metrics (tasks created, etc.) | Medium | Custom Prometheus metrics in app |
+| **Request tracing** | Debug latency issues | High | OpenTelemetry, Tempo/Jaeger |
+| **SLO/SLI dashboards** | Define and track reliability targets | Medium | Error budgets, latency percentiles |
+| **Log-based alerting** | Alert on error patterns | Medium | Loki alerting rules |
+| **Uptime monitoring** | External availability check | Low | Uptime Kuma or similar |
+
+---
+
+## Anti-Features
+
+Features that are overkill for a single-user personal app. Actively avoid these to prevent over-engineering.
+
+| Anti-Feature | Why Avoid | What to Do Instead |
+|--------------|-----------|-------------------|
+| **Multi-environment promotion (dev/staging/prod)** | Single user, single environment | Deploy directly to prod; use feature flags if needed |
+| **Blue-green/canary deployments** | Complex rollout for single user is overkill | Simple rolling update; ArgoCD rollback if needed |
+| **Full E2E test suite in CI** | Expensive, slow, diminishing returns for personal app | Unit + smoke tests; manual E2E when needed |
+| **High availability ArgoCD** | HA is for multi-team, multi-tenant | Single replica ArgoCD is fine |
+| **Distributed tracing** | Overkill unless debugging microservices latency | Only add if you have multiple services with latency issues |
+| **ELK stack for logging** | Resource-heavy; Elasticsearch needs significant memory | Use Loki instead (label-based, lightweight) |
+| **Full APM solution** | DataDog/NewRelic-style solutions are enterprise-focused | Prometheus + Grafana + Loki covers personal needs |
+| **Secrets management (Vault)** | Complex for single user with few secrets | Kubernetes secrets or sealed-secrets |
+| **Policy enforcement (OPA/Gatekeeper)** | You are the only user; no policy conflicts | Skip entirely |
+| **Multi-cluster management** | Single cluster, single app | Skip entirely |
+| **Cost optimization/FinOps** | Personal project; cost is fixed/minimal | Skip entirely |
+| **AI-assisted observability** | Marketing hype; manual review is fine at this scale | Skip entirely |
+
+---
+
+## Feature Dependencies
+
+```
+Automated Tests
+    |
+    v
+Lint/Static Analysis --> Build --> Push Image --> Update Git
+                                                      |
+                                                      v
+                                              ArgoCD Auto-Sync
+                                                      |
+                                                      v
+                                              Health Check Pass
+                                                      |
+                                                      v
+                                              Deployment Complete
+                                                      |
+                                                      v
+                                         Metrics/Logs Available in Grafana
+```
+
+Key ordering constraints:
+1. Tests before build (fail fast)
+2. ArgoCD watches Git, so Git update triggers deploy
+3. Observability stack must be deployed before app for metrics collection
+
+---
+
+## MVP Recommendation for CI/CD and Observability
+
+For production-grade operations on a personal project, prioritize in this order:
+
+### Phase 1: GitOps Foundation
+1. Enable ArgoCD auto-sync with self-healing
+2. Add basic health checks
+
+*Rationale:* Eliminates manual `helm upgrade`, establishes GitOps workflow
+
+### Phase 2: Basic Observability
+1. Prometheus + Grafana (kube-prometheus-stack helm chart)
+2. Loki for log aggregation
+3. 3-5 critical alerts (pod crashes, high memory, app down)
+
+*Rationale:* Can't operate what you can't see; minimum viable observability
+
+### Phase 3: CI Pipeline Hardening
+1. Add unit tests to pipeline
+2. Add linting/type checking
+3. Smoke test after deploy (optional)
+
+*Rationale:* Tests catch bugs before they reach production
+
+### Defer to Later (if ever)
+- Application-level custom metrics
+- SLO dashboards
+- Advanced alerting
+- Request tracing
+- Extensive E2E tests
+
+---
+
+## Complexity Budget
+
+For a single-user personal project, the total complexity budget should be LOW-MEDIUM:
+
+| Category | Recommended Complexity | Over-Budget Indicator |
+|----------|----------------------|----------------------|
+| CI Pipeline | LOW | More than 10 min build time; complex test matrix |
+| GitOps | LOW | Multi-environment promotion; complex sync policies |
+| Metrics | MEDIUM | Custom exporters; high-cardinality metrics |
+| Logging | LOW | Full-text search; complex log parsing |
+| Alerting | LOW | More than 10 alerts; complex routing |
+| Tracing | SKIP | Any tracing for single-service app |
+
+---
+
+## Essential Alerts for Personal Project
+
+Based on best practices, these 5 alerts are sufficient for a single-user app:
+
+| Alert | Condition | Why Critical |
+|-------|-----------|--------------|
+| **Pod CrashLooping** | restarts > 3 in 15 min | App is failing repeatedly |
+| **Pod OOMKilled** | OOM event detected | Memory limits too low or leak |
+| **High Memory Usage** | memory > 85% for 5 min | Approaching resource limits |
+| **App Unavailable** | probe failures > 3 | Users cannot access app |
+| **Disk Running Low** | disk > 80% used | Persistent storage filling up |
+
+**Key principle:** Alerts should be symptom-based and actionable. If an alert fires and you don't need to do anything, remove it.
+
+---
+
+## Sources
+
+### CI/CD Best Practices
+- [TeamCity CI/CD Guide](https://www.jetbrains.com/teamcity/ci-cd-guide/ci-cd-best-practices/)
+- [Spacelift CI/CD Best Practices](https://spacelift.io/blog/ci-cd-best-practices)
+- [GitLab CI/CD Best Practices](https://about.gitlab.com/blog/how-to-keep-up-with-ci-cd-best-practices/)
+- [AWS CI/CD Best Practices](https://docs.aws.amazon.com/prescriptive-guidance/latest/strategy-cicd-litmus/cicd-best-practices.html)
+
+### Observability
+- [Kubernetes Observability Trends 2026](https://www.usdsi.org/data-science-insights/kubernetes-observability-and-monitoring-trends-in-2026)
+- [Spectro Cloud: Choosing the Right Monitoring Stack](https://www.spectrocloud.com/blog/choosing-the-right-kubernetes-monitoring-stack)
+- [ClickHouse: Mastering Kubernetes Observability](https://clickhouse.com/resources/engineering/mastering-kubernetes-observability-guide)
+- [Kubernetes Official Observability Docs](https://kubernetes.io/docs/concepts/cluster-administration/observability/)
+
+### ArgoCD/GitOps
+- [ArgoCD Auto Sync Documentation](https://argo-cd.readthedocs.io/en/stable/user-guide/auto_sync/)
+- [ArgoCD Best Practices](https://argo-cd.readthedocs.io/en/stable/user-guide/best_practices/)
+- [mkdev: ArgoCD Self-Heal and Sync Windows](https://mkdev.me/posts/argo-cd-self-heal-sync-windows-and-diffing)
+
+### Alerting
+- [Sysdig: Alerting on Kubernetes](https://www.sysdig.com/blog/alerting-kubernetes)
+- [Groundcover: Kubernetes Alerting](https://www.groundcover.com/kubernetes-monitoring/kubernetes-alerting)
+- [Sematext: 10 Must-Have Kubernetes Alerts](https://sematext.com/blog/top-10-must-have-alerts-for-kubernetes/)
+
+### Logging
+- [Plural: Loki vs ELK for Kubernetes](https://www.plural.sh/blog/loki-vs-elk-kubernetes/)
+- [Loki vs ELK Comparison](https://alexandre-vazquez.com/loki-vs-elk/)
+
+### Testing Pyramid
+- [CircleCI: Testing Pyramid](https://circleci.com/blog/testing-pyramid/)
+- [Semaphore: Testing Pyramid](https://semaphore.io/blog/testing-pyramid)
+- [AWS: Testing Stages in CI/CD](https://docs.aws.amazon.com/whitepapers/latest/practicing-continuous-integration-continuous-delivery/testing-stages-in-continuous-integration-and-continuous-delivery.html)
+
+### Homelab/Personal Projects
+- [Prometheus and Grafana Homelab Setup](https://unixorn.github.io/post/homelab/homelab-setup-prometheus-and-grafana/)
+- [Better Stack: Install Prometheus/Grafana with Helm](https://betterstack.com/community/questions/install-prometheus-and-grafana-on-kubernetes-with-helm/)