docs: complete v2.0 CI/CD and observability research

Files:
- STACK-v2-cicd-observability.md (ArgoCD, Prometheus, Loki, Alloy)
- FEATURES.md (updated with CI/CD and observability section)
- ARCHITECTURE.md (updated with v2.0 integration architecture)
- PITFALLS-CICD-OBSERVABILITY.md (14 critical/moderate/minor pitfalls)
- SUMMARY-v2-cicd-observability.md (synthesis with roadmap implications)

Key findings:
- Stack: kube-prometheus-stack + Loki monolithic + Alloy (Promtail EOL March 2026)
- Architecture: 3-phase approach - GitOps first, observability second, CI tests last
- Critical pitfall: ArgoCD TLS redirect loop, Loki disk exhaustion, k3s metrics config

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
Thomas Richter
2026-02-03 03:29:23 +01:00
parent 6cdd5aa8c7
commit 5dbabe6a2d
5 changed files with 2401 additions and 3 deletions

View File

@@ -210,5 +210,241 @@ Features to defer until product-market fit is established:
- Evernote features page (verified via WebFetch)
---
*Feature research for: Personal Task/Notes Web App*
*Researched: 2026-01-29*
# CI/CD and Observability Features
**Domain:** CI/CD pipelines and Kubernetes observability for personal project
**Researched:** 2026-02-03
**Context:** Single-user, self-hosted TaskPlanner app with existing basic Gitea Actions pipeline
## Current State
Based on the existing `.gitea/workflows/build.yaml`:
- Build and push Docker images to Gitea Container Registry
- Docker layer caching enabled
- Automatic Helm values update with new image tag
- No tests in pipeline
- No GitOps automation (ArgoCD defined but requires manual sync)
- No observability stack
---
## Table Stakes
Features required for production-grade operations. Missing any of these means the system is incomplete for reliable self-hosting.
### CI/CD Pipeline
| Feature | Why Expected | Complexity | Notes |
|---------|--------------|------------|-------|
| **Automated tests in pipeline** | Catch bugs before deployment; without tests, pipeline is just a build script | Low | Start with unit tests (70% of test pyramid), add integration tests later |
| **Build caching** | Already have this | - | Using Docker layer cache to registry |
| **Lint/static analysis** | Catch errors early (fail fast principle) | Low | ESLint, TypeScript checking |
| **Pipeline as code** | Already have this | - | Workflow defined in `.gitea/workflows/` |
| **Automated deployment trigger** | Manual `helm upgrade` defeats CI/CD purpose | Low | ArgoCD auto-sync on Git changes |
| **Container image tagging** | Already have this | - | SHA-based tags with `latest` |
### GitOps
| Feature | Why Expected | Complexity | Notes |
|---------|--------------|------------|-------|
| **Git as single source of truth** | Core GitOps principle; cluster state should match Git | Low | ArgoCD watches Git repo, syncs to cluster |
| **Auto-sync** | Manual sync defeats GitOps purpose | Low | ArgoCD `syncPolicy.automated.enabled: true` |
| **Self-healing** | Prevents drift; if someone kubectl edits, ArgoCD reverts | Low | ArgoCD `selfHeal: true` |
| **Health checks** | Know if deployment succeeded | Low | ArgoCD built-in health status |
### Observability
| Feature | Why Expected | Complexity | Notes |
|---------|--------------|------------|-------|
| **Basic metrics collection** | Know if app is running, resource usage | Medium | Prometheus + kube-state-metrics |
| **Metrics visualization** | Metrics without dashboards are useless | Low | Grafana with pre-built Kubernetes dashboards |
| **Container logs aggregation** | Debug issues without `kubectl logs` | Medium | Loki (lightweight, label-based) |
| **Basic alerting** | Know when something breaks | Low | AlertManager with 3-5 critical alerts |
---
## Differentiators
Features that add significant value but are not strictly required for a single-user personal app. Implement if you want learning/practice or improved reliability.
### CI/CD Pipeline
| Feature | Value Proposition | Complexity | Notes |
|---------|-------------------|------------|-------|
| **Smoke tests on deploy** | Verify deployment actually works | Medium | Hit health endpoint after deploy |
| **Build notifications** | Know when builds fail without watching | Low | Slack/Discord/email webhook |
| **DORA metrics tracking** | Track deployment frequency, lead time | Medium | Measure CI/CD effectiveness |
| **Parallel test execution** | Faster feedback on larger test suites | Medium | Only valuable with substantial test suite |
| **Dependency vulnerability scanning** | Catch security issues early | Low | `npm audit`, Trivy for container images |
### GitOps
| Feature | Value Proposition | Complexity | Notes |
|---------|-------------------|------------|-------|
| **Automated pruning** | Remove resources deleted from Git | Low | ArgoCD `prune: true` |
| **Sync windows** | Control when syncs happen | Low | Useful if you want maintenance windows |
| **Application health dashboard** | Visual cluster state | Low | ArgoCD UI already provides this |
| **Git commit status** | See deployment status in Gitea | Medium | ArgoCD notifications to Git |
### Observability
| Feature | Value Proposition | Complexity | Notes |
|---------|-------------------|------------|-------|
| **Application-level metrics** | Track business metrics (tasks created, etc.) | Medium | Custom Prometheus metrics in app |
| **Request tracing** | Debug latency issues | High | OpenTelemetry, Tempo/Jaeger |
| **SLO/SLI dashboards** | Define and track reliability targets | Medium | Error budgets, latency percentiles |
| **Log-based alerting** | Alert on error patterns | Medium | Loki alerting rules |
| **Uptime monitoring** | External availability check | Low | Uptime Kuma or similar |
---
## Anti-Features
Features that are overkill for a single-user personal app. Actively avoid these to prevent over-engineering.
| Anti-Feature | Why Avoid | What to Do Instead |
|--------------|-----------|-------------------|
| **Multi-environment promotion (dev/staging/prod)** | Single user, single environment | Deploy directly to prod; use feature flags if needed |
| **Blue-green/canary deployments** | Complex rollout for single user is overkill | Simple rolling update; ArgoCD rollback if needed |
| **Full E2E test suite in CI** | Expensive, slow, diminishing returns for personal app | Unit + smoke tests; manual E2E when needed |
| **High availability ArgoCD** | HA is for multi-team, multi-tenant | Single replica ArgoCD is fine |
| **Distributed tracing** | Overkill unless debugging microservices latency | Only add if you have multiple services with latency issues |
| **ELK stack for logging** | Resource-heavy; Elasticsearch needs significant memory | Use Loki instead (label-based, lightweight) |
| **Full APM solution** | DataDog/NewRelic-style solutions are enterprise-focused | Prometheus + Grafana + Loki covers personal needs |
| **Secrets management (Vault)** | Complex for single user with few secrets | Kubernetes secrets or sealed-secrets |
| **Policy enforcement (OPA/Gatekeeper)** | You are the only user; no policy conflicts | Skip entirely |
| **Multi-cluster management** | Single cluster, single app | Skip entirely |
| **Cost optimization/FinOps** | Personal project; cost is fixed/minimal | Skip entirely |
| **AI-assisted observability** | Marketing hype; manual review is fine at this scale | Skip entirely |
---
## Feature Dependencies
```
Automated Tests
|
v
Lint/Static Analysis --> Build --> Push Image --> Update Git
|
v
ArgoCD Auto-Sync
|
v
Health Check Pass
|
v
Deployment Complete
|
v
Metrics/Logs Available in Grafana
```
Key ordering constraints:
1. Tests before build (fail fast)
2. ArgoCD watches Git, so Git update triggers deploy
3. Observability stack must be deployed before app for metrics collection
---
## MVP Recommendation for CI/CD and Observability
For production-grade operations on a personal project, prioritize in this order:
### Phase 1: GitOps Foundation
1. Enable ArgoCD auto-sync with self-healing
2. Add basic health checks
*Rationale:* Eliminates manual `helm upgrade`, establishes GitOps workflow
### Phase 2: Basic Observability
1. Prometheus + Grafana (kube-prometheus-stack helm chart)
2. Loki for log aggregation
3. 3-5 critical alerts (pod crashes, high memory, app down)
*Rationale:* Can't operate what you can't see; minimum viable observability
### Phase 3: CI Pipeline Hardening
1. Add unit tests to pipeline
2. Add linting/type checking
3. Smoke test after deploy (optional)
*Rationale:* Tests catch bugs before they reach production
### Defer to Later (if ever)
- Application-level custom metrics
- SLO dashboards
- Advanced alerting
- Request tracing
- Extensive E2E tests
---
## Complexity Budget
For a single-user personal project, the total complexity budget should be LOW-MEDIUM:
| Category | Recommended Complexity | Over-Budget Indicator |
|----------|----------------------|----------------------|
| CI Pipeline | LOW | More than 10 min build time; complex test matrix |
| GitOps | LOW | Multi-environment promotion; complex sync policies |
| Metrics | MEDIUM | Custom exporters; high-cardinality metrics |
| Logging | LOW | Full-text search; complex log parsing |
| Alerting | LOW | More than 10 alerts; complex routing |
| Tracing | SKIP | Any tracing for single-service app |
---
## Essential Alerts for Personal Project
Based on best practices, these 5 alerts are sufficient for a single-user app:
| Alert | Condition | Why Critical |
|-------|-----------|--------------|
| **Pod CrashLooping** | restarts > 3 in 15 min | App is failing repeatedly |
| **Pod OOMKilled** | OOM event detected | Memory limits too low or leak |
| **High Memory Usage** | memory > 85% for 5 min | Approaching resource limits |
| **App Unavailable** | probe failures > 3 | Users cannot access app |
| **Disk Running Low** | disk > 80% used | Persistent storage filling up |
**Key principle:** Alerts should be symptom-based and actionable. If an alert fires and you don't need to do anything, remove it.
---
## Sources
### CI/CD Best Practices
- [TeamCity CI/CD Guide](https://www.jetbrains.com/teamcity/ci-cd-guide/ci-cd-best-practices/)
- [Spacelift CI/CD Best Practices](https://spacelift.io/blog/ci-cd-best-practices)
- [GitLab CI/CD Best Practices](https://about.gitlab.com/blog/how-to-keep-up-with-ci-cd-best-practices/)
- [AWS CI/CD Best Practices](https://docs.aws.amazon.com/prescriptive-guidance/latest/strategy-cicd-litmus/cicd-best-practices.html)
### Observability
- [Kubernetes Observability Trends 2026](https://www.usdsi.org/data-science-insights/kubernetes-observability-and-monitoring-trends-in-2026)
- [Spectro Cloud: Choosing the Right Monitoring Stack](https://www.spectrocloud.com/blog/choosing-the-right-kubernetes-monitoring-stack)
- [ClickHouse: Mastering Kubernetes Observability](https://clickhouse.com/resources/engineering/mastering-kubernetes-observability-guide)
- [Kubernetes Official Observability Docs](https://kubernetes.io/docs/concepts/cluster-administration/observability/)
### ArgoCD/GitOps
- [ArgoCD Auto Sync Documentation](https://argo-cd.readthedocs.io/en/stable/user-guide/auto_sync/)
- [ArgoCD Best Practices](https://argo-cd.readthedocs.io/en/stable/user-guide/best_practices/)
- [mkdev: ArgoCD Self-Heal and Sync Windows](https://mkdev.me/posts/argo-cd-self-heal-sync-windows-and-diffing)
### Alerting
- [Sysdig: Alerting on Kubernetes](https://www.sysdig.com/blog/alerting-kubernetes)
- [Groundcover: Kubernetes Alerting](https://www.groundcover.com/kubernetes-monitoring/kubernetes-alerting)
- [Sematext: 10 Must-Have Kubernetes Alerts](https://sematext.com/blog/top-10-must-have-alerts-for-kubernetes/)
### Logging
- [Plural: Loki vs ELK for Kubernetes](https://www.plural.sh/blog/loki-vs-elk-kubernetes/)
- [Loki vs ELK Comparison](https://alexandre-vazquez.com/loki-vs-elk/)
### Testing Pyramid
- [CircleCI: Testing Pyramid](https://circleci.com/blog/testing-pyramid/)
- [Semaphore: Testing Pyramid](https://semaphore.io/blog/testing-pyramid)
- [AWS: Testing Stages in CI/CD](https://docs.aws.amazon.com/whitepapers/latest/practicing-continuous-integration-continuous-delivery/testing-stages-in-continuous-integration-and-continuous-delivery.html)
### Homelab/Personal Projects
- [Prometheus and Grafana Homelab Setup](https://unixorn.github.io/post/homelab/homelab-setup-prometheus-and-grafana/)
- [Better Stack: Install Prometheus/Grafana with Helm](https://betterstack.com/community/questions/install-prometheus-and-grafana-on-kubernetes-with-helm/)