docs: complete v2.0 CI/CD and observability research
Files: - STACK-v2-cicd-observability.md (ArgoCD, Prometheus, Loki, Alloy) - FEATURES.md (updated with CI/CD and observability section) - ARCHITECTURE.md (updated with v2.0 integration architecture) - PITFALLS-CICD-OBSERVABILITY.md (14 critical/moderate/minor pitfalls) - SUMMARY-v2-cicd-observability.md (synthesis with roadmap implications) Key findings: - Stack: kube-prometheus-stack + Loki monolithic + Alloy (Promtail EOL March 2026) - Architecture: 3-phase approach - GitOps first, observability second, CI tests last - Critical pitfall: ArgoCD TLS redirect loop, Loki disk exhaustion, k3s metrics config Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
@@ -210,5 +210,241 @@ Features to defer until product-market fit is established:
|
||||
- Evernote features page (verified via WebFetch)
|
||||
|
||||
---
|
||||
*Feature research for: Personal Task/Notes Web App*
|
||||
*Researched: 2026-01-29*
|
||||
|
||||
# CI/CD and Observability Features
|
||||
|
||||
**Domain:** CI/CD pipelines and Kubernetes observability for personal project
|
||||
**Researched:** 2026-02-03
|
||||
**Context:** Single-user, self-hosted TaskPlanner app with existing basic Gitea Actions pipeline
|
||||
|
||||
## Current State
|
||||
|
||||
Based on the existing `.gitea/workflows/build.yaml`:
|
||||
- Build and push Docker images to Gitea Container Registry
|
||||
- Docker layer caching enabled
|
||||
- Automatic Helm values update with new image tag
|
||||
- No tests in pipeline
|
||||
- No GitOps automation (ArgoCD defined but requires manual sync)
|
||||
- No observability stack
|
||||
|
||||
---
|
||||
|
||||
## Table Stakes
|
||||
|
||||
Features required for production-grade operations. Missing any of these means the system is incomplete for reliable self-hosting.
|
||||
|
||||
### CI/CD Pipeline
|
||||
|
||||
| Feature | Why Expected | Complexity | Notes |
|
||||
|---------|--------------|------------|-------|
|
||||
| **Automated tests in pipeline** | Catch bugs before deployment; without tests, pipeline is just a build script | Low | Start with unit tests (70% of test pyramid), add integration tests later |
|
||||
| **Build caching** | Already have this | - | Using Docker layer cache to registry |
|
||||
| **Lint/static analysis** | Catch errors early (fail fast principle) | Low | ESLint, TypeScript checking |
|
||||
| **Pipeline as code** | Already have this | - | Workflow defined in `.gitea/workflows/` |
|
||||
| **Automated deployment trigger** | Manual `helm upgrade` defeats CI/CD purpose | Low | ArgoCD auto-sync on Git changes |
|
||||
| **Container image tagging** | Already have this | - | SHA-based tags with `latest` |
|
||||
|
||||
### GitOps
|
||||
|
||||
| Feature | Why Expected | Complexity | Notes |
|
||||
|---------|--------------|------------|-------|
|
||||
| **Git as single source of truth** | Core GitOps principle; cluster state should match Git | Low | ArgoCD watches Git repo, syncs to cluster |
|
||||
| **Auto-sync** | Manual sync defeats GitOps purpose | Low | ArgoCD `syncPolicy.automated.enabled: true` |
|
||||
| **Self-healing** | Prevents drift; if someone kubectl edits, ArgoCD reverts | Low | ArgoCD `selfHeal: true` |
|
||||
| **Health checks** | Know if deployment succeeded | Low | ArgoCD built-in health status |
|
||||
|
||||
### Observability
|
||||
|
||||
| Feature | Why Expected | Complexity | Notes |
|
||||
|---------|--------------|------------|-------|
|
||||
| **Basic metrics collection** | Know if app is running, resource usage | Medium | Prometheus + kube-state-metrics |
|
||||
| **Metrics visualization** | Metrics without dashboards are useless | Low | Grafana with pre-built Kubernetes dashboards |
|
||||
| **Container logs aggregation** | Debug issues without `kubectl logs` | Medium | Loki (lightweight, label-based) |
|
||||
| **Basic alerting** | Know when something breaks | Low | AlertManager with 3-5 critical alerts |
|
||||
|
||||
---
|
||||
|
||||
## Differentiators
|
||||
|
||||
Features that add significant value but are not strictly required for a single-user personal app. Implement if you want learning/practice or improved reliability.
|
||||
|
||||
### CI/CD Pipeline
|
||||
|
||||
| Feature | Value Proposition | Complexity | Notes |
|
||||
|---------|-------------------|------------|-------|
|
||||
| **Smoke tests on deploy** | Verify deployment actually works | Medium | Hit health endpoint after deploy |
|
||||
| **Build notifications** | Know when builds fail without watching | Low | Slack/Discord/email webhook |
|
||||
| **DORA metrics tracking** | Track deployment frequency, lead time | Medium | Measure CI/CD effectiveness |
|
||||
| **Parallel test execution** | Faster feedback on larger test suites | Medium | Only valuable with substantial test suite |
|
||||
| **Dependency vulnerability scanning** | Catch security issues early | Low | `npm audit`, Trivy for container images |
|
||||
|
||||
### GitOps
|
||||
|
||||
| Feature | Value Proposition | Complexity | Notes |
|
||||
|---------|-------------------|------------|-------|
|
||||
| **Automated pruning** | Remove resources deleted from Git | Low | ArgoCD `prune: true` |
|
||||
| **Sync windows** | Control when syncs happen | Low | Useful if you want maintenance windows |
|
||||
| **Application health dashboard** | Visual cluster state | Low | ArgoCD UI already provides this |
|
||||
| **Git commit status** | See deployment status in Gitea | Medium | ArgoCD notifications to Git |
|
||||
|
||||
### Observability
|
||||
|
||||
| Feature | Value Proposition | Complexity | Notes |
|
||||
|---------|-------------------|------------|-------|
|
||||
| **Application-level metrics** | Track business metrics (tasks created, etc.) | Medium | Custom Prometheus metrics in app |
|
||||
| **Request tracing** | Debug latency issues | High | OpenTelemetry, Tempo/Jaeger |
|
||||
| **SLO/SLI dashboards** | Define and track reliability targets | Medium | Error budgets, latency percentiles |
|
||||
| **Log-based alerting** | Alert on error patterns | Medium | Loki alerting rules |
|
||||
| **Uptime monitoring** | External availability check | Low | Uptime Kuma or similar |
|
||||
|
||||
---
|
||||
|
||||
## Anti-Features
|
||||
|
||||
Features that are overkill for a single-user personal app. Actively avoid these to prevent over-engineering.
|
||||
|
||||
| Anti-Feature | Why Avoid | What to Do Instead |
|
||||
|--------------|-----------|-------------------|
|
||||
| **Multi-environment promotion (dev/staging/prod)** | Single user, single environment | Deploy directly to prod; use feature flags if needed |
|
||||
| **Blue-green/canary deployments** | Complex rollout for single user is overkill | Simple rolling update; ArgoCD rollback if needed |
|
||||
| **Full E2E test suite in CI** | Expensive, slow, diminishing returns for personal app | Unit + smoke tests; manual E2E when needed |
|
||||
| **High availability ArgoCD** | HA is for multi-team, multi-tenant | Single replica ArgoCD is fine |
|
||||
| **Distributed tracing** | Overkill unless debugging microservices latency | Only add if you have multiple services with latency issues |
|
||||
| **ELK stack for logging** | Resource-heavy; Elasticsearch needs significant memory | Use Loki instead (label-based, lightweight) |
|
||||
| **Full APM solution** | DataDog/NewRelic-style solutions are enterprise-focused | Prometheus + Grafana + Loki covers personal needs |
|
||||
| **Secrets management (Vault)** | Complex for single user with few secrets | Kubernetes secrets or sealed-secrets |
|
||||
| **Policy enforcement (OPA/Gatekeeper)** | You are the only user; no policy conflicts | Skip entirely |
|
||||
| **Multi-cluster management** | Single cluster, single app | Skip entirely |
|
||||
| **Cost optimization/FinOps** | Personal project; cost is fixed/minimal | Skip entirely |
|
||||
| **AI-assisted observability** | Marketing hype; manual review is fine at this scale | Skip entirely |
|
||||
|
||||
---
|
||||
|
||||
## Feature Dependencies
|
||||
|
||||
```
|
||||
Automated Tests
|
||||
|
|
||||
v
|
||||
Lint/Static Analysis --> Build --> Push Image --> Update Git
|
||||
|
|
||||
v
|
||||
ArgoCD Auto-Sync
|
||||
|
|
||||
v
|
||||
Health Check Pass
|
||||
|
|
||||
v
|
||||
Deployment Complete
|
||||
|
|
||||
v
|
||||
Metrics/Logs Available in Grafana
|
||||
```
|
||||
|
||||
Key ordering constraints:
|
||||
1. Tests before build (fail fast)
|
||||
2. ArgoCD watches Git, so Git update triggers deploy
|
||||
3. Observability stack must be deployed before app for metrics collection
|
||||
|
||||
---
|
||||
|
||||
## MVP Recommendation for CI/CD and Observability
|
||||
|
||||
For production-grade operations on a personal project, prioritize in this order:
|
||||
|
||||
### Phase 1: GitOps Foundation
|
||||
1. Enable ArgoCD auto-sync with self-healing
|
||||
2. Add basic health checks
|
||||
|
||||
*Rationale:* Eliminates manual `helm upgrade`, establishes GitOps workflow
|
||||
|
||||
### Phase 2: Basic Observability
|
||||
1. Prometheus + Grafana (kube-prometheus-stack helm chart)
|
||||
2. Loki for log aggregation
|
||||
3. 3-5 critical alerts (pod crashes, high memory, app down)
|
||||
|
||||
*Rationale:* Can't operate what you can't see; minimum viable observability
|
||||
|
||||
### Phase 3: CI Pipeline Hardening
|
||||
1. Add unit tests to pipeline
|
||||
2. Add linting/type checking
|
||||
3. Smoke test after deploy (optional)
|
||||
|
||||
*Rationale:* Tests catch bugs before they reach production
|
||||
|
||||
### Defer to Later (if ever)
|
||||
- Application-level custom metrics
|
||||
- SLO dashboards
|
||||
- Advanced alerting
|
||||
- Request tracing
|
||||
- Extensive E2E tests
|
||||
|
||||
---
|
||||
|
||||
## Complexity Budget
|
||||
|
||||
For a single-user personal project, the total complexity budget should be LOW-MEDIUM:
|
||||
|
||||
| Category | Recommended Complexity | Over-Budget Indicator |
|
||||
|----------|----------------------|----------------------|
|
||||
| CI Pipeline | LOW | More than 10 min build time; complex test matrix |
|
||||
| GitOps | LOW | Multi-environment promotion; complex sync policies |
|
||||
| Metrics | MEDIUM | Custom exporters; high-cardinality metrics |
|
||||
| Logging | LOW | Full-text search; complex log parsing |
|
||||
| Alerting | LOW | More than 10 alerts; complex routing |
|
||||
| Tracing | SKIP | Any tracing for single-service app |
|
||||
|
||||
---
|
||||
|
||||
## Essential Alerts for Personal Project
|
||||
|
||||
Based on best practices, these 5 alerts are sufficient for a single-user app:
|
||||
|
||||
| Alert | Condition | Why Critical |
|
||||
|-------|-----------|--------------|
|
||||
| **Pod CrashLooping** | restarts > 3 in 15 min | App is failing repeatedly |
|
||||
| **Pod OOMKilled** | OOM event detected | Memory limits too low or leak |
|
||||
| **High Memory Usage** | memory > 85% for 5 min | Approaching resource limits |
|
||||
| **App Unavailable** | probe failures > 3 | Users cannot access app |
|
||||
| **Disk Running Low** | disk > 80% used | Persistent storage filling up |
|
||||
|
||||
**Key principle:** Alerts should be symptom-based and actionable. If an alert fires and you don't need to do anything, remove it.
|
||||
|
||||
---
|
||||
|
||||
## Sources
|
||||
|
||||
### CI/CD Best Practices
|
||||
- [TeamCity CI/CD Guide](https://www.jetbrains.com/teamcity/ci-cd-guide/ci-cd-best-practices/)
|
||||
- [Spacelift CI/CD Best Practices](https://spacelift.io/blog/ci-cd-best-practices)
|
||||
- [GitLab CI/CD Best Practices](https://about.gitlab.com/blog/how-to-keep-up-with-ci-cd-best-practices/)
|
||||
- [AWS CI/CD Best Practices](https://docs.aws.amazon.com/prescriptive-guidance/latest/strategy-cicd-litmus/cicd-best-practices.html)
|
||||
|
||||
### Observability
|
||||
- [Kubernetes Observability Trends 2026](https://www.usdsi.org/data-science-insights/kubernetes-observability-and-monitoring-trends-in-2026)
|
||||
- [Spectro Cloud: Choosing the Right Monitoring Stack](https://www.spectrocloud.com/blog/choosing-the-right-kubernetes-monitoring-stack)
|
||||
- [ClickHouse: Mastering Kubernetes Observability](https://clickhouse.com/resources/engineering/mastering-kubernetes-observability-guide)
|
||||
- [Kubernetes Official Observability Docs](https://kubernetes.io/docs/concepts/cluster-administration/observability/)
|
||||
|
||||
### ArgoCD/GitOps
|
||||
- [ArgoCD Auto Sync Documentation](https://argo-cd.readthedocs.io/en/stable/user-guide/auto_sync/)
|
||||
- [ArgoCD Best Practices](https://argo-cd.readthedocs.io/en/stable/user-guide/best_practices/)
|
||||
- [mkdev: ArgoCD Self-Heal and Sync Windows](https://mkdev.me/posts/argo-cd-self-heal-sync-windows-and-diffing)
|
||||
|
||||
### Alerting
|
||||
- [Sysdig: Alerting on Kubernetes](https://www.sysdig.com/blog/alerting-kubernetes)
|
||||
- [Groundcover: Kubernetes Alerting](https://www.groundcover.com/kubernetes-monitoring/kubernetes-alerting)
|
||||
- [Sematext: 10 Must-Have Kubernetes Alerts](https://sematext.com/blog/top-10-must-have-alerts-for-kubernetes/)
|
||||
|
||||
### Logging
|
||||
- [Plural: Loki vs ELK for Kubernetes](https://www.plural.sh/blog/loki-vs-elk-kubernetes/)
|
||||
- [Loki vs ELK Comparison](https://alexandre-vazquez.com/loki-vs-elk/)
|
||||
|
||||
### Testing Pyramid
|
||||
- [CircleCI: Testing Pyramid](https://circleci.com/blog/testing-pyramid/)
|
||||
- [Semaphore: Testing Pyramid](https://semaphore.io/blog/testing-pyramid)
|
||||
- [AWS: Testing Stages in CI/CD](https://docs.aws.amazon.com/whitepapers/latest/practicing-continuous-integration-continuous-delivery/testing-stages-in-continuous-integration-and-continuous-delivery.html)
|
||||
|
||||
### Homelab/Personal Projects
|
||||
- [Prometheus and Grafana Homelab Setup](https://unixorn.github.io/post/homelab/homelab-setup-prometheus-and-grafana/)
|
||||
- [Better Stack: Install Prometheus/Grafana with Helm](https://betterstack.com/community/questions/install-prometheus-and-grafana-on-kubernetes-with-helm/)
|
||||
|
||||
Reference in New Issue
Block a user