docs: complete v2.0 CI/CD and observability research

Files:
- STACK-v2-cicd-observability.md (ArgoCD, Prometheus, Loki, Alloy)
- FEATURES.md (updated with CI/CD and observability section)
- ARCHITECTURE.md (updated with v2.0 integration architecture)
- PITFALLS-CICD-OBSERVABILITY.md (14 critical/moderate/minor pitfalls)
- SUMMARY-v2-cicd-observability.md (synthesis with roadmap implications)

Key findings:
- Stack: kube-prometheus-stack + Loki monolithic + Alloy (Promtail EOL March 2026)
- Architecture: 3-phase approach - GitOps first, observability second, CI tests last
- Critical pitfall: ArgoCD TLS redirect loop, Loki disk exhaustion, k3s metrics config

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
Thomas Richter
2026-02-03 03:29:23 +01:00
parent 6cdd5aa8c7
commit 5dbabe6a2d
5 changed files with 2401 additions and 3 deletions

View File

@@ -257,7 +257,7 @@ func (s *LocalStorage) Store(ctx context.Context, file io.Reader) (string, error
| |
| v
| [FTS5 trigger auto-updates index]
| |
| v
v v
[UI shows new note] <--JSON response-- [Return created note]
```
@@ -513,3 +513,621 @@ Based on component dependencies, suggested implementation order:
---
*Architecture research for: Personal task/notes web application*
*Researched: 2026-01-29*
---
# v2.0 Architecture: CI/CD and Observability Integration
**Domain:** GitOps CI/CD and Observability Stack
**Researched:** 2026-02-03
**Confidence:** HIGH (verified with official documentation)
## Executive Summary
This section details how ArgoCD, Prometheus, Grafana, and Loki integrate with the existing k3s/Gitea/Traefik architecture. The integration follows established patterns for self-hosted Kubernetes observability stacks, with specific considerations for k3s's lightweight nature and Traefik as the ingress controller.
Key insight: The existing CI/CD foundation (Gitea Actions + ArgoCD Application) is already in place. This milestone adds observability and operational automation rather than building from scratch.
## Current Architecture Overview
```
Internet
|
[Traefik]
(Ingress)
|
+-------------------------+-------------------------+
| | |
task.kube2 git.kube2 (future)
.tricnet.de .tricnet.de argocd/grafana
| |
[TaskPlaner] [Gitea]
(default ns) + Actions
| Runner
| |
[Longhorn PVC] |
(data store) |
v
[Container Registry]
git.kube2.tricnet.de
```
### Existing Components
| Component | Namespace | Purpose | Status |
|-----------|-----------|---------|--------|
| k3s | - | Kubernetes distribution | Running |
| Traefik | kube-system | Ingress controller | Running |
| Longhorn | longhorn-system | Persistent storage | Running |
| cert-manager | cert-manager | TLS certificates | Running |
| Gitea | gitea (assumed) | Git hosting + CI | Running |
| TaskPlaner | default | Application | Running |
| ArgoCD Application | argocd | GitOps deployment | Defined (may need install) |
### Existing CI/CD Pipeline
From `.gitea/workflows/build.yaml`:
1. Push to master triggers Gitea Actions
2. Build Docker image with BuildX
3. Push to Gitea Container Registry
4. Update Helm values.yaml with new image tag
5. Commit with `[skip ci]`
6. ArgoCD detects change and syncs
**Current gap:** ArgoCD may not be installed yet (Application manifest exists but needs ArgoCD server).
## Integration Architecture
### Target State
```
Internet
|
[Traefik]
(Ingress)
|
+----------+----------+----------+----------+----------+
| | | | | |
task.* git.* argocd.* grafana.* (internal)
| | | | |
[TaskPlaner] [Gitea] [ArgoCD] [Grafana] [Prometheus]
| | | | [Loki]
| | | | [Alloy]
| +---webhook---> | |
| | | |
+------ metrics ------+----------+--------->+
+------ logs ---------+---------[Alloy]---->+ (to Loki)
```
### Namespace Strategy
| Namespace | Components | Rationale |
|-----------|------------|-----------|
| `argocd` | ArgoCD server, repo-server, application-controller | Standard convention; ClusterRoleBinding expects this |
| `monitoring` | Prometheus, Grafana, Alertmanager | Consolidate observability; kube-prometheus-stack default |
| `loki` | Loki, Alloy (DaemonSet) | Separate from metrics for resource isolation |
| `default` | TaskPlaner | Existing app deployment |
| `gitea` | Gitea + Actions Runner | Assumed existing |
**Alternative considered:** All observability in single namespace
**Decision:** Separate `monitoring` and `loki` because:
- Different scaling characteristics (Alloy is DaemonSet, Prometheus is StatefulSet)
- Easier resource quota management
- Standard community practice
## Component Integration Details
### 1. ArgoCD Integration
**Installation Method:** Helm chart from `argo/argo-cd`
**Integration Points:**
| Integration | How | Configuration |
|-------------|-----|---------------|
| Gitea Repository | HTTPS clone | Repository credential in argocd-secret |
| Gitea Webhook | POST to `/api/webhook` | Reduces sync delay from 3min to seconds |
| Traefik Ingress | IngressRoute or Ingress | `server.insecure=true` to avoid redirect loops |
| TLS | cert-manager annotation | Let's Encrypt via existing cluster-issuer |
**Critical Configuration:**
```yaml
# Helm values for ArgoCD with Traefik
configs:
params:
server.insecure: true # Required: Traefik handles TLS
server:
ingress:
enabled: true
ingressClassName: traefik
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
hosts:
- argocd.kube2.tricnet.de
tls:
- secretName: argocd-tls
hosts:
- argocd.kube2.tricnet.de
```
**Webhook Setup for Gitea:**
1. In ArgoCD secret, set `webhook.gogs.secret` (Gitea uses Gogs-compatible webhooks)
2. In Gitea repository settings, add webhook:
- URL: `https://argocd.kube2.tricnet.de/api/webhook`
- Content type: `application/json`
- Secret: Same as configured in ArgoCD
**Known Limitation:** Webhooks work for Applications but not ApplicationSets with Gitea.
### 2. Prometheus/Grafana Integration (kube-prometheus-stack)
**Installation Method:** Helm chart `prometheus-community/kube-prometheus-stack`
**Integration Points:**
| Integration | How | Configuration |
|-------------|-----|---------------|
| k3s metrics | Exposed kube-* endpoints | k3s config modification required |
| Traefik metrics | ServiceMonitor | Traefik exposes `:9100/metrics` |
| TaskPlaner metrics | ServiceMonitor (future) | App must expose `/metrics` endpoint |
| Grafana UI | Traefik Ingress | Standard Kubernetes Ingress |
**Critical k3s Configuration:**
k3s binds controller-manager, scheduler, and proxy to localhost by default. For Prometheus scraping, expose on 0.0.0.0.
Create/modify `/etc/rancher/k3s/config.yaml`:
```yaml
kube-controller-manager-arg:
- "bind-address=0.0.0.0"
kube-proxy-arg:
- "metrics-bind-address=0.0.0.0"
kube-scheduler-arg:
- "bind-address=0.0.0.0"
```
Then restart k3s: `sudo systemctl restart k3s`
**k3s-specific Helm values:**
```yaml
# Disable etcd monitoring (k3s uses sqlite, not etcd)
defaultRules:
rules:
etcd: false
kubeEtcd:
enabled: false
# Fix endpoint discovery for k3s
kubeControllerManager:
enabled: true
endpoints:
- <k3s-server-ip>
service:
enabled: true
port: 10257
targetPort: 10257
kubeScheduler:
enabled: true
endpoints:
- <k3s-server-ip>
service:
enabled: true
port: 10259
targetPort: 10259
kubeProxy:
enabled: true
endpoints:
- <k3s-server-ip>
service:
enabled: true
port: 10249
targetPort: 10249
# Grafana ingress
grafana:
ingress:
enabled: true
ingressClassName: traefik
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
hosts:
- grafana.kube2.tricnet.de
tls:
- secretName: grafana-tls
hosts:
- grafana.kube2.tricnet.de
```
**ServiceMonitor for TaskPlaner (future):**
Once TaskPlaner exposes `/metrics`:
```yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: taskplaner
namespace: monitoring
labels:
release: prometheus # Must match kube-prometheus-stack release
spec:
namespaceSelector:
matchNames:
- default
selector:
matchLabels:
app.kubernetes.io/name: taskplaner
endpoints:
- port: http
path: /metrics
interval: 30s
```
### 3. Loki + Alloy Integration (Log Aggregation)
**Important:** Promtail is deprecated (LTS until Feb 2026, EOL March 2026). Use **Grafana Alloy** instead.
**Installation Method:**
- Loki: Helm chart `grafana/loki` (monolithic mode for single node)
- Alloy: Helm chart `grafana/alloy`
**Integration Points:**
| Integration | How | Configuration |
|-------------|-----|---------------|
| Pod logs | Alloy DaemonSet | Mounts `/var/log/pods` |
| Loki storage | Longhorn PVC or MinIO | Single-binary uses filesystem |
| Grafana datasource | Auto-configured | kube-prometheus-stack integration |
| k3s node logs | Alloy journal reader | journalctl access |
**Deployment Mode Decision:**
| Mode | When to Use | Our Choice |
|------|-------------|------------|
| Monolithic (single-binary) | Small deployments, <100GB/day | **Yes - single node k3s** |
| Simple Scalable | Medium deployments | No |
| Microservices | Large scale, HA required | No |
**Loki Helm values (monolithic):**
```yaml
deploymentMode: SingleBinary
singleBinary:
replicas: 1
persistence:
enabled: true
storageClass: longhorn
size: 10Gi
# Disable components not needed in monolithic
read:
replicas: 0
write:
replicas: 0
backend:
replicas: 0
# Use filesystem storage (not S3/MinIO for simplicity)
loki:
storage:
type: filesystem
schemaConfig:
configs:
- from: "2024-01-01"
store: tsdb
object_store: filesystem
schema: v13
index:
prefix: index_
period: 24h
```
**Alloy DaemonSet Configuration:**
```yaml
# alloy-values.yaml
alloy:
configMap:
create: true
content: |
// Kubernetes logs collection
loki.source.kubernetes "pods" {
targets = discovery.kubernetes.pods.targets
forward_to = [loki.write.default.receiver]
}
// Send to Loki
loki.write "default" {
endpoint {
url = "http://loki.loki.svc.cluster.local:3100/loki/api/v1/push"
}
}
// Kubernetes discovery
discovery.kubernetes "pods" {
role = "pod"
}
```
### 4. Traefik Metrics Integration
Traefik already exposes Prometheus metrics. Enable scraping:
**Option A: ServiceMonitor (if using kube-prometheus-stack)**
```yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: traefik
namespace: monitoring
labels:
release: prometheus
spec:
namespaceSelector:
matchNames:
- kube-system
selector:
matchLabels:
app.kubernetes.io/name: traefik
endpoints:
- port: metrics
path: /metrics
interval: 30s
```
**Option B: Verify Traefik metrics are enabled**
Check Traefik deployment args include:
```
--entrypoints.metrics.address=:8888
--metrics.prometheus=true
--metrics.prometheus.entryPoint=metrics
```
## Data Flow Diagrams
### Metrics Flow
```
+------------------+ +------------------+ +------------------+
| TaskPlaner | | Traefik | | k3s core |
| /metrics | | :9100/metrics | | :10249,10257... |
+--------+---------+ +--------+---------+ +--------+---------+
| | |
+------------------------+------------------------+
|
v
+-------------------+
| Prometheus |
| (ServiceMonitors) |
+--------+----------+
|
v
+-------------------+
| Grafana |
| (Dashboards) |
+-------------------+
```
### Log Flow
```
+------------------+ +------------------+ +------------------+
| TaskPlaner | | Traefik | | Other Pods |
| stdout/stderr | | access logs | | stdout/stderr |
+--------+---------+ +--------+---------+ +--------+---------+
| | |
+------------------------+------------------------+
|
/var/log/pods
|
v
+-------------------+
| Alloy DaemonSet |
| (log collection) |
+--------+----------+
|
v
+-------------------+
| Loki |
| (log storage) |
+--------+----------+
|
v
+-------------------+
| Grafana |
| (log queries) |
+-------------------+
```
### GitOps Flow
```
+------------+ +------------+ +---------------+ +------------+
| Developer | --> | Gitea | --> | Gitea Actions | --> | Container |
| git push | | Repository | | (build.yaml) | | Registry |
+------------+ +-----+------+ +-------+-------+ +------------+
| |
| (update values.yaml)
| |
v v
+------------+ +------------+
| Webhook | ----> | ArgoCD |
| (notify) | | Server |
+------------+ +-----+------+
|
(sync app)
|
v
+------------+
| Kubernetes |
| (deploy) |
+------------+
```
## Build Order (Dependencies)
Based on component dependencies, recommended installation order:
### Phase 1: ArgoCD (no dependencies on observability)
```
1. Install ArgoCD via Helm
- Creates namespace: argocd
- Verify existing Application manifest works
- Configure Gitea webhook
Dependencies: None (Traefik already running)
Validates: GitOps pipeline end-to-end
```
### Phase 2: kube-prometheus-stack (foundational observability)
```
2. Configure k3s metrics exposure
- Modify /etc/rancher/k3s/config.yaml
- Restart k3s
3. Install kube-prometheus-stack via Helm
- Creates namespace: monitoring
- Includes: Prometheus, Grafana, Alertmanager
- Includes: Default dashboards and alerts
Dependencies: k3s metrics exposed
Validates: Basic cluster monitoring working
```
### Phase 3: Loki + Alloy (log aggregation)
```
4. Install Loki via Helm (monolithic mode)
- Creates namespace: loki
- Configure storage with Longhorn
5. Install Alloy via Helm
- DaemonSet in loki namespace
- Configure Kubernetes log discovery
- Point to Loki endpoint
6. Add Loki datasource to Grafana
- URL: http://loki.loki.svc.cluster.local:3100
Dependencies: Grafana from step 3, storage
Validates: Logs visible in Grafana Explore
```
### Phase 4: Application Integration
```
7. Add TaskPlaner metrics endpoint (if not exists)
- Expose /metrics in app
- Create ServiceMonitor
8. Create application dashboards in Grafana
- TaskPlaner-specific metrics
- Request latency, error rates
Dependencies: All previous phases
Validates: Full observability of application
```
## Resource Requirements
| Component | CPU Request | Memory Request | Storage |
|-----------|-------------|----------------|---------|
| ArgoCD (all) | 500m | 512Mi | - |
| Prometheus | 200m | 512Mi | 10Gi (Longhorn) |
| Grafana | 100m | 256Mi | 1Gi (Longhorn) |
| Alertmanager | 50m | 64Mi | 1Gi (Longhorn) |
| Loki | 200m | 256Mi | 10Gi (Longhorn) |
| Alloy (per node) | 100m | 128Mi | - |
**Total additional:** ~1.2 CPU cores, ~1.7Gi RAM, ~22Gi storage
## Security Considerations
### Network Policies
Consider network policies to restrict:
- Prometheus scraping only from monitoring namespace
- Loki ingestion only from Alloy
- Grafana access only via Traefik
### Secrets Management
| Secret | Location | Purpose |
|--------|----------|---------|
| `argocd-initial-admin-secret` | argocd ns | Initial admin password |
| `argocd-secret` | argocd ns | Webhook secrets, repo credentials |
| `grafana-admin` | monitoring ns | Grafana admin password |
### Ingress Authentication
For production, consider:
- ArgoCD: Built-in OIDC/OAuth integration
- Grafana: Built-in auth (local, LDAP, OAuth)
- Prometheus: Traefik BasicAuth middleware (already pattern in use)
## Anti-Patterns to Avoid
### 1. Skipping k3s Metrics Configuration
**What happens:** Prometheus installs but most dashboards show "No data"
**Prevention:** Configure k3s to expose metrics BEFORE installing kube-prometheus-stack
### 2. Using Promtail Instead of Alloy
**What happens:** Technical debt - Promtail EOL is March 2026
**Prevention:** Use Alloy from the start; migration documentation exists
### 3. Running Loki in Microservices Mode for Small Clusters
**What happens:** Unnecessary complexity, resource overhead
**Prevention:** Monolithic mode for clusters under 100GB/day log volume
### 4. Forgetting server.insecure for ArgoCD with Traefik
**What happens:** Redirect loop (ERR_TOO_MANY_REDIRECTS)
**Prevention:** Always set `configs.params.server.insecure=true` when Traefik handles TLS
### 5. ServiceMonitor Label Mismatch
**What happens:** Prometheus doesn't discover custom ServiceMonitors
**Prevention:** Ensure `release: <helm-release-name>` label matches kube-prometheus-stack release
## Sources
**ArgoCD:**
- [ArgoCD Webhook Configuration](https://argo-cd.readthedocs.io/en/stable/operator-manual/webhook/)
- [ArgoCD Ingress Configuration](https://argo-cd.readthedocs.io/en/stable/operator-manual/ingress/)
- [ArgoCD Installation](https://argo-cd.readthedocs.io/en/stable/operator-manual/installation/)
- [Mastering GitOps: ArgoCD and Gitea on Kubernetes](https://blog.stackademic.com/mastering-gitops-a-comprehensive-guide-to-self-hosting-argocd-and-gitea-on-kubernetes-9cdf36856c38)
**Prometheus/Grafana:**
- [kube-prometheus-stack Helm Chart](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack)
- [Prometheus on K3s](https://fabianlee.org/2022/07/02/prometheus-installing-kube-prometheus-stack-on-k3s-cluster/)
- [K3s Monitoring Guide](https://github.com/cablespaghetti/k3s-monitoring)
- [ServiceMonitor Explained](https://dkbalachandar.wordpress.com/2025/07/21/kubernetes-servicemonitor-explained-how-to-monitor-services-with-prometheus/)
**Loki/Alloy:**
- [Loki Monolithic Installation](https://grafana.com/docs/loki/latest/setup/install/helm/install-monolithic/)
- [Loki Deployment Modes](https://grafana.com/docs/loki/latest/get-started/deployment-modes/)
- [Migrate from Promtail to Alloy](https://grafana.com/docs/alloy/latest/set-up/migrate/from-promtail/)
- [Grafana Loki 3.4 Release](https://grafana.com/blog/2025/02/13/grafana-loki-3.4-standardized-storage-config-sizing-guidance-and-promtail-merging-into-alloy/)
- [Alloy Replacing Promtail](https://docs-bigbang.dso.mil/latest/docs/adrs/0004-alloy-replacing-promtail/)
**Traefik Integration:**
- [Traefik Metrics with Prometheus](https://traefik.io/blog/capture-traefik-metrics-for-apps-on-kubernetes-with-prometheus)
---
*Last updated: 2026-02-03*