docs: complete v2.0 CI/CD and observability research

Files: - STACK-v2-cicd-observability.md (ArgoCD, Prometheus, Loki, Alloy) - FEATURES.md (updated with CI/CD and observability section) - ARCHITECTURE.md (updated with v2.0 integration architecture) - PITFALLS-CICD-OBSERVABILITY.md (14 critical/moderate/minor pitfalls) - SUMMARY-v2-cicd-observability.md (synthesis with roadmap implications) Key findings: - Stack: kube-prometheus-stack + Loki monolithic + Alloy (Promtail EOL March 2026) - Architecture: 3-phase approach - GitOps first, observability second, CI tests last - Critical pitfall: ArgoCD TLS redirect loop, Loki disk exhaustion, k3s metrics config Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 03:29:23 +01:00
parent 6cdd5aa8c7
commit 5dbabe6a2d
5 changed files with 2401 additions and 3 deletions
--- a/.planning/research/ARCHITECTURE.md
+++ b/.planning/research/ARCHITECTURE.md
@@ -257,7 +257,7 @@ func (s *LocalStorage) Store(ctx context.Context, file io.Reader) (string, error
    |                                               |
    |                                               v
    |                                          [FTS5 trigger auto-updates index]
-    |                                               |
+    |                                               v
    v                                               v
 [UI shows new note] <--JSON response-- [Return created note]
 ```
@@ -513,3 +513,621 @@ Based on component dependencies, suggested implementation order:
 ---
 *Architecture research for: Personal task/notes web application*
 *Researched: 2026-01-29*
+
+---
+
+# v2.0 Architecture: CI/CD and Observability Integration
+
+**Domain:** GitOps CI/CD and Observability Stack
+**Researched:** 2026-02-03
+**Confidence:** HIGH (verified with official documentation)
+
+## Executive Summary
+
+This section details how ArgoCD, Prometheus, Grafana, and Loki integrate with the existing k3s/Gitea/Traefik architecture. The integration follows established patterns for self-hosted Kubernetes observability stacks, with specific considerations for k3s's lightweight nature and Traefik as the ingress controller.
+
+Key insight: The existing CI/CD foundation (Gitea Actions + ArgoCD Application) is already in place. This milestone adds observability and operational automation rather than building from scratch.
+
+## Current Architecture Overview
+
+```
+                                    Internet
+                                        |
+                                   [Traefik]
+                                   (Ingress)
+                                        |
+              +-------------------------+-------------------------+
+              |                         |                         |
+        task.kube2          git.kube2               (future)
+        .tricnet.de         .tricnet.de         argocd/grafana
+              |                         |
+        [TaskPlaner]              [Gitea]
+         (default ns)           + Actions
+              |                  Runner
+              |                         |
+        [Longhorn PVC]                  |
+         (data store)                   |
+                                        v
+                            [Container Registry]
+                             git.kube2.tricnet.de
+```
+
+### Existing Components
+
+| Component | Namespace | Purpose | Status |
+|-----------|-----------|---------|--------|
+| k3s | - | Kubernetes distribution | Running |
+| Traefik | kube-system | Ingress controller | Running |
+| Longhorn | longhorn-system | Persistent storage | Running |
+| cert-manager | cert-manager | TLS certificates | Running |
+| Gitea | gitea (assumed) | Git hosting + CI | Running |
+| TaskPlaner | default | Application | Running |
+| ArgoCD Application | argocd | GitOps deployment | Defined (may need install) |
+
+### Existing CI/CD Pipeline
+
+From `.gitea/workflows/build.yaml`:
+1. Push to master triggers Gitea Actions
+2. Build Docker image with BuildX
+3. Push to Gitea Container Registry
+4. Update Helm values.yaml with new image tag
+5. Commit with `[skip ci]`
+6. ArgoCD detects change and syncs
+
+**Current gap:** ArgoCD may not be installed yet (Application manifest exists but needs ArgoCD server).
+
+## Integration Architecture
+
+### Target State
+
+```
+                                    Internet
+                                        |
+                                   [Traefik]
+                                   (Ingress)
+                                        |
+     +----------+----------+----------+----------+----------+
+     |          |          |          |          |          |
+   task.*    git.*     argocd.*   grafana.*   (internal)
+     |          |          |          |          |
+[TaskPlaner] [Gitea]   [ArgoCD]  [Grafana] [Prometheus]
+     |          |          |          |      [Loki]
+     |          |          |          |      [Alloy]
+     |          +---webhook--->       |          |
+     |                     |          |          |
+     +------ metrics ------+----------+--------->+
+     +------ logs ---------+---------[Alloy]---->+ (to Loki)
+```
+
+### Namespace Strategy
+
+| Namespace | Components | Rationale |
+|-----------|------------|-----------|
+| `argocd` | ArgoCD server, repo-server, application-controller | Standard convention; ClusterRoleBinding expects this |
+| `monitoring` | Prometheus, Grafana, Alertmanager | Consolidate observability; kube-prometheus-stack default |
+| `loki` | Loki, Alloy (DaemonSet) | Separate from metrics for resource isolation |
+| `default` | TaskPlaner | Existing app deployment |
+| `gitea` | Gitea + Actions Runner | Assumed existing |
+
+**Alternative considered:** All observability in single namespace
+**Decision:** Separate `monitoring` and `loki` because:
+- Different scaling characteristics (Alloy is DaemonSet, Prometheus is StatefulSet)
+- Easier resource quota management
+- Standard community practice
+
+## Component Integration Details
+
+### 1. ArgoCD Integration
+
+**Installation Method:** Helm chart from `argo/argo-cd`
+
+**Integration Points:**
+
+| Integration | How | Configuration |
+|-------------|-----|---------------|
+| Gitea Repository | HTTPS clone | Repository credential in argocd-secret |
+| Gitea Webhook | POST to `/api/webhook` | Reduces sync delay from 3min to seconds |
+| Traefik Ingress | IngressRoute or Ingress | `server.insecure=true` to avoid redirect loops |
+| TLS | cert-manager annotation | Let's Encrypt via existing cluster-issuer |
+
+**Critical Configuration:**
+
+```yaml
+# Helm values for ArgoCD with Traefik
+configs:
+  params:
+    server.insecure: true  # Required: Traefik handles TLS
+
+server:
+  ingress:
+    enabled: true
+    ingressClassName: traefik
+    annotations:
+      cert-manager.io/cluster-issuer: letsencrypt-prod
+    hosts:
+      - argocd.kube2.tricnet.de
+    tls:
+      - secretName: argocd-tls
+        hosts:
+          - argocd.kube2.tricnet.de
+```
+
+**Webhook Setup for Gitea:**
+
+1. In ArgoCD secret, set `webhook.gogs.secret` (Gitea uses Gogs-compatible webhooks)
+2. In Gitea repository settings, add webhook:
+   - URL: `https://argocd.kube2.tricnet.de/api/webhook`
+   - Content type: `application/json`
+   - Secret: Same as configured in ArgoCD
+
+**Known Limitation:** Webhooks work for Applications but not ApplicationSets with Gitea.
+
+### 2. Prometheus/Grafana Integration (kube-prometheus-stack)
+
+**Installation Method:** Helm chart `prometheus-community/kube-prometheus-stack`
+
+**Integration Points:**
+
+| Integration | How | Configuration |
+|-------------|-----|---------------|
+| k3s metrics | Exposed kube-* endpoints | k3s config modification required |
+| Traefik metrics | ServiceMonitor | Traefik exposes `:9100/metrics` |
+| TaskPlaner metrics | ServiceMonitor (future) | App must expose `/metrics` endpoint |
+| Grafana UI | Traefik Ingress | Standard Kubernetes Ingress |
+
+**Critical k3s Configuration:**
+
+k3s binds controller-manager, scheduler, and proxy to localhost by default. For Prometheus scraping, expose on 0.0.0.0.
+
+Create/modify `/etc/rancher/k3s/config.yaml`:
+
+```yaml
+kube-controller-manager-arg:
+  - "bind-address=0.0.0.0"
+kube-proxy-arg:
+  - "metrics-bind-address=0.0.0.0"
+kube-scheduler-arg:
+  - "bind-address=0.0.0.0"
+```
+
+Then restart k3s: `sudo systemctl restart k3s`
+
+**k3s-specific Helm values:**
+
+```yaml
+# Disable etcd monitoring (k3s uses sqlite, not etcd)
+defaultRules:
+  rules:
+    etcd: false
+
+kubeEtcd:
+  enabled: false
+
+# Fix endpoint discovery for k3s
+kubeControllerManager:
+  enabled: true
+  endpoints:
+    - <k3s-server-ip>
+  service:
+    enabled: true
+    port: 10257
+    targetPort: 10257
+
+kubeScheduler:
+  enabled: true
+  endpoints:
+    - <k3s-server-ip>
+  service:
+    enabled: true
+    port: 10259
+    targetPort: 10259
+
+kubeProxy:
+  enabled: true
+  endpoints:
+    - <k3s-server-ip>
+  service:
+    enabled: true
+    port: 10249
+    targetPort: 10249
+
+# Grafana ingress
+grafana:
+  ingress:
+    enabled: true
+    ingressClassName: traefik
+    annotations:
+      cert-manager.io/cluster-issuer: letsencrypt-prod
+    hosts:
+      - grafana.kube2.tricnet.de
+    tls:
+      - secretName: grafana-tls
+        hosts:
+          - grafana.kube2.tricnet.de
+```
+
+**ServiceMonitor for TaskPlaner (future):**
+
+Once TaskPlaner exposes `/metrics`:
+
+```yaml
+apiVersion: monitoring.coreos.com/v1
+kind: ServiceMonitor
+metadata:
+  name: taskplaner
+  namespace: monitoring
+  labels:
+    release: prometheus  # Must match kube-prometheus-stack release
+spec:
+  namespaceSelector:
+    matchNames:
+      - default
+  selector:
+    matchLabels:
+      app.kubernetes.io/name: taskplaner
+  endpoints:
+    - port: http
+      path: /metrics
+      interval: 30s
+```
+
+### 3. Loki + Alloy Integration (Log Aggregation)
+
+**Important:** Promtail is deprecated (LTS until Feb 2026, EOL March 2026). Use **Grafana Alloy** instead.
+
+**Installation Method:**
+- Loki: Helm chart `grafana/loki` (monolithic mode for single node)
+- Alloy: Helm chart `grafana/alloy`
+
+**Integration Points:**
+
+| Integration | How | Configuration |
+|-------------|-----|---------------|
+| Pod logs | Alloy DaemonSet | Mounts `/var/log/pods` |
+| Loki storage | Longhorn PVC or MinIO | Single-binary uses filesystem |
+| Grafana datasource | Auto-configured | kube-prometheus-stack integration |
+| k3s node logs | Alloy journal reader | journalctl access |
+
+**Deployment Mode Decision:**
+
+| Mode | When to Use | Our Choice |
+|------|-------------|------------|
+| Monolithic (single-binary) | Small deployments, <100GB/day | **Yes - single node k3s** |
+| Simple Scalable | Medium deployments | No |
+| Microservices | Large scale, HA required | No |
+
+**Loki Helm values (monolithic):**
+
+```yaml
+deploymentMode: SingleBinary
+
+singleBinary:
+  replicas: 1
+  persistence:
+    enabled: true
+    storageClass: longhorn
+    size: 10Gi
+
+# Disable components not needed in monolithic
+read:
+  replicas: 0
+write:
+  replicas: 0
+backend:
+  replicas: 0
+
+# Use filesystem storage (not S3/MinIO for simplicity)
+loki:
+  storage:
+    type: filesystem
+  schemaConfig:
+    configs:
+      - from: "2024-01-01"
+        store: tsdb
+        object_store: filesystem
+        schema: v13
+        index:
+          prefix: index_
+          period: 24h
+```
+
+**Alloy DaemonSet Configuration:**
+
+```yaml
+# alloy-values.yaml
+alloy:
+  configMap:
+    create: true
+    content: |
+      // Kubernetes logs collection
+      loki.source.kubernetes "pods" {
+        targets    = discovery.kubernetes.pods.targets
+        forward_to = [loki.write.default.receiver]
+      }
+
+      // Send to Loki
+      loki.write "default" {
+        endpoint {
+          url = "http://loki.loki.svc.cluster.local:3100/loki/api/v1/push"
+        }
+      }
+
+      // Kubernetes discovery
+      discovery.kubernetes "pods" {
+        role = "pod"
+      }
+```
+
+### 4. Traefik Metrics Integration
+
+Traefik already exposes Prometheus metrics. Enable scraping:
+
+**Option A: ServiceMonitor (if using kube-prometheus-stack)**
+
+```yaml
+apiVersion: monitoring.coreos.com/v1
+kind: ServiceMonitor
+metadata:
+  name: traefik
+  namespace: monitoring
+  labels:
+    release: prometheus
+spec:
+  namespaceSelector:
+    matchNames:
+      - kube-system
+  selector:
+    matchLabels:
+      app.kubernetes.io/name: traefik
+  endpoints:
+    - port: metrics
+      path: /metrics
+      interval: 30s
+```
+
+**Option B: Verify Traefik metrics are enabled**
+
+Check Traefik deployment args include:
+```
+--entrypoints.metrics.address=:8888
+--metrics.prometheus=true
+--metrics.prometheus.entryPoint=metrics
+```
+
+## Data Flow Diagrams
+
+### Metrics Flow
+
+```
+------------------+     +------------------+     +------------------+
+|   TaskPlaner     |     |     Traefik      |     |    k3s core      |
+|   /metrics       |     |   :9100/metrics  |     |  :10249,10257... |
+--------+---------+     +--------+---------+     +--------+---------+
+         |                        |                        |
+         +------------------------+------------------------+
+                                  |
+                                  v
+                        +-------------------+
+                        |   Prometheus      |
+                        | (ServiceMonitors) |
+                        +--------+----------+
+                                 |
+                                 v
+                        +-------------------+
+                        |     Grafana       |
+                        |   (Dashboards)    |
+                        +-------------------+
+```
+
+### Log Flow
+
+```
+------------------+     +------------------+     +------------------+
+|   TaskPlaner     |     |     Traefik      |     |   Other Pods     |
+|   stdout/stderr  |     |   access logs    |     |   stdout/stderr  |
+--------+---------+     +--------+---------+     +--------+---------+
+         |                        |                        |
+         +------------------------+------------------------+
+                                  |
+                            /var/log/pods
+                                  |
+                                  v
+                        +-------------------+
+                        |   Alloy DaemonSet |
+                        |  (log collection) |
+                        +--------+----------+
+                                 |
+                                 v
+                        +-------------------+
+                        |      Loki         |
+                        |  (log storage)    |
+                        +--------+----------+
+                                 |
+                                 v
+                        +-------------------+
+                        |     Grafana       |
+                        |   (log queries)   |
+                        +-------------------+
+```
+
+### GitOps Flow
+
+```
+------------+     +------------+     +---------------+     +------------+
+| Developer  | --> |   Gitea    | --> | Gitea Actions | --> | Container  |
+| git push   |     | Repository |     | (build.yaml)  |     | Registry   |
+------------+     +-----+------+     +-------+-------+     +------------+
+                         |                    |
+                         |              (update values.yaml)
+                         |                    |
+                         v                    v
+                   +------------+       +------------+
+                   |  Webhook   | ----> |   ArgoCD   |
+                   |  (notify)  |       |   Server   |
+                   +------------+       +-----+------+
+                                              |
+                                        (sync app)
+                                              |
+                                              v
+                                        +------------+
+                                        | Kubernetes |
+                                        |  (deploy)  |
+                                        +------------+
+```
+
+## Build Order (Dependencies)
+
+Based on component dependencies, recommended installation order:
+
+### Phase 1: ArgoCD (no dependencies on observability)
+
+```
+1. Install ArgoCD via Helm
+   - Creates namespace: argocd
+   - Verify existing Application manifest works
+   - Configure Gitea webhook
+
+Dependencies: None (Traefik already running)
+Validates: GitOps pipeline end-to-end
+```
+
+### Phase 2: kube-prometheus-stack (foundational observability)
+
+```
+2. Configure k3s metrics exposure
+   - Modify /etc/rancher/k3s/config.yaml
+   - Restart k3s
+
+3. Install kube-prometheus-stack via Helm
+   - Creates namespace: monitoring
+   - Includes: Prometheus, Grafana, Alertmanager
+   - Includes: Default dashboards and alerts
+
+Dependencies: k3s metrics exposed
+Validates: Basic cluster monitoring working
+```
+
+### Phase 3: Loki + Alloy (log aggregation)
+
+```
+4. Install Loki via Helm (monolithic mode)
+   - Creates namespace: loki
+   - Configure storage with Longhorn
+
+5. Install Alloy via Helm
+   - DaemonSet in loki namespace
+   - Configure Kubernetes log discovery
+   - Point to Loki endpoint
+
+6. Add Loki datasource to Grafana
+   - URL: http://loki.loki.svc.cluster.local:3100
+
+Dependencies: Grafana from step 3, storage
+Validates: Logs visible in Grafana Explore
+```
+
+### Phase 4: Application Integration
+
+```
+7. Add TaskPlaner metrics endpoint (if not exists)
+   - Expose /metrics in app
+   - Create ServiceMonitor
+
+8. Create application dashboards in Grafana
+   - TaskPlaner-specific metrics
+   - Request latency, error rates
+
+Dependencies: All previous phases
+Validates: Full observability of application
+```
+
+## Resource Requirements
+
+| Component | CPU Request | Memory Request | Storage |
+|-----------|-------------|----------------|---------|
+| ArgoCD (all) | 500m | 512Mi | - |
+| Prometheus | 200m | 512Mi | 10Gi (Longhorn) |
+| Grafana | 100m | 256Mi | 1Gi (Longhorn) |
+| Alertmanager | 50m | 64Mi | 1Gi (Longhorn) |
+| Loki | 200m | 256Mi | 10Gi (Longhorn) |
+| Alloy (per node) | 100m | 128Mi | - |
+
+**Total additional:** ~1.2 CPU cores, ~1.7Gi RAM, ~22Gi storage
+
+## Security Considerations
+
+### Network Policies
+
+Consider network policies to restrict:
+- Prometheus scraping only from monitoring namespace
+- Loki ingestion only from Alloy
+- Grafana access only via Traefik
+
+### Secrets Management
+
+| Secret | Location | Purpose |
+|--------|----------|---------|
+| `argocd-initial-admin-secret` | argocd ns | Initial admin password |
+| `argocd-secret` | argocd ns | Webhook secrets, repo credentials |
+| `grafana-admin` | monitoring ns | Grafana admin password |
+
+### Ingress Authentication
+
+For production, consider:
+- ArgoCD: Built-in OIDC/OAuth integration
+- Grafana: Built-in auth (local, LDAP, OAuth)
+- Prometheus: Traefik BasicAuth middleware (already pattern in use)
+
+## Anti-Patterns to Avoid
+
+### 1. Skipping k3s Metrics Configuration
+
+**What happens:** Prometheus installs but most dashboards show "No data"
+**Prevention:** Configure k3s to expose metrics BEFORE installing kube-prometheus-stack
+
+### 2. Using Promtail Instead of Alloy
+
+**What happens:** Technical debt - Promtail EOL is March 2026
+**Prevention:** Use Alloy from the start; migration documentation exists
+
+### 3. Running Loki in Microservices Mode for Small Clusters
+
+**What happens:** Unnecessary complexity, resource overhead
+**Prevention:** Monolithic mode for clusters under 100GB/day log volume
+
+### 4. Forgetting server.insecure for ArgoCD with Traefik
+
+**What happens:** Redirect loop (ERR_TOO_MANY_REDIRECTS)
+**Prevention:** Always set `configs.params.server.insecure=true` when Traefik handles TLS
+
+### 5. ServiceMonitor Label Mismatch
+
+**What happens:** Prometheus doesn't discover custom ServiceMonitors
+**Prevention:** Ensure `release: <helm-release-name>` label matches kube-prometheus-stack release
+
+## Sources
+
+**ArgoCD:**
+- [ArgoCD Webhook Configuration](https://argo-cd.readthedocs.io/en/stable/operator-manual/webhook/)
+- [ArgoCD Ingress Configuration](https://argo-cd.readthedocs.io/en/stable/operator-manual/ingress/)
+- [ArgoCD Installation](https://argo-cd.readthedocs.io/en/stable/operator-manual/installation/)
+- [Mastering GitOps: ArgoCD and Gitea on Kubernetes](https://blog.stackademic.com/mastering-gitops-a-comprehensive-guide-to-self-hosting-argocd-and-gitea-on-kubernetes-9cdf36856c38)
+
+**Prometheus/Grafana:**
+- [kube-prometheus-stack Helm Chart](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack)
+- [Prometheus on K3s](https://fabianlee.org/2022/07/02/prometheus-installing-kube-prometheus-stack-on-k3s-cluster/)
+- [K3s Monitoring Guide](https://github.com/cablespaghetti/k3s-monitoring)
+- [ServiceMonitor Explained](https://dkbalachandar.wordpress.com/2025/07/21/kubernetes-servicemonitor-explained-how-to-monitor-services-with-prometheus/)
+
+**Loki/Alloy:**
+- [Loki Monolithic Installation](https://grafana.com/docs/loki/latest/setup/install/helm/install-monolithic/)
+- [Loki Deployment Modes](https://grafana.com/docs/loki/latest/get-started/deployment-modes/)
+- [Migrate from Promtail to Alloy](https://grafana.com/docs/alloy/latest/set-up/migrate/from-promtail/)
+- [Grafana Loki 3.4 Release](https://grafana.com/blog/2025/02/13/grafana-loki-3.4-standardized-storage-config-sizing-guidance-and-promtail-merging-into-alloy/)
+- [Alloy Replacing Promtail](https://docs-bigbang.dso.mil/latest/docs/adrs/0004-alloy-replacing-promtail/)
+
+**Traefik Integration:**
+- [Traefik Metrics with Prometheus](https://traefik.io/blog/capture-traefik-metrics-for-apps-on-kubernetes-with-prometheus)
+
+---
+*Last updated: 2026-02-03*