Compare commits

..

10 Commits

Author SHA1 Message Date
Thomas Richter
eff251ca70 feat(07-01): update ArgoCD application for internal cluster access
Some checks failed
Build and Push / build (push) Has been cancelled
- Change repoURL to internal Gitea cluster service
- Remove inline registry secret placeholder (created via kubectl)
- Registry secret created separately for security
2026-02-03 15:07:40 +01:00
Thomas Richter
54f933b1f7 chore(07-01): add ArgoCD repository secret documentation
- Document taskplaner-repo secret structure for ArgoCD
- Secret created via kubectl (not committed) for security
- Uses internal cluster URL for Gitea access
2026-02-03 15:07:05 +01:00
Thomas Richter
1d4302d5bf docs(07): create phase plan
Phase 07: GitOps Foundation
- 2 plan(s) in 2 wave(s)
- Wave 1: 07-01 (register application)
- Wave 2: 07-02 (verify gitops behavior)
- Ready for execution

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 14:54:41 +01:00
Thomas Richter
c1c46d9581 docs(07): capture phase context
Phase 07: GitOps Foundation
- ArgoCD already installed, UI accessible
- Apply TaskPlanner Application manifest
- Verify sync, self-heal, auto-deploy
2026-02-03 14:50:19 +01:00
Thomas Richter
27ed813413 docs: create milestone v2.0 roadmap (3 phases)
Phases:
7. GitOps Foundation: ArgoCD installation and configuration
8. Observability Stack: Prometheus/Grafana/Loki + alerts
9. CI Pipeline Hardening: Vitest, Playwright, type checking

All 17 requirements mapped to phases.
2026-02-03 14:41:43 +01:00
Thomas Richter
34b1c05146 docs: define milestone v2.0 requirements
17 requirements across 3 categories:
- GitOps: 4 (ArgoCD installation and configuration)
- Observability: 8 (Prometheus/Grafana/Loki stack + app metrics)
- CI Testing: 5 (Vitest, Playwright, type checking)
2026-02-03 13:27:31 +01:00
Thomas Richter
5dbabe6a2d docs: complete v2.0 CI/CD and observability research
Files:
- STACK-v2-cicd-observability.md (ArgoCD, Prometheus, Loki, Alloy)
- FEATURES.md (updated with CI/CD and observability section)
- ARCHITECTURE.md (updated with v2.0 integration architecture)
- PITFALLS-CICD-OBSERVABILITY.md (14 critical/moderate/minor pitfalls)
- SUMMARY-v2-cicd-observability.md (synthesis with roadmap implications)

Key findings:
- Stack: kube-prometheus-stack + Loki monolithic + Alloy (Promtail EOL March 2026)
- Architecture: 3-phase approach - GitOps first, observability second, CI tests last
- Critical pitfall: ArgoCD TLS redirect loop, Loki disk exhaustion, k3s metrics config

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 03:29:23 +01:00
Thomas Richter
6cdd5aa8c7 docs: start milestone v2.0 Production Operations 2026-02-03 03:14:14 +01:00
Thomas Richter
51b4b34c19 feat(ci): add GitOps pipeline with Gitea Actions and ArgoCD
- Add Gitea Actions workflow for building and pushing Docker images
- Configure ArgoCD Application for auto-sync deployment
- Update Helm values to use Gitea container registry
- Add setup documentation for GitOps configuration

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-02 00:05:45 +01:00
Thomas Richter
b205fedde6 fix: remove deleted tags from filter automatically
When a tag is deleted as orphaned, it's now automatically removed
from the active filter selection.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-01 23:16:49 +01:00
18 changed files with 3383 additions and 123 deletions

View File

@@ -0,0 +1,63 @@
name: Build and Push
on:
push:
branches:
- master
- main
pull_request:
branches:
- master
- main
env:
REGISTRY: git.kube2.tricnet.de
IMAGE_NAME: tho/taskplaner
jobs:
build:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Log in to Gitea Container Registry
uses: docker/login-action@v3
with:
registry: ${{ env.REGISTRY }}
username: ${{ secrets.REGISTRY_USERNAME }}
password: ${{ secrets.REGISTRY_PASSWORD }}
- name: Extract metadata
id: meta
uses: docker/metadata-action@v5
with:
images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
tags: |
type=sha,prefix=
type=ref,event=branch
type=raw,value=latest,enable=${{ github.ref == 'refs/heads/master' }}
- name: Build and push
uses: docker/build-push-action@v5
with:
context: .
push: ${{ github.event_name != 'pull_request' }}
tags: ${{ steps.meta.outputs.tags }}
labels: ${{ steps.meta.outputs.labels }}
cache-from: type=registry,ref=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:buildcache
cache-to: type=registry,ref=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:buildcache,mode=max
- name: Update Helm values with new image tag
if: github.event_name != 'pull_request'
run: |
SHORT_SHA=$(echo "${{ github.sha }}" | cut -c1-7)
sed -i "s/^ tag:.*/ tag: \"${SHORT_SHA}\"/" helm/taskplaner/values.yaml
git config user.name "Gitea Actions"
git config user.email "actions@git.kube2.tricnet.de"
git add helm/taskplaner/values.yaml
git commit -m "chore: update image tag to ${SHORT_SHA} [skip ci]" || echo "No changes to commit"
git push || echo "Push failed - may need to configure git credentials"

View File

@@ -8,6 +8,17 @@ A personal web app for capturing tasks and thoughts with image attachments, acce
Capture and find anything from any device — especially laptop. If cross-device capture with images doesn't work, nothing else matters.
## Current Milestone: v2.0 Production Operations
**Goal:** Establish production-grade CI/CD pipeline and observability stack for reliable operations.
**Target features:**
- Automated tests in Gitea Actions pipeline
- ArgoCD for GitOps deployment automation
- Prometheus metrics collection
- Grafana dashboards for visibility
- Loki log aggregation
## Requirements
### Validated (v1.0 - Shipped 2026-02-01)
@@ -68,4 +79,4 @@ This project solves a real problem while serving as a vehicle for learning new a
| adapter-node for Docker | Server-side rendering with env prefix support | ✓ Validated |
---
*Last updated: 2026-02-01 after v1.0 milestone completion*
*Last updated: 2026-02-03 after v2.0 milestone started*

101
.planning/REQUIREMENTS.md Normal file
View File

@@ -0,0 +1,101 @@
# Requirements: TaskPlanner v2.0
**Defined:** 2026-02-03
**Core Value:** Production-grade operations — reliable deployments and visibility into system health
## v2.0 Requirements
Requirements for milestone v2.0 Production Operations. Each maps to roadmap phases.
### GitOps
- [ ] **GITOPS-01**: ArgoCD server installed and running in cluster
- [ ] **GITOPS-02**: ArgoCD syncs TaskPlanner deployment from Git automatically
- [ ] **GITOPS-03**: ArgoCD self-heals manual changes to match Git state
- [ ] **GITOPS-04**: ArgoCD UI accessible via Traefik ingress with TLS
### Observability
- [ ] **OBS-01**: Prometheus collects metrics from cluster and applications
- [ ] **OBS-02**: Grafana displays dashboards with cluster metrics
- [ ] **OBS-03**: Loki aggregates logs from all pods
- [ ] **OBS-04**: Alloy DaemonSet collects pod logs and forwards to Loki
- [ ] **OBS-05**: Grafana can query logs via Loki datasource
- [ ] **OBS-06**: Critical alerts configured (pod crashes, disk full, app down)
- [ ] **OBS-07**: Grafana UI accessible via Traefik ingress with TLS
- [ ] **OBS-08**: TaskPlanner exposes /metrics endpoint for Prometheus
### CI Testing
- [ ] **CI-01**: Vitest installed and configured for unit tests
- [ ] **CI-02**: Unit tests run in Gitea Actions pipeline before build
- [ ] **CI-03**: Type checking (svelte-check) runs in pipeline
- [ ] **CI-04**: E2E tests (Playwright) run in pipeline
- [ ] **CI-05**: Pipeline fails fast on test/type errors before build
## Future Requirements
Deferred to later milestones.
### Observability Enhancements
- **OBS-F01**: k3s control plane metrics (scheduler, controller-manager)
- **OBS-F02**: Traefik ingress metrics integration
- **OBS-F03**: SLO/SLI dashboards with error budgets
- **OBS-F04**: Distributed tracing
### CI Enhancements
- **CI-F01**: Vulnerability scanning (Trivy, npm audit)
- **CI-F02**: DORA metrics tracking
- **CI-F03**: Smoke tests on deploy
### GitOps Enhancements
- **GITOPS-F01**: Gitea webhook integration (faster sync)
## Out of Scope
Explicitly excluded — overkill for single-user personal project.
| Feature | Reason |
|---------|--------|
| Multi-environment promotion | Single user, single environment; deploy directly to prod |
| Blue-green/canary deployments | Complex rollout unnecessary for personal app |
| ArgoCD high availability | HA for multi-team, not personal projects |
| ELK stack | Resource-heavy; Loki is lightweight alternative |
| Vault secrets management | Kubernetes secrets sufficient for personal app |
| OPA policy enforcement | Single user has no policy conflicts |
## Traceability
Which phases cover which requirements. Updated during roadmap creation.
| Requirement | Phase | Status |
|-------------|-------|--------|
| GITOPS-01 | Phase 7 | Pending |
| GITOPS-02 | Phase 7 | Pending |
| GITOPS-03 | Phase 7 | Pending |
| GITOPS-04 | Phase 7 | Pending |
| OBS-01 | Phase 8 | Pending |
| OBS-02 | Phase 8 | Pending |
| OBS-03 | Phase 8 | Pending |
| OBS-04 | Phase 8 | Pending |
| OBS-05 | Phase 8 | Pending |
| OBS-06 | Phase 8 | Pending |
| OBS-07 | Phase 8 | Pending |
| OBS-08 | Phase 8 | Pending |
| CI-01 | Phase 9 | Pending |
| CI-02 | Phase 9 | Pending |
| CI-03 | Phase 9 | Pending |
| CI-04 | Phase 9 | Pending |
| CI-05 | Phase 9 | Pending |
**Coverage:**
- v2.0 requirements: 17 total
- Mapped to phases: 17
- Unmapped: 0
---
*Requirements defined: 2026-02-03*
*Last updated: 2026-02-03 — Traceability updated after roadmap creation*

View File

@@ -1,8 +1,13 @@
# Roadmap: TaskPlanner
## Milestones
-**v1.0 MVP** - Phases 1-6 (shipped 2026-02-01)
- 🚧 **v2.0 Production Operations** - Phases 7-9 (in progress)
## Overview
TaskPlanner delivers personal task and notes management with image attachments, accessible from any device via web browser. The roadmap progresses from data foundation through core features (entries, images, tags, search) to containerized deployment, with each phase delivering complete, verifiable functionality that enables the next.
TaskPlanner delivers personal task and notes management with image attachments, accessible from any device via web browser. v1.0 established core functionality. v2.0 adds production-grade operations: GitOps deployment automation via ArgoCD, comprehensive observability via Prometheus/Grafana/Loki, and CI pipeline hardening with automated testing.
## Phases
@@ -12,137 +17,122 @@ TaskPlanner delivers personal task and notes management with image attachments,
Decimal phases appear between their surrounding integers in numeric order.
- [x] **Phase 1: Foundation** - Data model, repository layer, and project structure ✓
- [x] **Phase 2: Core CRUD** - Entry management, quick capture, and responsive UI ✓
- [x] **Phase 3: Images** - Image attachments with mobile camera support ✓
- [x] **Phase 4: Tags & Organization** - Tagging system with pinning and due dates ✓
- [x] **Phase 5: Search** - Full-text search and filtering ✓
- [x] **Phase 6: Deployment** - Docker containerization and production configuration ✓
<details>
<summary>v1.0 MVP (Phases 1-6) - SHIPPED 2026-02-01</summary>
## Phase Details
- [x] **Phase 1: Foundation** - Data model, repository layer, and project structure
- [x] **Phase 2: Core CRUD** - Entry management, quick capture, and responsive UI
- [x] **Phase 3: Images** - Image attachments with mobile camera support
- [x] **Phase 4: Tags & Organization** - Tagging system with pinning and due dates
- [x] **Phase 5: Search** - Full-text search and filtering
- [x] **Phase 6: Deployment** - Docker containerization and production configuration
### Phase 1: Foundation
**Goal**: Data persistence and project structure are ready for feature development
**Depends on**: Nothing (first phase)
**Requirements**: None (foundational — enables all other requirements)
**Success Criteria** (what must be TRUE):
1. SQLite database initializes with schema on first run
2. Unified entries table supports both tasks and thoughts via type field
3. Repository layer provides typed CRUD operations for entries
4. Filesystem storage directory structure exists for future images
5. Development server starts and serves a basic page
**Plans**: 2 plans
Plans:
- [x] 01-01-PLAN.md — SvelteKit project setup with Drizzle schema and unified entries table
- [x] 01-02-PLAN.md — Repository layer with typed CRUD and verification page
**Plans**: 2/2 complete
### Phase 2: Core CRUD
**Goal**: Users can create, manage, and view entries with a responsive, accessible UI
**Depends on**: Phase 1
**Requirements**: CORE-01, CORE-02, CORE-03, CORE-04, CORE-05, CORE-06, CAPT-01, CAPT-02, CAPT-03, UX-01, UX-02, UX-03
**Success Criteria** (what must be TRUE):
1. User can create a new entry specifying task or thought type
2. User can edit entry title, content, and type
3. User can delete an entry with confirmation
4. User can mark a task as complete and see visual indicator
5. User can add notes to an existing entry
6. Quick capture input is visible on main view with one-click submission
7. UI is usable on mobile devices with adequate touch targets
8. Text is readable for older eyes (minimum 16px base font)
**Plans**: 4 plans
Plans:
- [x] 02-01-PLAN.md — Form actions for CRUD operations and accessible base styling
- [x] 02-02-PLAN.md — Entry list, entry cards, and quick capture components
- [x] 02-03-PLAN.md — Inline editing with expand/collapse, auto-save, and completed toggle
- [x] 02-04-PLAN.md — Swipe-to-delete gesture and mobile UX verification
**Plans**: 4/4 complete
### Phase 3: Images
**Goal**: Users can attach, view, and manage images on entries from any device
**Depends on**: Phase 2
**Requirements**: IMG-01, IMG-02, IMG-03, IMG-04
**Success Criteria** (what must be TRUE):
1. User can attach images via file upload on desktop
2. User can attach images via camera capture on mobile
3. User can view attached images inline with entry
4. User can remove image attachments from an entry
5. Images are stored on filesystem (not in database)
**Plans**: 4 plans
Plans:
- [x] 03-01-PLAN.md — Database schema, file storage, thumbnail generation, and API endpoints
- [x] 03-02-PLAN.md — File upload form action and ImageUpload component with drag-drop
- [x] 03-03-PLAN.md — CameraCapture component with getUserMedia and preview/confirm flow
- [x] 03-04-PLAN.md — EntryCard integration with gallery, lightbox, and delete functionality
**Plans**: 4/4 complete
### Phase 4: Tags & Organization
**Goal**: Users can organize entries with tags and quick access features
**Depends on**: Phase 2
**Requirements**: TAG-01, TAG-02, TAG-03, TAG-04, ORG-01, ORG-02, ORG-03
**Success Criteria** (what must be TRUE):
1. User can add multiple tags to an entry
2. User can remove tags from an entry
3. Tag input shows autocomplete suggestions from existing tags
4. Tags are case-insensitive ("work" matches "Work" and "WORK")
5. User can pin/favorite an entry for quick access
6. User can set a due date on a task
7. Pinned entries appear in a dedicated section at top of list
**Plans**: 3 plans
Plans:
- [x] 04-01-PLAN.md — Tags schema with case-insensitive index and tagRepository
- [x] 04-02-PLAN.md — Pin/favorite and due date UI (uses existing schema columns)
- [x] 04-03-PLAN.md — Tag input component with Svelecte autocomplete
**Plans**: 3/3 complete
### Phase 5: Search
**Goal**: Users can find entries through search and filtering
**Depends on**: Phase 2, Phase 4 (tags for filtering)
**Requirements**: SRCH-01, SRCH-02, SRCH-03, SRCH-04
**Success Criteria** (what must be TRUE):
1. User can search entries by text in title and content
2. User can filter entries by tag (single or multiple)
3. User can filter entries by date range
4. User can filter to show only tasks or only thoughts
5. Search results show relevant matches with highlighting
**Plans**: 3 plans
Plans:
- [x] 05-01-PLAN.md — SearchBar and FilterBar components with type definitions
- [x] 05-02-PLAN.md — Filtering logic and text highlighting utilities
- [x] 05-03-PLAN.md — Integration with recent searches and "/" keyboard shortcut
**Plans**: 3/3 complete
### Phase 6: Deployment
**Goal**: Application runs in Docker with persistent data and easy configuration
**Depends on**: Phase 1-5 (all features complete)
**Requirements**: DEPLOY-01, DEPLOY-02, DEPLOY-03, DEPLOY-04
**Plans**: 2/2 complete
</details>
### 🚧 v2.0 Production Operations (In Progress)
**Milestone Goal:** Production-grade operations with GitOps deployment, observability stack, and CI test pipeline
- [ ] **Phase 7: GitOps Foundation** - ArgoCD deployment automation with Git as source of truth
- [ ] **Phase 8: Observability Stack** - Metrics, dashboards, logs, and alerting
- [ ] **Phase 9: CI Pipeline Hardening** - Automated testing before build
## Phase Details
### Phase 7: GitOps Foundation
**Goal**: Deployments are fully automated via Git - push triggers deploy, manual changes self-heal
**Depends on**: Phase 6 (running deployment)
**Requirements**: GITOPS-01, GITOPS-02, GITOPS-03, GITOPS-04
**Success Criteria** (what must be TRUE):
1. Application runs in a Docker container
2. Configuration is provided via environment variables
3. Data persists across container restarts via named volumes
4. Single docker-compose.yml starts the entire application
5. Backup of data directory preserves all entries and images
1. ArgoCD server is running and accessible at argocd.tricnet.be
2. TaskPlanner Application shows "Synced" status in ArgoCD UI
3. Pushing a change to helm/taskplaner/values.yaml triggers automatic deployment within 3 minutes
4. Manually deleting a pod results in ArgoCD restoring it to match Git state
5. ArgoCD UI shows deployment history with sync status for each revision
**Plans**: 2 plans
Plans:
- [x] 06-01-PLAN.md — Docker configuration with adapter-node, Dockerfile, and docker-compose.yml
- [x] 06-02-PLAN.md — Health endpoint, environment documentation, and backup script
- [ ] 07-01-PLAN.md — Register TaskPlanner Application with ArgoCD
- [ ] 07-02-PLAN.md — Verify auto-sync and self-heal behavior
### Phase 8: Observability Stack
**Goal**: Full visibility into cluster and application health via metrics, logs, and dashboards
**Depends on**: Phase 7 (ArgoCD can deploy observability stack)
**Requirements**: OBS-01, OBS-02, OBS-03, OBS-04, OBS-05, OBS-06, OBS-07, OBS-08
**Success Criteria** (what must be TRUE):
1. Grafana is accessible at grafana.tricnet.be with pre-built Kubernetes dashboards
2. Prometheus scrapes metrics from TaskPlanner, Traefik, and k3s nodes
3. Logs from all pods are queryable in Grafana Explore via Loki
4. Alert fires when a pod crashes or restarts repeatedly (KubePodCrashLooping)
5. TaskPlanner /metrics endpoint returns Prometheus-format metrics
**Plans**: TBD
Plans:
- [ ] 08-01: kube-prometheus-stack installation (Prometheus + Grafana)
- [ ] 08-02: Loki + Alloy installation for log aggregation
- [ ] 08-03: Critical alerts and TaskPlanner metrics endpoint
### Phase 9: CI Pipeline Hardening
**Goal**: Tests run before build - type errors and test failures block deployment
**Depends on**: Phase 8 (observability shows test/build failures)
**Requirements**: CI-01, CI-02, CI-03, CI-04, CI-05
**Success Criteria** (what must be TRUE):
1. `npm run test:unit` runs Vitest and reports pass/fail
2. `npm run check` runs svelte-check and catches type errors
3. Pipeline fails before Docker build when unit tests fail
4. Pipeline fails before Docker build when type checking fails
5. E2E tests run in pipeline using Playwright Docker image
**Plans**: TBD
Plans:
- [ ] 09-01: Vitest setup and unit test structure
- [ ] 09-02: Pipeline integration with fail-fast behavior
## Progress
**Execution Order:**
Phases execute in numeric order: 1 -> 2 -> 3 -> 4 -> 5 -> 6
Phases execute in numeric order: 7 -> 8 -> 9
| Phase | Plans Complete | Status | Completed |
|-------|----------------|--------|-----------|
| 1. Foundation | 2/2 | Complete | 2026-01-29 |
| 2. Core CRUD | 4/4 | Complete | 2026-01-29 |
| 3. Images | 4/4 | Complete | 2026-01-31 |
| 4. Tags & Organization | 3/3 | Complete | 2026-01-31 |
| 5. Search | 3/3 | Complete | 2026-01-31 |
| 6. Deployment | 2/2 | Complete | 2026-02-01 |
| Phase | Milestone | Plans Complete | Status | Completed |
|-------|-----------|----------------|--------|-----------|
| 1. Foundation | v1.0 | 2/2 | Complete | 2026-01-29 |
| 2. Core CRUD | v1.0 | 4/4 | Complete | 2026-01-29 |
| 3. Images | v1.0 | 4/4 | Complete | 2026-01-31 |
| 4. Tags & Organization | v1.0 | 3/3 | Complete | 2026-01-31 |
| 5. Search | v1.0 | 3/3 | Complete | 2026-01-31 |
| 6. Deployment | v1.0 | 2/2 | Complete | 2026-02-01 |
| 7. GitOps Foundation | v2.0 | 0/2 | Planned | - |
| 8. Observability Stack | v2.0 | 0/3 | Not started | - |
| 9. CI Pipeline Hardening | v2.0 | 0/2 | Not started | - |
---
*Roadmap created: 2026-01-29*
*Depth: standard (5-8 phases)*
*Coverage: 31/31 v1 requirements mapped*
*v2.0 phases added: 2026-02-03*
*Phase 7 planned: 2026-02-03*
*Depth: standard*
*v1.0 Coverage: 31/31 requirements mapped*
*v2.0 Coverage: 17/17 requirements mapped*

View File

@@ -5,16 +5,16 @@
See: .planning/PROJECT.md (updated 2026-02-01)
**Core value:** Capture and find anything from any device — especially laptop. If cross-device capture with images doesn't work, nothing else matters.
**Current focus:** Post-v1.0 — awaiting next milestone
**Current focus:** v2.0 Production Operations — Phase 7 (GitOps Foundation)
## Current Position
Phase: N/A (between milestones)
Plan: N/A
Status: MILESTONE v1.0 COMPLETE
Last activity: 2026-02-01Completed milestone v1.0
Phase: 7 of 9 (GitOps Foundation)
Plan: 0 of 2 in current phase
Status: Ready to plan
Last activity: 2026-02-03Roadmap created for v2.0
Progress: Awaiting `/gsd:new-milestone` for v2 planning
Progress: [██████████████████░░░░░░░░░░░░] 67% (v1.0 complete, v2.0 starting)
## Performance Metrics
@@ -25,6 +25,11 @@ Progress: Awaiting `/gsd:new-milestone` for v2 planning
- Phases: 6
- Requirements satisfied: 31/31
**v2.0 Target:**
- Phases: 3 (7-9)
- Plans estimated: 7
- Requirements: 17
**By Phase (v1.0):**
| Phase | Plans | Total | Avg/Plan |
@@ -42,11 +47,15 @@ Progress: Awaiting `/gsd:new-milestone` for v2 planning
Key decisions from v1.0 are preserved in PROJECT.md.
For v2, new decisions will be logged here as work progresses.
For v2.0, key decisions from research:
- Use Grafana Alloy (not Promtail - EOL March 2026)
- ArgoCD needs server.insecure: true for Traefik TLS termination
- Loki monolithic mode with 7-day retention
- Vitest for unit tests (official Svelte recommendation)
### Pending Todos
None — ready for next milestone.
None — ready for Phase 7 planning.
### Blockers/Concerns
@@ -54,10 +63,10 @@ None.
## Session Continuity
Last session: 2026-02-01
Stopped at: Completed milestone v1.0
Last session: 2026-02-03
Stopped at: Roadmap v2.0 created
Resume file: None
---
*State initialized: 2026-01-29*
*Last updated: 2026-02-01Milestone v1.0 archived*
*Last updated: 2026-02-03v2.0 roadmap created*

View File

@@ -0,0 +1,240 @@
---
phase: 07-gitops-foundation
plan: 01
type: execute
wave: 1
depends_on: []
files_modified:
- argocd/application.yaml
- argocd/repo-secret.yaml
autonomous: true
must_haves:
truths:
- "ArgoCD can access TaskPlanner Git repository"
- "TaskPlanner Application exists in ArgoCD"
- "Application shows Synced status"
artifacts:
- path: "argocd/repo-secret.yaml"
provides: "Repository credentials for ArgoCD"
contains: "argocd.argoproj.io/secret-type: repository"
- path: "argocd/application.yaml"
provides: "ArgoCD Application manifest"
contains: "kind: Application"
key_links:
- from: "argocd/application.yaml"
to: "ArgoCD server"
via: "kubectl apply"
pattern: "kind: Application"
- from: "argocd/repo-secret.yaml"
to: "Gitea repository"
via: "repository secret"
pattern: "secret-type: repository"
---
<objective>
Register TaskPlanner with ArgoCD by creating repository credentials and applying the Application manifest.
Purpose: Enable GitOps workflow where ArgoCD manages TaskPlanner deployment from Git source of truth.
Output: TaskPlanner Application registered in ArgoCD showing "Synced" status.
</objective>
<execution_context>
@/home/tho/.claude/get-shit-done/workflows/execute-plan.md
@/home/tho/.claude/get-shit-done/templates/summary.md
</execution_context>
<context>
@.planning/PROJECT.md
@.planning/ROADMAP.md
@.planning/STATE.md
@.planning/phases/07-gitops-foundation/07-CONTEXT.md
@argocd/application.yaml
@helm/taskplaner/values.yaml
</context>
<tasks>
<task type="auto">
<name>Task 1: Create ArgoCD repository secret for TaskPlanner</name>
<files>argocd/repo-secret.yaml</files>
<action>
Create a Kubernetes Secret for ArgoCD to access the TaskPlanner Gitea repository.
The secret must:
1. Be in namespace `argocd`
2. Have label `argocd.argoproj.io/secret-type: repository`
3. Use internal cluster URL: `http://gitea-http.gitea.svc.cluster.local:3000/tho/taskplaner.git`
4. Use same credentials as existing gitea-repo secret (username: admin)
Create the file `argocd/repo-secret.yaml`:
```yaml
apiVersion: v1
kind: Secret
metadata:
name: taskplaner-repo
namespace: argocd
labels:
argocd.argoproj.io/secret-type: repository
stringData:
type: git
url: http://gitea-http.gitea.svc.cluster.local:3000/tho/taskplaner.git
username: admin
password: <GET_FROM_EXISTING_SECRET>
```
Get the password from existing gitea-repo secret:
```bash
kubectl get secret gitea-repo -n argocd -o jsonpath='{.data.password}' | base64 -d
```
Apply the secret:
```bash
kubectl apply -f argocd/repo-secret.yaml
```
Note: Do NOT commit the password to Git. The file should use a placeholder or be gitignored.
Actually, create the secret directly with kubectl instead of a file with real credentials:
```bash
PASSWORD=$(kubectl get secret gitea-repo -n argocd -o jsonpath='{.data.password}' | base64 -d)
kubectl create secret generic taskplaner-repo \
--namespace argocd \
--from-literal=type=git \
--from-literal=url=http://gitea-http.gitea.svc.cluster.local:3000/tho/taskplaner.git \
--from-literal=username=admin \
--from-literal=password="$PASSWORD" \
--dry-run=client -o yaml | kubectl label -f - argocd.argoproj.io/secret-type=repository --local -o yaml | kubectl apply -f -
```
Or simpler approach - just apply with label:
```bash
PASSWORD=$(kubectl get secret gitea-repo -n argocd -o jsonpath='{.data.password}' | base64 -d)
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Secret
metadata:
name: taskplaner-repo
namespace: argocd
labels:
argocd.argoproj.io/secret-type: repository
stringData:
type: git
url: http://gitea-http.gitea.svc.cluster.local:3000/tho/taskplaner.git
username: admin
password: "$PASSWORD"
EOF
```
</action>
<verify>
```bash
kubectl get secret taskplaner-repo -n argocd
kubectl get secret taskplaner-repo -n argocd -o jsonpath='{.metadata.labels}'
```
Should show the secret exists with repository label.
</verify>
<done>Secret `taskplaner-repo` exists in argocd namespace with correct labels and credentials.</done>
</task>
<task type="auto">
<name>Task 2: Update and apply ArgoCD Application manifest</name>
<files>argocd/application.yaml</files>
<action>
Update `argocd/application.yaml` to:
1. Use internal Gitea URL (matches the repo secret)
2. Remove the inline registry secret (it has a placeholder that shouldn't be in Git)
3. Ensure the Application references the correct image pull secret name
Changes needed in application.yaml:
1. Change `repoURL` from `https://git.kube2.tricnet.de/tho/taskplaner.git` to `http://gitea-http.gitea.svc.cluster.local:3000/tho/taskplaner.git`
2. Remove the `---` separated Secret at the bottom (gitea-registry-secret with placeholder)
3. The helm values already reference `gitea-registry-secret` for imagePullSecrets
The registry secret needs to exist separately. Check if it exists:
```bash
kubectl get secret gitea-registry-secret -n default
```
If it doesn't exist, create it (the helm chart expects it). Get Gitea registry credentials and create:
```bash
# Create the registry secret for image pulls
kubectl create secret docker-registry gitea-registry-secret \
--namespace default \
--docker-server=git.kube2.tricnet.de \
--docker-username=admin \
--docker-password="$(kubectl get secret gitea-repo -n argocd -o jsonpath='{.data.password}' | base64 -d)"
```
Then apply the Application:
```bash
kubectl apply -f argocd/application.yaml
```
</action>
<verify>
```bash
kubectl get application taskplaner -n argocd
kubectl get application taskplaner -n argocd -o jsonpath='{.status.sync.status}'
```
Application should exist and show sync status.
</verify>
<done>ArgoCD Application `taskplaner` exists and ArgoCD begins syncing.</done>
</task>
<task type="auto">
<name>Task 3: Wait for sync and verify healthy status</name>
<files></files>
<action>
Wait for ArgoCD to sync the application and verify it reaches Synced + Healthy status.
```bash
# Wait for sync (up to 5 minutes)
kubectl wait --for=jsonpath='{.status.sync.status}'=Synced application/taskplaner -n argocd --timeout=300s
# Check health status
kubectl get application taskplaner -n argocd -o jsonpath='{.status.health.status}'
# Get full status
kubectl get application taskplaner -n argocd -o wide
```
If sync fails, check:
1. ArgoCD logs: `kubectl logs -n argocd -l app.kubernetes.io/name=argocd-repo-server`
2. Application status: `kubectl describe application taskplaner -n argocd`
3. Repo connectivity: ArgoCD UI Settings -> Repositories
Common issues:
- Repo credentials incorrect: Check taskplaner-repo secret
- Helm chart errors: Check argocd-repo-server logs
- Image pull errors: Check gitea-registry-secret
</action>
<verify>
```bash
kubectl get application taskplaner -n argocd -o jsonpath='{.status.sync.status}'
kubectl get application taskplaner -n argocd -o jsonpath='{.status.health.status}'
```
Should output: `Synced` and `Healthy`
</verify>
<done>Application shows "Synced" status and "Healthy" health in ArgoCD.</done>
</task>
</tasks>
<verification>
Phase success indicators:
1. `kubectl get secret taskplaner-repo -n argocd` returns the secret
2. `kubectl get application taskplaner -n argocd` shows the application
3. Application status is Synced and Healthy
4. ArgoCD UI at argocd.kube2.tricnet.de shows TaskPlanner with green sync status
</verification>
<success_criteria>
- Repository secret created with correct labels
- Application manifest applied successfully
- ArgoCD shows TaskPlanner as Synced
- ArgoCD shows TaskPlanner as Healthy
- Requirements GITOPS-01 (already done) and GITOPS-02 satisfied
</success_criteria>
<output>
After completion, create `.planning/phases/07-gitops-foundation/07-01-SUMMARY.md`
</output>

View File

@@ -0,0 +1,209 @@
---
phase: 07-gitops-foundation
plan: 02
type: execute
wave: 2
depends_on: ["07-01"]
files_modified:
- helm/taskplaner/values.yaml
autonomous: false
must_haves:
truths:
- "Pushing helm changes triggers automatic deployment"
- "Manual pod deletion triggers ArgoCD self-heal"
- "ArgoCD UI shows deployment history"
artifacts:
- path: "helm/taskplaner/values.yaml"
provides: "Test change to trigger sync"
key_links:
- from: "Git push"
to: "ArgoCD sync"
via: "polling (3 min)"
pattern: "automated sync"
- from: "kubectl delete pod"
to: "ArgoCD restore"
via: "selfHeal: true"
pattern: "pod restored"
---
<objective>
Verify GitOps workflow: auto-sync on Git push and self-healing on manual cluster changes.
Purpose: Confirm ArgoCD delivers on GitOps promise - Git is source of truth, cluster self-heals.
Output: Verified auto-deploy and self-heal behavior with documentation of tests.
</objective>
<execution_context>
@/home/tho/.claude/get-shit-done/workflows/execute-plan.md
@/home/tho/.claude/get-shit-done/templates/summary.md
</execution_context>
<context>
@.planning/PROJECT.md
@.planning/ROADMAP.md
@.planning/STATE.md
@.planning/phases/07-gitops-foundation/07-CONTEXT.md
@.planning/phases/07-gitops-foundation/07-01-SUMMARY.md
@argocd/application.yaml
@helm/taskplaner/values.yaml
</context>
<tasks>
<task type="auto">
<name>Task 1: Test auto-sync by pushing a helm change</name>
<files>helm/taskplaner/values.yaml</files>
<action>
Make a visible but harmless change to helm/taskplaner/values.yaml and push to trigger ArgoCD sync.
1. Add or modify a pod annotation that won't affect functionality:
```yaml
podAnnotations:
gitops-test: "verified-YYYYMMDD-HHMMSS"
```
Use current timestamp to make change unique.
2. Commit and push:
```bash
git add helm/taskplaner/values.yaml
git commit -m "test(gitops): verify auto-sync with annotation change"
git push
```
3. Wait for ArgoCD to detect and sync (up to 3 minutes polling interval):
```bash
# Watch for sync
echo "Waiting for ArgoCD to detect change (up to 3 minutes)..."
for i in {1..36}; do
REVISION=$(kubectl get application taskplaner -n argocd -o jsonpath='{.status.sync.revision}' 2>/dev/null)
CURRENT_COMMIT=$(git rev-parse HEAD)
if [ "$REVISION" = "$CURRENT_COMMIT" ]; then
echo "Synced to commit: $REVISION"
break
fi
echo "Waiting... ($i/36)"
sleep 5
done
```
4. Verify the pod has the new annotation:
```bash
kubectl get pods -n default -l app.kubernetes.io/name=taskplaner -o jsonpath='{.items[0].metadata.annotations.gitops-test}'
```
</action>
<verify>
```bash
# Verify sync revision matches latest commit
kubectl get application taskplaner -n argocd -o jsonpath='{.status.sync.revision}'
git rev-parse HEAD
# Should match
# Verify pod annotation
kubectl get pods -n default -l app.kubernetes.io/name=taskplaner -o jsonpath='{.items[0].metadata.annotations.gitops-test}'
# Should show the timestamp from values.yaml
```
</verify>
<done>Git push triggered ArgoCD sync within 3 minutes, pod shows new annotation.</done>
</task>
<task type="auto">
<name>Task 2: Test self-heal by deleting a pod</name>
<files></files>
<action>
Verify ArgoCD's self-heal restores manual changes to match Git state.
1. Get current pod name:
```bash
POD_NAME=$(kubectl get pods -n default -l app.kubernetes.io/name=taskplaner -o jsonpath='{.items[0].metadata.name}')
echo "Current pod: $POD_NAME"
```
2. Delete the pod (simulating manual intervention):
```bash
kubectl delete pod $POD_NAME -n default
```
3. ArgoCD should detect the drift and restore (selfHeal: true in syncPolicy).
Watch for restoration:
```bash
echo "Waiting for ArgoCD to restore pod..."
kubectl get pods -n default -l app.kubernetes.io/name=taskplaner -w &
WATCH_PID=$!
sleep 30
kill $WATCH_PID 2>/dev/null
```
4. Verify new pod is running:
```bash
kubectl get pods -n default -l app.kubernetes.io/name=taskplaner
```
5. Verify ArgoCD still shows Synced (not OutOfSync):
```bash
kubectl get application taskplaner -n argocd -o jsonpath='{.status.sync.status}'
```
Note: The Deployment controller recreates the pod immediately (Kubernetes behavior), but ArgoCD should also detect this and ensure the state matches Git. The key verification is that ArgoCD remains in Synced state.
</action>
<verify>
```bash
kubectl get pods -n default -l app.kubernetes.io/name=taskplaner -o wide
kubectl get application taskplaner -n argocd -o jsonpath='{.status.sync.status}'
kubectl get application taskplaner -n argocd -o jsonpath='{.status.health.status}'
```
Pod should be running, status should be Synced and Healthy.
</verify>
<done>Pod deletion triggered restore, ArgoCD shows Synced + Healthy status.</done>
</task>
<task type="checkpoint:human-verify" gate="blocking">
<what-built>
GitOps workflow with ArgoCD managing TaskPlanner deployment:
- Repository credentials configured
- Application registered and syncing
- Auto-deploy on Git push verified
- Self-heal on manual changes verified
</what-built>
<how-to-verify>
1. Open ArgoCD UI: https://argocd.kube2.tricnet.de
2. Log in (credentials should be available)
3. Find "taskplaner" application in the list
4. Verify:
- Status shows "Synced" (green checkmark)
- Health shows "Healthy" (green heart)
- Click on the application to see deployment details
- Check "History and Rollback" tab shows recent syncs including the test commit
5. Verify TaskPlanner still works: https://task.kube2.tricnet.de
</how-to-verify>
<resume-signal>Type "approved" if ArgoCD shows TaskPlanner as Synced/Healthy and app works, or describe any issues.</resume-signal>
</task>
</tasks>
<verification>
Phase 7 completion checklist:
1. GITOPS-01: ArgoCD server running - ALREADY DONE (pre-existing)
2. GITOPS-02: ArgoCD syncs TaskPlanner from Git - Verified by sync test
3. GITOPS-03: ArgoCD self-heals manual changes - Verified by pod deletion test
4. GITOPS-04: ArgoCD UI accessible via Traefik - ALREADY DONE (pre-existing)
Success Criteria from ROADMAP.md:
- [x] ArgoCD server is running and accessible at argocd.tricnet.be
- [ ] TaskPlanner Application shows "Synced" status in ArgoCD UI
- [ ] Pushing a change to helm/taskplaner/values.yaml triggers automatic deployment within 3 minutes
- [ ] Manually deleting a pod results in ArgoCD restoring it to match Git state
- [ ] ArgoCD UI shows deployment history with sync status for each revision
</verification>
<success_criteria>
- Auto-sync test: Git push -> ArgoCD detects -> Pod updated (within 3 min)
- Self-heal test: Pod deleted -> ArgoCD restores -> Status remains Synced
- Human verification: ArgoCD UI shows healthy TaskPlanner with deployment history
- All GITOPS requirements satisfied
</success_criteria>
<output>
After completion, create `.planning/phases/07-gitops-foundation/07-02-SUMMARY.md`
</output>

View File

@@ -0,0 +1,59 @@
# Phase 7: GitOps Foundation - Context
**Gathered:** 2026-02-03
**Status:** Ready for planning
<domain>
## Phase Boundary
Register TaskPlanner with existing ArgoCD installation and verify GitOps workflow. ArgoCD server is already running and accessible — this phase applies the Application manifest and confirms auto-sync, self-heal, and deployment triggering work correctly.
</domain>
<decisions>
## Implementation Decisions
### Infrastructure State
- ArgoCD already installed and running in `argocd` namespace
- UI accessible at argocd.kube2.tricnet.de (TLS configured)
- Gitea repo credentials exist (`gitea-repo` secret) — same user can access TaskPlanner repo
- Application manifest exists at `argocd/application.yaml` with auto-sync and self-heal enabled
### What This Phase Delivers
- Apply existing `argocd/application.yaml` to register TaskPlanner
- Verify Application shows "Synced" status in ArgoCD UI
- Verify auto-deploy: push to helm/taskplaner/values.yaml triggers deployment
- Verify self-heal: manual pod deletion restores to Git state
### Repository Configuration
- Repo URL: https://git.kube2.tricnet.de/tho/taskplaner.git
- Use existing Gitea credentials (same user works for all repos)
- Internal cluster URL may be needed if external URL has issues
### Claude's Discretion
- Whether to add repo credentials specifically for TaskPlanner or rely on existing
- Exact verification test approach
- Any cleanup of placeholder values in application.yaml (e.g., registry secret)
</decisions>
<specifics>
## Specific Ideas
- The `argocd/application.yaml` has a placeholder for registry secret that needs real credentials
- Sync policy already configured: automated prune + selfHeal
- No webhook setup needed for now — 3-minute polling is acceptable
</specifics>
<deferred>
## Deferred Ideas
None — discussion stayed within phase scope.
</deferred>
---
*Phase: 07-gitops-foundation*
*Context gathered: 2026-02-03*

View File

@@ -257,7 +257,7 @@ func (s *LocalStorage) Store(ctx context.Context, file io.Reader) (string, error
| |
| v
| [FTS5 trigger auto-updates index]
| |
| v
v v
[UI shows new note] <--JSON response-- [Return created note]
```
@@ -513,3 +513,621 @@ Based on component dependencies, suggested implementation order:
---
*Architecture research for: Personal task/notes web application*
*Researched: 2026-01-29*
---
# v2.0 Architecture: CI/CD and Observability Integration
**Domain:** GitOps CI/CD and Observability Stack
**Researched:** 2026-02-03
**Confidence:** HIGH (verified with official documentation)
## Executive Summary
This section details how ArgoCD, Prometheus, Grafana, and Loki integrate with the existing k3s/Gitea/Traefik architecture. The integration follows established patterns for self-hosted Kubernetes observability stacks, with specific considerations for k3s's lightweight nature and Traefik as the ingress controller.
Key insight: The existing CI/CD foundation (Gitea Actions + ArgoCD Application) is already in place. This milestone adds observability and operational automation rather than building from scratch.
## Current Architecture Overview
```
Internet
|
[Traefik]
(Ingress)
|
+-------------------------+-------------------------+
| | |
task.kube2 git.kube2 (future)
.tricnet.de .tricnet.de argocd/grafana
| |
[TaskPlaner] [Gitea]
(default ns) + Actions
| Runner
| |
[Longhorn PVC] |
(data store) |
v
[Container Registry]
git.kube2.tricnet.de
```
### Existing Components
| Component | Namespace | Purpose | Status |
|-----------|-----------|---------|--------|
| k3s | - | Kubernetes distribution | Running |
| Traefik | kube-system | Ingress controller | Running |
| Longhorn | longhorn-system | Persistent storage | Running |
| cert-manager | cert-manager | TLS certificates | Running |
| Gitea | gitea (assumed) | Git hosting + CI | Running |
| TaskPlaner | default | Application | Running |
| ArgoCD Application | argocd | GitOps deployment | Defined (may need install) |
### Existing CI/CD Pipeline
From `.gitea/workflows/build.yaml`:
1. Push to master triggers Gitea Actions
2. Build Docker image with BuildX
3. Push to Gitea Container Registry
4. Update Helm values.yaml with new image tag
5. Commit with `[skip ci]`
6. ArgoCD detects change and syncs
**Current gap:** ArgoCD may not be installed yet (Application manifest exists but needs ArgoCD server).
## Integration Architecture
### Target State
```
Internet
|
[Traefik]
(Ingress)
|
+----------+----------+----------+----------+----------+
| | | | | |
task.* git.* argocd.* grafana.* (internal)
| | | | |
[TaskPlaner] [Gitea] [ArgoCD] [Grafana] [Prometheus]
| | | | [Loki]
| | | | [Alloy]
| +---webhook---> | |
| | | |
+------ metrics ------+----------+--------->+
+------ logs ---------+---------[Alloy]---->+ (to Loki)
```
### Namespace Strategy
| Namespace | Components | Rationale |
|-----------|------------|-----------|
| `argocd` | ArgoCD server, repo-server, application-controller | Standard convention; ClusterRoleBinding expects this |
| `monitoring` | Prometheus, Grafana, Alertmanager | Consolidate observability; kube-prometheus-stack default |
| `loki` | Loki, Alloy (DaemonSet) | Separate from metrics for resource isolation |
| `default` | TaskPlaner | Existing app deployment |
| `gitea` | Gitea + Actions Runner | Assumed existing |
**Alternative considered:** All observability in single namespace
**Decision:** Separate `monitoring` and `loki` because:
- Different scaling characteristics (Alloy is DaemonSet, Prometheus is StatefulSet)
- Easier resource quota management
- Standard community practice
## Component Integration Details
### 1. ArgoCD Integration
**Installation Method:** Helm chart from `argo/argo-cd`
**Integration Points:**
| Integration | How | Configuration |
|-------------|-----|---------------|
| Gitea Repository | HTTPS clone | Repository credential in argocd-secret |
| Gitea Webhook | POST to `/api/webhook` | Reduces sync delay from 3min to seconds |
| Traefik Ingress | IngressRoute or Ingress | `server.insecure=true` to avoid redirect loops |
| TLS | cert-manager annotation | Let's Encrypt via existing cluster-issuer |
**Critical Configuration:**
```yaml
# Helm values for ArgoCD with Traefik
configs:
params:
server.insecure: true # Required: Traefik handles TLS
server:
ingress:
enabled: true
ingressClassName: traefik
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
hosts:
- argocd.kube2.tricnet.de
tls:
- secretName: argocd-tls
hosts:
- argocd.kube2.tricnet.de
```
**Webhook Setup for Gitea:**
1. In ArgoCD secret, set `webhook.gogs.secret` (Gitea uses Gogs-compatible webhooks)
2. In Gitea repository settings, add webhook:
- URL: `https://argocd.kube2.tricnet.de/api/webhook`
- Content type: `application/json`
- Secret: Same as configured in ArgoCD
**Known Limitation:** Webhooks work for Applications but not ApplicationSets with Gitea.
### 2. Prometheus/Grafana Integration (kube-prometheus-stack)
**Installation Method:** Helm chart `prometheus-community/kube-prometheus-stack`
**Integration Points:**
| Integration | How | Configuration |
|-------------|-----|---------------|
| k3s metrics | Exposed kube-* endpoints | k3s config modification required |
| Traefik metrics | ServiceMonitor | Traefik exposes `:9100/metrics` |
| TaskPlaner metrics | ServiceMonitor (future) | App must expose `/metrics` endpoint |
| Grafana UI | Traefik Ingress | Standard Kubernetes Ingress |
**Critical k3s Configuration:**
k3s binds controller-manager, scheduler, and proxy to localhost by default. For Prometheus scraping, expose on 0.0.0.0.
Create/modify `/etc/rancher/k3s/config.yaml`:
```yaml
kube-controller-manager-arg:
- "bind-address=0.0.0.0"
kube-proxy-arg:
- "metrics-bind-address=0.0.0.0"
kube-scheduler-arg:
- "bind-address=0.0.0.0"
```
Then restart k3s: `sudo systemctl restart k3s`
**k3s-specific Helm values:**
```yaml
# Disable etcd monitoring (k3s uses sqlite, not etcd)
defaultRules:
rules:
etcd: false
kubeEtcd:
enabled: false
# Fix endpoint discovery for k3s
kubeControllerManager:
enabled: true
endpoints:
- <k3s-server-ip>
service:
enabled: true
port: 10257
targetPort: 10257
kubeScheduler:
enabled: true
endpoints:
- <k3s-server-ip>
service:
enabled: true
port: 10259
targetPort: 10259
kubeProxy:
enabled: true
endpoints:
- <k3s-server-ip>
service:
enabled: true
port: 10249
targetPort: 10249
# Grafana ingress
grafana:
ingress:
enabled: true
ingressClassName: traefik
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
hosts:
- grafana.kube2.tricnet.de
tls:
- secretName: grafana-tls
hosts:
- grafana.kube2.tricnet.de
```
**ServiceMonitor for TaskPlaner (future):**
Once TaskPlaner exposes `/metrics`:
```yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: taskplaner
namespace: monitoring
labels:
release: prometheus # Must match kube-prometheus-stack release
spec:
namespaceSelector:
matchNames:
- default
selector:
matchLabels:
app.kubernetes.io/name: taskplaner
endpoints:
- port: http
path: /metrics
interval: 30s
```
### 3. Loki + Alloy Integration (Log Aggregation)
**Important:** Promtail is deprecated (LTS until Feb 2026, EOL March 2026). Use **Grafana Alloy** instead.
**Installation Method:**
- Loki: Helm chart `grafana/loki` (monolithic mode for single node)
- Alloy: Helm chart `grafana/alloy`
**Integration Points:**
| Integration | How | Configuration |
|-------------|-----|---------------|
| Pod logs | Alloy DaemonSet | Mounts `/var/log/pods` |
| Loki storage | Longhorn PVC or MinIO | Single-binary uses filesystem |
| Grafana datasource | Auto-configured | kube-prometheus-stack integration |
| k3s node logs | Alloy journal reader | journalctl access |
**Deployment Mode Decision:**
| Mode | When to Use | Our Choice |
|------|-------------|------------|
| Monolithic (single-binary) | Small deployments, <100GB/day | **Yes - single node k3s** |
| Simple Scalable | Medium deployments | No |
| Microservices | Large scale, HA required | No |
**Loki Helm values (monolithic):**
```yaml
deploymentMode: SingleBinary
singleBinary:
replicas: 1
persistence:
enabled: true
storageClass: longhorn
size: 10Gi
# Disable components not needed in monolithic
read:
replicas: 0
write:
replicas: 0
backend:
replicas: 0
# Use filesystem storage (not S3/MinIO for simplicity)
loki:
storage:
type: filesystem
schemaConfig:
configs:
- from: "2024-01-01"
store: tsdb
object_store: filesystem
schema: v13
index:
prefix: index_
period: 24h
```
**Alloy DaemonSet Configuration:**
```yaml
# alloy-values.yaml
alloy:
configMap:
create: true
content: |
// Kubernetes logs collection
loki.source.kubernetes "pods" {
targets = discovery.kubernetes.pods.targets
forward_to = [loki.write.default.receiver]
}
// Send to Loki
loki.write "default" {
endpoint {
url = "http://loki.loki.svc.cluster.local:3100/loki/api/v1/push"
}
}
// Kubernetes discovery
discovery.kubernetes "pods" {
role = "pod"
}
```
### 4. Traefik Metrics Integration
Traefik already exposes Prometheus metrics. Enable scraping:
**Option A: ServiceMonitor (if using kube-prometheus-stack)**
```yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: traefik
namespace: monitoring
labels:
release: prometheus
spec:
namespaceSelector:
matchNames:
- kube-system
selector:
matchLabels:
app.kubernetes.io/name: traefik
endpoints:
- port: metrics
path: /metrics
interval: 30s
```
**Option B: Verify Traefik metrics are enabled**
Check Traefik deployment args include:
```
--entrypoints.metrics.address=:8888
--metrics.prometheus=true
--metrics.prometheus.entryPoint=metrics
```
## Data Flow Diagrams
### Metrics Flow
```
+------------------+ +------------------+ +------------------+
| TaskPlaner | | Traefik | | k3s core |
| /metrics | | :9100/metrics | | :10249,10257... |
+--------+---------+ +--------+---------+ +--------+---------+
| | |
+------------------------+------------------------+
|
v
+-------------------+
| Prometheus |
| (ServiceMonitors) |
+--------+----------+
|
v
+-------------------+
| Grafana |
| (Dashboards) |
+-------------------+
```
### Log Flow
```
+------------------+ +------------------+ +------------------+
| TaskPlaner | | Traefik | | Other Pods |
| stdout/stderr | | access logs | | stdout/stderr |
+--------+---------+ +--------+---------+ +--------+---------+
| | |
+------------------------+------------------------+
|
/var/log/pods
|
v
+-------------------+
| Alloy DaemonSet |
| (log collection) |
+--------+----------+
|
v
+-------------------+
| Loki |
| (log storage) |
+--------+----------+
|
v
+-------------------+
| Grafana |
| (log queries) |
+-------------------+
```
### GitOps Flow
```
+------------+ +------------+ +---------------+ +------------+
| Developer | --> | Gitea | --> | Gitea Actions | --> | Container |
| git push | | Repository | | (build.yaml) | | Registry |
+------------+ +-----+------+ +-------+-------+ +------------+
| |
| (update values.yaml)
| |
v v
+------------+ +------------+
| Webhook | ----> | ArgoCD |
| (notify) | | Server |
+------------+ +-----+------+
|
(sync app)
|
v
+------------+
| Kubernetes |
| (deploy) |
+------------+
```
## Build Order (Dependencies)
Based on component dependencies, recommended installation order:
### Phase 1: ArgoCD (no dependencies on observability)
```
1. Install ArgoCD via Helm
- Creates namespace: argocd
- Verify existing Application manifest works
- Configure Gitea webhook
Dependencies: None (Traefik already running)
Validates: GitOps pipeline end-to-end
```
### Phase 2: kube-prometheus-stack (foundational observability)
```
2. Configure k3s metrics exposure
- Modify /etc/rancher/k3s/config.yaml
- Restart k3s
3. Install kube-prometheus-stack via Helm
- Creates namespace: monitoring
- Includes: Prometheus, Grafana, Alertmanager
- Includes: Default dashboards and alerts
Dependencies: k3s metrics exposed
Validates: Basic cluster monitoring working
```
### Phase 3: Loki + Alloy (log aggregation)
```
4. Install Loki via Helm (monolithic mode)
- Creates namespace: loki
- Configure storage with Longhorn
5. Install Alloy via Helm
- DaemonSet in loki namespace
- Configure Kubernetes log discovery
- Point to Loki endpoint
6. Add Loki datasource to Grafana
- URL: http://loki.loki.svc.cluster.local:3100
Dependencies: Grafana from step 3, storage
Validates: Logs visible in Grafana Explore
```
### Phase 4: Application Integration
```
7. Add TaskPlaner metrics endpoint (if not exists)
- Expose /metrics in app
- Create ServiceMonitor
8. Create application dashboards in Grafana
- TaskPlaner-specific metrics
- Request latency, error rates
Dependencies: All previous phases
Validates: Full observability of application
```
## Resource Requirements
| Component | CPU Request | Memory Request | Storage |
|-----------|-------------|----------------|---------|
| ArgoCD (all) | 500m | 512Mi | - |
| Prometheus | 200m | 512Mi | 10Gi (Longhorn) |
| Grafana | 100m | 256Mi | 1Gi (Longhorn) |
| Alertmanager | 50m | 64Mi | 1Gi (Longhorn) |
| Loki | 200m | 256Mi | 10Gi (Longhorn) |
| Alloy (per node) | 100m | 128Mi | - |
**Total additional:** ~1.2 CPU cores, ~1.7Gi RAM, ~22Gi storage
## Security Considerations
### Network Policies
Consider network policies to restrict:
- Prometheus scraping only from monitoring namespace
- Loki ingestion only from Alloy
- Grafana access only via Traefik
### Secrets Management
| Secret | Location | Purpose |
|--------|----------|---------|
| `argocd-initial-admin-secret` | argocd ns | Initial admin password |
| `argocd-secret` | argocd ns | Webhook secrets, repo credentials |
| `grafana-admin` | monitoring ns | Grafana admin password |
### Ingress Authentication
For production, consider:
- ArgoCD: Built-in OIDC/OAuth integration
- Grafana: Built-in auth (local, LDAP, OAuth)
- Prometheus: Traefik BasicAuth middleware (already pattern in use)
## Anti-Patterns to Avoid
### 1. Skipping k3s Metrics Configuration
**What happens:** Prometheus installs but most dashboards show "No data"
**Prevention:** Configure k3s to expose metrics BEFORE installing kube-prometheus-stack
### 2. Using Promtail Instead of Alloy
**What happens:** Technical debt - Promtail EOL is March 2026
**Prevention:** Use Alloy from the start; migration documentation exists
### 3. Running Loki in Microservices Mode for Small Clusters
**What happens:** Unnecessary complexity, resource overhead
**Prevention:** Monolithic mode for clusters under 100GB/day log volume
### 4. Forgetting server.insecure for ArgoCD with Traefik
**What happens:** Redirect loop (ERR_TOO_MANY_REDIRECTS)
**Prevention:** Always set `configs.params.server.insecure=true` when Traefik handles TLS
### 5. ServiceMonitor Label Mismatch
**What happens:** Prometheus doesn't discover custom ServiceMonitors
**Prevention:** Ensure `release: <helm-release-name>` label matches kube-prometheus-stack release
## Sources
**ArgoCD:**
- [ArgoCD Webhook Configuration](https://argo-cd.readthedocs.io/en/stable/operator-manual/webhook/)
- [ArgoCD Ingress Configuration](https://argo-cd.readthedocs.io/en/stable/operator-manual/ingress/)
- [ArgoCD Installation](https://argo-cd.readthedocs.io/en/stable/operator-manual/installation/)
- [Mastering GitOps: ArgoCD and Gitea on Kubernetes](https://blog.stackademic.com/mastering-gitops-a-comprehensive-guide-to-self-hosting-argocd-and-gitea-on-kubernetes-9cdf36856c38)
**Prometheus/Grafana:**
- [kube-prometheus-stack Helm Chart](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack)
- [Prometheus on K3s](https://fabianlee.org/2022/07/02/prometheus-installing-kube-prometheus-stack-on-k3s-cluster/)
- [K3s Monitoring Guide](https://github.com/cablespaghetti/k3s-monitoring)
- [ServiceMonitor Explained](https://dkbalachandar.wordpress.com/2025/07/21/kubernetes-servicemonitor-explained-how-to-monitor-services-with-prometheus/)
**Loki/Alloy:**
- [Loki Monolithic Installation](https://grafana.com/docs/loki/latest/setup/install/helm/install-monolithic/)
- [Loki Deployment Modes](https://grafana.com/docs/loki/latest/get-started/deployment-modes/)
- [Migrate from Promtail to Alloy](https://grafana.com/docs/alloy/latest/set-up/migrate/from-promtail/)
- [Grafana Loki 3.4 Release](https://grafana.com/blog/2025/02/13/grafana-loki-3.4-standardized-storage-config-sizing-guidance-and-promtail-merging-into-alloy/)
- [Alloy Replacing Promtail](https://docs-bigbang.dso.mil/latest/docs/adrs/0004-alloy-replacing-promtail/)
**Traefik Integration:**
- [Traefik Metrics with Prometheus](https://traefik.io/blog/capture-traefik-metrics-for-apps-on-kubernetes-with-prometheus)
---
*Last updated: 2026-02-03*

View File

@@ -210,5 +210,241 @@ Features to defer until product-market fit is established:
- Evernote features page (verified via WebFetch)
---
*Feature research for: Personal Task/Notes Web App*
*Researched: 2026-01-29*
# CI/CD and Observability Features
**Domain:** CI/CD pipelines and Kubernetes observability for personal project
**Researched:** 2026-02-03
**Context:** Single-user, self-hosted TaskPlanner app with existing basic Gitea Actions pipeline
## Current State
Based on the existing `.gitea/workflows/build.yaml`:
- Build and push Docker images to Gitea Container Registry
- Docker layer caching enabled
- Automatic Helm values update with new image tag
- No tests in pipeline
- No GitOps automation (ArgoCD defined but requires manual sync)
- No observability stack
---
## Table Stakes
Features required for production-grade operations. Missing any of these means the system is incomplete for reliable self-hosting.
### CI/CD Pipeline
| Feature | Why Expected | Complexity | Notes |
|---------|--------------|------------|-------|
| **Automated tests in pipeline** | Catch bugs before deployment; without tests, pipeline is just a build script | Low | Start with unit tests (70% of test pyramid), add integration tests later |
| **Build caching** | Already have this | - | Using Docker layer cache to registry |
| **Lint/static analysis** | Catch errors early (fail fast principle) | Low | ESLint, TypeScript checking |
| **Pipeline as code** | Already have this | - | Workflow defined in `.gitea/workflows/` |
| **Automated deployment trigger** | Manual `helm upgrade` defeats CI/CD purpose | Low | ArgoCD auto-sync on Git changes |
| **Container image tagging** | Already have this | - | SHA-based tags with `latest` |
### GitOps
| Feature | Why Expected | Complexity | Notes |
|---------|--------------|------------|-------|
| **Git as single source of truth** | Core GitOps principle; cluster state should match Git | Low | ArgoCD watches Git repo, syncs to cluster |
| **Auto-sync** | Manual sync defeats GitOps purpose | Low | ArgoCD `syncPolicy.automated.enabled: true` |
| **Self-healing** | Prevents drift; if someone kubectl edits, ArgoCD reverts | Low | ArgoCD `selfHeal: true` |
| **Health checks** | Know if deployment succeeded | Low | ArgoCD built-in health status |
### Observability
| Feature | Why Expected | Complexity | Notes |
|---------|--------------|------------|-------|
| **Basic metrics collection** | Know if app is running, resource usage | Medium | Prometheus + kube-state-metrics |
| **Metrics visualization** | Metrics without dashboards are useless | Low | Grafana with pre-built Kubernetes dashboards |
| **Container logs aggregation** | Debug issues without `kubectl logs` | Medium | Loki (lightweight, label-based) |
| **Basic alerting** | Know when something breaks | Low | AlertManager with 3-5 critical alerts |
---
## Differentiators
Features that add significant value but are not strictly required for a single-user personal app. Implement if you want learning/practice or improved reliability.
### CI/CD Pipeline
| Feature | Value Proposition | Complexity | Notes |
|---------|-------------------|------------|-------|
| **Smoke tests on deploy** | Verify deployment actually works | Medium | Hit health endpoint after deploy |
| **Build notifications** | Know when builds fail without watching | Low | Slack/Discord/email webhook |
| **DORA metrics tracking** | Track deployment frequency, lead time | Medium | Measure CI/CD effectiveness |
| **Parallel test execution** | Faster feedback on larger test suites | Medium | Only valuable with substantial test suite |
| **Dependency vulnerability scanning** | Catch security issues early | Low | `npm audit`, Trivy for container images |
### GitOps
| Feature | Value Proposition | Complexity | Notes |
|---------|-------------------|------------|-------|
| **Automated pruning** | Remove resources deleted from Git | Low | ArgoCD `prune: true` |
| **Sync windows** | Control when syncs happen | Low | Useful if you want maintenance windows |
| **Application health dashboard** | Visual cluster state | Low | ArgoCD UI already provides this |
| **Git commit status** | See deployment status in Gitea | Medium | ArgoCD notifications to Git |
### Observability
| Feature | Value Proposition | Complexity | Notes |
|---------|-------------------|------------|-------|
| **Application-level metrics** | Track business metrics (tasks created, etc.) | Medium | Custom Prometheus metrics in app |
| **Request tracing** | Debug latency issues | High | OpenTelemetry, Tempo/Jaeger |
| **SLO/SLI dashboards** | Define and track reliability targets | Medium | Error budgets, latency percentiles |
| **Log-based alerting** | Alert on error patterns | Medium | Loki alerting rules |
| **Uptime monitoring** | External availability check | Low | Uptime Kuma or similar |
---
## Anti-Features
Features that are overkill for a single-user personal app. Actively avoid these to prevent over-engineering.
| Anti-Feature | Why Avoid | What to Do Instead |
|--------------|-----------|-------------------|
| **Multi-environment promotion (dev/staging/prod)** | Single user, single environment | Deploy directly to prod; use feature flags if needed |
| **Blue-green/canary deployments** | Complex rollout for single user is overkill | Simple rolling update; ArgoCD rollback if needed |
| **Full E2E test suite in CI** | Expensive, slow, diminishing returns for personal app | Unit + smoke tests; manual E2E when needed |
| **High availability ArgoCD** | HA is for multi-team, multi-tenant | Single replica ArgoCD is fine |
| **Distributed tracing** | Overkill unless debugging microservices latency | Only add if you have multiple services with latency issues |
| **ELK stack for logging** | Resource-heavy; Elasticsearch needs significant memory | Use Loki instead (label-based, lightweight) |
| **Full APM solution** | DataDog/NewRelic-style solutions are enterprise-focused | Prometheus + Grafana + Loki covers personal needs |
| **Secrets management (Vault)** | Complex for single user with few secrets | Kubernetes secrets or sealed-secrets |
| **Policy enforcement (OPA/Gatekeeper)** | You are the only user; no policy conflicts | Skip entirely |
| **Multi-cluster management** | Single cluster, single app | Skip entirely |
| **Cost optimization/FinOps** | Personal project; cost is fixed/minimal | Skip entirely |
| **AI-assisted observability** | Marketing hype; manual review is fine at this scale | Skip entirely |
---
## Feature Dependencies
```
Automated Tests
|
v
Lint/Static Analysis --> Build --> Push Image --> Update Git
|
v
ArgoCD Auto-Sync
|
v
Health Check Pass
|
v
Deployment Complete
|
v
Metrics/Logs Available in Grafana
```
Key ordering constraints:
1. Tests before build (fail fast)
2. ArgoCD watches Git, so Git update triggers deploy
3. Observability stack must be deployed before app for metrics collection
---
## MVP Recommendation for CI/CD and Observability
For production-grade operations on a personal project, prioritize in this order:
### Phase 1: GitOps Foundation
1. Enable ArgoCD auto-sync with self-healing
2. Add basic health checks
*Rationale:* Eliminates manual `helm upgrade`, establishes GitOps workflow
### Phase 2: Basic Observability
1. Prometheus + Grafana (kube-prometheus-stack helm chart)
2. Loki for log aggregation
3. 3-5 critical alerts (pod crashes, high memory, app down)
*Rationale:* Can't operate what you can't see; minimum viable observability
### Phase 3: CI Pipeline Hardening
1. Add unit tests to pipeline
2. Add linting/type checking
3. Smoke test after deploy (optional)
*Rationale:* Tests catch bugs before they reach production
### Defer to Later (if ever)
- Application-level custom metrics
- SLO dashboards
- Advanced alerting
- Request tracing
- Extensive E2E tests
---
## Complexity Budget
For a single-user personal project, the total complexity budget should be LOW-MEDIUM:
| Category | Recommended Complexity | Over-Budget Indicator |
|----------|----------------------|----------------------|
| CI Pipeline | LOW | More than 10 min build time; complex test matrix |
| GitOps | LOW | Multi-environment promotion; complex sync policies |
| Metrics | MEDIUM | Custom exporters; high-cardinality metrics |
| Logging | LOW | Full-text search; complex log parsing |
| Alerting | LOW | More than 10 alerts; complex routing |
| Tracing | SKIP | Any tracing for single-service app |
---
## Essential Alerts for Personal Project
Based on best practices, these 5 alerts are sufficient for a single-user app:
| Alert | Condition | Why Critical |
|-------|-----------|--------------|
| **Pod CrashLooping** | restarts > 3 in 15 min | App is failing repeatedly |
| **Pod OOMKilled** | OOM event detected | Memory limits too low or leak |
| **High Memory Usage** | memory > 85% for 5 min | Approaching resource limits |
| **App Unavailable** | probe failures > 3 | Users cannot access app |
| **Disk Running Low** | disk > 80% used | Persistent storage filling up |
**Key principle:** Alerts should be symptom-based and actionable. If an alert fires and you don't need to do anything, remove it.
---
## Sources
### CI/CD Best Practices
- [TeamCity CI/CD Guide](https://www.jetbrains.com/teamcity/ci-cd-guide/ci-cd-best-practices/)
- [Spacelift CI/CD Best Practices](https://spacelift.io/blog/ci-cd-best-practices)
- [GitLab CI/CD Best Practices](https://about.gitlab.com/blog/how-to-keep-up-with-ci-cd-best-practices/)
- [AWS CI/CD Best Practices](https://docs.aws.amazon.com/prescriptive-guidance/latest/strategy-cicd-litmus/cicd-best-practices.html)
### Observability
- [Kubernetes Observability Trends 2026](https://www.usdsi.org/data-science-insights/kubernetes-observability-and-monitoring-trends-in-2026)
- [Spectro Cloud: Choosing the Right Monitoring Stack](https://www.spectrocloud.com/blog/choosing-the-right-kubernetes-monitoring-stack)
- [ClickHouse: Mastering Kubernetes Observability](https://clickhouse.com/resources/engineering/mastering-kubernetes-observability-guide)
- [Kubernetes Official Observability Docs](https://kubernetes.io/docs/concepts/cluster-administration/observability/)
### ArgoCD/GitOps
- [ArgoCD Auto Sync Documentation](https://argo-cd.readthedocs.io/en/stable/user-guide/auto_sync/)
- [ArgoCD Best Practices](https://argo-cd.readthedocs.io/en/stable/user-guide/best_practices/)
- [mkdev: ArgoCD Self-Heal and Sync Windows](https://mkdev.me/posts/argo-cd-self-heal-sync-windows-and-diffing)
### Alerting
- [Sysdig: Alerting on Kubernetes](https://www.sysdig.com/blog/alerting-kubernetes)
- [Groundcover: Kubernetes Alerting](https://www.groundcover.com/kubernetes-monitoring/kubernetes-alerting)
- [Sematext: 10 Must-Have Kubernetes Alerts](https://sematext.com/blog/top-10-must-have-alerts-for-kubernetes/)
### Logging
- [Plural: Loki vs ELK for Kubernetes](https://www.plural.sh/blog/loki-vs-elk-kubernetes/)
- [Loki vs ELK Comparison](https://alexandre-vazquez.com/loki-vs-elk/)
### Testing Pyramid
- [CircleCI: Testing Pyramid](https://circleci.com/blog/testing-pyramid/)
- [Semaphore: Testing Pyramid](https://semaphore.io/blog/testing-pyramid)
- [AWS: Testing Stages in CI/CD](https://docs.aws.amazon.com/whitepapers/latest/practicing-continuous-integration-continuous-delivery/testing-stages-in-continuous-integration-and-continuous-delivery.html)
### Homelab/Personal Projects
- [Prometheus and Grafana Homelab Setup](https://unixorn.github.io/post/homelab/homelab-setup-prometheus-and-grafana/)
- [Better Stack: Install Prometheus/Grafana with Helm](https://betterstack.com/community/questions/install-prometheus-and-grafana-on-kubernetes-with-helm/)

View File

@@ -0,0 +1,633 @@
# Domain Pitfalls: CI/CD and Observability on k3s
**Domain:** Adding ArgoCD, Prometheus, Grafana, and Loki to existing k3s cluster
**Context:** TaskPlanner on self-hosted k3s with Gitea, Traefik, Longhorn
**Researched:** 2026-02-03
**Confidence:** HIGH (verified with official documentation and community issues)
---
## Critical Pitfalls
Mistakes that cause system instability, data loss, or require significant rework.
### 1. Gitea Webhook JSON Parsing Failure with ArgoCD
**What goes wrong:** ArgoCD receives webhooks from Gitea but fails to parse them with error: `json: cannot unmarshal string into Go struct field .repository.created_at of type int64`. This happens because ArgoCD treats Gitea events as GitHub events instead of Gogs events.
**Why it happens:** Gitea is a fork of Gogs, but ArgoCD's webhook handler expects different field types. The `repository.created_at` field is a string in Gitea/Gogs but ArgoCD expects int64 for GitHub format.
**Consequences:**
- Webhooks silently fail (ArgoCD logs error but continues)
- Must wait for 3-minute polling interval for changes to sync
- False confidence that instant sync is working
**Warning signs:**
- ArgoCD server logs show webhook parsing errors
- Application sync doesn't happen immediately after push
- Webhook delivery shows success in Gitea but no ArgoCD response
**Prevention:**
- Configure webhook with `Gogs` type in Gitea, NOT `Gitea` type
- Test webhook delivery and check ArgoCD server logs: `kubectl logs -n argocd deploy/argocd-server | grep -i webhook`
- Accept 3-minute polling as fallback (webhooks are optional enhancement)
**Phase to address:** ArgoCD installation phase - verify webhook integration immediately
**Sources:**
- [ArgoCD Issue #16453 - Forgejo/Gitea webhook parsing](https://github.com/argoproj/argo-cd/issues/16453)
- [ArgoCD Issue #20444 - Gitea support lacking](https://github.com/argoproj/argo-cd/issues/20444)
---
### 2. Loki Disk Full with No Size-Based Retention
**What goes wrong:** Loki fills the entire disk because retention is only time-based, not size-based. When disk fills, Loki crashes with "no space left on device" and becomes completely non-functional - Grafana cannot even fetch labels.
**Why it happens:**
- Retention is disabled by default (`compactor.retention-enabled: false`)
- Loki only supports time-based retention (e.g., 7 days), not size-based
- High-volume logging can fill disk before retention period expires
**Consequences:**
- Complete logging system failure
- May affect other pods sharing the same Longhorn volume
- Recovery requires manual cleanup or volume expansion
**Warning signs:**
- Steadily increasing PVC usage visible in `kubectl get pvc`
- Loki compactor logs show no deletion activity
- Grafana queries become slow before complete failure
**Prevention:**
```yaml
# Loki values.yaml
loki:
compactor:
retention_enabled: true
compaction_interval: 10m
retention_delete_delay: 2h
retention_delete_worker_count: 150
working_directory: /loki/compactor
limits_config:
retention_period: 168h # 7 days - adjust based on disk size
```
- Set conservative retention period (start with 7 days)
- Run compactor as StatefulSet with persistent storage for marker files
- Set up Prometheus alert for PVC usage > 80%
- Index period MUST be 24h for retention to work
**Phase to address:** Loki installation phase - configure retention from day one
**Sources:**
- [Grafana Loki Retention Documentation](https://grafana.com/docs/loki/latest/operations/storage/retention/)
- [Loki Issue #5242 - Retention not working](https://github.com/grafana/loki/issues/5242)
---
### 3. Prometheus Volume Growth Exceeds Longhorn PVC
**What goes wrong:** Prometheus metrics storage grows beyond PVC capacity. Longhorn volume expansion via CSI can result in a faulted volume that prevents Prometheus from starting.
**Why it happens:**
- Default Prometheus retention is 15 days with no size limit
- kube-prometheus-stack defaults don't match k3s resource constraints
- Longhorn CSI volume expansion has known issues requiring specific procedure
**Consequences:**
- Prometheus pod stuck in pending/crash loop
- Loss of historical metrics
- Longhorn volume in faulted state requiring manual recovery
**Warning signs:**
- Prometheus pod restarts with OOMKilled or disk errors
- `kubectl describe pvc` shows capacity approaching limit
- Longhorn UI shows volume health degraded
**Prevention:**
```yaml
# kube-prometheus-stack values
prometheus:
prometheusSpec:
retention: 7d
retentionSize: "8GB" # Set explicit size limit
resources:
requests:
memory: 400Mi
limits:
memory: 600Mi
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: longhorn
resources:
requests:
storage: 10Gi
```
- Always set both `retention` AND `retentionSize`
- Size PVC with 20% headroom above retentionSize
- Monitor with `prometheus_tsdb_storage_blocks_bytes` metric
- For expansion: stop pod, detach volume, resize, then restart
**Phase to address:** Prometheus installation phase
**Sources:**
- [Longhorn Issue #2222 - Volume expansion faults](https://github.com/longhorn/longhorn/issues/2222)
- [kube-prometheus-stack Issue #3401 - Resource limits](https://github.com/prometheus-community/helm-charts/issues/3401)
---
### 4. ArgoCD + Traefik TLS Termination Redirect Loop
**What goes wrong:** ArgoCD UI becomes inaccessible with redirect loops or connection refused errors when accessed through Traefik. Browser shows ERR_TOO_MANY_REDIRECTS.
**Why it happens:** Traefik terminates TLS and forwards HTTP to ArgoCD. ArgoCD server, configured for TLS by default, responds with 307 redirects to HTTPS, creating infinite loop.
**Consequences:**
- Cannot access ArgoCD UI via ingress
- CLI may work with port-forward but not through ingress
- gRPC connections for CLI through ingress fail
**Warning signs:**
- Browser redirect loop when accessing ArgoCD URL
- `curl -v` shows 307 redirect responses
- Works with `kubectl port-forward` but not via ingress
**Prevention:**
```yaml
# Option 1: ConfigMap (recommended)
apiVersion: v1
kind: ConfigMap
metadata:
name: argocd-cmd-params-cm
namespace: argocd
data:
server.insecure: "true"
# Option 2: Traefik IngressRoute for dual HTTP/gRPC
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
name: argocd-server
namespace: argocd
spec:
entryPoints:
- websecure
routes:
- kind: Rule
match: Host(`argocd.example.com`)
priority: 10
services:
- name: argocd-server
port: 80
- kind: Rule
match: Host(`argocd.example.com`) && Header(`Content-Type`, `application/grpc`)
priority: 11
services:
- name: argocd-server
port: 80
scheme: h2c
tls:
certResolver: letsencrypt-prod
```
- Set `server.insecure: "true"` in argocd-cmd-params-cm ConfigMap
- Use IngressRoute (not Ingress) for proper gRPC support
- Configure separate routes for HTTP and gRPC with correct priority
**Phase to address:** ArgoCD installation phase - test immediately after ingress setup
**Sources:**
- [ArgoCD Ingress Configuration](https://argo-cd.readthedocs.io/en/stable/operator-manual/ingress/)
- [Traefik Community - ArgoCD behind Traefik](https://community.traefik.io/t/serving-argocd-behind-traefik-ingress/15901)
---
## Moderate Pitfalls
Mistakes that cause delays, debugging sessions, or technical debt.
### 5. ServiceMonitor Not Discovering Targets
**What goes wrong:** Prometheus ServiceMonitors are created but no targets appear in Prometheus. The scrape config shows 0/0 targets up.
**Why it happens:**
- Label selector mismatch between Prometheus CR and ServiceMonitor
- RBAC: Prometheus ServiceAccount lacks permission in target namespace
- Port specified as number instead of name
- ServiceMonitor in different namespace than Prometheus expects
**Prevention:**
```yaml
# Ensure Prometheus CR has permissive selectors
prometheus:
prometheusSpec:
serviceMonitorSelectorNilUsesHelmValues: false
serviceMonitorSelector: {} # Select all ServiceMonitors
serviceMonitorNamespaceSelector: {} # From all namespaces
# ServiceMonitor must use port NAME not number
spec:
endpoints:
- port: metrics # NOT 9090
```
- Use port name, never port number in ServiceMonitor
- Check RBAC: `kubectl auth can-i list endpoints --as=system:serviceaccount:monitoring:prometheus-kube-prometheus-prometheus -n default`
- Verify label matching: `kubectl get servicemonitor -A --show-labels`
**Phase to address:** Prometheus installation phase, verify with test ServiceMonitor
**Sources:**
- [Prometheus Operator Troubleshooting](https://managedkube.com/prometheus/operator/servicemonitor/troubleshooting/2019/11/07/prometheus-operator-servicemonitor-troubleshooting.html)
- [ServiceMonitor not discovered Issue #3383](https://github.com/prometheus-operator/prometheus-operator/issues/3383)
---
### 6. k3s Control Plane Metrics Not Scraped
**What goes wrong:** Prometheus dashboards show no metrics for kube-scheduler, kube-controller-manager, or etcd. These panels appear blank or show "No data."
**Why it happens:** k3s runs control plane components as a single binary, not as pods. Standard kube-prometheus-stack expects to scrape pods that don't exist.
**Prevention:**
```yaml
# kube-prometheus-stack values for k3s
kubeControllerManager:
enabled: true
endpoints:
- 192.168.1.100 # k3s server IP
service:
enabled: true
port: 10257
targetPort: 10257
kubeScheduler:
enabled: true
endpoints:
- 192.168.1.100
service:
enabled: true
port: 10259
targetPort: 10259
kubeEtcd:
enabled: false # k3s uses embedded sqlite/etcd
```
- Explicitly configure control plane endpoints with k3s server IPs
- Disable etcd monitoring if using embedded database
- OR disable these components entirely for simpler setup
**Phase to address:** Prometheus installation phase
**Sources:**
- [Prometheus for Rancher K3s Control Plane Monitoring](https://www.spectrocloud.com/blog/enabling-rancher-k3s-cluster-control-plane-monitoring-with-prometheus)
---
### 7. Promtail Not Sending Logs to Loki
**What goes wrong:** Promtail pods are running but no logs appear in Grafana/Loki. Queries return empty results.
**Why it happens:**
- Promtail started before Loki was ready
- Log path configuration doesn't match k3s container runtime paths
- Label selectors don't match actual pod labels
- Network policy blocking Promtail -> Loki communication
**Warning signs:**
- Promtail logs show "dropping target, no labels" or connection errors
- `kubectl logs -n monitoring promtail-xxx` shows retries
- Loki data source health check passes but queries return nothing
**Prevention:**
```yaml
# Verify k3s containerd log paths
promtail:
config:
snippets:
scrapeConfigs: |
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
pipeline_stages:
- cri: {}
relabel_configs:
- source_labels: [__meta_kubernetes_pod_node_name]
target_label: node
```
- Delete Promtail positions file to force re-read: `kubectl exec -n monitoring promtail-xxx -- rm /tmp/positions.yaml`
- Ensure Loki is healthy before Promtail starts (use init container or sync wave)
- Verify log paths match containerd: `/var/log/pods/*/*/*.log`
**Phase to address:** Loki installation phase
**Sources:**
- [Grafana Loki Troubleshooting](https://grafana.com/docs/loki/latest/operations/troubleshooting/)
---
### 8. ArgoCD Self-Management Bootstrap Chicken-Egg
**What goes wrong:** Attempting to have ArgoCD manage itself creates confusion about what's managing what. Initial mistakes in the ArgoCD Application manifest can lock you out.
**Why it happens:** GitOps can't install ArgoCD if ArgoCD isn't present. After bootstrap, changing ArgoCD's self-managing Application incorrectly can break the cluster.
**Prevention:**
```yaml
# Phase 1: Install ArgoCD manually (kubectl apply or helm)
# Phase 2: Create self-management Application
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: argocd
namespace: argocd
spec:
project: default
source:
repoURL: https://git.kube2.tricnet.de/tho/infrastructure.git
path: argocd
targetRevision: HEAD
destination:
server: https://kubernetes.default.svc
namespace: argocd
syncPolicy:
automated:
prune: false # CRITICAL: Don't auto-prune ArgoCD components
selfHeal: true
```
- Always bootstrap ArgoCD manually first (Helm or kubectl)
- Set `prune: false` for ArgoCD's self-management Application
- Use App of Apps pattern for managed applications
- Keep a local backup of ArgoCD Application manifest
**Phase to address:** ArgoCD installation phase - plan bootstrap strategy upfront
**Sources:**
- [Bootstrapping ArgoCD - Windsock.io](https://windsock.io/bootstrapping-argocd/)
- [Demystifying GitOps - Bootstrapping ArgoCD](https://medium.com/@aaltundemir/demystifying-gitops-bootstrapping-argo-cd-4a861284f273)
---
### 9. Sync Waves Misuse Creating False Dependencies
**What goes wrong:** Over-engineering sync waves creates unnecessary sequential deployments, increasing deployment time and complexity. Or under-engineering leads to race conditions.
**Why it happens:**
- Developers add waves "just in case"
- Misunderstanding that waves are within single Application only
- Not knowing default wave is 0 and waves can be negative
**Prevention:**
```yaml
# Use waves sparingly - only for true dependencies
# Database must exist before app
metadata:
annotations:
argocd.argoproj.io/sync-wave: "-1" # First
# App deployment
metadata:
annotations:
argocd.argoproj.io/sync-wave: "0" # Default, after database
# Don't create unnecessary chains like:
# ConfigMap (wave -3) -> Secret (wave -2) -> Service (wave -1) -> Deployment (wave 0)
# These have no real dependency and should all be wave 0
```
- Use waves only for actual dependencies (database before app, CRD before CR)
- Keep wave structure as flat as possible
- Sync waves do NOT work across different ArgoCD Applications
- For cross-Application dependencies, use ApplicationSets with Progressive Syncs
**Phase to address:** Application configuration phase
**Sources:**
- [ArgoCD Sync Phases and Waves](https://argo-cd.readthedocs.io/en/stable/user-guide/sync-waves/)
---
## Minor Pitfalls
Annoyances that are easily fixed but waste time if not known.
### 10. Grafana Default Password Not Changed
**What goes wrong:** Using default `admin/prom-operator` credentials in production exposes the monitoring stack.
**Prevention:**
```yaml
# kube-prometheus-stack values
grafana:
adminPassword: "${GRAFANA_ADMIN_PASSWORD}" # From secret
# Or use existing secret
admin:
existingSecret: grafana-admin-credentials
userKey: admin-user
passwordKey: admin-password
```
**Phase to address:** Grafana installation phase
---
### 11. Missing open-iscsi for Longhorn
**What goes wrong:** Longhorn volumes fail to attach with cryptic errors.
**Why it happens:** Longhorn requires `open-iscsi` on all nodes, which isn't installed by default on many Linux distributions.
**Prevention:**
```bash
# On each node before Longhorn installation
sudo apt-get install -y open-iscsi
sudo systemctl enable iscsid
sudo systemctl start iscsid
```
**Phase to address:** Pre-installation prerequisites check
**Sources:**
- [Longhorn Prerequisites](https://longhorn.io/docs/latest/deploy/install/#installation-requirements)
---
### 12. ClusterIP Services Not Accessible
**What goes wrong:** After installing monitoring stack, Grafana/Prometheus aren't accessible externally.
**Why it happens:** k3s defaults to ClusterIP for services. Single-node setups need explicit ingress or LoadBalancer configuration.
**Prevention:**
```yaml
# kube-prometheus-stack values
grafana:
ingress:
enabled: true
ingressClassName: traefik
hosts:
- grafana.kube2.tricnet.de
tls:
- secretName: grafana-tls
hosts:
- grafana.kube2.tricnet.de
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
```
**Phase to address:** Installation phase - configure ingress alongside deployment
---
### 13. Traefik v3 Breaking Changes for ArgoCD IngressRoute
**What goes wrong:** ArgoCD IngressRoute with gRPC support stops working after Traefik upgrade to v3.
**Why it happens:** Traefik v3 changed header matcher syntax from `Headers()` to `Header()`.
**Prevention:**
```yaml
# Traefik v2 (OLD - broken in v3)
match: Host(`argocd.example.com`) && Headers(`Content-Type`, `application/grpc`)
# Traefik v3 (NEW)
match: Host(`argocd.example.com`) && Header(`Content-Type`, `application/grpc`)
```
- Check Traefik version before applying IngressRoutes
- Test gRPC route after any Traefik upgrade
**Phase to address:** ArgoCD installation phase
**Sources:**
- [ArgoCD Issue #15534 - Traefik v3 docs](https://github.com/argoproj/argo-cd/issues/15534)
---
### 14. k3s Resource Exhaustion with Full Monitoring Stack
**What goes wrong:** Single-node k3s cluster becomes unresponsive after deploying full kube-prometheus-stack.
**Why it happens:**
- kube-prometheus-stack deploys many components (prometheus, alertmanager, grafana, node-exporter, kube-state-metrics)
- Default resource requests/limits are sized for larger clusters
- k3s server process itself needs ~500MB RAM
**Warning signs:**
- Pods stuck in Pending
- OOMKilled events
- Node NotReady status
**Prevention:**
```yaml
# Minimal kube-prometheus-stack for single-node
alertmanager:
enabled: false # Disable if not using alerts
prometheus:
prometheusSpec:
resources:
requests:
memory: 256Mi
cpu: 100m
limits:
memory: 512Mi
grafana:
resources:
requests:
memory: 128Mi
cpu: 50m
limits:
memory: 256Mi
```
- Disable unnecessary components (alertmanager if no alerts configured)
- Set explicit resource limits lower than defaults
- Monitor cluster resources: `kubectl top nodes`
- Consider: 4GB RAM minimum for k3s + monitoring + workloads
**Phase to address:** Prometheus installation phase - right-size from start
**Sources:**
- [K3s Resource Profiling](https://docs.k3s.io/reference/resource-profiling)
---
## Phase-Specific Warning Summary
| Phase | Likely Pitfall | Mitigation |
|-------|---------------|------------|
| Prerequisites | #11 Missing open-iscsi | Pre-flight check script |
| ArgoCD Installation | #4 TLS redirect loop, #8 Bootstrap | Test ingress immediately, plan bootstrap |
| ArgoCD + Gitea Integration | #1 Webhook parsing | Use Gogs webhook type, accept polling fallback |
| Prometheus Installation | #3 Volume growth, #5 ServiceMonitor, #6 Control plane, #14 Resources | Configure retention+size, verify RBAC, right-size |
| Loki Installation | #2 Disk full, #7 Promtail | Enable retention day one, verify log paths |
| Grafana Installation | #10 Default password, #12 ClusterIP | Set password, configure ingress |
| Application Configuration | #9 Sync waves | Use sparingly, only for real dependencies |
---
## Pre-Installation Checklist
Before starting installation, verify:
- [ ] open-iscsi installed on all nodes
- [ ] Longhorn healthy with available storage (check `kubectl get nodes` and Longhorn UI)
- [ ] Traefik version known (v2 vs v3 affects IngressRoute syntax)
- [ ] DNS entries configured for monitoring subdomains
- [ ] Gitea webhook type decision (use Gogs type, or accept polling fallback)
- [ ] Disk space planning: Loki retention + Prometheus retention + headroom
- [ ] Memory planning: k3s (~500MB) + monitoring (~1GB) + workloads
- [ ] Namespace strategy decided (monitoring namespace vs default)
---
## Existing Infrastructure Compatibility Notes
Based on the existing TaskPlanner setup:
**Traefik:** Already in use with cert-manager (letsencrypt-prod). New services should follow same pattern:
```yaml
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
```
**Longhorn:** Already the storage class. New PVCs should use explicit `storageClassName: longhorn` and consider replica count for single-node (set to 1).
**Gitea:** Repository already configured at `git.kube2.tricnet.de`. ArgoCD Application already exists in `argocd/application.yaml` - don't duplicate.
**Existing ArgoCD Application:** TaskPlanner is already configured with ArgoCD. The monitoring stack should be a separate Application, not added to the existing one.
---
## Sources Summary
### Official Documentation
- [ArgoCD Ingress Configuration](https://argo-cd.readthedocs.io/en/stable/operator-manual/ingress/)
- [ArgoCD Sync Phases and Waves](https://argo-cd.readthedocs.io/en/stable/user-guide/sync-waves/)
- [Grafana Loki Retention](https://grafana.com/docs/loki/latest/operations/storage/retention/)
- [Grafana Loki Troubleshooting](https://grafana.com/docs/loki/latest/operations/troubleshooting/)
- [K3s Resource Profiling](https://docs.k3s.io/reference/resource-profiling)
### Community Issues (Verified Problems)
- [ArgoCD #16453 - Gitea webhook parsing](https://github.com/argoproj/argo-cd/issues/16453)
- [ArgoCD #20444 - Gitea support](https://github.com/argoproj/argo-cd/issues/20444)
- [Loki #5242 - Retention not working](https://github.com/grafana/loki/issues/5242)
- [Longhorn #2222 - Volume expansion](https://github.com/longhorn/longhorn/issues/2222)
- [kube-prometheus-stack #3401 - Resource limits](https://github.com/prometheus-community/helm-charts/issues/3401)
- [Prometheus Operator #3383 - ServiceMonitor discovery](https://github.com/prometheus-operator/prometheus-operator/issues/3383)
### Tutorials and Guides
- [K3S Rocks - ArgoCD](https://k3s.rocks/argocd/)
- [K3S Rocks - Logging](https://k3s.rocks/logging/)
- [Bootstrapping ArgoCD](https://windsock.io/bootstrapping-argocd/)
- [Prometheus ServiceMonitor Troubleshooting](https://managedkube.com/prometheus/operator/servicemonitor/troubleshooting/2019/11/07/prometheus-operator-servicemonitor-troubleshooting.html)
- [Traefik Community - ArgoCD](https://community.traefik.io/t/serving-argocd-behind-traefik-ingress/15901)
---
*Pitfalls research for: CI/CD and Observability on k3s*
*Context: Adding to existing TaskPlanner deployment*
*Researched: 2026-02-03*

View File

@@ -0,0 +1,583 @@
# Technology Stack: CI/CD Testing, ArgoCD GitOps, and Observability
**Project:** TaskPlanner v2.0 Production Operations
**Researched:** 2026-02-03
**Scope:** Stack additions for existing k3s-deployed SvelteKit app
## Executive Summary
This research covers three areas: (1) adding tests to the existing Gitea Actions pipeline, (2) ArgoCD for GitOps deployment automation, and (3) Prometheus/Grafana/Loki observability. The existing setup already has ArgoCD configured; research focuses on validating that configuration and adding the observability stack.
**Key finding:** Promtail is EOL on 2026-03-02. Use Grafana Alloy instead for log collection.
---
## 1. CI/CD Testing Stack
### Recommended Stack
| Component | Version | Purpose | Rationale |
|-----------|---------|---------|-----------|
| Playwright | ^1.58.1 (existing) | E2E testing | Already configured, comprehensive browser automation |
| Vitest | ^3.0.0 | Unit/component tests | Official Svelte recommendation for Vite-based projects |
| @testing-library/svelte | ^5.0.0 | Component testing utilities | Streamlined component assertions |
| mcr.microsoft.com/playwright | v1.58.1 | CI browser execution | Pre-installed browsers, eliminates install step |
### Why This Stack
**Playwright (keep existing):** Already configured with `playwright.config.ts` and `tests/docker-deployment.spec.ts`. The existing tests cover critical paths: health endpoint, CSRF-protected form submissions, and data persistence. Extend rather than replace.
**Vitest (add):** Svelte officially recommends Vitest for unit and component testing when using Vite (which SvelteKit uses). Vitest shares Vite's config, eliminating configuration overhead. Jest muscle memory transfers directly.
**NOT recommended:**
- Jest: Requires separate configuration, slower than Vitest, no Vite integration
- Cypress: Overlaps with Playwright; adding both creates maintenance burden
- @vitest/browser with Playwright: Adds complexity; save for later if jsdom proves insufficient
### Gitea Actions Workflow Updates
The existing workflow at `.gitea/workflows/build.yaml` needs a test stage. Gitea Actions uses GitHub Actions syntax.
**Recommended workflow structure:**
```yaml
name: Build and Push
on:
push:
branches: [master, main]
pull_request:
branches: [master, main]
env:
REGISTRY: git.kube2.tricnet.de
IMAGE_NAME: tho/taskplaner
jobs:
test:
runs-on: ubuntu-latest
container:
image: mcr.microsoft.com/playwright:v1.58.1-noble
steps:
- uses: actions/checkout@v4
- name: Install dependencies
run: npm ci
- name: Run type check
run: npm run check
- name: Run unit tests
run: npm run test:unit
- name: Run E2E tests
run: npm run test:e2e
env:
CI: true
build:
needs: test
runs-on: ubuntu-latest
if: github.event_name != 'pull_request'
steps:
# ... existing build steps ...
```
**Key decisions:**
- Use Playwright Docker image to avoid browser installation (saves 2-3 minutes)
- Run tests before build to fail fast
- Only build/push on push to master, not PRs
- Type checking (`svelte-check`) catches errors before runtime
### Package.json Scripts to Add
```json
{
"scripts": {
"test": "npm run test:unit && npm run test:e2e",
"test:unit": "vitest run",
"test:unit:watch": "vitest",
"test:e2e": "playwright test",
"test:e2e:docker": "BASE_URL=http://localhost:3000 playwright test tests/docker-deployment.spec.ts"
}
}
```
### Installation
```bash
# Add Vitest and testing utilities
npm install -D vitest @testing-library/svelte jsdom
```
### Vitest Configuration
Create `vitest.config.ts`:
```typescript
import { defineConfig } from 'vitest/config';
import { sveltekit } from '@sveltejs/kit/vite';
export default defineConfig({
plugins: [sveltekit()],
test: {
include: ['src/**/*.{test,spec}.{js,ts}'],
environment: 'jsdom',
globals: true,
setupFiles: ['./src/test-setup.ts']
}
});
```
### Confidence: HIGH
Sources:
- [Svelte Testing Documentation](https://svelte.dev/docs/svelte/testing) - Official recommendation for Vitest
- [Playwright CI Setup](https://playwright.dev/docs/ci-intro) - Docker image and CI best practices
- Existing `playwright.config.ts` in project
---
## 2. ArgoCD GitOps Stack
### Current State
ArgoCD is already configured in `argocd/application.yaml`. The configuration is correct and follows best practices:
```yaml
syncPolicy:
automated:
prune: true # Removes resources deleted from Git
selfHeal: true # Reverts manual changes
```
### Recommended Stack
| Component | Version | Purpose | Rationale |
|-----------|---------|---------|-----------|
| ArgoCD Helm Chart | 9.4.0 | GitOps controller | Latest stable, deploys ArgoCD v3.3.0 |
### What's Already Done (No Changes Needed)
1. **Application manifest:** `argocd/application.yaml` correctly points to `helm/taskplaner`
2. **Auto-sync enabled:** `automated.prune` and `selfHeal` are configured
3. **Git-based image tags:** Pipeline updates `values.yaml` with new image tag
4. **Namespace creation:** `CreateNamespace=true` is set
### What May Need Verification
1. **ArgoCD installation:** Verify ArgoCD is actually deployed on the k3s cluster
2. **Repository credentials:** If the Gitea repo is private, ArgoCD needs credentials
3. **Registry secret:** The `gitea-registry-secret` placeholder needs real credentials
### Installation (if ArgoCD not yet installed)
```bash
# Add ArgoCD Helm repository
helm repo add argo https://argoproj.github.io/argo-helm
helm repo update
# Install ArgoCD (minimal for single-node k3s)
helm install argocd argo/argo-cd \
--namespace argocd \
--create-namespace \
--set server.service.type=ClusterIP \
--set configs.params.server\.insecure=true # If behind Traefik TLS termination
```
### Apply Application
```bash
kubectl apply -f argocd/application.yaml
```
### NOT Recommended
- **ArgoCD Image Updater:** Overkill for single-app deployment; the current approach of updating values.yaml in Git is simpler and provides better audit trail
- **ApplicationSets:** Unnecessary for single environment
- **App of Apps pattern:** Unnecessary complexity for one application
### Confidence: HIGH
Sources:
- [ArgoCD Helm Chart on Artifact Hub](https://artifacthub.io/packages/helm/argo/argo-cd) - Version 9.4.0 confirmed
- [ArgoCD Helm GitHub Releases](https://github.com/argoproj/argo-helm/releases) - Release notes
- Existing `argocd/application.yaml` in project
---
## 3. Observability Stack
### Recommended Stack
| Component | Chart | Version | Purpose |
|-----------|-------|---------|---------|
| kube-prometheus-stack | prometheus-community/kube-prometheus-stack | 81.4.2 | Prometheus + Grafana + Alertmanager |
| Loki | grafana/loki | 6.51.0 | Log aggregation (monolithic mode) |
| Grafana Alloy | grafana/alloy | 1.5.3 | Log collection agent |
### Why This Stack
**kube-prometheus-stack (not standalone Prometheus):** Single chart deploys Prometheus, Grafana, Alertmanager, node-exporter, and kube-state-metrics. Pre-configured with Kubernetes dashboards. This is the standard approach.
**Loki (not ELK/Elasticsearch):** "Like Prometheus, but for logs." Integrates natively with Grafana. Much lower resource footprint than Elasticsearch. Uses same label-based querying as Prometheus.
**Grafana Alloy (not Promtail):** CRITICAL - Promtail reaches End-of-Life on 2026-03-02 (next month). Grafana Alloy is the official replacement. It's based on OpenTelemetry Collector and supports logs, metrics, and traces in one agent.
### NOT Recommended
- **Promtail:** EOL 2026-03-02. Do not install; use Alloy
- **loki-stack Helm chart:** Deprecated, no longer maintained
- **Elasticsearch/ELK:** Resource-heavy, complex, overkill for single-user app
- **Loki microservices mode:** Requires 3+ nodes, object storage; overkill for personal app
- **Separate Prometheus + Grafana charts:** kube-prometheus-stack bundles them correctly
### Architecture
```
+------------------+
| Grafana |
| (Dashboards/UI) |
+--------+---------+
|
+--------------------+--------------------+
| |
+--------v---------+ +----------v---------+
| Prometheus | | Loki |
| (Metrics) | | (Logs) |
+--------+---------+ +----------+---------+
| |
+--------------+---------------+ |
| | | |
+-----v-----+ +-----v-----+ +------v------+ +--------v---------+
| node- | | kube- | | TaskPlanner | | Grafana Alloy |
| exporter | | state- | | /metrics | | (Log Shipper) |
| | | metrics | | | | |
+-----------+ +-----------+ +-------------+ +------------------+
```
### Installation
```bash
# Add Helm repositories
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
# Create monitoring namespace
kubectl create namespace monitoring
# Install kube-prometheus-stack
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--values prometheus-values.yaml
# Install Loki (monolithic mode for single-node)
helm install loki grafana/loki \
--namespace monitoring \
--values loki-values.yaml
# Install Alloy for log collection
helm install alloy grafana/alloy \
--namespace monitoring \
--values alloy-values.yaml
```
### Recommended Values Files
#### prometheus-values.yaml (minimal for k3s single-node)
```yaml
# Reduce resource usage for single-node k3s
prometheus:
prometheusSpec:
retention: 15d
resources:
requests:
cpu: 200m
memory: 512Mi
limits:
cpu: 1000m
memory: 2Gi
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: longhorn # Use existing Longhorn
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 20Gi
alertmanager:
alertmanagerSpec:
resources:
requests:
cpu: 50m
memory: 64Mi
limits:
cpu: 200m
memory: 256Mi
storage:
volumeClaimTemplate:
spec:
storageClassName: longhorn
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 5Gi
grafana:
persistence:
enabled: true
storageClassName: longhorn
size: 5Gi
# Grafana will be exposed via Traefik
ingress:
enabled: true
ingressClassName: traefik
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
hosts:
- grafana.kube2.tricnet.de
tls:
- secretName: grafana-tls
hosts:
- grafana.kube2.tricnet.de
# Disable components not needed for single-node
kubeControllerManager:
enabled: false # k3s bundles this differently
kubeScheduler:
enabled: false # k3s bundles this differently
kubeProxy:
enabled: false # k3s uses different proxy
```
#### loki-values.yaml (monolithic mode)
```yaml
deploymentMode: SingleBinary
loki:
auth_enabled: false
commonConfig:
replication_factor: 1
storage:
type: filesystem
schemaConfig:
configs:
- from: "2024-01-01"
store: tsdb
object_store: filesystem
schema: v13
index:
prefix: loki_index_
period: 24h
singleBinary:
replicas: 1
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 500m
memory: 1Gi
persistence:
enabled: true
storageClass: longhorn
size: 10Gi
# Disable components not needed for monolithic
backend:
replicas: 0
read:
replicas: 0
write:
replicas: 0
# Gateway not needed for internal access
gateway:
enabled: false
```
#### alloy-values.yaml
```yaml
alloy:
configMap:
content: |-
// Discover and collect logs from all pods
discovery.kubernetes "pods" {
role = "pod"
}
discovery.relabel "pods" {
targets = discovery.kubernetes.pods.targets
rule {
source_labels = ["__meta_kubernetes_namespace"]
target_label = "namespace"
}
rule {
source_labels = ["__meta_kubernetes_pod_name"]
target_label = "pod"
}
rule {
source_labels = ["__meta_kubernetes_pod_container_name"]
target_label = "container"
}
}
loki.source.kubernetes "pods" {
targets = discovery.relabel.pods.output
forward_to = [loki.write.local.receiver]
}
loki.write "local" {
endpoint {
url = "http://loki.monitoring.svc:3100/loki/api/v1/push"
}
}
controller:
type: daemonset
resources:
requests:
cpu: 50m
memory: 64Mi
limits:
cpu: 200m
memory: 256Mi
```
### TaskPlanner Metrics Endpoint
The app needs a `/metrics` endpoint for Prometheus to scrape. SvelteKit options:
1. **prom-client library** (recommended): Standard Prometheus client for Node.js
2. **Custom endpoint**: Simple counter/gauge implementation
Add to `package.json`:
```bash
npm install prom-client
```
Add ServiceMonitor for Prometheus to scrape TaskPlanner:
```yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: taskplaner
namespace: monitoring
labels:
release: prometheus # Must match Prometheus selector
spec:
selector:
matchLabels:
app.kubernetes.io/name: taskplaner
namespaceSelector:
matchNames:
- default
endpoints:
- port: http
path: /metrics
interval: 30s
```
### Resource Summary
Total additional resource requirements for observability:
| Component | CPU Request | Memory Request | Storage |
|-----------|-------------|----------------|---------|
| Prometheus | 200m | 512Mi | 20Gi |
| Alertmanager | 50m | 64Mi | 5Gi |
| Grafana | 100m | 128Mi | 5Gi |
| Loki | 100m | 256Mi | 10Gi |
| Alloy (per node) | 50m | 64Mi | - |
| **Total** | ~500m | ~1Gi | 40Gi |
This fits comfortably on a single k3s node with 4+ cores and 8GB+ RAM.
### Confidence: HIGH
Sources:
- [kube-prometheus-stack on Artifact Hub](https://artifacthub.io/packages/helm/prometheus-community/kube-prometheus-stack) - Version 81.4.2
- [Grafana Loki Helm Installation](https://grafana.com/docs/loki/latest/setup/install/helm/) - Monolithic mode guidance
- [Grafana Alloy Kubernetes Deployment](https://grafana.com/docs/alloy/latest/set-up/install/kubernetes/) - Alloy setup
- [Promtail Deprecation Notice](https://grafana.com/docs/loki/latest/send-data/promtail/installation/) - EOL 2026-03-02
- [Migrate from Promtail to Alloy](https://grafana.com/docs/alloy/latest/set-up/migrate/from-promtail/) - Migration guide
---
## Summary: What to Install
### Immediate Actions
| Category | Add | Version | Notes |
|----------|-----|---------|-------|
| Testing | vitest | ^3.0.0 | Unit tests |
| Testing | @testing-library/svelte | ^5.0.0 | Component testing |
| Metrics | prom-client | ^15.0.0 | Prometheus metrics from app |
### Helm Charts to Deploy
| Chart | Repository | Version | Namespace |
|-------|------------|---------|-----------|
| kube-prometheus-stack | prometheus-community | 81.4.2 | monitoring |
| loki | grafana | 6.51.0 | monitoring |
| alloy | grafana | 1.5.3 | monitoring |
### Already Configured (Verify, Don't Re-install)
| Component | Status | Action |
|-----------|--------|--------|
| ArgoCD Application | Configured in `argocd/application.yaml` | Verify ArgoCD is running |
| Playwright | Configured in `playwright.config.ts` | Keep, extend tests |
### Do NOT Install
| Component | Reason |
|-----------|--------|
| Promtail | EOL 2026-03-02, use Alloy instead |
| loki-stack chart | Deprecated, unmaintained |
| Elasticsearch/ELK | Overkill, resource-heavy |
| Jest | Vitest is better for Vite projects |
| ArgoCD Image Updater | Current Git-based approach is simpler |
---
## Helm Repository Commands
```bash
# Add all needed repositories
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add grafana https://grafana.github.io/helm-charts
helm repo add argo https://argoproj.github.io/argo-helm
helm repo update
# Verify
helm search repo prometheus-community/kube-prometheus-stack
helm search repo grafana/loki
helm search repo grafana/alloy
helm search repo argo/argo-cd
```
---
## Sources
### Official Documentation
- [Svelte Testing](https://svelte.dev/docs/svelte/testing)
- [Playwright CI Setup](https://playwright.dev/docs/ci-intro)
- [ArgoCD Helm Chart](https://artifacthub.io/packages/helm/argo/argo-cd)
- [kube-prometheus-stack](https://artifacthub.io/packages/helm/prometheus-community/kube-prometheus-stack)
- [Grafana Loki Helm](https://grafana.com/docs/loki/latest/setup/install/helm/)
- [Grafana Alloy](https://grafana.com/docs/alloy/latest/set-up/install/kubernetes/)
### Critical Updates
- [Promtail EOL Notice](https://grafana.com/docs/loki/latest/send-data/promtail/installation/) - EOL 2026-03-02
- [Promtail to Alloy Migration](https://grafana.com/docs/alloy/latest/set-up/migrate/from-promtail/)

View File

@@ -0,0 +1,328 @@
# Project Research Summary: v2.0 CI/CD and Observability
**Project:** TaskPlanner v2.0 Production Operations
**Domain:** CI/CD Testing, GitOps Deployment, and Kubernetes Observability
**Researched:** 2026-02-03
**Confidence:** HIGH
## Executive Summary
This research covers production-readiness improvements for a self-hosted SvelteKit task management application running on k3s. The milestone adds three capabilities: (1) automated testing in the existing Gitea Actions pipeline, (2) ArgoCD-based GitOps deployment automation, and (3) a complete observability stack (Prometheus, Grafana, Loki). The infrastructure foundation already exists—k3s cluster, Gitea with Actions, Traefik ingress, Longhorn storage, and a defined ArgoCD Application manifest.
**Recommended approach:** Implement in three phases prioritizing operational foundation first. Phase 1 enables GitOps automation (ArgoCD), Phase 2 establishes observability (kube-prometheus-stack + Loki/Alloy), and Phase 3 hardens the CI pipeline with comprehensive testing. This ordering delivers immediate value (hands-off deployments) before adding observability, then solidifies quality gates last. The stack is standard for self-hosted k3s: ArgoCD for GitOps, kube-prometheus-stack for metrics/dashboards, Loki in monolithic mode for logs, and Grafana Alloy for log collection (Promtail is EOL March 2026).
**Key risks:** (1) ArgoCD + Traefik TLS termination requires `server.insecure: true` or redirect loops occur, (2) Loki disk exhaustion without retention configuration (filesystem storage has no size limits), (3) k3s control plane metrics need explicit endpoint configuration, and (4) Gitea webhooks fail JSON parsing with ArgoCD (use polling or accept webhook limitations). All risks have documented mitigations from production k3s deployments.
## Key Findings
### Recommended Stack
**GitOps:** ArgoCD is already configured in `argocd/application.yaml` with correct auto-sync and self-heal policies. The Application manifest exists but ArgoCD server installation is needed. Gitea webhooks to ArgoCD have known JSON parsing issues (Gitea uses Gogs format but ArgoCD expects GitHub); fallback to 3-minute polling is acceptable for single-user workload. ArgoCD Image Updater is unnecessary—the existing pattern of updating `values.yaml` in Git provides better audit trails.
**Observability:** The standard k3s stack is kube-prometheus-stack (single Helm chart bundling Prometheus, Grafana, Alertmanager, node-exporter, kube-state-metrics), Loki in monolithic SingleBinary mode for logs, and Grafana Alloy for log collection. CRITICAL: Promtail reaches End-of-Life on 2026-03-02 (next month)—use Alloy instead. Loki's monolithic mode uses filesystem storage, appropriate for single-node deployments under 100GB/day log volume. k3s requires explicit configuration to expose control plane metrics (scheduler, controller-manager bind to localhost by default).
**Testing:** Playwright is already configured with E2E tests in `tests/docker-deployment.spec.ts`. Add Vitest for unit/component testing (official Svelte recommendation for Vite-based projects). Use the Playwright Docker image (`mcr.microsoft.com/playwright:v1.58.1-noble`) in Gitea Actions to avoid 2-3 minute browser installation overhead. Run tests before build to fail fast.
**Core technologies:**
- **ArgoCD 3.3.0** (via Helm chart 9.4.0): GitOps deployment automation — already configured, needs installation
- **kube-prometheus-stack 81.4.2**: Bundled Prometheus + Grafana + Alertmanager — standard k3s observability stack
- **Loki 6.51.0** (monolithic mode): Log aggregation — lightweight, label-based like Prometheus
- **Grafana Alloy 1.5.3**: Log collection agent — Promtail replacement (EOL March 2026)
- **Vitest 3.0**: Unit/component tests — official Svelte recommendation, shares Vite config
- **Playwright 1.58.1**: E2E testing — already in use, comprehensive browser automation
### Expected Features
**Must have (table stakes):**
- **Automated tests in CI pipeline** — without tests, pipeline is just a build script; fail fast before deployment
- **GitOps auto-sync** — manual `helm upgrade` defeats CI/CD purpose; Git is single source of truth
- **Self-healing deployments** — ArgoCD reverts manual changes to maintain Git state
- **Basic metrics collection** — Prometheus scraping cluster and app metrics for visibility
- **Metrics visualization** — Grafana dashboards; metrics without visualization are useless
- **Log aggregation** — Loki centralized logging; no more `kubectl logs` per pod
- **Basic alerting** — 3-5 critical alerts (pod crashes, OOM, app down, disk full)
**Should have (differentiators):**
- **Application-level metrics** — custom Prometheus metrics in TaskPlanner (`/metrics` endpoint)
- **Gitea webhook integration** — reduces sync delay from 3min to seconds (accept limitations)
- **Smoke tests on deploy** — verify deployment health after ArgoCD sync
- **k3s control plane monitoring** — scheduler, controller-manager metrics in dashboards
- **Traefik metrics integration** — ingress traffic patterns and latency
**Defer (v2+):**
- **Distributed tracing** — overkill unless debugging microservices latency
- **SLO/SLI dashboards** — error budgets and reliability tracking (nice-to-have for learning)
- **Log-based alerting** — Loki alerting rules beyond basic metrics alerts
- **DORA metrics** — deployment frequency, lead time tracking
- **Vulnerability scanning** — Trivy for container images, npm audit
**Anti-features (actively avoid):**
- **Multi-environment promotion** — single user, single environment; deploy directly to prod
- **Blue-green/canary deployments** — complex rollout for single-user app
- **ArgoCD high availability** — HA for multi-team, not personal projects
- **ELK stack** — resource-heavy; Loki is lightweight alternative
- **Secrets management (Vault)** — overkill; Kubernetes secrets sufficient
- **Policy enforcement (OPA)** — single user has no policy conflicts
### Architecture Approach
The existing architecture has Gitea Actions building Docker images and pushing to Gitea Container Registry, then updating `helm/taskplaner/values.yaml` with the new image tag via Git commit. ArgoCD watches this repository and syncs changes to the k3s cluster. The observability stack integrates via ServiceMonitors (for Prometheus scraping), Alloy DaemonSet (for log collection), and Traefik ingress (for Grafana/ArgoCD UIs).
**Integration points:**
1. **Gitea → ArgoCD**: HTTPS repository clone (credentials in `argocd-secret`), optional webhook (Gogs type), automatic sync on Git changes
2. **Prometheus → Targets**: ServiceMonitors for TaskPlanner, Traefik, k3s control plane; scrapes `/metrics` endpoints every 30s
3. **Alloy → Loki**: DaemonSet reads `/var/log/pods`, forwards to Loki HTTP endpoint in `loki` namespace
4. **Grafana → Data Sources**: Auto-configured Prometheus and Loki datasources via kube-prometheus-stack integration
5. **Traefik → Ingress**: All UIs (Grafana, ArgoCD) exposed via Traefik with cert-manager TLS
**Namespace strategy:**
- `argocd`: ArgoCD server, repo-server, application-controller (standard convention)
- `monitoring`: Prometheus, Grafana, Alertmanager (kube-prometheus-stack default)
- `loki`: Loki SingleBinary, Alloy DaemonSet (separate for resource isolation)
- `default`: TaskPlanner application (existing)
**Major components:**
1. **ArgoCD Server** — GitOps controller; watches Git, syncs to cluster, exposes UI/API
2. **Prometheus** — metrics storage and querying; scrapes targets via ServiceMonitors
3. **Grafana** — visualization layer; queries Prometheus and Loki, displays dashboards
4. **Loki** — log aggregation; receives from Alloy, stores on filesystem, queries via LogQL
5. **Alloy DaemonSet** — log collection; reads pod logs, ships to Loki with Kubernetes labels
6. **kube-state-metrics** — Kubernetes object metrics (pod status, deployments, etc.)
7. **node-exporter** — node-level metrics (CPU, memory, disk, network)
**Data flows:**
- **Metrics**: TaskPlanner/Traefik/k3s expose `/metrics` → Prometheus scrapes → Grafana queries → dashboards display
- **Logs**: Pod stdout/stderr → `/var/log/pods` → Alloy reads → Loki stores → Grafana Explore queries
- **GitOps**: Developer pushes Git → Gitea Actions builds → updates values.yaml → ArgoCD syncs → Kubernetes deploys
- **Observability**: Metrics + Logs converge in Grafana for unified troubleshooting
### Critical Pitfalls
1. **ArgoCD + Traefik TLS Redirect Loop** — ArgoCD expects HTTPS but Traefik terminates TLS, causing infinite 307 redirects. Set `server.insecure: true` in `argocd-cmd-params-cm` ConfigMap. Use IngressRoute (not Ingress) for proper gRPC support with correct Header matcher syntax.
2. **Loki Disk Exhaustion Without Retention** — Loki fills disk because retention is disabled by default and only supports time-based retention (no size limits). Configure `compactor.retention_enabled: true` with `retention_period: 168h` (7 days). Set up Prometheus alert for PVC > 80% usage. Index period MUST be 24h for retention to work.
3. **Prometheus Volume Growth Exceeds PVC** — Default 15-day retention without size limits causes disk full. Set BOTH `retention: 7d` AND `retentionSize: 8GB`. Size PVC with 20% headroom. Longhorn volume expansion has known issues requiring pod stop, detach, resize, restart procedure.
4. **k3s Control Plane Metrics Not Scraped** — k3s runs scheduler/controller-manager as single binary binding to localhost, not as pods. Modify `/etc/rancher/k3s/config.yaml` to set `bind-address=0.0.0.0` for each component, then restart k3s. Configure explicit endpoints with k3s server IP in kube-prometheus-stack values.
5. **Gitea Webhook JSON Parsing Failure** — ArgoCD treats Gitea webhooks as GitHub events but field types differ (e.g., `repository.created_at` is string in Gitea, int64 in GitHub). Webhooks silently fail with parsing errors in ArgoCD logs. Use Gogs webhook type or accept 3-minute polling interval as fallback.
6. **ServiceMonitor Not Discovering Targets** — Label selector mismatch between Prometheus CR and ServiceMonitor, or RBAC issues. Use port NAME (not number) in ServiceMonitor endpoints. Set `serviceMonitorSelector: {}` for permissive selection. Verify RBAC with `kubectl auth can-i list endpoints`.
7. **k3s Resource Exhaustion** — Full kube-prometheus-stack deploys many components sized for larger clusters. Single-node k3s with 8GB RAM needs explicit resource limits. Disable alertmanager if not using alerts. Set Prometheus to `256Mi` request, Grafana to `128Mi`. Monitor with `kubectl top nodes`.
## Implications for Roadmap
Based on research, suggested phase structure prioritizes operational foundation before observability, then CI hardening:
### Phase 1: GitOps Foundation (ArgoCD)
**Rationale:** Eliminates manual `helm upgrade` commands and establishes Git as single source of truth. ArgoCD is the lowest-hanging fruit—Application manifest already exists, just needs server installation. Immediate value: hands-off deployments.
**Delivers:**
- ArgoCD installed via Helm in `argocd` namespace
- Existing `argocd/application.yaml` applied and syncing
- Auto-sync with self-heal enabled (already configured)
- Traefik ingress for ArgoCD UI with TLS
- Health checks showing deployment status
**Addresses:**
- Automated deployment trigger (table stakes from FEATURES.md)
- Git as single source of truth (GitOps principle)
- Self-healing (prevents manual drift)
**Avoids:**
- Pitfall #1: ArgoCD TLS redirect loop (configure `server.insecure: true`)
- Pitfall #5: Gitea webhook parsing (use Gogs type or polling)
**Configuration needed:**
- ArgoCD Helm values with `server.insecure: true`
- Gitea repository credentials in `argocd-secret`
- IngressRoute for ArgoCD UI (Traefik v3 syntax)
- Optional webhook in Gitea (test but accept polling fallback)
### Phase 2: Observability Stack (Prometheus/Grafana/Loki)
**Rationale:** Can't operate what you can't see. Establishes visibility before adding CI complexity. Observability enables debugging issues from Phase 1 and provides baseline before Phase 3 changes.
**Delivers:**
- kube-prometheus-stack (Prometheus + Grafana + Alertmanager)
- k3s control plane metrics exposed and scraped
- Pre-built Kubernetes dashboards in Grafana
- Loki in monolithic mode with retention configured
- Alloy DaemonSet collecting pod logs
- 3-5 critical alerts (pod crashes, OOM, disk full, app down)
- Traefik metrics integration
- Ingress for Grafana UI with TLS
**Addresses:**
- Basic metrics collection (table stakes)
- Metrics visualization (table stakes)
- Log aggregation (table stakes)
- Basic alerting (table stakes)
- k3s control plane monitoring (differentiator)
**Avoids:**
- Pitfall #2: Loki disk full (configure retention from day one)
- Pitfall #3: Prometheus volume growth (set retention + size limits)
- Pitfall #4: k3s metrics not scraped (configure endpoints)
- Pitfall #6: ServiceMonitor discovery (verify RBAC, use port names)
- Pitfall #7: Resource exhaustion (right-size for single-node)
**Configuration needed:**
- Modify `/etc/rancher/k3s/config.yaml` to expose control plane metrics
- kube-prometheus-stack values with k3s-specific endpoints and resource limits
- Loki values with retention enabled and monolithic mode
- Alloy values with Kubernetes log discovery pointing to Loki
- ServiceMonitors for Traefik (and future TaskPlanner metrics)
**Sub-phases:**
1. Configure k3s metrics exposure (restart k3s)
2. Install kube-prometheus-stack (Prometheus + Grafana)
3. Install Loki + Alloy (log aggregation)
4. Verify dashboards and create critical alerts
### Phase 3: CI Pipeline Hardening (Tests)
**Rationale:** Tests catch bugs before deployment. Comes last because Phases 1-2 provide operational foundation to observe test failures and deployment issues. Playwright already configured; just needs integration into pipeline plus Vitest addition.
**Delivers:**
- Vitest installed for unit/component tests
- Test suite structure established
- Gitea Actions workflow updated with test stage
- Tests run before build (fail fast)
- Playwright Docker image for browser tests (no install overhead)
- Type checking (`svelte-check`) in pipeline
- NPM scripts for local testing
**Addresses:**
- Automated tests in pipeline (table stakes)
- Lint/static analysis (table stakes)
- Pipeline fail-fast principle
**Avoids:**
- Over-engineering with extensive E2E suite (start simple)
- Test complexity that slows iterations
**Configuration needed:**
- Install Vitest + @testing-library/svelte
- Create `vitest.config.ts`
- Update `.gitea/workflows/build.yaml` with test job
- Add NPM scripts for test commands
- Configure test container image
**Test pyramid for personal app:**
- Unit tests: 70% (Vitest, fast, isolated)
- Integration tests: 20% (API endpoints, database)
- E2E tests: 10% (Playwright, critical paths only)
### Phase Ordering Rationale
**Why GitOps first:**
- ArgoCD configuration already exists (lowest effort)
- Immediate value: eliminates manual deployment
- Foundation for observing subsequent changes
- No dependencies on other phases
**Why Observability second:**
- Provides visibility into GitOps operations from Phase 1
- Required before adding CI complexity (Phase 3)
- k3s metrics configuration requires cluster restart (minimize disruptions)
- Baseline metrics needed to measure impact of changes
**Why CI Testing last:**
- Tests benefit from observability (can see failures in Grafana)
- GitOps ensures test failures block bad deployments
- Building on working foundation reduces moving parts
- Can iterate on test coverage after core infrastructure solid
**Dependencies respected:**
- Tests before build → CI pipeline structure
- ArgoCD watches Git → Git update triggers deploy
- Observability before app changes → baseline established
- Prometheus before alerts → scraping functional before alerting
### Research Flags
**Phases needing deeper research during planning:**
- **Phase 2.1 (k3s metrics)**: Verify exact k3s version and config file location; k3s installation methods vary
- **Phase 2.3 (Loki retention)**: Confirm disk capacity planning based on actual log volume
**Phases with standard patterns (skip research-phase):**
- **Phase 1 (ArgoCD)**: Well-documented Helm installation, existing Application manifest, standard Traefik pattern
- **Phase 2.2 (kube-prometheus-stack)**: Standard chart with k3s-specific values, extensive community examples
- **Phase 3 (Testing)**: Playwright already configured, Vitest is official Svelte recommendation
**Research confidence:**
- GitOps: HIGH (official ArgoCD docs + existing config)
- Observability: HIGH (official Helm charts + k3s community guides)
- Testing: HIGH (official Svelte docs + existing Playwright setup)
- Pitfalls: HIGH (verified with GitHub issues and production reports)
## Confidence Assessment
| Area | Confidence | Notes |
|------|------------|-------|
| Stack | HIGH | All components verified with official Helm charts and version numbers. Promtail EOL confirmed from Grafana docs. |
| Features | HIGH | Table stakes derived from CI/CD best practices and Kubernetes observability standards. Anti-features validated against homelab community patterns. |
| Architecture | HIGH | Integration patterns verified with official documentation (ArgoCD, Prometheus Operator, Loki). Namespace strategy follows community conventions. |
| Pitfalls | HIGH | All critical pitfalls sourced from verified GitHub issues with reproduction steps and fixes. k3s-specific issues confirmed from k3s.rocks tutorials. |
**Overall confidence:** HIGH
### Gaps to Address
**Gitea webhook reliability:** Research confirms JSON parsing issues with ArgoCD but workarounds exist (use Gogs type). Need to test in actual environment and decide whether to invest in debugging webhook vs. accepting 3-minute polling. For single-user workload, polling is acceptable.
**k3s version compatibility:** Research assumes recent k3s (v1.27+). Need to verify actual cluster version and k3s installation method (server vs. embedded) affects config file location and metrics exposure. Standard install at `/etc/rancher/k3s/config.yaml` may differ for k3d or other variants.
**Longhorn replica count:** Single-node k3s requires Longhorn replica count set to 1 (default is 3). Verify existing Longhorn configuration handles this correctly for new PVCs created by observability stack.
**Resource capacity:** Research estimates ~1.2 CPU cores and ~1.7GB RAM for observability stack. Verify actual k3s node has headroom beyond existing TaskPlanner, Gitea, Traefik, Longhorn workloads. Minimum 4GB RAM recommended for k3s + monitoring + apps.
**TLS certificate limits:** Adding Grafana and ArgoCD ingresses increases Let's Encrypt certificate count. Verify current usage doesn't approach rate limits (50 certs per domain per week).
## Sources
### Primary (HIGH confidence)
**Official Documentation:**
- [Svelte Testing Documentation](https://svelte.dev/docs/svelte/testing) - Vitest recommendation
- [Playwright CI Setup](https://playwright.dev/docs/ci-intro) - Docker image and best practices
- [ArgoCD Helm Chart](https://artifacthub.io/packages/helm/argo/argo-cd) - Version 9.4.0
- [kube-prometheus-stack](https://artifacthub.io/packages/helm/prometheus-community/kube-prometheus-stack) - Version 81.4.2
- [Grafana Loki Helm](https://grafana.com/docs/loki/latest/setup/install/helm/) - Monolithic mode
- [Grafana Alloy](https://grafana.com/docs/alloy/latest/set-up/install/kubernetes/) - Installation and config
- [Promtail EOL Notice](https://grafana.com/docs/loki/latest/send-data/promtail/installation/) - EOL 2026-03-02
- [ArgoCD Ingress Configuration](https://argo-cd.readthedocs.io/en/stable/operator-manual/ingress/) - TLS termination
- [Grafana Loki Retention](https://grafana.com/docs/loki/latest/operations/storage/retention/) - Compactor config
**Verified Issues:**
- [ArgoCD #16453](https://github.com/argoproj/argo-cd/issues/16453) - Gitea webhook parsing failure
- [Loki #5242](https://github.com/grafana/loki/issues/5242) - Retention not working
- [Longhorn #2222](https://github.com/longhorn/longhorn/issues/2222) - Volume expansion issues
- [kube-prometheus-stack #3401](https://github.com/prometheus-community/helm-charts/issues/3401) - Resource limits
- [Prometheus Operator #3383](https://github.com/prometheus-operator/prometheus-operator/issues/3383) - ServiceMonitor discovery
### Secondary (MEDIUM confidence)
**Community Tutorials:**
- [K3S Rocks - ArgoCD](https://k3s.rocks/argocd/) - k3s-specific ArgoCD setup
- [K3S Rocks - Logging](https://k3s.rocks/logging/) - Loki on k3s patterns
- [Prometheus on K3s](https://fabianlee.org/2022/07/02/prometheus-installing-kube-prometheus-stack-on-k3s-cluster/) - k3s control plane configuration
- [K3s Monitoring Guide](https://github.com/cablespaghetti/k3s-monitoring) - Complete k3s observability stack
- [Bootstrapping ArgoCD](https://windsock.io/bootstrapping-argocd/) - Initial setup patterns
- [ServiceMonitor Troubleshooting](https://managedkube.com/prometheus/operator/servicemonitor/troubleshooting/2019/11/07/prometheus-operator-servicemonitor-troubleshooting.html) - Common issues
**Best Practices:**
- [CI/CD Best Practices](https://www.jetbrains.com/teamcity/ci-cd-guide/ci-cd-best-practices/) - Testing pyramid, fail fast
- [Kubernetes Observability](https://www.usdsi.org/data-science-insights/kubernetes-observability-and-monitoring-trends-in-2026) - Stack selection
- [ArgoCD Best Practices](https://argo-cd.readthedocs.io/en/stable/user-guide/best_practices/) - Sync waves, self-management
### Tertiary (LOW confidence)
- None - all research verified with official sources or production issue reports
---
*Research completed: 2026-02-03*
*Ready for roadmap: Yes*
*Files synthesized: STACK-v2-cicd-observability.md, FEATURES.md, ARCHITECTURE.md, PITFALLS-CICD-OBSERVABILITY.md*

104
argocd/SETUP.md Normal file
View File

@@ -0,0 +1,104 @@
# ArgoCD GitOps Setup for TaskPlaner
This guide sets up automatic deployment of TaskPlaner using GitOps with ArgoCD and Gitea.
## Prerequisites
- Kubernetes cluster access
- Gitea instance with Packages (Container Registry) enabled
- Gitea Actions runner configured
## 1. Install ArgoCD
```bash
kubectl create namespace argocd
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml
```
Wait for ArgoCD to be ready:
```bash
kubectl wait --for=condition=available deployment/argocd-server -n argocd --timeout=300s
```
## 2. Configure Gitea Registry Secrets
### For Gitea Actions (push access)
In Gitea repository settings, add these secrets:
- `REGISTRY_USERNAME`: Your Gitea username
- `REGISTRY_PASSWORD`: A Gitea access token with `write:package` scope
### For Kubernetes (pull access)
Create an image pull secret:
```bash
kubectl create secret docker-registry gitea-registry-secret \
--docker-server=git.kube2.tricnet.de \
--docker-username=YOUR_USERNAME \
--docker-password=YOUR_ACCESS_TOKEN \
-n default
```
## 3. Configure ArgoCD Repository Access
Add the Gitea repository to ArgoCD:
```bash
# Get ArgoCD admin password
kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d; echo
# Port forward to access ArgoCD UI
kubectl port-forward svc/argocd-server -n argocd 8080:443
# Or use CLI
argocd login localhost:8080 --insecure
argocd repo add https://git.kube2.tricnet.de/tho/taskplaner.git \
--username YOUR_USERNAME \
--password YOUR_ACCESS_TOKEN
```
## 4. Deploy the ArgoCD Application
```bash
kubectl apply -f argocd/application.yaml
```
Note: Edit `application.yaml` first to remove the example Secret or replace `REPLACE_WITH_BASE64_ENCODED_USERNAME_COLON_PASSWORD` with actual credentials.
## 5. Verify Deployment
```bash
# Check ArgoCD application status
kubectl get applications -n argocd
# Watch sync status
argocd app get taskplaner
# Check pods
kubectl get pods -l app.kubernetes.io/name=taskplaner
```
## Workflow
1. Push code to `master` branch
2. Gitea Actions builds Docker image and pushes to registry
3. Workflow updates `helm/taskplaner/values.yaml` with new image tag
4. ArgoCD detects change and auto-syncs deployment
## Troubleshooting
### Image Pull Errors
```bash
kubectl describe pod -l app.kubernetes.io/name=taskplaner
```
Check if the image pull secret is correctly configured.
### ArgoCD Sync Issues
```bash
argocd app sync taskplaner --force
argocd app logs taskplaner
```
### Actions Runner Issues
```bash
kubectl logs -n gitea -l app=act-runner -c runner
```

44
argocd/application.yaml Normal file
View File

@@ -0,0 +1,44 @@
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: taskplaner
namespace: argocd
spec:
project: default
source:
repoURL: http://gitea-http.gitea.svc.cluster.local:3000/tho/taskplaner.git
targetRevision: HEAD
path: helm/taskplaner
helm:
valueFiles:
- values.yaml
parameters:
- name: image.repository
value: git.kube2.tricnet.de/tho/taskplaner
- name: ingress.enabled
value: "true"
- name: ingress.className
value: traefik
- name: ingress.hosts[0].host
value: task.kube2.tricnet.de
- name: ingress.hosts[0].paths[0].path
value: /
- name: ingress.hosts[0].paths[0].pathType
value: Prefix
- name: ingress.tls[0].secretName
value: taskplaner-tls
- name: ingress.tls[0].hosts[0]
value: task.kube2.tricnet.de
- name: ingress.annotations.cert-manager\.io/cluster-issuer
value: letsencrypt-prod
- name: config.origin
value: https://task.kube2.tricnet.de
destination:
server: https://kubernetes.default.svc
namespace: default
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true

22
argocd/repo-secret.yaml Normal file
View File

@@ -0,0 +1,22 @@
# ArgoCD Repository Secret for TaskPlanner
# This file documents the secret structure. Apply using kubectl, not this file.
#
# To create the secret:
# PASSWORD=$(kubectl get secret gitea-repo -n argocd -o jsonpath='{.data.password}' | base64 -d)
# cat <<EOF | kubectl apply -f -
# apiVersion: v1
# kind: Secret
# metadata:
# name: taskplaner-repo
# namespace: argocd
# labels:
# argocd.argoproj.io/secret-type: repository
# stringData:
# type: git
# url: http://gitea-http.gitea.svc.cluster.local:3000/tho/taskplaner.git
# username: admin
# password: "$PASSWORD"
# EOF
#
# The secret allows ArgoCD to access the TaskPlanner Git repository
# using internal cluster networking (gitea-http.gitea.svc.cluster.local).

View File

@@ -3,12 +3,13 @@
replicaCount: 1
image:
repository: taskplaner
pullPolicy: IfNotPresent
repository: git.kube2.tricnet.de/tho/taskplaner
pullPolicy: Always
# Overrides the image tag whose default is the chart appVersion
tag: ""
tag: "latest"
imagePullSecrets: []
imagePullSecrets:
- name: gitea-registry-secret
nameOverride: ""
fullnameOverride: ""

View File

@@ -13,6 +13,7 @@
// Transform tags to Svelecte format
let tagOptions = $derived(availableTags.map((t) => ({ value: t.name, label: t.name })));
let availableTagNames = $derived(new Set(availableTags.map((t) => t.name.toLowerCase())));
// Track selected tag names for Svelecte
let selectedTagNames = $state(filters.tags);
@@ -22,6 +23,14 @@
selectedTagNames = filters.tags;
});
// Remove deleted tags from filter when availableTags changes
$effect(() => {
const validTags = filters.tags.filter((t) => availableTagNames.has(t.toLowerCase()));
if (validTags.length !== filters.tags.length) {
onchange({ ...filters, tags: validTags });
}
});
function handleTypeChange(newType: 'task' | 'thought' | 'all') {
onchange({ ...filters, type: newType });
}