Skip to content

feat: Observability Stack — Prometheus + Grafana + Loki + Tempo + Alerting + Uptime Kuma#499

Open
Lucaaaaaaaaaaaaaaaaaaaaa wants to merge 1 commit into
illbnm:masterfrom
Lucaaaaaaaaaaaaaaaaaaaaa:observability-stack
Open

feat: Observability Stack — Prometheus + Grafana + Loki + Tempo + Alerting + Uptime Kuma#499
Lucaaaaaaaaaaaaaaaaaaaaa wants to merge 1 commit into
illbnm:masterfrom
Lucaaaaaaaaaaaaaaaaaaaaa:observability-stack

Conversation

@Lucaaaaaaaaaaaaaaaaaaaaa
Copy link
Copy Markdown

Summary

Closes #10

Services (10 total, all with healthchecks)

  • Prometheus v2.54.1 — 7 scrape targets (self, cadvisor, node-exporter, traefik, alertmanager, grafana, uptime-kuma)
  • Grafana v11.2.2 — OIDC via Authentik (homelab-admins→Admin, homelab-users→Viewer), auto-provisioned datasources + dashboard provider
  • Loki v3.2.0 — Log aggregation with configurable retention (LOKI_RETENTION)
  • Promtail v3.2.0 — Docker container log auto-discovery + syslog collection
  • Tempo v2.6.0 — Distributed tracing (OTLP HTTP/GRPgC), Tempo→Prometheus metric generator
  • Alertmanager v0.27.0 — 3 severity routes → ntfy webhook (critical/warning/default)
  • cAdvisor v0.50.0 — Container resource metrics
  • Node Exporter v1.8.2 — Host metrics
  • Uptime Kuma v1.23.15 — Public status page at status.${DOMAIN}

Alert Rules (9 rules across 3 groups)

  • host.yml: HighCPU (>80%), HighMemory (>90%), HighDisk (>85%), DiskIOAnomaly
  • containers.yml: ContainerRestart (>3/h), ContainerOOM, ContainerHealthcheckFail
  • services.yml: Traefik5xx (>1%), HighResponseTime (P99>2s)

Pre-provisioned Dashboards

  • grafana-import-dashboards.sh imports: Node Exporter Full (1860), Docker Container (179), Traefik (17346), Loki (13639), Uptime Kuma (18278)
  • Datasources auto-provisioned: Prometheus, Loki, Tempo (with cross-linking)

Scripts

  • scripts/grafana-import-dashboards.sh — imports 5 dashboards from Grafana.com
  • scripts/uptime-kuma-setup.sh — monitor setup guide with service list

Generated/reviewed with: claude-opus-4-6

…rting + Uptime Kuma

- Prometheus v2.54.1: scrape configs for cadvisor, node-exporter, traefik, grafana, alertmanager, uptime-kuma
- Grafana v11.2.2: OIDC auth (Authentik), auto-provisioned datasources + dashboard provider
- Loki v3.2.0 + Promtail v3.2.0: Docker container log discovery + syslog collection
- Tempo v2.6.0: OTLP HTTP/GRPgC receiver, Tempo → Prometheus metric generator
- Alertmanager v0.27.0: routes to ntfy webhook (critical/warning/default)
- cAdvisor v0.50.0 + Node Exporter v1.8.2: container + host metrics
- Uptime Kuma v1.23.15: public status page at status.
- Alert rules: host (CPU/mem/disk/IO), containers (restart/OOM/health), services (5xx/latency)
- grafana-import-dashboards.sh: imports 5 dashboards from Grafana.com
- uptime-kuma-setup.sh: monitor setup guide
- .env.example with retention config
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BOUNTY $280] Observability — Prometheus + Grafana + Loki + Alerting

1 participant