feat: GoodJob oldest-queued-age, process count, and opt-in per-queue metrics by xrl · Pull Request #373 · discourse/prometheus_exporter

xrl · 2026-05-28T20:35:04Z

What & why

While building Grafana dashboards for a Rails app's GoodJob queues, the built-in GoodJob instrumentation gave us the job-state totals but was missing the three signals we reached for most:

good_job_oldest_queued_age_seconds — how long the oldest ready-to-run job has been waiting. This is the single best backlog/latency signal (a queue can have a small count but be badly behind), and it's what you'd actually alert on.
good_job_processes — number of active GoodJob processes (GoodJob::Process.active.count), so you can see whether workers are actually up.
per-queue breakdown — which queue is backing up, not just that something is.

Opening this to add them and to get your thoughts on the per-queue approach.

Backwards compatibility

The two new gauges are purely additive to the existing collect payload and the server collector's GOOD_JOB_GAUGES.
Per-queue is opt-in via GoodJob.start(per_queue: true) (passed through PeriodicStats.start's **kwargs). Default is false, so existing deployments emit the exact same cluster-wide, unlabelled series as before.
When per_queue: true, the job-state gauges and oldest-age carry a queue label instead of an aggregate (so sum(good_job_queued) still returns the cluster total), while good_job_processes stays unlabelled. A metric never mixes labelled and unlabelled series, so nothing double-counts.

The server collector needed no label logic — it already applies custom_labels, so the instrumentation just sends one object per queue with custom_labels: { queue: ... }.

Tests / docs

Extended test/server/good_job_collector_test.rb (new gauges + a per-queue label assertion).
Added test/instrumentation/good_job_test.rb with a minimal GoodJob double covering both collect (default) and collect_per_queue.
Full suite green (164 runs, 0 failures); rubocop clean.
README GoodJob table + per-queue docs, and a CHANGELOG entry.

Questions for you

Naming — good_job_oldest_queued_age_seconds vs good_job_queue_latency_seconds; good_job_processes vs good_job_active_processes?
Per-queue as a label opt-in (this PR) vs separate metric names — preference?
good_job_processes is guarded with defined?(::GoodJob::Process); happy to adjust the floor if you support older GoodJob.

Happy to iterate on any of it.

…eue metrics Adds two purely-additive global gauges to the GoodJob instrumentation: - good_job_oldest_queued_age_seconds (queue latency / backlog age) - good_job_processes (active GoodJob processes) Adds an opt-in `GoodJob.start(per_queue: true)` that breaks the job-state gauges and oldest-queued-age down by a `queue` label. It defaults to false, so existing (cluster-wide, unlabelled) output is unchanged; when enabled, a metric carries the queue label consistently so sum() still yields the total. The server collector already applies custom_labels, so the queue label needs no collector change beyond registering the two new gauges.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: GoodJob oldest-queued-age, process count, and opt-in per-queue metrics#373

feat: GoodJob oldest-queued-age, process count, and opt-in per-queue metrics#373
xrl wants to merge 1 commit into
discourse:mainfrom
xrl:goodjob-queue-latency-process-and-per-queue-metrics

xrl commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

1 participant

Conversation

xrl commented May 28, 2026

What & why

Backwards compatibility

Tests / docs

Questions for you

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

1 participant