Skip to content

feat: GoodJob oldest-queued-age, process count, and opt-in per-queue metrics#373

Open
xrl wants to merge 1 commit into
discourse:mainfrom
xrl:goodjob-queue-latency-process-and-per-queue-metrics
Open

feat: GoodJob oldest-queued-age, process count, and opt-in per-queue metrics#373
xrl wants to merge 1 commit into
discourse:mainfrom
xrl:goodjob-queue-latency-process-and-per-queue-metrics

Conversation

@xrl
Copy link
Copy Markdown

@xrl xrl commented May 28, 2026

What & why

While building Grafana dashboards for a Rails app's GoodJob queues, the built-in GoodJob instrumentation gave us the job-state totals but was missing the three signals we reached for most:

  • good_job_oldest_queued_age_seconds — how long the oldest ready-to-run job has been waiting. This is the single best backlog/latency signal (a queue can have a small count but be badly behind), and it's what you'd actually alert on.
  • good_job_processes — number of active GoodJob processes (GoodJob::Process.active.count), so you can see whether workers are actually up.
  • per-queue breakdownwhich queue is backing up, not just that something is.

Opening this to add them and to get your thoughts on the per-queue approach.

Backwards compatibility

  • The two new gauges are purely additive to the existing collect payload and the server collector's GOOD_JOB_GAUGES.
  • Per-queue is opt-in via GoodJob.start(per_queue: true) (passed through PeriodicStats.start's **kwargs). Default is false, so existing deployments emit the exact same cluster-wide, unlabelled series as before.
  • When per_queue: true, the job-state gauges and oldest-age carry a queue label instead of an aggregate (so sum(good_job_queued) still returns the cluster total), while good_job_processes stays unlabelled. A metric never mixes labelled and unlabelled series, so nothing double-counts.

The server collector needed no label logic — it already applies custom_labels, so the instrumentation just sends one object per queue with custom_labels: { queue: ... }.

Tests / docs

  • Extended test/server/good_job_collector_test.rb (new gauges + a per-queue label assertion).
  • Added test/instrumentation/good_job_test.rb with a minimal GoodJob double covering both collect (default) and collect_per_queue.
  • Full suite green (164 runs, 0 failures); rubocop clean.
  • README GoodJob table + per-queue docs, and a CHANGELOG entry.

Questions for you

  1. Naming — good_job_oldest_queued_age_seconds vs good_job_queue_latency_seconds; good_job_processes vs good_job_active_processes?
  2. Per-queue as a label opt-in (this PR) vs separate metric names — preference?
  3. good_job_processes is guarded with defined?(::GoodJob::Process); happy to adjust the floor if you support older GoodJob.

Happy to iterate on any of it.

…eue metrics

Adds two purely-additive global gauges to the GoodJob instrumentation:
- good_job_oldest_queued_age_seconds (queue latency / backlog age)
- good_job_processes (active GoodJob processes)

Adds an opt-in `GoodJob.start(per_queue: true)` that breaks the job-state
gauges and oldest-queued-age down by a `queue` label. It defaults to false,
so existing (cluster-wide, unlabelled) output is unchanged; when enabled, a
metric carries the queue label consistently so sum() still yields the total.

The server collector already applies custom_labels, so the queue label needs
no collector change beyond registering the two new gauges.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

1 participant