feat(promhttp): add CoalesceGather option to deduplicate concurrent Gather calls#1969
feat(promhttp): add CoalesceGather option to deduplicate concurrent Gather calls#1969kakkoyun wants to merge 6 commits into
Conversation
840d68c to
f9e46a2
Compare
…ather calls When a collector's Collect() is slower than the scrape interval, each incoming HTTP request triggers an independent Gather() that spawns its own goroutine pipeline. With no upper bound, this causes goroutine pile-up proportional to (scrape rate / collection time) — the apparent "goroutine leak" reported in prometheus#1477. Add HandlerOpts.CoalesceGather bool. When true, HandlerForTransactional wraps the underlying TransactionalGatherer in a coalescingGatherer that allows only one Gather to run at a time. Concurrent requests join the in-flight cycle and receive the same result once it completes. Reference counting ensures the TransactionalGatherer done() callback is called exactly once, after the last handler finishes encoding. Design decisions: - Per-cycle gatherCycle object (not fields on the struct) prevents the race where a new cycle overwrites result fields while prior waiters are still reading them. - Mutex-based ref counting (not atomic) ensures c.cycle = nil is cleared before cy.done() is called, ruling out double-done on a stale pointer. - close(cy.ready) happens-before <-cy.ready returns (Go memory model), so cy.mfs/err/done are safely readable without additional locking. - Zero overhead when disabled: single if-branch in HandlerForTransactional setup, identical code path when CoalesceGather is false. Fixes prometheus#1477 Signed-off-by: Kemal Akkoyun <kemal.akkoyun@datadoghq.com>
f9e46a2 to
81c415a
Compare
|
@kakkoyun LGTM- I tested with the roughly same approach as for my hack implementation (ref.: ncabatoff/process-exporter@master...initialed85:process-exporter:kakkoyun/goroutine_leak) Worked as expected- with a fake 1s sleep added to the scrape method in process-exporter:
So I think we're in good shape- lots of timing out of requests, no climbing FDs; works nicely. Awesome work! |
| close(cy.ready) | ||
| } | ||
| }() | ||
| cy.mfs, cy.done, cy.err = c.g.Gather() |
There was a problem hiding this comment.
One question: if c.g.Gather() panics, do the joiners actually see the panic, or do they just unblock and return as if everything was ok?
There was a problem hiding this comment.
This is a bit of defensive coding here. They just unblock and clean up. If there is no recover in the chain, it will be printed as any other panic.
Feel free to challenge and propose better alternative. Would you prefer us to capture the error?
There was a problem hiding this comment.
I made a suggestion above to capture the error. This is because I was thinking that HandlerForTransactional basically returns an http.Handler and if I’m reading this correctly https://github.com/golang/go/blob/e2fa8596cf7016d080c51d772c3012f2533e2219/src/net/http/server.go#L84, then the panics are recovered in any case.
vesari
left a comment
There was a problem hiding this comment.
I left just a small suggestion.
| c.cycle = nil | ||
| } | ||
| c.mu.Unlock() | ||
| close(cy.ready) |
There was a problem hiding this comment.
| close(cy.ready) | |
| cy.err = fmt.Errorf("gather panicked") | |
| close(cy.ready) |
| close(cy.ready) | ||
| } | ||
| }() | ||
| cy.mfs, cy.done, cy.err = c.g.Gather() |
There was a problem hiding this comment.
I made a suggestion above to capture the error. This is because I was thinking that HandlerForTransactional basically returns an http.Handler and if I’m reading this correctly https://github.com/golang/go/blob/e2fa8596cf7016d080c51d772c3012f2533e2219/src/net/http/server.go#L84, then the panics are recovered in any case.
What problem does this solve?
Issue #1477 reports apparent goroutine leaks when using
prometheus/client_golang. The root cause is not a WaitGroup bug —Registry.Gather()logic is correct. The real problem:When a collector's
Collect()is slower than the scrape interval, each incoming HTTP scrape triggers an independentGather(). Each call spawns its own goroutine pipeline. With no upper bound, concurrent pipelines grow proportionally to(scrape rate / collection time)— a goroutine pile-up that looks like a leak but is unbounded concurrent work.This was originally investigated by @initialed85 in this comment, who validated the issue and prototyped a "piggybacking" (singleflight) approach. This PR implements that idea cleanly inside the HTTP handler, with zero new public types and zero overhead when disabled.
Solution
Add
HandlerOpts.CoalesceGather bool. Whentrue,HandlerForTransactionalwraps theTransactionalGathererin an unexportedcoalescingGathererthat allows only oneGather()to run at a time. Concurrent HTTP requests arriving while a gather is in-flight join the existing cycle and receive the same result.Usage
Design decisions
HandlerOptsfield, not a public wrapper typegatherCycleobjectc.cycle = nilis cleared beforecy.done()is called, ruling out double-done on a stale pointerclose(cy.ready)as the visibility barriercy.mfs/err/doneare safely readable after<-cy.readywithout additional lockingfalseChanges
prometheus/promhttp/http.gocoalescingGatherer+gatherCycletypes; addHandlerOpts.CoalesceGather; wire intoHandlerForTransactionalprometheus/promhttp/http_test.goTest plan
go test -race -count=1 ./prometheus/promhttp/...— all pass, no data racesgo vet ./prometheus/...— cleanTestCoalesceGatherGoroutineLeakFreeusesgoleak.VerifyNoneto assert no leaked goroutines under concurrent slow-collector loadTestCoalesceGatherDoneCalledExactlyOnceverifies theTransactionalGatherercontract:done()calls ==Gather()calls regardless of concurrencyCloses #1477