Skip to content

tracing: fix BatchSpanProcessor goroutine leak on config reload#7826

Merged
mholt merged 3 commits into
caddyserver:masterfrom
Dean2026:fix/tracing-bsp-goroutine-leak
Jun 18, 2026
Merged

tracing: fix BatchSpanProcessor goroutine leak on config reload#7826
mholt merged 3 commits into
caddyserver:masterfrom
Dean2026:fix/tracing-bsp-goroutine-leak

Conversation

@Dean2026

@Dean2026 Dean2026 commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

tracing: fix BatchSpanProcessor goroutine leak on config reload

Description

When the tracing directive is enabled, each config reload leaks one
go.opentelemetry.io/otel/sdk/trace.(*batchSpanProcessor).processQueue
goroutine. On a server that reloads frequently (e.g. polling a remote
config source every few seconds) this accumulates into tens of thousands
of leaked goroutines over time.

A goroutine dump shows many identical stacks:

goroutine ... [select]:
go.opentelemetry.io/otel/sdk/trace.(*batchSpanProcessor).processQueue(...)
	.../sdk/trace/batch_span_processor.go:327
go.opentelemetry.io/otel/sdk/trace.NewBatchSpanProcessor.func2()
	.../sdk/trace/batch_span_processor.go:129
created by go.opentelemetry.io/otel/sdk/trace.NewBatchSpanProcessor
	.../sdk/trace/batch_span_processor.go:127

Root cause

tracing keeps a global, reference-counted TracerProvider so it can be
reused across reloads (tracerProvider.getTracerProvider). Caddy reloads
provision the new config before cleaning up the old one, so the counter
never drops to 0 and the provider is correctly reused — Shutdown is
never called, by design.

The problem is on the caller side in newOpenTelemetryWrapper:

traceExporter, err := autoexport.NewSpanExporter(ctx)
...
tracerProvider := globalTracerProvider.getTracerProvider(
    sdktrace.WithBatcher(traceExporter),   // evaluated on every reload
    sdktrace.WithResource(res),
)

sdktrace.WithBatcher(e) is WithSpanProcessor(NewBatchSpanProcessor(e)),
and NewBatchSpanProcessor starts its processQueue goroutine eagerly at
construction time
— not when the option is applied. The option is built
on every Provision (every reload), but getTracerProvider only applies
it when it actually creates a new provider (t.tracerProvider == nil). On
the reuse path the option is silently discarded, so the just-started
BatchSpanProcessor goroutine is orphaned: it is never registered with any
provider and therefore never shut down. Result: one leaked goroutine (plus
a leaked exporter) per reload.

Fix

Defer construction of the exporter/batcher until a new provider is actually
needed. getTracerProvider now takes a buildOpts factory that is invoked
only on the create path, so nothing with a side effect is built on the
reuse path.

The reference counter is now incremented only after the provider is
successfully obtained, preserving the previous semantics where a failed
exporter creation did not affect the counter.

Testing

  • Test_tracersProvider_buildOptsOnlyOnCreate — asserts buildOpts runs
    exactly once across one create + five reuses (the regression guard).
  • Test_tracersProvider_buildOptsError — asserts that on a build error the
    provider stays nil and the counter is not incremented.
  • Existing tracing tests updated for the new signature and still pass.

Verified manually with a reload loop: before the fix, 50 reloads leaked 50
processQueue goroutines; after the fix, 0 are leaked while the provider is
still reused (counter stays at 1).


🤖 AI Disclosure: Claude Opus 4.8 was used to generate the patch, and verified at the staging deployment.

When the `tracing` directive is enabled, each config reload leaks one
`go.opentelemetry.io/otel/sdk/trace.(*batchSpanProcessor).processQueue`
goroutine. On a server that reloads frequently (e.g. polling a remote
config source every few seconds) this accumulates into tens of thousands
of leaked goroutines over time.

A goroutine dump shows many identical stacks:

```
goroutine ... [select]:
go.opentelemetry.io/otel/sdk/trace.(*batchSpanProcessor).processQueue(...)
	.../sdk/trace/batch_span_processor.go:327
go.opentelemetry.io/otel/sdk/trace.NewBatchSpanProcessor.func2()
	.../sdk/trace/batch_span_processor.go:129
created by go.opentelemetry.io/otel/sdk/trace.NewBatchSpanProcessor
	.../sdk/trace/batch_span_processor.go:127
```

`tracing` keeps a global, reference-counted `TracerProvider` so it can be
reused across reloads (`tracerProvider.getTracerProvider`). Caddy reloads
provision the new config *before* cleaning up the old one, so the counter
never drops to 0 and the provider is correctly reused — `Shutdown` is
never called, by design.

The problem is on the caller side in `newOpenTelemetryWrapper`:

```go
traceExporter, err := autoexport.NewSpanExporter(ctx)
...
tracerProvider := globalTracerProvider.getTracerProvider(
    sdktrace.WithBatcher(traceExporter),   // evaluated on every reload
    sdktrace.WithResource(res),
)
```

`sdktrace.WithBatcher(e)` is `WithSpanProcessor(NewBatchSpanProcessor(e))`,
and `NewBatchSpanProcessor` **starts its `processQueue` goroutine eagerly at
construction time** — not when the option is applied. The option is built
on every `Provision` (every reload), but `getTracerProvider` only applies
it when it actually creates a new provider (`t.tracerProvider == nil`). On
the reuse path the option is silently discarded, so the just-started
BatchSpanProcessor goroutine is orphaned: it is never registered with any
provider and therefore never shut down. Result: one leaked goroutine (plus
a leaked exporter) per reload.

Defer construction of the exporter/batcher until a new provider is actually
needed. `getTracerProvider` now takes a `buildOpts` factory that is invoked
only on the create path, so nothing with a side effect is built on the
reuse path.

The reference counter is now incremented only after the provider is
successfully obtained, preserving the previous semantics where a failed
exporter creation did not affect the counter.

- `Test_tracersProvider_buildOptsOnlyOnCreate` — asserts `buildOpts` runs
  exactly once across one create + five reuses (the regression guard).
- `Test_tracersProvider_buildOptsError` — asserts that on a build error the
  provider stays nil and the counter is not incremented.
- Existing tracing tests updated for the new signature and still pass.

Verified manually with a reload loop: before the fix, 50 reloads leaked 50
`processQueue` goroutines; after the fix, 0 are leaked while the provider is
still reused (counter stays at 1).
@CLAassistant

CLAassistant commented Jun 17, 2026

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

@mholt

mholt commented Jun 17, 2026

Copy link
Copy Markdown
Member

Closing due to missing AI disclosure (PR template deleted/ignored). We can reopen it if this is remedied.

@mholt mholt closed this Jun 17, 2026
@Dean2026

Copy link
Copy Markdown
Contributor Author

Closing due to missing AI disclosure (PR template deleted/ignored). We can reopen it if this is remedied.

I have updated the AI disclosure section, please review it, thanks.

@steadytao

Copy link
Copy Markdown
Member

Code direction seems right from a glance, just deferring exporter/batcher construction until the provider is actually created. One small thing before I reopen is that the PR description mentions Test_tracersProvider_buildOptsError but I only see Test_tracersProvider_buildOptsOnlyOnCreate in the branch. Could you add that build test please?

@Dean2026

Dean2026 commented Jun 18, 2026

Copy link
Copy Markdown
Contributor Author

Thanks for the review! I've added Test_tracersProvider_buildOptsError in commit 5754aca.

Ready for you to reopen whenever you'd like. Thanks!

@steadytao

Copy link
Copy Markdown
Member

Please push that commit to your PR branch...

@Dean2026

Copy link
Copy Markdown
Contributor Author

The commit is already on the branch — you can see it here: https://github.com/Dean2026/caddy/commits/fix/tracing-bsp-goroutine-leak/ .

@steadytao steadytao reopened this Jun 18, 2026
@steadytao

Copy link
Copy Markdown
Member

Ah PRs dont pick up commits while closed. The more you know 😁

@steadytao steadytao left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. The fix defers exporter/batcher construction until a provider is actually created which avoids leaking the unused BatchSpanProcessor on reload. Tests cover the reuse and error paths.

@steadytao steadytao added this to the v2.11.5 milestone Jun 18, 2026
@steadytao steadytao added the bug 🐞 Something isn't working label Jun 18, 2026

@mholt mholt left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alrighty, thank you!

@mholt mholt merged commit 69d6ace into caddyserver:master Jun 18, 2026
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug 🐞 Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants