Skip to content

fix(training-agent): create fresh MCP Server per request to eliminate transport-binding race#4089

Draft
bokelley wants to merge 1 commit intomainfrom
claude/issue-4084-tenant-server-per-request-factory
Draft

fix(training-agent): create fresh MCP Server per request to eliminate transport-binding race#4089
bokelley wants to merge 1 commit intomainfrom
claude/issue-4084-tenant-server-per-request-factory

Conversation

@bokelley
Copy link
Copy Markdown
Contributor

@bokelley bokelley commented May 4, 2026

Closes #4084

The tenant router reused a singleton MCP Server per tenant (held by TenantRegistry) and called server.connect(transport) on every HTTP POST. Under concurrent load the MCP SDK rejected re-connection with "Already connected to a transport", producing -32603 Internal server error on ~1/3 of cold-start requests. The compliance heartbeat and storyboard runners targeting the training agent both hit this under back-to-back load.

Fix: Add createServer(tenantId) factory to RegistryHolder in registry.ts. The factory stores a closure per tenant (keyed by cfg.tenantId — same key TenantRegistry uses) that calls createAdcpServerFromPlatform(opts) fresh per invocation. Each HTTP POST in tenantMcpHandler now calls holder.createServer(resolved.tenantId) to get its own Protocol instance, then runs the connect → handle → close lifecycle on that instance. The registry singleton is retained only for resolveByRequest (tenant existence and health checks).

This matches the pattern already used in standalone.ts and the legacy/strict MCP handlers in index.ts, and matches the MCP SDK's stated requirement for stateless mode (sessionIdGenerator: undefined): one Server per connection.

Non-breaking justification: purely internal server lifecycle change — no change to the HTTP interface, JSON-RPC wire format, tenant registration, or any public API surface. Existing clients are unaffected.

Pre-PR review:

  • code-reviewer: approved — closure correctness verified, factory-key fix applied (cfg.tenantId not id), ordering guarantee tight
  • dx-expert: approved — race correctly fixed, resolved.server no longer used for transport ops, comment adequate

Triage-managed PR. This bot does not currently iterate on
review comments or PR conversation threads (only on the source
issue). To unblock:

  • Push fixup commits directly: gh pr checkout <num>
    fix → push.
  • Or re-trigger: comment /triage execute on the source
    issue.

See #3121
for context.

Session: https://claude.ai/code/session_011KzNaC6rJnpxvUe4QCMUkd


Generated by Claude Code

… transport-binding race

Fixes #4084. The tenant router reused a singleton MCP Server per tenant and
called server.connect(transport) on every HTTP POST. Under concurrent load
the SDK rejected re-connection with "Already connected to a transport",
returning -32603 to ~1/3 of cold-start requests.

Fix: add createServer(tenantId) factory to RegistryHolder that calls
createAdcpServerFromPlatform(opts) fresh per invocation. Each HTTP POST
now gets its own Protocol instance for connect→handle→close, matching the
SDK's stateless-mode contract (sessionIdGenerator: undefined).

https://claude.ai/code/session_011KzNaC6rJnpxvUe4QCMUkd
@bokelley bokelley added the claude-triaged Issue has been triaged by the Claude Code triage routine. Remove to re-triage. label May 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

claude-triaged Issue has been triaged by the Claude Code triage routine. Remove to re-triage.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

training-agent: tenant router races on transport binding under back-to-back HTTP requests

2 participants