RFC: Horizontal Scaling for vMCP and Proxy Runner#47
Conversation
4987889 to
b9e00fd
Compare
Introduces THV-XXXX covering background, problems, scope, high-level solution, and requirements for enabling safe horizontal scale-out of the vmcp and thv-proxyrunner components via externalized Redis session storage and session-aware routing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
b9e00fd to
dab8883
Compare
- Fix Mermaid \n → <br/> in both diagrams
- Update metadata layer description to include session IDs
- Strengthen re-initialization language ("destructive" not "may not be safe")
- Add current proxyrunner state context to §2.2
- Fix stdio scaling description: about concurrency, not exclusivity
- Add fungibility constraint note to §1.4 and §5.3 R-OP-1
- Fix §3.1: single MCPServer backed by multiple proxyrunner replicas
- Add vMCP scale-in to §3.1 in-scope
- Update §3.2: proxyrunner scale-in only; proxyrunner:StatefulSet N:1 ratio
- Add §3.3 Scaling Summary table
- Update §4.1 diagram to show one:many proxyrunner→backend pods
- Update vMCP session record to backends[] array with per-backend URLs/session IDs
- Simplify proxyrunner session record to session→backend-pod mapping
- Update §4.3 routing to reflect multi-backend session model
- Add §4.6 proxyrunner value proposition note
- Remove redundant R-PR-7
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- §1.1 diagram: use subgraphs to show logical MCPServer boundary (one MCPServer = one proxyrunner Deployment + its StatefulSet) - §1.4: replace vague "This constraint" with specific statement that a stdio backend couples itself to a specific proxyrunner process - §2.2: correct current-state description — controller already supports multiple proxyrunner replicas for sse/streamable-http transports; the problem is lack of session-aware routing, not lack of replica support - §3.2: correct proxyrunner:StatefulSet ratio — each replica manages its own StatefulSet (1:1), not a shared StatefulSet (N:1) - §3.3: update Scaling Summary table to reflect 1:1 replica:StatefulSet - §4.1: update architecture diagram to show per-replica StatefulSets - §4.2: proxyrunner session record now includes identity subject for session hijacking prevention (per session-scoped work THV-0038) - §5.5: add Security Requirements (R-SEC-1, R-SEC-2) for session hijacking prevention Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
All replicas of a proxyrunner Deployment share a single StatefulSet — they converge on the same desired state via Kubernetes server-side apply (field manager: toolhive-container-manager), with no leader election. The previous edit assumed a 1:1 replica:StatefulSet ratio, which is incorrect. Updated sections: - §1.1: add explanation of shared StatefulSet and server-side apply mechanics; note stdio replica cap vs sse/streamable-http - §2.2: correct current-state description — replicas share one StatefulSet; the problem is missing session-to-pod routing - §3.2: correct ratio back to N:1 (N replicas, 1 StatefulSet) - §3.3: update Scaling Summary table accordingly - §4.1: revert architecture diagram to single shared StatefulSet subgraph with multiple pods Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Neither MCPServer nor VirtualMCPServer CRDs have a replicas field; both Deployments and the StatefulSet are hardcoded to 1. Add this as a core deliverable: spec.replicas (proxyrunner/vMCP pod count) and spec.backendReplicas (StatefulSet pod count) for declarative scaling. Explicitly document the one-StatefulSet-per-MCPServer invariant. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- §1.1 diagram: remove replica count labels from nodes - §3.1: add proxyrunner scale-in (non-stdio) to in scope - §3.2: note 1:1 StatefulSet ratio as future stdio scaling path - §3.2: clarify inter-proxyrunner routing is best-effort - §3.2: replace proxyrunner scale-in out-of-scope bullet with graceful drain and backend StatefulSet scale-in bullets - §3.3: update table to reflect proxyrunner scale-in is in scope - §4.1: simplify diagram (no individual pod nodes) - §5.1: remove R-VMCP-6 (vMCP pod DNS exposure) - §5.4: fix R-DEP-4 to focus on backend scale-in as disruptive Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ChrisJBurns
left a comment
There was a problem hiding this comment.
Looks ok, couple of comments, maybe some comments that are blockers. The HMAC stuff and subject-only comments I can relay to others expertise if they are blocking or non-issues. I'm more interested in how the ProxyRunner scales at all without vMCP - or if we even want to make a more controversial decision to mandate vMCP for use at scale?
Also, should we defined what observability looks like for this? Redis ops, routing decisions/cross-pod proxy success, distributed trace prop across pod boundaries etc? Or is that a later thing? (this is fine too)
Catalogs 16 concrete code changes needed to implement horizontal scaling for vMCP and proxyrunner, organized by component: CRD/operator changes (RC-1 through RC-5), transport session layer (RC-6, RC-7), vMCP session management (RC-8 through RC-10, RC-16), proxyrunner routing (RC-11 through RC-13), operational concerns (RC-14), and security (RC-15). Each change is mapped to requirements from §5 and documents the current state of the code. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… In Review Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
eleftherias
left a comment
There was a problem hiding this comment.
This makes sense to me, I have no further comments.
note: I don't have the context to review section 6. Required Changes, but the rest looks good
yrobla
left a comment
There was a problem hiding this comment.
left some non blocker comments, but approving
THV-0047: Manual Horizontal Scaling for vMCP and Proxy Runner
This RFC defines an approach to enable safe horizontal scale-out of
vmcpandthv-proxyrunnerby externalizing session state to a shared Redis store and implementing session-aware routing at each layer.Key Sections
RestoreSession, LRU eviction, backend expiry syncRelated