RFC: Horizontal Scaling for vMCP and Proxy Runner by jerm-dro · Pull Request #47 · stacklok/toolhive-rfcs

jerm-dro · 2026-03-04T22:46:12Z

THV-0047: Manual Horizontal Scaling for vMCP and Proxy Runner

This RFC defines an approach to enable safe horizontal scale-out of vmcp and thv-proxyrunner by externalizing session state to a shared Redis store and implementing session-aware routing at each layer.

Key Sections

Background (§1): Deployment architecture, session management infrastructure (transport sessions, vMCP MultiSession, auth server Redis pattern), current client-IP affinity workaround, and the inherent constraint of stateful backends.
Problems (§2): Session-to-pod affinity issues at both vMCP and proxyrunner layers, and current scaling limitations (hardcoded replica counts, no CRD fields).
Scope (§3): CRD replica fields, Redis session storage, horizontal scale-out/in for vMCP and proxyrunner (SSE + streamable-http), graceful shutdown. Out of scope: stdio scaling, smart routing at vMCP, auto-scaling policy, backend StatefulSet scale-in.
High-Level Solution (§4): Redis session records (vMCP and proxyrunner), request routing logic at each layer, Redis configuration via CRDs.
Requirements (§5): Success criteria organized by component (vMCP, proxyrunner, operator, deployment, security).
Required Changes (§6): 16 concrete code changes mapped to requirements, organized by component:
- CRD/Operator (RC-1 to RC-5): Replica fields, session storage config, reconciler updates, Redis injection
- Transport Session Layer (RC-6, RC-7): Redis Storage backend, storage selection wiring
- vMCP (RC-8 to RC-10, RC-16): Metadata persistence, session reconstruction via RestoreSession, LRU eviction, backend expiry sync
- Proxyrunner (RC-11 to RC-13): Configurable StatefulSet replicas, session-aware pod routing, LRU eviction
- Operational (RC-14): Graceful shutdown (SIGTERM handling, terminationGracePeriodSeconds)
- Security (RC-15): Cross-pod hijack prevention via persisted token binding

Introduces THV-XXXX covering background, problems, scope, high-level solution, and requirements for enabling safe horizontal scale-out of the vmcp and thv-proxyrunner components via externalized Redis session storage and session-aware routing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

rfcs/THV-XXXX-vmcp-proxyrunner-horizontal-scaling.md

- Fix Mermaid \n → <br/> in both diagrams - Update metadata layer description to include session IDs - Strengthen re-initialization language ("destructive" not "may not be safe") - Add current proxyrunner state context to §2.2 - Fix stdio scaling description: about concurrency, not exclusivity - Add fungibility constraint note to §1.4 and §5.3 R-OP-1 - Fix §3.1: single MCPServer backed by multiple proxyrunner replicas - Add vMCP scale-in to §3.1 in-scope - Update §3.2: proxyrunner scale-in only; proxyrunner:StatefulSet N:1 ratio - Add §3.3 Scaling Summary table - Update §4.1 diagram to show one:many proxyrunner→backend pods - Update vMCP session record to backends[] array with per-backend URLs/session IDs - Simplify proxyrunner session record to session→backend-pod mapping - Update §4.3 routing to reflect multi-backend session model - Add §4.6 proxyrunner value proposition note - Remove redundant R-PR-7 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

rfcs/THV-XXXX-vmcp-proxyrunner-horizontal-scaling.md

- §1.1 diagram: use subgraphs to show logical MCPServer boundary (one MCPServer = one proxyrunner Deployment + its StatefulSet) - §1.4: replace vague "This constraint" with specific statement that a stdio backend couples itself to a specific proxyrunner process - §2.2: correct current-state description — controller already supports multiple proxyrunner replicas for sse/streamable-http transports; the problem is lack of session-aware routing, not lack of replica support - §3.2: correct proxyrunner:StatefulSet ratio — each replica manages its own StatefulSet (1:1), not a shared StatefulSet (N:1) - §3.3: update Scaling Summary table to reflect 1:1 replica:StatefulSet - §4.1: update architecture diagram to show per-replica StatefulSets - §4.2: proxyrunner session record now includes identity subject for session hijacking prevention (per session-scoped work THV-0038) - §5.5: add Security Requirements (R-SEC-1, R-SEC-2) for session hijacking prevention Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

All replicas of a proxyrunner Deployment share a single StatefulSet — they converge on the same desired state via Kubernetes server-side apply (field manager: toolhive-container-manager), with no leader election. The previous edit assumed a 1:1 replica:StatefulSet ratio, which is incorrect. Updated sections: - §1.1: add explanation of shared StatefulSet and server-side apply mechanics; note stdio replica cap vs sse/streamable-http - §2.2: correct current-state description — replicas share one StatefulSet; the problem is missing session-to-pod routing - §3.2: correct ratio back to N:1 (N replicas, 1 StatefulSet) - §3.3: update Scaling Summary table accordingly - §4.1: revert architecture diagram to single shared StatefulSet subgraph with multiple pods Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Neither MCPServer nor VirtualMCPServer CRDs have a replicas field; both Deployments and the StatefulSet are hardcoded to 1. Add this as a core deliverable: spec.replicas (proxyrunner/vMCP pod count) and spec.backendReplicas (StatefulSet pod count) for declarative scaling. Explicitly document the one-StatefulSet-per-MCPServer invariant. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

rfcs/THV-0047-vmcp-proxyrunner-horizontal-scaling.md

rfcs/THV-XXXX-vmcp-proxyrunner-horizontal-scaling.md

- §1.1 diagram: remove replica count labels from nodes - §3.1: add proxyrunner scale-in (non-stdio) to in scope - §3.2: note 1:1 StatefulSet ratio as future stdio scaling path - §3.2: clarify inter-proxyrunner routing is best-effort - §3.2: replace proxyrunner scale-in out-of-scope bullet with graceful drain and backend StatefulSet scale-in bullets - §3.3: update table to reflect proxyrunner scale-in is in scope - §4.1: simplify diagram (no individual pod nodes) - §5.1: remove R-VMCP-6 (vMCP pod DNS exposure) - §5.4: fix R-DEP-4 to focus on backend scale-in as disruptive Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

rfcs/THV-0047-vmcp-proxyrunner-horizontal-scaling.md

rfcs/THV-XXXX-vmcp-proxyrunner-horizontal-scaling.md

rfcs/THV-0047-vmcp-proxyrunner-horizontal-scaling.md

rfcs/THV-XXXX-vmcp-proxyrunner-horizontal-scaling.md

ChrisJBurns

Looks ok, couple of comments, maybe some comments that are blockers. The HMAC stuff and subject-only comments I can relay to others expertise if they are blocking or non-issues. I'm more interested in how the ProxyRunner scales at all without vMCP - or if we even want to make a more controversial decision to mandate vMCP for use at scale?

Also, should we defined what observability looks like for this? Redis ops, routing decisions/cross-pod proxy success, distributed trace prop across pod boundaries etc? Or is that a later thing? (this is fine too)

rfcs/THV-XXXX-vmcp-proxyrunner-horizontal-scaling.md

rfcs/THV-0047-vmcp-proxyrunner-horizontal-scaling.md

rfcs/THV-XXXX-vmcp-proxyrunner-horizontal-scaling.md

Catalogs 16 concrete code changes needed to implement horizontal scaling for vMCP and proxyrunner, organized by component: CRD/operator changes (RC-1 through RC-5), transport session layer (RC-6, RC-7), vMCP session management (RC-8 through RC-10, RC-16), proxyrunner routing (RC-11 through RC-13), operational concerns (RC-14), and security (RC-15). Each change is mapped to requirements from §5 and documents the current state of the code. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… In Review Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

eleftherias

This makes sense to me, I have no further comments.
note: I don't have the context to review section 6. Required Changes, but the rest looks good

rfcs/THV-0047-vmcp-proxyrunner-horizontal-scaling.md

yrobla

left some non blocker comments, but approving

JAORMX

excellent initiative

jerm-dro force-pushed the jerm/2026-03-04-session-storage branch from 4987889 to b9e00fd Compare March 4, 2026 22:49

jerm-dro force-pushed the jerm/2026-03-04-session-storage branch from b9e00fd to dab8883 Compare March 4, 2026 23:05

jerm-dro commented Mar 4, 2026

View reviewed changes

jerm-dro commented Mar 5, 2026

View reviewed changes

jerm-dro and others added 3 commits March 4, 2026 18:26

jerm-dro commented Mar 5, 2026

View reviewed changes

jerm-dro and others added 2 commits March 4, 2026 19:23

tweaks

fd9cf66

eleftherias reviewed Mar 5, 2026

View reviewed changes

rfcs/THV-0047-vmcp-proxyrunner-horizontal-scaling.md Show resolved Hide resolved

rfcs/THV-0047-vmcp-proxyrunner-horizontal-scaling.md Show resolved Hide resolved

tgrunnagle reviewed Mar 5, 2026

View reviewed changes

rfcs/THV-0047-vmcp-proxyrunner-horizontal-scaling.md Show resolved Hide resolved

tgrunnagle reviewed Mar 5, 2026

View reviewed changes

rfcs/THV-XXXX-vmcp-proxyrunner-horizontal-scaling.md Outdated Show resolved Hide resolved

tgrunnagle reviewed Mar 5, 2026

View reviewed changes

rfcs/THV-0047-vmcp-proxyrunner-horizontal-scaling.md Show resolved Hide resolved

tgrunnagle reviewed Mar 5, 2026

View reviewed changes

rfcs/THV-XXXX-vmcp-proxyrunner-horizontal-scaling.md Outdated Show resolved Hide resolved

ChrisJBurns reviewed Mar 6, 2026

View reviewed changes

jerm-dro and others added 3 commits March 7, 2026 11:19

address comments

7344cba

self-review

e6abde1

jerm-dro marked this pull request as ready for review March 7, 2026 20:24

Rename RFC to match PR number (THV-0051 → THV-0047) and set status to…

9a319c1

… In Review Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

jerm-dro requested review from ChrisJBurns, JAORMX, amirejaz, eleftherias, tgrunnagle and yrobla March 7, 2026 20:26

eleftherias reviewed Mar 10, 2026

View reviewed changes

reyortiz3 reviewed Mar 11, 2026

View reviewed changes

rfcs/THV-0047-vmcp-proxyrunner-horizontal-scaling.md Show resolved Hide resolved

JAORMX reviewed Mar 12, 2026

View reviewed changes

rfcs/THV-0047-vmcp-proxyrunner-horizontal-scaling.md Show resolved Hide resolved

yrobla reviewed Mar 12, 2026

View reviewed changes

rfcs/THV-0047-vmcp-proxyrunner-horizontal-scaling.md Show resolved Hide resolved

yrobla reviewed Mar 12, 2026

View reviewed changes

rfcs/THV-0047-vmcp-proxyrunner-horizontal-scaling.md Show resolved Hide resolved

yrobla reviewed Mar 16, 2026

View reviewed changes

rfcs/THV-0047-vmcp-proxyrunner-horizontal-scaling.md Show resolved Hide resolved

yrobla reviewed Mar 16, 2026

View reviewed changes

rfcs/THV-0047-vmcp-proxyrunner-horizontal-scaling.md Show resolved Hide resolved

yrobla reviewed Mar 16, 2026

View reviewed changes

rfcs/THV-0047-vmcp-proxyrunner-horizontal-scaling.md Show resolved Hide resolved

yrobla reviewed Mar 16, 2026

View reviewed changes

rfcs/THV-0047-vmcp-proxyrunner-horizontal-scaling.md Show resolved Hide resolved

yrobla approved these changes Mar 16, 2026

View reviewed changes

JAORMX approved these changes Mar 17, 2026

View reviewed changes

jerm-dro merged commit 9ddce6a into main Mar 17, 2026
1 check passed

Conversation

jerm-dro commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

THV-0047: Manual Horizontal Scaling for vMCP and Proxy Runner

Key Sections

Related

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ChrisJBurns left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

eleftherias left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yrobla left a comment

Choose a reason for hiding this comment

Uh oh!

JAORMX left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

jerm-dro commented Mar 4, 2026 •

edited

Loading