Starlark scripted tools for vMCP by jerm-dro · Pull Request #51 · stacklok/toolhive-rfcs

jerm-dro · 2026-03-07T01:41:28Z

Summary

Proposes replacing vMCP's declarative composite tools system (DAG + Go templates) with a Starlark scripting engine for multi-step tool workflows.

The current composite tools system hits hard limits: no iteration over results, no dynamic branching, and awkward Go template data flow. Starlark provides iteration, conditional branching, dynamic tool dispatch, and data transformation while maintaining sandboxed execution with no arbitrary I/O.

Key design decisions:

Two-builtin error handling: call_tool() halts on error (common case), try_call_tool() returns error info (opt-in handling) — works around Starlark's lack of try/except
Parallel system: Runs alongside existing composite tools during migration, then replaces them
Small builtin API: call_tool, try_call_tool, retry, elicit, parallel, log
Code reuse via load(): Shared helper libraries in .star files
V1 and V2 session compatible: Engine sits behind the handler factory

Why Starlark

BSD 3-Clause license (compatible with Apache 2.0)
Purpose-built for embedded use in Go (Bazel, Buck2, Tilt, Cirrus CI)
Sandboxed by design: no I/O, no network, no OS access
Python-like syntax lowers learning curve
Alternatives evaluated: Risor (security risk from Go stdlib access), Tengo (unfamiliar syntax), Goja/JS (large attack surface), Wasm (overkill)

🤖 Generated with Claude Code

Proposes replacing vMCP's declarative composite tools system (DAG + Go templates) with a Starlark scripting engine for multi-step tool workflows. Starlark provides iteration, conditional branching, dynamic dispatch, and data transformation while maintaining sandboxed execution with no arbitrary I/O. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

rfcs/THV-0051-starlark-scripted-tools.md

- Remove V2 session incompatibility bullet (temporary concern) - Remove V1/V2 session goal (not relevant to RFC scope) - Add parallel execution to Goals section - Remove migration tooling phase (unnecessary) - Remove fuzz tests from testing strategy (unnecessary) - Simplify migration path (3 phases instead of 4) - Add docs-website to documentation requirements - Resolve naming question: scripted and composite are interchangeable - Remove json.encode/decode open question Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Starlark's sandbox makes it feasible for agents to dynamically compose and submit scripts to vMCP at runtime — something a declarative YAML DSL could never support safely. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Six practical examples covering structured data manipulation, returning structured data, JSON-as-string parsing from legacy servers, fan-out with parallel, error handling patterns, and elicitation. Includes a callout explaining when to use dict indexing (tool results) vs attribute access (builtin return values). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

kantord · 2026-03-09T11:14:55Z

This RFC is incredible! Seems like I'm the first human reviewer and considering that, honestly I'm impressed by how thorough it already is!

I do have some thoughts but honestly, none of them are blockers, I just wanted to give more than a "Looks good to me". I did have to scrape the barrel to think of some things that might be actionable:

Perhaps for error handling, it would make sense to distinguish between runtime errors vs. input validation errors vs. output validation errors?
Streaming/large data: this may be mostly theoretical based on how MCP is used today, but if workflows start dispatching large fan-outs (e.g., calling a github tool across thousands of repos), the lack of streaming primitives in Starlark could become a limitation, especially when combined with strict memory limitations
Human-in-the-loop using elicit() is already covered, but I'm wondering if we also need to think about MCP apps.
About the http built-in: I would actually try to rely on MCP tool calls for things like this. This way we could load all of the complexities of access control and whatnot into a single system (MCP).
Reusing Starlark code across different scripts could open a can of worms, such as one change generating a cascading failure. Perhaps this is another thing we can push onto MCP, just use MCP tool calls to reuse one vMCP script in another. At least in that case we could have things like schema-based contracts (you get an error if you add a breaking change in an output schema that would generate a domino-effect.) In any case, the feature of load() might be premature without knowing exactly how vMCP would be used in real life.
Automated testing for these scripts might be challenging/nearly impossible. A simple example-based test runner would be trivial to add, but mocking MCP calls would be a problem. Perhaps there is a good MCP testing system that we could integrate into Toolhive that would solve these problems?

jerm-dro · 2026-03-10T03:08:58Z

Thanks for the thoughtful review @kantord 😄

Streaming/large data: this may be mostly theoretical based on how MCP is used today, but if workflows start dispatching large fan-outs (e.g., calling a github tool across thousands of repos), the lack of streaming primitives in Starlark could become a limitation, especially when combined with strict memory limitations

This is an important thing to call out. I think it can be solved by building an implementation of Iterable. That would look like something within vMCP recognizes "hey, this is a huge array / dict / paginated response. Let's turn it into an Iterable."

Human-in-the-loop using elicit() is already covered, but I'm wondering if we also need to think about MCP apps.

Can you say more about what you're imagining? I'm not that familiar with MCP apps.

About the http built-in: I would actually try to rely on MCP tool calls for things like this. This way we could load all of the complexities of access control and whatnot into a single system (MCP).

Yes, that makes perfect sense. If you want a http tool, users could add a fetch tool.

Reusing Starlark code across different scripts could open a can of worms, such as one change generating a cascading failure. Perhaps this is another thing we can push onto MCP, just use MCP tool calls to reuse one vMCP script in another. At least in that case we could have things like schema-based contracts (you get an error if you add a breaking change in an output schema that would generate a domino-effect.) In any case, the feature of load() might be premature without knowing exactly how vMCP would be used in real life.

Yea, I could see it getting hairy too. If we get to the point people (or agents) are writing so much code that we think reuse is important, then that's a good problem to have. I like the ideas you have though. Let's wait until this problem needs more attention.

Automated testing for these scripts might be challenging/nearly impossible. A simple example-based test runner would be trivial to add, but mocking MCP calls would be a problem. Perhaps there is a good MCP testing system that we could integrate into Toolhive that would solve these problems?

Yes, it could be frustrating, especially with the schemas of the underlyling MCP servers potentially changing.

Some thoughts here:

we could build a thv repl for manually iterating
we could a some custom type checker that validates the inputs to tool calls and use of their responses.

I think this is another thing where we have to wait for the problem to arise to know what solution is justified.

aponcedeleonch

Great RFC — the Starlark choice is well-justified, the builtin API is clean, and the phased rollout is the right approach. Left a few inline comments on typed elicitation, migration tooling, and memory limits. Overall this looks solid.

aponcedeleonch · 2026-03-11T16:31:22Z

rfcs/THV-0051-starlark-scripted-tools.md

+
+**`parallel(fns)`** executes a list of zero-argument callables concurrently on the Go side using `errgroup`:
+
+```python


Suggestion: Typed elicitation via schema validation

Today the elicitation handler in ToolHive (pkg/vmcp/composer/elicitation_handler.go) validates response size and depth but does not validate content against the provided JSON Schema — the schema is sent to the client purely for UI rendering.

Since scripts already provide a schema to elicit(), the Go-side builtin could validate decision.content against that schema before returning it to the script. This means every script that uses elicitation can trust .content without defensive type checks.

Concretely, add a validate parameter (default True):

decision = elicit( "Approve?", schema={ "type": "object", "properties": { "reason": {"type": "string"}, "severity": {"type": "string", "enum": ["low", "medium", "high"]}, }, "required": ["reason"], }, validate=True, # default ) # decision.content is guaranteed to match the schema if action == "accept"

The Go side would use a JSON Schema validator (e.g., santhosh-tekuri/jsonschema) to enforce this. On validation failure, the builtin could either re-prompt the client or return a structured error.

This is a small addition to the builtin but makes every script that uses elicitation simpler and safer.

aponcedeleonch · 2026-03-11T16:31:22Z

rfcs/THV-0051-starlark-scripted-tools.md

+
+The Starlark engine is designed to be extensible:
+
+- New builtins can be added without breaking existing scripts


Suggestion: Migration tooling (automated transpiler)

The 3-phase migration plan is good, but Phase 2 (deprecation) would be much smoother with concrete migration tooling. The current composite tool model is a strict subset of what Starlark can express — every construct has a direct translation:

Composite YAML Starlark Equivalent

Sequential steps (dependsOn chain) Sequential call_tool() calls

Parallel steps (same DAG level) parallel([...])

condition template if statement

onError: continue try_call_tool()

onError: retry retry(lambda: call_tool(...))

Elicitation step elicit()

onDecline/onCancel actions if decision.action == "decline"

Go template {{.steps.X.output.Y}} Variable assignment: x = call_tool(...); x["Y"]

defaultResults Default in try_call_tool fallback

OutputConfig properties Return dict construction

A transpiler built into vMCP could:

Parse CompositeToolConfig (already done at config load time)

Topologically sort the steps (reuse dag_executor.go's buildExecutionLevels)

Emit Starlark source — sequential calls within levels, parallel() across levels

Convert Go template expressions to Python string formatting

Output the .star file or inline script

This could be exposed as:

A thv vmcp migrate-composite <tool-name> CLI command that prints the equivalent Starlark

A deprecation warning at config load: "Composite tool 'X' can be migrated to Starlark. Run thv vmcp migrate-composite X to see the equivalent script."

Optionally, an automatic in-memory "compilation" where composite tools are internally transpiled to Starlark and executed through the new engine (proving equivalence before asking users to migrate)

This gives users concrete migration commands rather than "rewrite your YAML."

aponcedeleonch · 2026-03-11T16:31:22Z

rfcs/THV-0051-starlark-scripted-tools.md

+| Supply chain (shared libs) | Libraries are loaded from admin-controlled paths only (no user-supplied paths). Same trust model as the main script — the admin controls both. |
+| Script injection | Scripts are defined by administrators in YAML/CRDs, not by end users. Input parameters are passed as structured data, not string-interpolated into script source. |
+
+## Alternatives Considered


Concern: Memory limit mitigation is insufficient

This is listed as High severity but the mitigation (step count as indirect memory bound + Go-level monitoring) has a gap. A script can allocate massive data structures in very few steps:

big = ["x" * 1000000] * 1000 # ~1GB in two bytecode operations

Step counting won't catch this because the allocation happens in a single operation.

Consider adding a more direct mitigation:

A periodic memory check via starlark.Thread's cancel function — the cancel function is called at each step and could check runtime.MemStats.Alloc against a threshold (e.g., 256MB per script execution). This piggybacks on the existing step-counting mechanism.

Or a per-script memory limit row in the Resource Limits table (e.g., Memory per execution | 256 MB | No | Cancel function checks runtime.MemStats).

Note: runtime.MemStats is process-wide, so with concurrent script executions you'd need to track per-goroutine deltas or use a simpler heuristic (abort if total process memory exceeds a threshold). Not perfect, but better than no limit.

jerm-dro and others added 2 commits March 6, 2026 17:38

Rename RFC to THV-0051 per PR number

75c79cc

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

jerm-dro commented Mar 7, 2026

View reviewed changes

jerm-dro and others added 5 commits March 6, 2026 18:58

self-review

f18bb41

Add agent-composed scripts as long-term goal

53403f2

Starlark's sandbox makes it feasible for agents to dynamically compose and submit scripts to vMCP at runtime — something a declarative YAML DSL could never support safely. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

cleanup formatting

d431e36

aponcedeleonch reviewed Mar 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Starlark scripted tools for vMCP#51

Starlark scripted tools for vMCP#51
jerm-dro wants to merge 7 commits intomainfrom
jerm/2026-03-06-jerm-starlark

jerm-dro commented Mar 7, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kantord commented Mar 9, 2026

Uh oh!

jerm-dro commented Mar 10, 2026

Uh oh!

aponcedeleonch left a comment

Uh oh!

aponcedeleonch Mar 11, 2026

Uh oh!

aponcedeleonch Mar 11, 2026

Uh oh!

aponcedeleonch Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		`parallel(fns)` executes a list of zero-argument callables concurrently on the Go side using `errgroup`:

		```python


		The Starlark engine is designed to be extensible:

		- New builtins can be added without breaking existing scripts

Composite YAML	Starlark Equivalent
Sequential steps (`dependsOn` chain)	Sequential `call_tool()` calls
Parallel steps (same DAG level)	`parallel([...])`
`condition` template	`if` statement
`onError: continue`	`try_call_tool()`
`onError: retry`	`retry(lambda: call_tool(...))`
Elicitation step	`elicit()`
`onDecline`/`onCancel` actions	`if decision.action == "decline"`
Go template `{{.steps.X.output.Y}}`	Variable assignment: `x = call_tool(...); x["Y"]`
`defaultResults`	Default in `try_call_tool` fallback
`OutputConfig` properties	Return dict construction

Conversation

jerm-dro commented Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why Starlark

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kantord commented Mar 9, 2026

Uh oh!

jerm-dro commented Mar 10, 2026

Uh oh!

aponcedeleonch left a comment

Choose a reason for hiding this comment

Uh oh!

aponcedeleonch Mar 11, 2026

Choose a reason for hiding this comment

Suggestion: Typed elicitation via schema validation

Uh oh!

aponcedeleonch Mar 11, 2026

Choose a reason for hiding this comment

Suggestion: Migration tooling (automated transpiler)

Uh oh!

aponcedeleonch Mar 11, 2026

Choose a reason for hiding this comment

Concern: Memory limit mitigation is insufficient

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jerm-dro commented Mar 7, 2026 •

edited

Loading