Skip to content

Starlark scripted tools for vMCP#51

Open
jerm-dro wants to merge 7 commits intomainfrom
jerm/2026-03-06-jerm-starlark
Open

Starlark scripted tools for vMCP#51
jerm-dro wants to merge 7 commits intomainfrom
jerm/2026-03-06-jerm-starlark

Conversation

@jerm-dro
Copy link
Contributor

@jerm-dro jerm-dro commented Mar 7, 2026

Summary

Proposes replacing vMCP's declarative composite tools system (DAG + Go templates) with a Starlark scripting engine for multi-step tool workflows.

The current composite tools system hits hard limits: no iteration over results, no dynamic branching, and awkward Go template data flow. Starlark provides iteration, conditional branching, dynamic tool dispatch, and data transformation while maintaining sandboxed execution with no arbitrary I/O.

Key design decisions:

  • Two-builtin error handling: call_tool() halts on error (common case), try_call_tool() returns error info (opt-in handling) — works around Starlark's lack of try/except
  • Parallel system: Runs alongside existing composite tools during migration, then replaces them
  • Small builtin API: call_tool, try_call_tool, retry, elicit, parallel, log
  • Code reuse via load(): Shared helper libraries in .star files
  • V1 and V2 session compatible: Engine sits behind the handler factory

Why Starlark

  • BSD 3-Clause license (compatible with Apache 2.0)
  • Purpose-built for embedded use in Go (Bazel, Buck2, Tilt, Cirrus CI)
  • Sandboxed by design: no I/O, no network, no OS access
  • Python-like syntax lowers learning curve
  • Alternatives evaluated: Risor (security risk from Go stdlib access), Tengo (unfamiliar syntax), Goja/JS (large attack surface), Wasm (overkill)

🤖 Generated with Claude Code

jerm-dro and others added 2 commits March 6, 2026 17:38
Proposes replacing vMCP's declarative composite tools system (DAG + Go
templates) with a Starlark scripting engine for multi-step tool
workflows. Starlark provides iteration, conditional branching, dynamic
dispatch, and data transformation while maintaining sandboxed execution
with no arbitrary I/O.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
jerm-dro and others added 5 commits March 6, 2026 18:58
- Remove V2 session incompatibility bullet (temporary concern)
- Remove V1/V2 session goal (not relevant to RFC scope)
- Add parallel execution to Goals section
- Remove migration tooling phase (unnecessary)
- Remove fuzz tests from testing strategy (unnecessary)
- Simplify migration path (3 phases instead of 4)
- Add docs-website to documentation requirements
- Resolve naming question: scripted and composite are interchangeable
- Remove json.encode/decode open question

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Starlark's sandbox makes it feasible for agents to dynamically
compose and submit scripts to vMCP at runtime — something a
declarative YAML DSL could never support safely.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Six practical examples covering structured data manipulation,
returning structured data, JSON-as-string parsing from legacy
servers, fan-out with parallel, error handling patterns, and
elicitation. Includes a callout explaining when to use dict
indexing (tool results) vs attribute access (builtin return values).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@kantord
Copy link
Member

kantord commented Mar 9, 2026

This RFC is incredible! Seems like I'm the first human reviewer and considering that, honestly I'm impressed by how thorough it already is!

I do have some thoughts but honestly, none of them are blockers, I just wanted to give more than a "Looks good to me". I did have to scrape the barrel to think of some things that might be actionable:

  • Perhaps for error handling, it would make sense to distinguish between runtime errors vs. input validation errors vs. output validation errors?
  • Streaming/large data: this may be mostly theoretical based on how MCP is used today, but if workflows start dispatching large fan-outs (e.g., calling a github tool across thousands of repos), the lack of streaming primitives in Starlark could become a limitation, especially when combined with strict memory limitations
  • Human-in-the-loop using elicit() is already covered, but I'm wondering if we also need to think about MCP apps.
  • About the http built-in: I would actually try to rely on MCP tool calls for things like this. This way we could load all of the complexities of access control and whatnot into a single system (MCP).
  • Reusing Starlark code across different scripts could open a can of worms, such as one change generating a cascading failure. Perhaps this is another thing we can push onto MCP, just use MCP tool calls to reuse one vMCP script in another. At least in that case we could have things like schema-based contracts (you get an error if you add a breaking change in an output schema that would generate a domino-effect.) In any case, the feature of load() might be premature without knowing exactly how vMCP would be used in real life.
  • Automated testing for these scripts might be challenging/nearly impossible. A simple example-based test runner would be trivial to add, but mocking MCP calls would be a problem. Perhaps there is a good MCP testing system that we could integrate into Toolhive that would solve these problems?

@jerm-dro
Copy link
Contributor Author

Thanks for the thoughtful review @kantord 😄

  • Streaming/large data: this may be mostly theoretical based on how MCP is used today, but if workflows start dispatching large fan-outs (e.g., calling a github tool across thousands of repos), the lack of streaming primitives in Starlark could become a limitation, especially when combined with strict memory limitations

This is an important thing to call out. I think it can be solved by building an implementation of Iterable. That would look like something within vMCP recognizes "hey, this is a huge array / dict / paginated response. Let's turn it into an Iterable."

  • Human-in-the-loop using elicit() is already covered, but I'm wondering if we also need to think about MCP apps.

Can you say more about what you're imagining? I'm not that familiar with MCP apps.

  • About the http built-in: I would actually try to rely on MCP tool calls for things like this. This way we could load all of the complexities of access control and whatnot into a single system (MCP).

Yes, that makes perfect sense. If you want a http tool, users could add a fetch tool.

  • Reusing Starlark code across different scripts could open a can of worms, such as one change generating a cascading failure. Perhaps this is another thing we can push onto MCP, just use MCP tool calls to reuse one vMCP script in another. At least in that case we could have things like schema-based contracts (you get an error if you add a breaking change in an output schema that would generate a domino-effect.) In any case, the feature of load() might be premature without knowing exactly how vMCP would be used in real life.

Yea, I could see it getting hairy too. If we get to the point people (or agents) are writing so much code that we think reuse is important, then that's a good problem to have. I like the ideas you have though. Let's wait until this problem needs more attention.

  • Automated testing for these scripts might be challenging/nearly impossible. A simple example-based test runner would be trivial to add, but mocking MCP calls would be a problem. Perhaps there is a good MCP testing system that we could integrate into Toolhive that would solve these problems?

Yes, it could be frustrating, especially with the schemas of the underlyling MCP servers potentially changing.

Some thoughts here:

  • we could build a thv repl for manually iterating
  • we could a some custom type checker that validates the inputs to tool calls and use of their responses.

I think this is another thing where we have to wait for the problem to arise to know what solution is justified.

Copy link
Member

@aponcedeleonch aponcedeleonch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great RFC — the Starlark choice is well-justified, the builtin API is clean, and the phased rollout is the right approach. Left a few inline comments on typed elicitation, migration tooling, and memory limits. Overall this looks solid.


**`parallel(fns)`** executes a list of zero-argument callables concurrently on the Go side using `errgroup`:

```python
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Typed elicitation via schema validation

Today the elicitation handler in ToolHive (pkg/vmcp/composer/elicitation_handler.go) validates response size and depth but does not validate content against the provided JSON Schema — the schema is sent to the client purely for UI rendering.

Since scripts already provide a schema to elicit(), the Go-side builtin could validate decision.content against that schema before returning it to the script. This means every script that uses elicitation can trust .content without defensive type checks.

Concretely, add a validate parameter (default True):

decision = elicit(
    "Approve?",
    schema={
        "type": "object",
        "properties": {
            "reason": {"type": "string"},
            "severity": {"type": "string", "enum": ["low", "medium", "high"]},
        },
        "required": ["reason"],
    },
    validate=True,  # default
)
# decision.content is guaranteed to match the schema if action == "accept"

The Go side would use a JSON Schema validator (e.g., santhosh-tekuri/jsonschema) to enforce this. On validation failure, the builtin could either re-prompt the client or return a structured error.

This is a small addition to the builtin but makes every script that uses elicitation simpler and safer.


The Starlark engine is designed to be extensible:

- New builtins can be added without breaking existing scripts
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Migration tooling (automated transpiler)

The 3-phase migration plan is good, but Phase 2 (deprecation) would be much smoother with concrete migration tooling. The current composite tool model is a strict subset of what Starlark can express — every construct has a direct translation:

Composite YAML Starlark Equivalent
Sequential steps (dependsOn chain) Sequential call_tool() calls
Parallel steps (same DAG level) parallel([...])
condition template if statement
onError: continue try_call_tool()
onError: retry retry(lambda: call_tool(...))
Elicitation step elicit()
onDecline/onCancel actions if decision.action == "decline"
Go template {{.steps.X.output.Y}} Variable assignment: x = call_tool(...); x["Y"]
defaultResults Default in try_call_tool fallback
OutputConfig properties Return dict construction

A transpiler built into vMCP could:

  1. Parse CompositeToolConfig (already done at config load time)
  2. Topologically sort the steps (reuse dag_executor.go's buildExecutionLevels)
  3. Emit Starlark source — sequential calls within levels, parallel() across levels
  4. Convert Go template expressions to Python string formatting
  5. Output the .star file or inline script

This could be exposed as:

  • A thv vmcp migrate-composite <tool-name> CLI command that prints the equivalent Starlark
  • A deprecation warning at config load: "Composite tool 'X' can be migrated to Starlark. Run thv vmcp migrate-composite X to see the equivalent script."
  • Optionally, an automatic in-memory "compilation" where composite tools are internally transpiled to Starlark and executed through the new engine (proving equivalence before asking users to migrate)

This gives users concrete migration commands rather than "rewrite your YAML."

| Supply chain (shared libs) | Libraries are loaded from admin-controlled paths only (no user-supplied paths). Same trust model as the main script — the admin controls both. |
| Script injection | Scripts are defined by administrators in YAML/CRDs, not by end users. Input parameters are passed as structured data, not string-interpolated into script source. |

## Alternatives Considered
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concern: Memory limit mitigation is insufficient

This is listed as High severity but the mitigation (step count as indirect memory bound + Go-level monitoring) has a gap. A script can allocate massive data structures in very few steps:

big = ["x" * 1000000] * 1000  # ~1GB in two bytecode operations

Step counting won't catch this because the allocation happens in a single operation.

Consider adding a more direct mitigation:

  • A periodic memory check via starlark.Thread's cancel function — the cancel function is called at each step and could check runtime.MemStats.Alloc against a threshold (e.g., 256MB per script execution). This piggybacks on the existing step-counting mechanism.
  • Or a per-script memory limit row in the Resource Limits table (e.g., Memory per execution | 256 MB | No | Cancel function checks runtime.MemStats).

Note: runtime.MemStats is process-wide, so with concurrent script executions you'd need to track per-goroutine deltas or use a simpler heuristic (abort if total process memory exceeds a threshold). Not perfect, but better than no limit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants