Skip to content

perf: parallelize tool server init and reduce LLM retry overhead#139

Closed
wangbinluo wants to merge 16 commits into
MiroMindAI:mainfrom
wangbinluo:dev_wbl
Closed

perf: parallelize tool server init and reduce LLM retry overhead#139
wangbinluo wants to merge 16 commits into
MiroMindAI:mainfrom
wangbinluo:dev_wbl

Conversation

@wangbinluo
Copy link
Copy Markdown
Collaborator

@wangbinluo wangbinluo commented Mar 17, 2026

Summary

Addresses evaluation pipeline performance bottlenecks identified in #137.

Three changes targeting the slowest parts of BC benchmark evaluation.

Closes #137

Changes

  • P0: Parallelize MCP tool server initializationmanager.py get_all_tool_definitions() now uses asyncio.gather() instead of sequential for loop.
  • P0: Reduce LLM retry overheadopenai_client.py base_wait_time 30s → 10s, max_retries 10 → 5. Prevents 60-90s wasted on retries.
  • P1: httpx connection poolingsearch_and_scrape_webpage.py reuses a shared httpx.AsyncClient instead of creating a new one per request (~346 search calls per BC task).

Benchmark Results

Tool server initialization time (3 runs on dev_wbl)

Run dev_wbl (parallel)
Run 1 (cold start) 29.3s
Run 2 12.1s
Run 3 9.6s
Average 17.0s

Comparison with main branch (extracted from existing evaluation logs)

main (sequential) dev_wbl (parallel) Speedup
Average 234s 17.0s 13.8x
Min 22s 9.6s 2.3x
Max 945s (15.8 min) 29.3s 32.3x

Note: main branch baseline was extracted from BC-ZH evaluation logs (qwen_xxg_negative_r10_new1_step50, 30 tasks). The high average (234s) includes E2B sandbox queueing under high concurrency (MAX_CONCURRENT=60). Per-server breakdown on main: tool-python avg=145s (max=631s), search avg=69s, jina avg=19s.

Test plan

  • Run bench_init_time.py on dev_wbl — 17.0s avg vs main baseline 234s avg
  • Run BC-ZH evaluation end-to-end and verify no regression in accuracy
  • Verify LLM retry behavior works correctly with reduced parameters
  • Check Serper API connection pooling doesn't cause stale connection issues

shawnlimn and others added 13 commits February 4, 2026 16:17
- Fix server_name routing: dynamically parse system prompt to build
  tool→server mapping, auto-correct wrong server_name in LLM responses
- Fix tool_name hallucination: python/python_code → run_python_code
  (only when system prompt defines run_python_code)
- Fix parameter names: code → code_block, add default sandbox_id
- Fix scrape_and_extract_info params: description → info_to_extract
- Add stateless sandbox fallback for invalid sandbox_id

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…oMindAI#137)

- P0: Parallelize MCP tool server initialization with asyncio.gather()
  (saves ~40-50s per task, previously ~71s sequential)
- P0: Reduce LLM retry base_wait_time from 30s to 10s, max_retries from 10 to 5
- P1: Add httpx connection pooling for Serper API requests
  (reuse TCP connections across ~346 search calls per task)

Ref: MiroMindAI#137

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@wangbinluo
Copy link
Copy Markdown
Collaborator Author

wangbinluo commented Mar 17, 2026

@shawnlimn @xingxuanli Could you review this PR when you get a chance?

This PR addresses the performance bottlenecks identified in #137 with two optimizations:

  1. Parallel tool server init (asyncio.gather) — measured 13.8x speedup (234s → 17s avg)
  2. httpx connection pooling — reuse TCP connections for Serper API (~346 calls per BC task)

Note: PR #138 by @JasonOA888 covers only item 1; this PR is a superset with additional optimization and benchmark data.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
wangbinluo and others added 2 commits March 17, 2026 14:54
Algorithm team confirmed these values are needed for reliability under
high load with self-hosted sglang servers. Ref: MiroMindAI#137

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…iroMindAI#137)

Previously every tool call (except playwright) spawned a new MCP server
subprocess, initialized it, called the tool, then killed it.  BC tasks
average ~300+ tool calls, so the spawn/teardown overhead adds up.

Introduce PersistentMCPSession that keeps the subprocess alive for the
entire task lifetime.  On connection failure it transparently reconnects
once.  Sessions are cleaned up via close_all_sessions() at task end.

This is the P0 "MCP server connection reuse" item from MiroMindAI#137, estimated
to save 2-5 min per task.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@wangbinluo wangbinluo marked this pull request as draft April 10, 2026 08:26
Copy link
Copy Markdown
Member

@Vanint Vanint left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid PR — persistent sessions + parallel init is a big perf win, and this supersedes #138 with better exception handling. A few issues to address:

  1. Subprocess leak in _connect if session init fails: aenter/aexit are called manually, bypassing async with safety. If session.aenter() or initialize() throws, client.aexit is never called and the subprocess leaks:
async def _connect(self):
    ...
    read, write = await self._client.__aenter__()
    try:
        self._session = ClientSession(read, write, sampling_callback=None)
        await self._session.__aenter__()
        await self._session.initialize()
    except Exception:
        await self._client.__aexit__(None, None, None)
        self._client = None
        raise
  1. Sessions not cleaned up on abnormal exit: close_all_sessions() is only called at the end of a successful run_main_agent. If the orchestrator throws mid-execution, subprocesses are leaked. Wrap in try/finally:
try:
    # ... existing run_main_agent logic ...
finally:
    await self.main_agent_tool_manager.close_all_sessions()
    for tm in self.sub_agent_tool_managers.values():
        await tm.close_all_sessions()
  1. Reconnect logic in call_tool is not concurrency-safe: If two coroutines fail concurrently, both enter close() → ensure_connected(), potentially tearing down a connection the other just established. Currently the orchestrator calls tools sequentially so this won't bite, but PersistentMCPSession looks like a reusable component — worth either adding a lock around the reconnect path or documenting the single-caller assumption.

  2. Reconnect failure lacks context: If the retry in call_tool also fails, the exception propagates without any indication that a reconnect was attempted. Consider logging or wrapping the error (e.g., raise RuntimeError(f"Reconnect to {self.server_name} also failed") from e).

  3. get_all_tool_definitions connects twice per server: _get_single_server_tools opens a temporary connection to list tools, then execute_tool_call creates a new persistent session. Intentional (to avoid holding long-lived connections during init)? If so, a brief comment would help. If not, could reuse the persistent session for the initial tool listing too.

  4. Re: #138: This PR covers everything in #138 with better exception handling (uses zip to preserve server name, retains fallback error entries). If this lands, #138 can be closed.

Fix these problems, and this is good to merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Evaluation pipeline performance bottlenecks: BC benchmarks 3-5x slower than necessary

4 participants