Skip to content

[Bug] Fix Batch API echoing request body instead of model responses#2117

Open
ianliuy wants to merge 2 commits into
vllm-project:mainfrom
ianliuy:fix/issue-1802
Open

[Bug] Fix Batch API echoing request body instead of model responses#2117
ianliuy wants to merge 2 commits into
vllm-project:mainfrom
ianliuy:fix/issue-1802

Conversation

@ianliuy
Copy link
Copy Markdown
Contributor

@ianliuy ianliuy commented Apr 17, 2026

Pull Request Description

Fix a bug where the Batch API's JobDriver writes the original request body into the response body field instead of the actual model inference response. The batch job completes with status_code: 200 but the output JSONL echoes back the input (model, messages, max_tokens) rather than the chat completion result (choices, usage, etc.).

Root cause: The default InferenceEngineClient base class (job_driver.py:37-41) was a test stub that silently echoed request_data back as the "response." When JobDriver received no inference client (inference_client=None), it fell back to this stub without any warning, causing batch outputs to contain request bodies instead of model responses.

Fix:

  • Make InferenceEngineClient.inference_request() raise NotImplementedError to prevent the base class from being used as a silent echo stub in production
  • Extract the echo behavior into an explicit MockInferenceEngineClient subclass for test usage
  • Add a targeted WARNING log in execute_worker() when the mock client is used for actual inference execution (prepare/finalize paths are not affected)
  • Update test_inference_client_integration.py to use MockInferenceEngineClient explicitly and add a regression test for the NotImplementedError guard

Testing:

  • All 4 tests in test_inference_client_integration.py pass (including the new test_base_client_raises_not_implemented)
  • Verified locally via WSL with pytest

Related Issues

Resolves: #1802

cc @Jeffwan @zhangjyr

The default InferenceEngineClient base class silently echoed request_data
back as the inference response. When JobDriver received no inference client
(inference_client=None), it fell back to this echo stub, causing batch
outputs to contain original request bodies instead of model responses.

Fix:
- Make InferenceEngineClient.inference_request() raise NotImplementedError
  to prevent silent echo in production
- Extract echo behavior into explicit MockInferenceEngineClient subclass
  for test usage
- Add WARNING log in JobDriver.__init__() when falling back to mock client
- Update test_inference_client_integration.py to use MockInferenceEngineClient
  and add test for NotImplementedError guard

Resolves: vllm-project#1802

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Yiyang Liu <37043548+ianliuy@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the InferenceEngineClient into an abstract base class that raises a NotImplementedError and introduces a MockInferenceEngineClient for testing purposes. The JobDriver is updated to default to the mock client when none is provided, and a warning is logged if the mock client is used during worker execution. Review feedback suggests that defaulting to a mock client in production code could lead to silent failures; it is recommended to leave the client as None by default and implement explicit runtime checks to ensure a valid client is configured before execution.

Comment on lines 95 to 98
if inference_client is None:
self._inference_client = InferenceEngineClient()
self._inference_client = MockInferenceEngineClient()
else:
self._inference_client = inference_client
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Defaulting to a mock client in production code can lead to silent failures or incorrect behavior (like the echoing bug this PR addresses) if the configuration is missing. It is safer to leave the inference client as None by default and explicitly handle the missing client in methods that require it. This forces the caller to provide a valid client for execution tasks while still allowing the JobDriver to be used for non-inference tasks like preparation and finalization.

        self._inference_client = inference_client

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree. using mock client doesn't fully solve the problem, I would rather to raise ValueError if the client is None,

Comment on lines +168 to +174
if isinstance(self._inference_client, MockInferenceEngineClient):
logger.warning(
"Using MockInferenceEngineClient for batch execution. "
"Inference responses will echo request data instead of model responses. "
"Configure a real LLM engine endpoint for production use.",
job_id=job_id,
) # type: ignore[call-arg]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Since the inference client is now optional by default, we should explicitly check for its presence before starting the worker execution. This prevents accidental use of a missing client and provides a clear error message. We can also keep the warning for the mock client to alert users if it's being used in a context where a real client might be expected.

Suggested change
if isinstance(self._inference_client, MockInferenceEngineClient):
logger.warning(
"Using MockInferenceEngineClient for batch execution. "
"Inference responses will echo request data instead of model responses. "
"Configure a real LLM engine endpoint for production use.",
job_id=job_id,
) # type: ignore[call-arg]
if self._inference_client is None:
raise RuntimeError(f"Inference client not configured for job {job_id}. Execution cannot proceed.")
if isinstance(self._inference_client, MockInferenceEngineClient):
logger.warning(
"Using MockInferenceEngineClient for batch execution. "
"Inference responses will echo request data instead of model responses. "
"Ensure this is intended for testing purposes.",
job_id=job_id,
) # type: ignore[call-arg]

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seem feedback as gemini, explicit throw error is better

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Batch requests completed successfully but model responses are empty.

2 participants