[Bug] Fix Batch API echoing request body instead of model responses#2117
[Bug] Fix Batch API echoing request body instead of model responses#2117ianliuy wants to merge 2 commits into
Conversation
The default InferenceEngineClient base class silently echoed request_data back as the inference response. When JobDriver received no inference client (inference_client=None), it fell back to this echo stub, causing batch outputs to contain original request bodies instead of model responses. Fix: - Make InferenceEngineClient.inference_request() raise NotImplementedError to prevent silent echo in production - Extract echo behavior into explicit MockInferenceEngineClient subclass for test usage - Add WARNING log in JobDriver.__init__() when falling back to mock client - Update test_inference_client_integration.py to use MockInferenceEngineClient and add test for NotImplementedError guard Resolves: vllm-project#1802 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Yiyang Liu <37043548+ianliuy@users.noreply.github.com>
There was a problem hiding this comment.
Code Review
This pull request refactors the InferenceEngineClient into an abstract base class that raises a NotImplementedError and introduces a MockInferenceEngineClient for testing purposes. The JobDriver is updated to default to the mock client when none is provided, and a warning is logged if the mock client is used during worker execution. Review feedback suggests that defaulting to a mock client in production code could lead to silent failures; it is recommended to leave the client as None by default and implement explicit runtime checks to ensure a valid client is configured before execution.
| if inference_client is None: | ||
| self._inference_client = InferenceEngineClient() | ||
| self._inference_client = MockInferenceEngineClient() | ||
| else: | ||
| self._inference_client = inference_client |
There was a problem hiding this comment.
Defaulting to a mock client in production code can lead to silent failures or incorrect behavior (like the echoing bug this PR addresses) if the configuration is missing. It is safer to leave the inference client as None by default and explicitly handle the missing client in methods that require it. This forces the caller to provide a valid client for execution tasks while still allowing the JobDriver to be used for non-inference tasks like preparation and finalization.
self._inference_client = inference_clientThere was a problem hiding this comment.
agree. using mock client doesn't fully solve the problem, I would rather to raise ValueError if the client is None,
| if isinstance(self._inference_client, MockInferenceEngineClient): | ||
| logger.warning( | ||
| "Using MockInferenceEngineClient for batch execution. " | ||
| "Inference responses will echo request data instead of model responses. " | ||
| "Configure a real LLM engine endpoint for production use.", | ||
| job_id=job_id, | ||
| ) # type: ignore[call-arg] |
There was a problem hiding this comment.
Since the inference client is now optional by default, we should explicitly check for its presence before starting the worker execution. This prevents accidental use of a missing client and provides a clear error message. We can also keep the warning for the mock client to alert users if it's being used in a context where a real client might be expected.
| if isinstance(self._inference_client, MockInferenceEngineClient): | |
| logger.warning( | |
| "Using MockInferenceEngineClient for batch execution. " | |
| "Inference responses will echo request data instead of model responses. " | |
| "Configure a real LLM engine endpoint for production use.", | |
| job_id=job_id, | |
| ) # type: ignore[call-arg] | |
| if self._inference_client is None: | |
| raise RuntimeError(f"Inference client not configured for job {job_id}. Execution cannot proceed.") | |
| if isinstance(self._inference_client, MockInferenceEngineClient): | |
| logger.warning( | |
| "Using MockInferenceEngineClient for batch execution. " | |
| "Inference responses will echo request data instead of model responses. " | |
| "Ensure this is intended for testing purposes.", | |
| job_id=job_id, | |
| ) # type: ignore[call-arg] |
There was a problem hiding this comment.
seem feedback as gemini, explicit throw error is better
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Pull Request Description
Fix a bug where the Batch API's
JobDriverwrites the original request body into the responsebodyfield instead of the actual model inference response. The batch job completes withstatus_code: 200but the output JSONL echoes back the input (model,messages,max_tokens) rather than the chat completion result (choices,usage, etc.).Root cause: The default
InferenceEngineClientbase class (job_driver.py:37-41) was a test stub that silently echoedrequest_databack as the "response." WhenJobDriverreceived no inference client (inference_client=None), it fell back to this stub without any warning, causing batch outputs to contain request bodies instead of model responses.Fix:
InferenceEngineClient.inference_request()raiseNotImplementedErrorto prevent the base class from being used as a silent echo stub in productionMockInferenceEngineClientsubclass for test usageexecute_worker()when the mock client is used for actual inference execution (prepare/finalize paths are not affected)test_inference_client_integration.pyto useMockInferenceEngineClientexplicitly and add a regression test for theNotImplementedErrorguardTesting:
test_inference_client_integration.pypass (including the newtest_base_client_raises_not_implemented)Related Issues
Resolves: #1802
cc @Jeffwan @zhangjyr