updated tests #383

shreymodi1 · 2025-12-18T02:13:44Z

Note

Updates streaming compliance benchmarks and rollout plumbing.

Default model set to glm-4p6; import pytest and add @pytest.mark.skipif to conditionally run reasoning and multi-tool tests via EP_SUPPORTS_REASONING and EP_SUPPORTS_MULTIPLE_TOOL_CALLS
New helpers _maybe_add_reasoning_effort and updated _build_completion_params_from_payload to only pass reasoning_effort when supported; many tests now use these helpers
Add raw_output: True to test params; rollout moves raw_output to extra_body and captures response raw_output into row.execution_metadata.raw_output
Model updates: ExecutionMetadata gains raw_output field to store provider extras
Tests adjusted to include raw_output, and make reasoning checks/metrics conditional on support; minor param tweaks across streaming and non-streaming cases

^{Written by Cursor Bugbot for commit da015c5. This will update automatically on new commits. Configure here.}

eval_protocol/benchmarks/test_glm_streaming_compliance.py

eval_protocol/pytest/default_single_turn_rollout_process.py

eval_protocol/benchmarks/test_glm_streaming_compliance.py

dphuang2

lets avoid test-specific environment variables in favor of globally configurable environment variables

dphuang2 · 2025-12-22T19:22:05Z

updated tests

ffadf69

cursor bot reviewed Dec 18, 2025

View reviewed changes

eval_protocol/benchmarks/test_glm_streaming_compliance.py Outdated Show resolved Hide resolved

update2

1e60a66