fix(tests): clean up process groups on timeout#910
Conversation
Try out this PRQuick install: pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@b97e81f56ac8a850910d94546e956803c309d7c9Recommended with virtual environment (using uv): uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@b97e81f56ac8a850910d94546e956803c309d7c9Last updated for commit: |
|
Warning Rate limit exceeded
You’ve run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (4)
WalkthroughThis PR enhances test infrastructure timeout and signal handling across three test execution contexts. It introduces a process-group-aware signal helper, implements watchdog-based timeout enforcement (via timers and threads), adds process title management for component integration tests, and updates subprocess management to escalate signals through process groups. ChangesProcess Group Signal Handling and Timeout Enforcement
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Warning Review ran into problems🔥 ProblemsGit: Failed to clone repository. Please run the Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/integration/conftest.py (1)
328-334:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winFix Python 3.10 incompatibility with
process_group=0.
asyncio.create_subprocess_exec(..., process_group=0)requires Python 3.11+, but the project supports Python 3.10 (requires-python = ">=3.10"). This will raiseTypeErroron 3.10. Either increase the minimum Python version to 3.11 inpyproject.toml, or implement a version check with a fallback for 3.10.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/integration/conftest.py` around lines 328 - 334, The call to asyncio.create_subprocess_exec currently unconditionally passes process_group=0 which is only supported on Python 3.11+, causing a TypeError on 3.10; update the code around the create_subprocess_exec call to detect Python version (via sys.version_info) and only include process_group=0 when sys.version_info >= (3,11), otherwise call asyncio.create_subprocess_exec without the process_group kwarg (preserving stdout, stderr, env), or alternatively raise/update pyproject.toml to require >=3.11; make the change where the symbol process = await asyncio.create_subprocess_exec(...) is defined so behavior is conditional at runtime.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/ci/test_docs_end_to_end/test_runner.py`:
- Around line 345-374: The watchdog can race and send SIGKILL to a recycled PID;
modify the _kill_on_timeout/watchdog logic to use a shared boolean flag
protected by a Lock (e.g., a variable like process_finished and a
threading.Lock) that _kill_on_timeout checks under the lock before calling
os.killpg(proc.pid, signal.SIGKILL), and set that flag to True inside the
try/finally (before calling watchdog.cancel()) in the block that handles
aiperf_process (references: _kill_on_timeout, watchdog, aiperf_process,
proc.pid); this ensures the timer callback will skip killing if the process has
already been marked finished.
---
Outside diff comments:
In `@tests/integration/conftest.py`:
- Around line 328-334: The call to asyncio.create_subprocess_exec currently
unconditionally passes process_group=0 which is only supported on Python 3.11+,
causing a TypeError on 3.10; update the code around the create_subprocess_exec
call to detect Python version (via sys.version_info) and only include
process_group=0 when sys.version_info >= (3,11), otherwise call
asyncio.create_subprocess_exec without the process_group kwarg (preserving
stdout, stderr, env), or alternatively raise/update pyproject.toml to require
>=3.11; make the change where the symbol process = await
asyncio.create_subprocess_exec(...) is defined so behavior is conditional at
runtime.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 68955a6e-87e4-418c-83fe-e98e58f38c14
📒 Files selected for processing (4)
tests/ci/test_docs_end_to_end/test_runner.pytests/component_integration/conftest.pytests/component_integration/test_process_title.pytests/integration/conftest.py
|
@dynamo-ops review |
Clean up AIPerf child process groups when test timeouts fire, while preserving Python 3.10 subprocess compatibility and guarding docs runner watchdog kills against late callbacks. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
1ffccdf to
b97e81f
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
| try: | ||
| with redirect_stdout(stdout_tee), redirect_stderr(stderr_tee): | ||
| app(full_args) | ||
| except SystemExit as e: |
There was a problem hiding this comment.
The watchdog can set timed_out and trigger AIPerf's asyncio SIGINT handler, after which the CLI exits via SystemExit and this branch returns its exit code instead of raising TimeoutError. Fix: check the watchdog state after app()/SystemExit and raise TimeoutError whenever it fired, or use a timeout mechanism that does not rely on process-wide SIGINT.
| "Process did not exit after SIGTERM, sending SIGKILL to process group" | ||
| ) | ||
| process.kill() | ||
| _killpg(process, signal.SIGKILL) |
There was a problem hiding this comment.
After sending SIGKILL to the process group, this timeout path raises without awaiting process.wait() or process.communicate(), leaving the group leader unreaped and subprocess resources open. Fix: await process termination after _killpg(process, signal.SIGKILL) before raising.
Summary
Test plan
uv run pytest -n auto tests/component_integration/test_process_title.pyuv run pytest -n auto tests/ci/test_docs_end_to_end/test_runner.py(no tests collected; helper module)uv run pytest -n auto tests/ci/test_docs_end_to_end(no tests collected; helper-only directory)🤖 Generated with Claude Code
Summary by CodeRabbit