Skip to content

fix(tests): clean up process groups on timeout#910

Open
ajcasagrande wants to merge 1 commit into
mainfrom
ajc/proc-name
Open

fix(tests): clean up process groups on timeout#910
ajcasagrande wants to merge 1 commit into
mainfrom
ajc/proc-name

Conversation

@ajcasagrande
Copy link
Copy Markdown
Contributor

@ajcasagrande ajcasagrande commented May 11, 2026

Summary

  • Clean up AIPerf child process groups when component and integration test timeouts fire.
  • Add process-title coverage for group cleanup behavior.
  • Route docs end-to-end runner cleanup through the same timeout-safe process handling.

Test plan

  • uv run pytest -n auto tests/component_integration/test_process_title.py
  • uv run pytest -n auto tests/ci/test_docs_end_to_end/test_runner.py (no tests collected; helper module)
  • uv run pytest -n auto tests/ci/test_docs_end_to_end (no tests collected; helper-only directory)

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Tests
    • Implemented timeout enforcement for test execution to prevent indefinite hangs and improve reliability.
    • Enhanced process management with improved signal handling for subprocess cleanup.
    • Added process title configuration for integration tests to improve debugging visibility.

Review Change Stack

@github-actions github-actions Bot added the fix label May 11, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 11, 2026

Try out this PR

Quick install:

pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@b97e81f56ac8a850910d94546e956803c309d7c9

Recommended with virtual environment (using uv):

uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@b97e81f56ac8a850910d94546e956803c309d7c9

Last updated for commit: b97e81fBrowse code

@ajcasagrande ajcasagrande self-assigned this May 11, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 11, 2026

Warning

Rate limit exceeded

@ajcasagrande has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 40 minutes and 44 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 25fa9aeb-2776-4143-9d6b-3bd2f6a35a92

📥 Commits

Reviewing files that changed from the base of the PR and between 7c44085 and b97e81f.

📒 Files selected for processing (4)
  • tests/ci/test_docs_end_to_end/test_runner.py
  • tests/component_integration/conftest.py
  • tests/component_integration/test_process_title.py
  • tests/integration/conftest.py

Walkthrough

This PR enhances test infrastructure timeout and signal handling across three test execution contexts. It introduces a process-group-aware signal helper, implements watchdog-based timeout enforcement (via timers and threads), adds process title management for component integration tests, and updates subprocess management to escalate signals through process groups.

Changes

Process Group Signal Handling and Timeout Enforcement

Layer / File(s) Summary
Foundation Helper
tests/integration/conftest.py
_killpg(process, sig) helper sends signals to subprocess process groups; suppress import added for ProcessLookupError handling.
Process Title Infrastructure
tests/component_integration/conftest.py
COMPONENT_INTEGRATION_PROCESS_TITLE constant and autouse package-scoped fixture set process title; threading import added; mock_os_exit updated to use _raise_system_exit helper.
Docs End-to-End Timeout
tests/ci/test_docs_end_to_end/test_runner.py
Imports expanded to include signal.SIGKILL and suppress; AIPerf subprocess started with start_new_session=True; watchdog threading.Timer sends SIGKILL to process group on timeout with try/finally cleanup.
Component Integration Timeout
tests/component_integration/conftest.py
aiperf_runner fixture docstring describes in-process synchronous runner; watchdog daemon thread sends SIGINT to current process on timeout; KeyboardInterrupt converted to TimeoutError with timeout duration message; watchdog cleanly stopped via threading.Event and join in finally block.
Integration Subprocess Management
tests/integration/conftest.py
aiperf_runner and AIPerfSignalCLI subprocesses created with process_group=0; timeout escalation sends process-group SIGTERM after SIGINT, then process-group SIGKILL on further timeout; cancellation paths updated to use process-group SIGKILL via _killpg.
Process Title Validation
tests/component_integration/test_process_title.py
New test module adds two integration tests: one validating that _set_component_integration_process_title() sets title to aiperf component_integration_test; one confirming BaseService._set_process_title() is no-op (disabled) in component integration context.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐇 A watchdog guard for every test,
Signals flow through process groups blessed,
Timeouts tamed with timers keen,
Process titles in between,
Safe termination, clean and best!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 41.18% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix(tests): clean up process groups on timeout' directly and precisely summarizes the primary change across all modified files—implementing proper cleanup of AIPerf child process groups during test timeouts.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

Warning

Review ran into problems

🔥 Problems

Git: Failed to clone repository. Please run the @coderabbitai full review command to re-trigger a full review. If the issue persists, set path_filters to include or exclude specific files.

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/integration/conftest.py (1)

328-334: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Fix Python 3.10 incompatibility with process_group=0.

asyncio.create_subprocess_exec(..., process_group=0) requires Python 3.11+, but the project supports Python 3.10 (requires-python = ">=3.10"). This will raise TypeError on 3.10. Either increase the minimum Python version to 3.11 in pyproject.toml, or implement a version check with a fallback for 3.10.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/integration/conftest.py` around lines 328 - 334, The call to
asyncio.create_subprocess_exec currently unconditionally passes process_group=0
which is only supported on Python 3.11+, causing a TypeError on 3.10; update the
code around the create_subprocess_exec call to detect Python version (via
sys.version_info) and only include process_group=0 when sys.version_info >=
(3,11), otherwise call asyncio.create_subprocess_exec without the process_group
kwarg (preserving stdout, stderr, env), or alternatively raise/update
pyproject.toml to require >=3.11; make the change where the symbol process =
await asyncio.create_subprocess_exec(...) is defined so behavior is conditional
at runtime.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/ci/test_docs_end_to_end/test_runner.py`:
- Around line 345-374: The watchdog can race and send SIGKILL to a recycled PID;
modify the _kill_on_timeout/watchdog logic to use a shared boolean flag
protected by a Lock (e.g., a variable like process_finished and a
threading.Lock) that _kill_on_timeout checks under the lock before calling
os.killpg(proc.pid, signal.SIGKILL), and set that flag to True inside the
try/finally (before calling watchdog.cancel()) in the block that handles
aiperf_process (references: _kill_on_timeout, watchdog, aiperf_process,
proc.pid); this ensures the timer callback will skip killing if the process has
already been marked finished.

---

Outside diff comments:
In `@tests/integration/conftest.py`:
- Around line 328-334: The call to asyncio.create_subprocess_exec currently
unconditionally passes process_group=0 which is only supported on Python 3.11+,
causing a TypeError on 3.10; update the code around the create_subprocess_exec
call to detect Python version (via sys.version_info) and only include
process_group=0 when sys.version_info >= (3,11), otherwise call
asyncio.create_subprocess_exec without the process_group kwarg (preserving
stdout, stderr, env), or alternatively raise/update pyproject.toml to require
>=3.11; make the change where the symbol process = await
asyncio.create_subprocess_exec(...) is defined so behavior is conditional at
runtime.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 68955a6e-87e4-418c-83fe-e98e58f38c14

📥 Commits

Reviewing files that changed from the base of the PR and between 79de74e and 7c44085.

📒 Files selected for processing (4)
  • tests/ci/test_docs_end_to_end/test_runner.py
  • tests/component_integration/conftest.py
  • tests/component_integration/test_process_title.py
  • tests/integration/conftest.py

Comment thread tests/ci/test_docs_end_to_end/test_runner.py Outdated
@ajcasagrande
Copy link
Copy Markdown
Contributor Author

@dynamo-ops review

Clean up AIPerf child process groups when test timeouts fire, while preserving Python 3.10 subprocess compatibility and guarding docs runner watchdog kills against late callbacks.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented May 11, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

try:
with redirect_stdout(stdout_tee), redirect_stderr(stderr_tee):
app(full_args)
except SystemExit as e:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The watchdog can set timed_out and trigger AIPerf's asyncio SIGINT handler, after which the CLI exits via SystemExit and this branch returns its exit code instead of raising TimeoutError. Fix: check the watchdog state after app()/SystemExit and raise TimeoutError whenever it fired, or use a timeout mechanism that does not rely on process-wide SIGINT.

"Process did not exit after SIGTERM, sending SIGKILL to process group"
)
process.kill()
_killpg(process, signal.SIGKILL)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After sending SIGKILL to the process group, this timeout path raises without awaiting process.wait() or process.communicate(), leaving the group leader unreaped and subprocess resources open. Fix: await process termination after _killpg(process, signal.SIGKILL) before raising.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants