Skip to content

fix(ci): export MALLOC_ARENA_MAX=2 before pytest for component_integration#950

Open
ajcasagrande wants to merge 1 commit into
mainfrom
ajc/fix-tests
Open

fix(ci): export MALLOC_ARENA_MAX=2 before pytest for component_integration#950
ajcasagrande wants to merge 1 commit into
mainfrom
ajc/fix-tests

Conversation

@ajcasagrande
Copy link
Copy Markdown
Contributor

@ajcasagrande ajcasagrande commented May 16, 2026

Summary

  • Ubuntu CI (3.11/3.12/3.13) for unit + component_integration tests has been failing since feat: YAML-native v2 config + adaptive sweep orchestrator with BO & search recipes #912; macOS is fine. Symptom: ~66/83 systematic FAILEDs the moment any test calls app(...) in-process, ending in an xdist INTERNALERROR: list.remove(x): x not in list (the signature of a worker dying mid-run).
  • Root cause: os.environ.setdefault("MALLOC_ARENA_MAX", "2") at tests/component_integration/conftest.py:33 is a no-op for the running pytest worker — glibc reads MALLOC_ARENA_MAX once at process startup, before Python imports run. Component_integration runs aiperf in-process (no subprocesses to inherit the var), so setting it from inside the conftest never actually capped arenas.
  • feat: YAML-native v2 config + adaptive sweep orchestrator with BO & search recipes #912 pulled in BoTorch / Optuna / scipy / torch, which pushed the working set past what glibc's default 8×NCPU arenas can fit on the 2-CPU GitHub runners → xdist workers crashed.

Fix

Prepend MALLOC_ARENA_MAX=2 to the four pytest invocations in Makefile that target tests/component_integration/ (test-ci, test-component-integration, test-component-integration-ci, test-component-integration-verbose), so the var is in the shell env before pytest forks workers. Rewrite the misleading conftest comment to point at the Makefile.

tests/integration/conftest.py is left alone — that suite spawns aiperf as a subprocess, and the conftest setdefault correctly propagates the var to children.

Test plan

  • MALLOC_ARENA_MAX=2 uv run pytest tests/component_integration/cli/ -m component_integration -n auto → 15/15 passing locally
  • make -n test-ci and make -n test-component-integration-ci show MALLOC_ARENA_MAX=2 in the expanded pytest command
  • CI on this PR shows Ubuntu jobs green

Summary by CodeRabbit

  • Chores
    • Updated test environment configuration for component integration tests.

Review Change Stack

…integration

glibc reads MALLOC_ARENA_MAX at process startup, so setting it from
inside tests/component_integration/conftest.py via os.environ.setdefault
was a no-op for the running pytest worker — by then glibc had already
initialized its arenas.

Component_integration runs aiperf in-process (no subprocesses to
inherit the var), so the only effective place to set it is the shell
env before pytest starts. Prepend MALLOC_ARENA_MAX=2 to the four
pytest invocations in the Makefile that target tests/component_integration/
(test-ci, test-component-integration, test-component-integration-ci,
test-component-integration-verbose), and rewrite the conftest comment
to reflect reality.

This unbroke Ubuntu CI (3.11/3.12/3.13). macOS was unaffected because
it uses a different allocator. The latent issue surfaced when #912
pulled in BoTorch/Optuna/scipy/torch and pushed the working set past
what glibc's default 8×NCPU arenas could fit on 2-CPU runners,
crashing xdist workers (visible as 66/83 systematic FAILED + final
xdist INTERNALERROR "list.remove(x): x not in list").

Note: tests/integration/conftest.py keeps its setdefault — that suite
spawns aiperf as subprocesses, so the var correctly propagates to
children.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
@github-actions
Copy link
Copy Markdown

Try out this PR

Quick install:

pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@2cf5b4ccc14fa941d4bcc67b4d32708a7e0d1be1

Recommended with virtual environment (using uv):

uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@2cf5b4ccc14fa941d4bcc67b4d32708a7e0d1be1

Last updated for commit: 2cf5b4cBrowse code

@github-actions github-actions Bot added the fix label May 16, 2026
@ajcasagrande ajcasagrande enabled auto-merge (squash) May 16, 2026 00:39
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 16, 2026

Walkthrough

This PR configures component integration test targets to run pytest with the MALLOC_ARENA_MAX=2 environment variable, which limits glibc memory arena allocation. The Makefile targets are updated consistently across test-ci, test-component-integration, test-component-integration-ci, and test-component-integration-verbose. The conftest.py comment is clarified to document that the Makefile export is authoritative, with the in-file setdefault serving as a fallback.

Changes

Memory Arena Configuration for Component Integration Tests

Layer / File(s) Summary
Makefile pytest environment variable
Makefile
Test targets test-ci, test-component-integration, test-component-integration-ci, and test-component-integration-verbose now set MALLOC_ARENA_MAX=2 for pytest invocations, preserving all existing coverage, verbosity, selection flags, markers, parallelism, and exit-code accumulation.
Configuration documentation
tests/component_integration/conftest.py
Comment clarified to document that glibc reads MALLOC_ARENA_MAX at process startup, with Makefile export as the authoritative source and the setdefault acting as fallback for alternate pytest invocations.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Poem

🐰 A hare hops through the test arena,
With MALLOC set to two, all serene-a,
The Makefile spoke clear, both near and far,
Each pytest command now shines like a star! ✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: exporting MALLOC_ARENA_MAX=2 before pytest for component_integration tests in CI, which is the core fix addressing the Ubuntu CI test failures.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/component_integration/conftest.py`:
- Line 31: Replace the ambiguous multiplication character in the comment that
reads "8×NCPU" with a plain ASCII 'x' so it reads "8xNCPU" to avoid Ruff RUF003;
update the comment text in the same location (the comment containing "default
8×NCPU arenas blows out RAM in 2-CPU CI runners. glibc reads this") accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e1ca39c3-7961-4879-9f90-32b364cabb94

📥 Commits

Reviewing files that changed from the base of the PR and between bac953d and 2cf5b4c.

📒 Files selected for processing (2)
  • Makefile
  • tests/component_integration/conftest.py

# integration conftest carries the same setting (gotcha 2026-04-21).
# xdist workers under heavy `-n auto` load. Component_integration runs aiperf
# in-process with full Pydantic / msgspec / tokenizer / torch imports, so the
# default 8×NCPU arenas blows out RAM in 2-CPU CI runners. glibc reads this
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Replace ambiguous × character to avoid Ruff RUF003 warning.

Use plain x (8xNCPU) in this comment to avoid ambiguous Unicode lint warnings.

🧰 Tools
🪛 Ruff (0.15.12)

[warning] 31-31: Comment contains ambiguous × (MULTIPLICATION SIGN). Did you mean x (LATIN SMALL LETTER X)?

(RUF003)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/component_integration/conftest.py` at line 31, Replace the ambiguous
multiplication character in the comment that reads "8×NCPU" with a plain ASCII
'x' so it reads "8xNCPU" to avoid Ruff RUF003; update the comment text in the
same location (the comment containing "default 8×NCPU arenas blows out RAM in
2-CPU CI runners. glibc reads this") accordingly.

dynamo-ops
dynamo-ops previously approved these changes May 16, 2026
dynamo-ops
dynamo-ops previously approved these changes May 16, 2026
@dynamo-ops dynamo-ops dismissed stale reviews from themself May 16, 2026 01:05

Duplicate approval (review lock race condition — now fixed)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants