Skip to content

scripts/build_backgrounds_chrombpnet.py: --gpu N silently overrides outer CUDA_VISIBLE_DEVICES #72

@lucapinello

Description

@lucapinello

Background

`scripts/build_backgrounds_chrombpnet.py:200` does:

```python
os.environ["CUDA_VISIBLE_DEVICES"] = str(args.gpu)
```

unconditionally — clobbering any pre-set `CUDA_VISIBLE_DEVICES` from the calling shell. So this parallel-launch pattern (suggested in the handoff at `audits/2026-04-29_chrombpnet_cdf_rebuild/HANDOFF.md` for sharded multi-GPU runs):

```bash

Terminal 1 (intended GPU 0):

CUDA_VISIBLE_DEVICES=0 python scripts/build_backgrounds_chrombpnet.py --gpu 0 ...

Terminal 2 (intended GPU 1):

CUDA_VISIBLE_DEVICES=1 python scripts/build_backgrounds_chrombpnet.py --gpu 0 ...
```

…doesn't work as expected: both jobs end up on physical GPU 0, fighting for memory, because each script invocation overrides `CUDA_VISIBLE_DEVICES=N` from its own `--gpu N` arg.

Reproduction

During PR #70's Phase 1 redo, two parallel invocations:

  • `CUDA_VISIBLE_DEVICES=0 ... --part variants --assay ATAC_DNASE --gpu 0`
  • `CUDA_VISIBLE_DEVICES=1 ... --part baselines --assay ATAC_DNASE --gpu 0`

…both ended up on physical GPU 0. The second job OOM'd at `MaxAllocSize: 327706624` (~312 MB free between the first job's allocation and the third user's job on the same GPU).

Workaround for the rebuild was to pass `--gpu 1` explicitly to the second invocation and skip the outer `CUDA_VISIBLE_DEVICES` setting — but this is the opposite of what an experienced cluster user would expect.

Suggested fix

```python

Honour pre-set CUDA_VISIBLE_DEVICES if present; only set from --gpu when nothing

was passed in via env.

if "CUDA_VISIBLE_DEVICES" not in os.environ:
os.environ["CUDA_VISIBLE_DEVICES"] = str(args.gpu)
```

Or warn loudly when the script is overriding an existing env var:

```python
if "CUDA_VISIBLE_DEVICES" in os.environ and os.environ["CUDA_VISIBLE_DEVICES"] != str(args.gpu):
logger.warning(
"Overriding caller CUDA_VISIBLE_DEVICES=%s with --gpu=%s",
os.environ["CUDA_VISIBLE_DEVICES"], args.gpu,
)
os.environ["CUDA_VISIBLE_DEVICES"] = str(args.gpu)
```

The first option is cleaner and matches Unix env-var conventions (env var > CLI arg unless arg explicitly overrides).

Why it matters

The handoff's sharded-multi-GPU invocation pattern requires a workaround that's non-obvious. Anyone following the handoff with `CUDA_VISIBLE_DEVICES=N` will silently land on the wrong GPU.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions