Skip to content

[dynamicemb] Add spawn-based multi-worker DataLoader option to MovieL…#412

Merged
shijieliu merged 2 commits into
NVIDIA:mainfrom
jiashuy:spawn_multi-worker_dataloader
Jun 1, 2026
Merged

[dynamicemb] Add spawn-based multi-worker DataLoader option to MovieL…#412
shijieliu merged 2 commits into
NVIDIA:mainfrom
jiashuy:spawn_multi-worker_dataloader

Conversation

@jiashuy
Copy link
Copy Markdown
Collaborator

@jiashuy jiashuy commented Jun 1, 2026

…ens example

Add an opt-in --num_workers argument to the dynamicemb MovieLens example so data loading can be offloaded to subprocesses.

The main process initializes a CUDA context (torch.cuda.set_device in init_runtime), so the default fork start method cannot be used for DataLoader workers: a forked child inherits an unusable copy of the CUDA context and raises "RuntimeError: initialization error" on its first CUDA touch. When --num_workers > 0 the DataLoader is therefore built with the spawn start method (plus persistent_workers), which gives each worker a fresh interpreter with no inherited CUDA state. Workers only do CPU work (dataset reads + collate_fn); the host-to-device copy stays in the main process.

  • Add build_dataloader() helper and route train/dump/load/inc_dump through it.
  • Remove the leftover module-level rank_print override that referenced an undefined module-scope local_rank; the rank-aware print is installed in init_runtime(). This matters now that spawn workers re-import the module.
  • Document --num_workers usage and the spawn rationale in the example README.

Description

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

…ens example

Add an opt-in --num_workers argument to the dynamicemb MovieLens example so
data loading can be offloaded to subprocesses.

The main process initializes a CUDA context (torch.cuda.set_device in
init_runtime), so the default fork start method cannot be used for DataLoader
workers: a forked child inherits an unusable copy of the CUDA context and
raises "RuntimeError: initialization error" on its first CUDA touch. When
--num_workers > 0 the DataLoader is therefore built with the spawn start
method (plus persistent_workers), which gives each worker a fresh interpreter
with no inherited CUDA state. Workers only do CPU work (dataset reads +
collate_fn); the host-to-device copy stays in the main process.

- Add build_dataloader() helper and route train/dump/load/inc_dump through it.
- Remove the leftover module-level rank_print override that referenced an
  undefined module-scope local_rank; the rank-aware print is installed in
  init_runtime(). This matters now that spawn workers re-import the module.
- Document --num_workers usage and the spawn rationale in the example README.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Jun 1, 2026

Greptile Summary

This PR adds opt-in spawn-based multi-worker DataLoader support to the dynamicemb MovieLens example via a new --num_workers argument, and fixes a latent bug where builtins.print was overridden at module scope using an undefined local_rank variable. Both changes are well-motivated: the spawn fix avoids CUDA context inheritance failures in forked workers, and the module-scope print fix prevents worker reimports from crashing before RuntimeContext is initialized.

  • Introduces build_dataloader() to centralize all four DataLoader construction sites (train, dump, load, inc_dump), with spawn context and persistent_workers=True enabled only when num_workers > 0. Input validation (< 0 guard) is present.
  • Removes the broken module-level rank_print override; rank-aware printing is now installed exclusively inside init_runtime(), which is never called in spawn workers.
  • README documents the CUDA/fork incompatibility, the spawn trade-offs, and a recommended CPU-core ceiling for --num_workers.

Confidence Score: 5/5

Safe to merge — the changes are a focused, self-contained refactor with no altered training logic and a correct spawn-based DataLoader implementation.

The module-level print override that referenced undefined local_rank is cleanly removed, the build_dataloader() helper correctly gates spawn/persistent_workers on num_workers > 0, the dataset and collate_fn are picklable, the if name == 'main' guard is in place, and argument validation is present. No logic changes to the model, optimizer, or distributed training path.

No files require special attention.

Important Files Changed

Filename Overview
corelib/dynamicemb/example/example.py Adds --num_workers arg with spawn-based DataLoader, removes broken module-level rank_print override, and centralizes DataLoader construction in build_dataloader(). Logic is correct: negative guard, picklable dataset, module-level collate_fn, and if name == 'main' guard are all in place.
corelib/dynamicemb/example/README.md Documents --num_workers usage, explains the CUDA context / fork incompatibility, and notes the performance trade-offs of spawn for small datasets. Clear and accurate.

Sequence Diagram

sequenceDiagram
    participant Main as Main Process
    participant IS as init_runtime()
    participant BD as build_dataloader()
    participant W as Spawn Worker(s)
    participant DL as DataLoader (main)

    Main->>IS: init_runtime()
    IS->>Main: "torch.cuda.set_device(local_rank) + builtins.print = rank_print"
    Main->>BD: build_dataloader(dataset, sampler, args)
    BD->>DL: "DataLoader(multiprocessing_context=spawn, persistent_workers=True) [if num_workers>0]"
    DL-->>W: spawn fresh interpreter per worker
    Note over W: re-imports example.py, init_runtime() NOT called, no CUDA context
    loop Each batch
        DL->>W: __getitem__ + collate_fn (CPU only)
        W-->>DL: KeyedJaggedTensor, labels tensor
        DL-->>Main: batch (host tensors)
        Main->>Main: .to(device) — H2D copy in main process
    end
Loading

Reviews (2): Last reviewed commit: "[dynamicemb] Reject negative --num_worke..." | Re-trigger Greptile

Comment thread corelib/dynamicemb/example/example.py
Comment thread corelib/dynamicemb/example/example.py
@jiashuy
Copy link
Copy Markdown
Collaborator Author

jiashuy commented Jun 1, 2026

Test passed

A negative --num_workers passed argparse (it is a valid int) but failed the
`args.num_workers > 0` spawn guard, so the value was handed straight to
DataLoader, which raised a confusing "num_workers option should be
non-negative" error that looked like an internal failure. Validate in
parse_args() and fail fast with a clear argument error instead.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@jiashuy jiashuy requested a review from shijieliu June 1, 2026 02:19
@shijieliu shijieliu merged commit 3c05d0a into NVIDIA:main Jun 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants