[dynamicemb] Add spawn-based multi-worker DataLoader option to MovieL… by jiashuy · Pull Request #412 · NVIDIA/recsys-examples

jiashuy · 2026-06-01T01:52:05Z

…ens example

Add an opt-in --num_workers argument to the dynamicemb MovieLens example so data loading can be offloaded to subprocesses.

The main process initializes a CUDA context (torch.cuda.set_device in init_runtime), so the default fork start method cannot be used for DataLoader workers: a forked child inherits an unusable copy of the CUDA context and raises "RuntimeError: initialization error" on its first CUDA touch. When --num_workers > 0 the DataLoader is therefore built with the spawn start method (plus persistent_workers), which gives each worker a fresh interpreter with no inherited CUDA state. Workers only do CPU work (dataset reads + collate_fn); the host-to-device copy stays in the main process.

Add build_dataloader() helper and route train/dump/load/inc_dump through it.
Remove the leftover module-level rank_print override that referenced an undefined module-scope local_rank; the rank-aware print is installed in init_runtime(). This matters now that spawn workers re-import the module.
Document --num_workers usage and the spawn rationale in the example README.

Description

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

…ens example Add an opt-in --num_workers argument to the dynamicemb MovieLens example so data loading can be offloaded to subprocesses. The main process initializes a CUDA context (torch.cuda.set_device in init_runtime), so the default fork start method cannot be used for DataLoader workers: a forked child inherits an unusable copy of the CUDA context and raises "RuntimeError: initialization error" on its first CUDA touch. When --num_workers > 0 the DataLoader is therefore built with the spawn start method (plus persistent_workers), which gives each worker a fresh interpreter with no inherited CUDA state. Workers only do CPU work (dataset reads + collate_fn); the host-to-device copy stays in the main process. - Add build_dataloader() helper and route train/dump/load/inc_dump through it. - Remove the leftover module-level rank_print override that referenced an undefined module-scope local_rank; the rank-aware print is installed in init_runtime(). This matters now that spawn workers re-import the module. - Document --num_workers usage and the spawn rationale in the example README. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

greptile-apps · 2026-06-01T01:56:04Z

Greptile Summary

This PR adds opt-in spawn-based multi-worker DataLoader support to the dynamicemb MovieLens example via a new --num_workers argument, and fixes a latent bug where builtins.print was overridden at module scope using an undefined local_rank variable. Both changes are well-motivated: the spawn fix avoids CUDA context inheritance failures in forked workers, and the module-scope print fix prevents worker reimports from crashing before RuntimeContext is initialized.

Introduces build_dataloader() to centralize all four DataLoader construction sites (train, dump, load, inc_dump), with spawn context and persistent_workers=True enabled only when num_workers > 0. Input validation (< 0 guard) is present.
Removes the broken module-level rank_print override; rank-aware printing is now installed exclusively inside init_runtime(), which is never called in spawn workers.
README documents the CUDA/fork incompatibility, the spawn trade-offs, and a recommended CPU-core ceiling for --num_workers.

Confidence Score: 5/5

Safe to merge — the changes are a focused, self-contained refactor with no altered training logic and a correct spawn-based DataLoader implementation.

The module-level print override that referenced undefined local_rank is cleanly removed, the build_dataloader() helper correctly gates spawn/persistent_workers on num_workers > 0, the dataset and collate_fn are picklable, the if name == 'main' guard is in place, and argument validation is present. No logic changes to the model, optimizer, or distributed training path.

No files require special attention.

Important Files Changed

Filename	Overview
corelib/dynamicemb/example/example.py	Adds --num_workers arg with spawn-based DataLoader, removes broken module-level rank_print override, and centralizes DataLoader construction in build_dataloader(). Logic is correct: negative guard, picklable dataset, module-level collate_fn, and if name == 'main' guard are all in place.
corelib/dynamicemb/example/README.md	Documents --num_workers usage, explains the CUDA context / fork incompatibility, and notes the performance trade-offs of spawn for small datasets. Clear and accurate.

Sequence Diagram

sequenceDiagram
    participant Main as Main Process
    participant IS as init_runtime()
    participant BD as build_dataloader()
    participant W as Spawn Worker(s)
    participant DL as DataLoader (main)

    Main->>IS: init_runtime()
    IS->>Main: "torch.cuda.set_device(local_rank) + builtins.print = rank_print"
    Main->>BD: build_dataloader(dataset, sampler, args)
    BD->>DL: "DataLoader(multiprocessing_context=spawn, persistent_workers=True) [if num_workers>0]"
    DL-->>W: spawn fresh interpreter per worker
    Note over W: re-imports example.py, init_runtime() NOT called, no CUDA context
    loop Each batch
        DL->>W: __getitem__ + collate_fn (CPU only)
        W-->>DL: KeyedJaggedTensor, labels tensor
        DL-->>Main: batch (host tensors)
        Main->>Main: .to(device) — H2D copy in main process
    end

_{Reviews (2): Last reviewed commit: "[dynamicemb] Reject negative --num_worke..." | Re-trigger Greptile}

jiashuy · 2026-06-01T02:13:35Z

Test passed

A negative --num_workers passed argparse (it is a valid int) but failed the `args.num_workers > 0` spawn guard, so the value was handed straight to DataLoader, which raised a confusing "num_workers option should be non-negative" error that looked like an internal failure. Validate in parse_args() and fail fast with a clear argument error instead. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

jiashuy mentioned this pull request Jun 1, 2026

[dynamicemb] Fix Torch 2.11 DataLoader worker initialization in MovieLens example #368

Closed

3 tasks

greptile-apps Bot reviewed Jun 1, 2026

View reviewed changes

Comment thread corelib/dynamicemb/example/example.py

Comment thread corelib/dynamicemb/example/example.py

jiashuy requested a review from shijieliu June 1, 2026 02:19

shijieliu approved these changes Jun 1, 2026

View reviewed changes

shijieliu merged commit 3c05d0a into NVIDIA:main Jun 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[dynamicemb] Add spawn-based multi-worker DataLoader option to MovieL…#412

[dynamicemb] Add spawn-based multi-worker DataLoader option to MovieL…#412
shijieliu merged 2 commits into
NVIDIA:mainfrom
jiashuy:spawn_multi-worker_dataloader

jiashuy commented Jun 1, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented Jun 1, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

jiashuy commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jiashuy commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

greptile-apps Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

jiashuy commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jiashuy commented Jun 1, 2026 •

edited

Loading

greptile-apps Bot commented Jun 1, 2026 •

edited

Loading