[dynamicemb] Add spawn-based multi-worker DataLoader option to MovieL…#412
Conversation
…ens example Add an opt-in --num_workers argument to the dynamicemb MovieLens example so data loading can be offloaded to subprocesses. The main process initializes a CUDA context (torch.cuda.set_device in init_runtime), so the default fork start method cannot be used for DataLoader workers: a forked child inherits an unusable copy of the CUDA context and raises "RuntimeError: initialization error" on its first CUDA touch. When --num_workers > 0 the DataLoader is therefore built with the spawn start method (plus persistent_workers), which gives each worker a fresh interpreter with no inherited CUDA state. Workers only do CPU work (dataset reads + collate_fn); the host-to-device copy stays in the main process. - Add build_dataloader() helper and route train/dump/load/inc_dump through it. - Remove the leftover module-level rank_print override that referenced an undefined module-scope local_rank; the rank-aware print is installed in init_runtime(). This matters now that spawn workers re-import the module. - Document --num_workers usage and the spawn rationale in the example README. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Greptile SummaryThis PR adds opt-in spawn-based multi-worker
Confidence Score: 5/5Safe to merge — the changes are a focused, self-contained refactor with no altered training logic and a correct spawn-based DataLoader implementation. The module-level print override that referenced undefined local_rank is cleanly removed, the build_dataloader() helper correctly gates spawn/persistent_workers on num_workers > 0, the dataset and collate_fn are picklable, the if name == 'main' guard is in place, and argument validation is present. No logic changes to the model, optimizer, or distributed training path. No files require special attention. Important Files Changed
Sequence DiagramsequenceDiagram
participant Main as Main Process
participant IS as init_runtime()
participant BD as build_dataloader()
participant W as Spawn Worker(s)
participant DL as DataLoader (main)
Main->>IS: init_runtime()
IS->>Main: "torch.cuda.set_device(local_rank) + builtins.print = rank_print"
Main->>BD: build_dataloader(dataset, sampler, args)
BD->>DL: "DataLoader(multiprocessing_context=spawn, persistent_workers=True) [if num_workers>0]"
DL-->>W: spawn fresh interpreter per worker
Note over W: re-imports example.py, init_runtime() NOT called, no CUDA context
loop Each batch
DL->>W: __getitem__ + collate_fn (CPU only)
W-->>DL: KeyedJaggedTensor, labels tensor
DL-->>Main: batch (host tensors)
Main->>Main: .to(device) — H2D copy in main process
end
Reviews (2): Last reviewed commit: "[dynamicemb] Reject negative --num_worke..." | Re-trigger Greptile |
|
Test passed |
A negative --num_workers passed argparse (it is a valid int) but failed the `args.num_workers > 0` spawn guard, so the value was handed straight to DataLoader, which raised a confusing "num_workers option should be non-negative" error that looked like an internal failure. Validate in parse_args() and fail fast with a clear argument error instead. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ens example
Add an opt-in --num_workers argument to the dynamicemb MovieLens example so data loading can be offloaded to subprocesses.
The main process initializes a CUDA context (torch.cuda.set_device in init_runtime), so the default fork start method cannot be used for DataLoader workers: a forked child inherits an unusable copy of the CUDA context and raises "RuntimeError: initialization error" on its first CUDA touch. When --num_workers > 0 the DataLoader is therefore built with the spawn start method (plus persistent_workers), which gives each worker a fresh interpreter with no inherited CUDA state. Workers only do CPU work (dataset reads + collate_fn); the host-to-device copy stays in the main process.
Description
Checklist