Switch video classification example to MTP pipeline with optimized concurrency by mthrok · Pull Request #1494 · facebookresearch/spdl

mthrok · 2026-05-29T20:28:23Z

Convert build_pipeline from single-process to Multi-Threading in subprocess (MTP) architecture, based on configurations discovered by autoresearch (78 experiments, 12 hours of automated optimization).

Architecture change:

Split pipeline into backend subprocess (fetch → disaggregate → demux) and frontend main process (NVDEC decode → aggregate → collate)
Demuxed packets are serialized across the process boundary via pickle, isolating CPU-intensive demux work from CUDA kernel scheduling in the training process

Key parameter changes:

Add --num-demux-threads argument to control demux concurrency independently from decode
Increase frontend sink buffer from 3 to 5 for smoother NVDEC timing jitter absorption
Disable automatic GC during training steps; run gc.collect() between epochs instead

Results:
The winning autoresearch configuration (--subclip-duration 0.5 --num-decode-threads 7 --num-demux-threads 3) achieved 6.6x throughput improvement on Kinetics-400 with R3D-18 (1x8 H100 grandteton): 195 → 1,294 samples/s (3,120 → 20,704 fps).

The most impactful finding was that reducing demux concurrency from 8 to 3 threads yielded a 3.4x throughput jump due to memory-bandwidth contention at higher thread counts.

…ncurrency Convert `build_pipeline` from single-process to Multi-Threading in subprocess (MTP) architecture, based on configurations discovered by autoresearch (78 experiments, 12 hours of automated optimization on MAST). **Architecture change:** - Split pipeline into backend subprocess (fetch → disaggregate → demux) and frontend main process (NVDEC decode → aggregate → collate) - Demuxed packets are serialized across the process boundary via pickle, isolating CPU-intensive demux work from CUDA kernel scheduling in the training process **Key parameter changes:** - Add `--num-demux-threads` argument to control demux concurrency independently from decode - Increase frontend sink buffer from 3 to 5 for smoother NVDEC timing jitter absorption - Disable automatic GC during training steps; run `gc.collect()` between epochs instead **Results:** The winning autoresearch configuration (`--subclip-duration 0.5 --num-decode-threads 7 --num-demux-threads 3`) achieved **6.6x throughput improvement** on Kinetics-400 with R3D-18 (1x8 H100 grandteton): 195 → 1,294 samples/s (3,120 → 20,704 fps). The most impactful finding was that reducing demux concurrency from 8 to 3 threads yielded a 3.4x throughput jump due to memory-bandwidth contention at higher thread counts.

meta-codesync · 2026-05-29T20:31:51Z

This pull request has been imported. If you are a Meta employee, you can view this in D106857243. (Because this pull request was imported automatically, there will not be any future comments.)

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 29, 2026

mthrok marked this pull request as ready for review May 29, 2026 20:34

mthrok enabled auto-merge (squash) May 29, 2026 20:34

mthrok merged commit fa5e934 into main May 29, 2026
11 of 12 checks passed

mthrok deleted the opt-video-classification branch May 29, 2026 20:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch video classification example to MTP pipeline with optimized concurrency#1494

Switch video classification example to MTP pipeline with optimized concurrency#1494
mthrok merged 1 commit into
mainfrom
opt-video-classification

mthrok commented May 29, 2026 •

edited

Loading

Uh oh!

meta-codesync Bot commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mthrok commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

meta-codesync Bot commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mthrok commented May 29, 2026 •

edited

Loading