Skip to content

Switch video classification example to MTP pipeline with optimized concurrency#1494

Merged
mthrok merged 1 commit into
mainfrom
opt-video-classification
May 29, 2026
Merged

Switch video classification example to MTP pipeline with optimized concurrency#1494
mthrok merged 1 commit into
mainfrom
opt-video-classification

Conversation

@mthrok
Copy link
Copy Markdown
Collaborator

@mthrok mthrok commented May 29, 2026

Convert build_pipeline from single-process to Multi-Threading in subprocess (MTP) architecture, based on configurations discovered by autoresearch (78 experiments, 12 hours of automated optimization).

image

Architecture change:

  • Split pipeline into backend subprocess (fetch → disaggregate → demux) and frontend main process (NVDEC decode → aggregate → collate)
  • Demuxed packets are serialized across the process boundary via pickle, isolating CPU-intensive demux work from CUDA kernel scheduling in the training process

Key parameter changes:

  • Add --num-demux-threads argument to control demux concurrency independently from decode
  • Increase frontend sink buffer from 3 to 5 for smoother NVDEC timing jitter absorption
  • Disable automatic GC during training steps; run gc.collect() between epochs instead

Results:
The winning autoresearch configuration (--subclip-duration 0.5 --num-decode-threads 7 --num-demux-threads 3) achieved 6.6x throughput improvement on Kinetics-400 with R3D-18 (1x8 H100 grandteton): 195 → 1,294 samples/s (3,120 → 20,704 fps).

The most impactful finding was that reducing demux concurrency from 8 to 3 threads yielded a 3.4x throughput jump due to memory-bandwidth contention at higher thread counts.

…ncurrency

Convert `build_pipeline` from single-process to Multi-Threading in subprocess (MTP) architecture, based on configurations discovered by autoresearch (78 experiments, 12 hours of automated optimization on MAST).

**Architecture change:**
- Split pipeline into backend subprocess (fetch → disaggregate → demux) and frontend main process (NVDEC decode → aggregate → collate)
- Demuxed packets are serialized across the process boundary via pickle, isolating CPU-intensive demux work from CUDA kernel scheduling in the training process

**Key parameter changes:**
- Add `--num-demux-threads` argument to control demux concurrency independently from decode
- Increase frontend sink buffer from 3 to 5 for smoother NVDEC timing jitter absorption
- Disable automatic GC during training steps; run `gc.collect()` between epochs instead

**Results:**
The winning autoresearch configuration (`--subclip-duration 0.5 --num-decode-threads 7 --num-demux-threads 3`) achieved **6.6x throughput improvement** on Kinetics-400 with R3D-18 (1x8 H100 grandteton): 195 → 1,294 samples/s (3,120 → 20,704 fps).

The most impactful finding was that reducing demux concurrency from 8 to 3 threads yielded a 3.4x throughput jump due to memory-bandwidth contention at higher thread counts.
@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 29, 2026
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync Bot commented May 29, 2026

This pull request has been imported. If you are a Meta employee, you can view this in D106857243. (Because this pull request was imported automatically, there will not be any future comments.)

@mthrok mthrok marked this pull request as ready for review May 29, 2026 20:34
@mthrok mthrok enabled auto-merge (squash) May 29, 2026 20:34
@mthrok mthrok merged commit fa5e934 into main May 29, 2026
11 of 12 checks passed
@mthrok mthrok deleted the opt-video-classification branch May 29, 2026 20:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant