Switch video classification example to MTP pipeline with optimized concurrency#1494
Merged
Conversation
…ncurrency Convert `build_pipeline` from single-process to Multi-Threading in subprocess (MTP) architecture, based on configurations discovered by autoresearch (78 experiments, 12 hours of automated optimization on MAST). **Architecture change:** - Split pipeline into backend subprocess (fetch → disaggregate → demux) and frontend main process (NVDEC decode → aggregate → collate) - Demuxed packets are serialized across the process boundary via pickle, isolating CPU-intensive demux work from CUDA kernel scheduling in the training process **Key parameter changes:** - Add `--num-demux-threads` argument to control demux concurrency independently from decode - Increase frontend sink buffer from 3 to 5 for smoother NVDEC timing jitter absorption - Disable automatic GC during training steps; run `gc.collect()` between epochs instead **Results:** The winning autoresearch configuration (`--subclip-duration 0.5 --num-decode-threads 7 --num-demux-threads 3`) achieved **6.6x throughput improvement** on Kinetics-400 with R3D-18 (1x8 H100 grandteton): 195 → 1,294 samples/s (3,120 → 20,704 fps). The most impactful finding was that reducing demux concurrency from 8 to 3 threads yielded a 3.4x throughput jump due to memory-bandwidth contention at higher thread counts.
Contributor
|
This pull request has been imported. If you are a Meta employee, you can view this in D106857243. (Because this pull request was imported automatically, there will not be any future comments.) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Convert
build_pipelinefrom single-process to Multi-Threading in subprocess (MTP) architecture, based on configurations discovered by autoresearch (78 experiments, 12 hours of automated optimization).Architecture change:
Key parameter changes:
--num-demux-threadsargument to control demux concurrency independently from decodegc.collect()between epochs insteadResults:
The winning autoresearch configuration (
--subclip-duration 0.5 --num-decode-threads 7 --num-demux-threads 3) achieved 6.6x throughput improvement on Kinetics-400 with R3D-18 (1x8 H100 grandteton): 195 → 1,294 samples/s (3,120 → 20,704 fps).The most impactful finding was that reducing demux concurrency from 8 to 3 threads yielded a 3.4x throughput jump due to memory-bandwidth contention at higher thread counts.