Skip to content

1rishuraj/bounded-mpmc-queue

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Bounded MPMC Queue

A bounded, multi-producer multi-consumer (MPMC) queue in Rust — std only, zero external dependencies for the core implementation.

Two implementations provided side-by-side:

  1. MutexQueue — Mutex + Condvar baseline (simple, correct, no unsafe)
  2. LockFreeQueue — Lock-free ring buffer based on Vyukov's algorithm (high-throughput, cache-friendly)

Design Decisions

Decision Rationale
Ring buffer O(1) push/pop, cache-line friendly sequential access
Power-of-two capacity Modulo → bitmask AND (index & mask), eliminating expensive division on the hot path
Cache-line padding (128B) head and tail on separate 128-byte aligned structs to prevent false sharing — 128B guards against Intel's adjacent-sector prefetch (relevant on the i5-12450H used for benchmarking)
Exponential backoff (spin → yield → sleep) Blocking push/pop use progressive backoff to balance latency vs CPU usage
Minimum capacity of 2 A single-slot ring buffer causes livelock so min cap = 2
Two implementations The mutex version is a correctness reference; the lock-free version is the performance target

unsafe Justification

There are exactly two unsafe operations in LockFreeQueue, both in the data path:

1. Writing to UnsafeCell<MaybeUninit<T>> in try_push

unsafe { (*slot.value.get()).write(item); }

Why this is sound: The CAS on head ensures that exactly one producer wins exclusive ownership of the slot at position head. No other thread can access this slot because:

  • Other producers see a different head after the CAS
  • Consumers will not read the slot until we advance slot.sequence with a Release store

2. Reading from UnsafeCell<MaybeUninit<T>> in try_pop

let item = unsafe { (*slot.value.get()).assume_init_read() };

Why this is sound: assume_init_read() safely transfers ownership of the data out of the cell without running destructors on the old memory. This is completely safe to do because the consumer's Acquire load pairs with the producer's Release store, guaranteeing the data was fully written and is physically visible before we attempt to read it.

Send + Sync impl

unsafe impl<T: Send> Send for LockFreeQueue<T> {}
unsafe impl<T: Send> Sync for LockFreeQueue<T> {}

Why this is sound: T: Send guarantees values can move between threads. The atomic protocol ensures no two threads ever access the same slot's UnsafeCell simultaneously.


Memory Ordering

Operation Ordering Reason
head/tail CAS Relaxed The CAS only needs to be atomic (no data dependency).
slot.sequence load Acquire Ensures all writes by the producer (including the data write) are visible to the consumer.
slot.sequence store Release Ensures the data write happens-before the sequence update and becomes visible to other threads.

Building & Running

# Run all tests (38 tests across both implementations)
cargo test

# Run benchmarks (HTML reports → target/criterion/)
cargo bench

# Run a specific benchmark group
cargo bench -- symmetric/cap_256

Test Results

All 38 tests pass across both implementations — tested via a macro that runs every scenario against MutexQueue and LockFreeQueue with identical expectations.

Test Results

Test Coverage

# Test Suite What it actually catches
1 Basic FIFO ordering Sequential push/pop logic and array indexing bugs.
2 try_push / try_pop semantics Non-blocking return value correctness (returns Err/None instantly).
3 Single-producer single-consumer correctness Baseline data loss or corruption without multi-thread contention.
4 MPMC stress — no lost or duplicated items ABA problems, race conditions, and skipped sequence numbers (8P/8C, 400k items).
5 Asymmetric workloads (many-to-few & few-to-many) Edge cases in backpressure (16P/2C) and core starvation (2P/16C).
6 Blocking semantics: producer blocks when full Ensures the backoff/spin logic properly halts execution instead of overwriting data.
7 Blocking semantics: consumer blocks when empty Ensures consumers don't read uninitialized memory while waiting.
8 Interleaved try_push/try_pop with concurrent push/pop Race conditions between blocking and non-blocking APIs mutating the same atomic states.
9 Small capacity stress test (capacity = 1) Prevents single-slot livelocks (forces internal capacity to 2).
10 Rapid wrap-around (Modulo test) Bitwise mask correctness (pos & mask) and sequence overflow handling.
11 Drop Checking for MaybeUninit via non-Copy types Validates the custom Drop trait prevents heap memory leaks for complex types like String.
12 Contention burst Tests how well the CAS bouncer loop handles massive, simultaneous collisions.

Benchmark Results

Environment:

  • CPU: Intel Core i5-12450H (8 Cores, 12 Threads)
  • OS: Windows Subsystem for Linux (WSL2)
  • Build: Rust 1.x (Release Profile)

Measured using Criterion.rs — 100 samples per configuration, 50,000 ops per producer thread per iteration.

Symmetric Workloads

Capacity 64

Threads Lock-free Mutex Speedup
1P / 1C 23.5 M/s 4.8 M/s 4.9×
2P / 2C 13.7 M/s 1.3 M/s 10.5×
4P / 4C 12.8 M/s 618 K/s 20.7×
8P / 8C 12.5 M/s 585 K/s 21.4×
16P / 16C 12.0 M/s 660 K/s 18.2×

The violin plot shows lock-free times tightly clustered near 0, while mutex spreads out to 2.5 seconds at 16 threads:

Capacity 64 — Violin + Comparison Capacity 64 — Violin + Comparison

Capacity 256

Threads Lock-free Mutex Speedup
1P / 1C 34.5 M/s 5.5 M/s 6.3×
2P / 2C 21.8 M/s 2.4 M/s 9.1×
4P / 4C 14.7 M/s 1.4 M/s 10.5×
8P / 8C 12.5 M/s 1.5 M/s 8.3×
16P / 16C 12.5 M/s 1.8 M/s 6.9×

Capacity 256 — Violin + Comparison

Capacity 1024

Threads Lock-free Mutex Speedup
1P / 1C 63.7 M/s 6.3 M/s 10.1×
2P / 2C 24.4 M/s 4.6 M/s 5.3×
4P / 4C 16.4 M/s 2.4 M/s 6.8×
8P / 8C 12.7 M/s 2.2 M/s 5.8×
16P / 16C 13.0 M/s 2.0 M/s 6.5×

Capacity 1024 — Violin + Comparison

Asymmetric Workloads

Configuration Lock-free Mutex Speedup
16P / 2C (cap 256) 12.8 M/s 268 K/s 47.8×
2P / 16C (cap 256) 12.4 M/s 263 K/s 47.1×
8P / 2C (cap 64) 12.1 M/s 344 K/s 35.2×

The asymmetric violin plot makes the disparity most dramatic — lock-free handles all three configurations in under 130ms, while the mutex 16P/2C case takes nearly 6 seconds per iteration:

Asymmetric — Violin + Comparison

Analysis

  1. Lock-free scales flat. Throughput stays in the 12–13 M/s range from 4 to 16 thread pairs regardless of capacity. The atomic CAS protocol distributes contention across slots instead of funneling everything through a single lock.

  2. Larger capacity helps lock-free disproportionately. At capacity 1024 with 1 thread pair, lock-free reaches 63.7 M/s — 10× faster than mutex.The ring buffer's sequential memory layout is highly hardware cache-friendly, allowing the CPU to prefetch data efficiently without waiting on main RAM.

  3. Mutex convoy effect is severe. At cap 64 with 4+ threads, mutex throughput collapses below 700 K/s and never recovers.If a consumer is delayed by OS scheduling noise, fast producers will rapidly fill the queue and all park on the not_full condition variable. When a consumer finally pops an item and calls notify_one(), it wakes exactly one producer. That producer must then immediately fight to reacquire the mutex—but if another active thread has already claimed the lock, the woken producer is instantly forced back to sleep. This creates a cascading loop of wasted context switches where the CPU burns massive cycles.

  4. Asymmetric workloads expose mutex worst-case. The 16P/2C mutex case at 268 K/s (vs 12.8 M/s lock-free) is a 47.8× gap — the worst ratio in the entire benchmark suite. 16 producers hammering a mutex while 2 consumers struggle to acquire it creates pathological convoy behavior.


Project Structure

.
├── Cargo.toml
├── README.md
├── LICENSE
├── assets/                  # Benchmark graphs and test screenshots
├── src/
│   ├── lib.rs               # Trait definition, module declarations
│   ├── mutex_queue.rs        # Mutex + Condvar implementation
│   ├── lockfree_queue.rs     # Lock-free ring buffer (Vyukov)
│   └── tests.rs              # Integration tests (parameterised for both impls)
└── benches/
    └── throughput.rs         # Criterion benchmarks

About

A zero-dependency, lock-free bounded MPMC ring buffer written in standard Rust.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages