A bounded, multi-producer multi-consumer (MPMC) queue in Rust — std only, zero external dependencies for the core implementation.
Two implementations provided side-by-side:
MutexQueue— Mutex + Condvar baseline (simple, correct, nounsafe)LockFreeQueue— Lock-free ring buffer based on Vyukov's algorithm (high-throughput, cache-friendly)
| Decision | Rationale |
|---|---|
| Ring buffer | O(1) push/pop, cache-line friendly sequential access |
| Power-of-two capacity | Modulo → bitmask AND (index & mask), eliminating expensive division on the hot path |
| Cache-line padding (128B) | head and tail on separate 128-byte aligned structs to prevent false sharing — 128B guards against Intel's adjacent-sector prefetch (relevant on the i5-12450H used for benchmarking) |
| Exponential backoff (spin → yield → sleep) | Blocking push/pop use progressive backoff to balance latency vs CPU usage |
| Minimum capacity of 2 | A single-slot ring buffer causes livelock so min cap = 2 |
| Two implementations | The mutex version is a correctness reference; the lock-free version is the performance target |
There are exactly two unsafe operations in LockFreeQueue, both in the data path:
unsafe { (*slot.value.get()).write(item); }Why this is sound: The CAS on head ensures that exactly one producer wins exclusive ownership of the slot at position head. No other thread can access this slot because:
- Other producers see a different
headafter the CAS - Consumers will not read the slot until we advance
slot.sequencewith aReleasestore
let item = unsafe { (*slot.value.get()).assume_init_read() };Why this is sound: assume_init_read() safely transfers ownership of the data out of the cell without running destructors on the old memory. This is completely safe to do because the consumer's Acquire load pairs with the producer's Release store, guaranteeing the data was fully written and is physically visible before we attempt to read it.
unsafe impl<T: Send> Send for LockFreeQueue<T> {}
unsafe impl<T: Send> Sync for LockFreeQueue<T> {}Why this is sound: T: Send guarantees values can move between threads. The atomic protocol ensures no two threads ever access the same slot's UnsafeCell simultaneously.
| Operation | Ordering | Reason |
|---|---|---|
head/tail CAS |
Relaxed |
The CAS only needs to be atomic (no data dependency). |
slot.sequence load |
Acquire |
Ensures all writes by the producer (including the data write) are visible to the consumer. |
slot.sequence store |
Release |
Ensures the data write happens-before the sequence update and becomes visible to other threads. |
# Run all tests (38 tests across both implementations)
cargo test
# Run benchmarks (HTML reports → target/criterion/)
cargo bench
# Run a specific benchmark group
cargo bench -- symmetric/cap_256All 38 tests pass across both implementations — tested via a macro that runs every scenario against MutexQueue and LockFreeQueue with identical expectations.
| # | Test Suite | What it actually catches |
|---|---|---|
| 1 | Basic FIFO ordering | Sequential push/pop logic and array indexing bugs. |
| 2 | try_push / try_pop semantics |
Non-blocking return value correctness (returns Err/None instantly). |
| 3 | Single-producer single-consumer correctness | Baseline data loss or corruption without multi-thread contention. |
| 4 | MPMC stress — no lost or duplicated items | ABA problems, race conditions, and skipped sequence numbers (8P/8C, 400k items). |
| 5 | Asymmetric workloads (many-to-few & few-to-many) | Edge cases in backpressure (16P/2C) and core starvation (2P/16C). |
| 6 | Blocking semantics: producer blocks when full | Ensures the backoff/spin logic properly halts execution instead of overwriting data. |
| 7 | Blocking semantics: consumer blocks when empty | Ensures consumers don't read uninitialized memory while waiting. |
| 8 | Interleaved try_push/try_pop with concurrent push/pop |
Race conditions between blocking and non-blocking APIs mutating the same atomic states. |
| 9 | Small capacity stress test (capacity = 1) | Prevents single-slot livelocks (forces internal capacity to 2). |
| 10 | Rapid wrap-around (Modulo test) | Bitwise mask correctness (pos & mask) and sequence overflow handling. |
| 11 | Drop Checking for MaybeUninit via non-Copy types | Validates the custom Drop trait prevents heap memory leaks for complex types like String. |
| 12 | Contention burst | Tests how well the CAS bouncer loop handles massive, simultaneous collisions. |
Environment:
- CPU: Intel Core i5-12450H (8 Cores, 12 Threads)
- OS: Windows Subsystem for Linux (WSL2)
- Build: Rust 1.x (Release Profile)
Measured using Criterion.rs — 100 samples per configuration, 50,000 ops per producer thread per iteration.
| Threads | Lock-free | Mutex | Speedup |
|---|---|---|---|
| 1P / 1C | 23.5 M/s | 4.8 M/s | 4.9× |
| 2P / 2C | 13.7 M/s | 1.3 M/s | 10.5× |
| 4P / 4C | 12.8 M/s | 618 K/s | 20.7× |
| 8P / 8C | 12.5 M/s | 585 K/s | 21.4× |
| 16P / 16C | 12.0 M/s | 660 K/s | 18.2× |
The violin plot shows lock-free times tightly clustered near 0, while mutex spreads out to 2.5 seconds at 16 threads:
| Threads | Lock-free | Mutex | Speedup |
|---|---|---|---|
| 1P / 1C | 34.5 M/s | 5.5 M/s | 6.3× |
| 2P / 2C | 21.8 M/s | 2.4 M/s | 9.1× |
| 4P / 4C | 14.7 M/s | 1.4 M/s | 10.5× |
| 8P / 8C | 12.5 M/s | 1.5 M/s | 8.3× |
| 16P / 16C | 12.5 M/s | 1.8 M/s | 6.9× |
| Threads | Lock-free | Mutex | Speedup |
|---|---|---|---|
| 1P / 1C | 63.7 M/s | 6.3 M/s | 10.1× |
| 2P / 2C | 24.4 M/s | 4.6 M/s | 5.3× |
| 4P / 4C | 16.4 M/s | 2.4 M/s | 6.8× |
| 8P / 8C | 12.7 M/s | 2.2 M/s | 5.8× |
| 16P / 16C | 13.0 M/s | 2.0 M/s | 6.5× |
| Configuration | Lock-free | Mutex | Speedup |
|---|---|---|---|
| 16P / 2C (cap 256) | 12.8 M/s | 268 K/s | 47.8× |
| 2P / 16C (cap 256) | 12.4 M/s | 263 K/s | 47.1× |
| 8P / 2C (cap 64) | 12.1 M/s | 344 K/s | 35.2× |
The asymmetric violin plot makes the disparity most dramatic — lock-free handles all three configurations in under 130ms, while the mutex 16P/2C case takes nearly 6 seconds per iteration:
-
Lock-free scales flat. Throughput stays in the 12–13 M/s range from 4 to 16 thread pairs regardless of capacity. The atomic CAS protocol distributes contention across slots instead of funneling everything through a single lock.
-
Larger capacity helps lock-free disproportionately. At capacity 1024 with 1 thread pair, lock-free reaches 63.7 M/s — 10× faster than mutex.The ring buffer's sequential memory layout is highly hardware cache-friendly, allowing the CPU to prefetch data efficiently without waiting on main RAM.
-
Mutex convoy effect is severe. At cap 64 with 4+ threads, mutex throughput collapses below 700 K/s and never recovers.If a consumer is delayed by OS scheduling noise, fast producers will rapidly fill the queue and all park on the not_full condition variable. When a consumer finally pops an item and calls notify_one(), it wakes exactly one producer. That producer must then immediately fight to reacquire the mutex—but if another active thread has already claimed the lock, the woken producer is instantly forced back to sleep. This creates a cascading loop of wasted context switches where the CPU burns massive cycles.
-
Asymmetric workloads expose mutex worst-case. The 16P/2C mutex case at 268 K/s (vs 12.8 M/s lock-free) is a 47.8× gap — the worst ratio in the entire benchmark suite. 16 producers hammering a mutex while 2 consumers struggle to acquire it creates pathological convoy behavior.
.
├── Cargo.toml
├── README.md
├── LICENSE
├── assets/ # Benchmark graphs and test screenshots
├── src/
│ ├── lib.rs # Trait definition, module declarations
│ ├── mutex_queue.rs # Mutex + Condvar implementation
│ ├── lockfree_queue.rs # Lock-free ring buffer (Vyukov)
│ └── tests.rs # Integration tests (parameterised for both impls)
└── benches/
└── throughput.rs # Criterion benchmarks





