Bounded MPMC Queue

A bounded, multi-producer multi-consumer (MPMC) queue in Rust — std only, zero external dependencies for the core implementation.

Two implementations provided side-by-side:

MutexQueue — Mutex + Condvar baseline (simple, correct, no unsafe)
LockFreeQueue — Lock-free ring buffer based on Vyukov's algorithm (high-throughput, cache-friendly)

Design Decisions

Decision	Rationale
Ring buffer	O(1) push/pop, cache-line friendly sequential access
Power-of-two capacity	Modulo → bitmask AND (`index & mask`), eliminating expensive division on the hot path
Cache-line padding (128B)	`head` and `tail` on separate 128-byte aligned structs to prevent false sharing — 128B guards against Intel's adjacent-sector prefetch (relevant on the i5-12450H used for benchmarking)
Exponential backoff (spin → yield → sleep)	Blocking `push`/`pop` use progressive backoff to balance latency vs CPU usage
Minimum capacity of 2	A single-slot ring buffer causes livelock so min cap = 2
Two implementations	The mutex version is a correctness reference; the lock-free version is the performance target

`unsafe` Justification

There are exactly two unsafe operations in LockFreeQueue, both in the data path:

1. Writing to `UnsafeCell<MaybeUninit<T>>` in `try_push`

unsafe { (*slot.value.get()).write(item); }

Why this is sound: The CAS on head ensures that exactly one producer wins exclusive ownership of the slot at position head. No other thread can access this slot because:

Other producers see a different head after the CAS
Consumers will not read the slot until we advance slot.sequence with a Release store

2. Reading from `UnsafeCell<MaybeUninit<T>>` in `try_pop`

let item = unsafe { (*slot.value.get()).assume_init_read() };

Why this is sound: assume_init_read() safely transfers ownership of the data out of the cell without running destructors on the old memory. This is completely safe to do because the consumer's Acquire load pairs with the producer's Release store, guaranteeing the data was fully written and is physically visible before we attempt to read it.

`Send + Sync` impl

unsafe impl<T: Send> Send for LockFreeQueue<T> {}
unsafe impl<T: Send> Sync for LockFreeQueue<T> {}

Why this is sound: T: Send guarantees values can move between threads. The atomic protocol ensures no two threads ever access the same slot's UnsafeCell simultaneously.

Memory Ordering

Operation	Ordering	Reason
`head`/`tail` CAS	`Relaxed`	The CAS only needs to be atomic (no data dependency).
`slot.sequence` load	`Acquire`	Ensures all writes by the producer (including the data write) are visible to the consumer.
`slot.sequence` store	`Release`	Ensures the data write happens-before the sequence update and becomes visible to other threads.

Building & Running

# Run all tests (38 tests across both implementations)
cargo test

# Run benchmarks (HTML reports → target/criterion/)
cargo bench

# Run a specific benchmark group
cargo bench -- symmetric/cap_256

Test Results

All 38 tests pass across both implementations — tested via a macro that runs every scenario against MutexQueue and LockFreeQueue with identical expectations.

Test Coverage

#	Test Suite	What it actually catches
1	Basic FIFO ordering	Sequential push/pop logic and array indexing bugs.
2	`try_push` / `try_pop` semantics	Non-blocking return value correctness (returns `Err`/`None` instantly).
3	Single-producer single-consumer correctness	Baseline data loss or corruption without multi-thread contention.
4	MPMC stress — no lost or duplicated items	ABA problems, race conditions, and skipped sequence numbers (8P/8C, 400k items).
5	Asymmetric workloads (many-to-few & few-to-many)	Edge cases in backpressure (16P/2C) and core starvation (2P/16C).
6	Blocking semantics: producer blocks when full	Ensures the backoff/spin logic properly halts execution instead of overwriting data.
7	Blocking semantics: consumer blocks when empty	Ensures consumers don't read uninitialized memory while waiting.
8	Interleaved `try_push`/`try_pop` with concurrent push/pop	Race conditions between blocking and non-blocking APIs mutating the same atomic states.
9	Small capacity stress test (capacity = 1)	Prevents single-slot livelocks (forces internal capacity to 2).
10	Rapid wrap-around (Modulo test)	Bitwise mask correctness (`pos & mask`) and sequence overflow handling.
11	Drop Checking for MaybeUninit via non-Copy types	Validates the custom `Drop` trait prevents heap memory leaks for complex types like `String`.
12	Contention burst	Tests how well the CAS bouncer loop handles massive, simultaneous collisions.

Benchmark Results

Environment:

CPU: Intel Core i5-12450H (8 Cores, 12 Threads)
OS: Windows Subsystem for Linux (WSL2)
Build: Rust 1.x (Release Profile)

Measured using Criterion.rs — 100 samples per configuration, 50,000 ops per producer thread per iteration.

Symmetric Workloads

Capacity 64

Threads	Lock-free	Mutex	Speedup
1P / 1C	23.5 M/s	4.8 M/s	4.9×
2P / 2C	13.7 M/s	1.3 M/s	10.5×
4P / 4C	12.8 M/s	618 K/s	20.7×
8P / 8C	12.5 M/s	585 K/s	21.4×
16P / 16C	12.0 M/s	660 K/s	18.2×

The violin plot shows lock-free times tightly clustered near 0, while mutex spreads out to 2.5 seconds at 16 threads:

Capacity 256

Threads	Lock-free	Mutex	Speedup
1P / 1C	34.5 M/s	5.5 M/s	6.3×
2P / 2C	21.8 M/s	2.4 M/s	9.1×
4P / 4C	14.7 M/s	1.4 M/s	10.5×
8P / 8C	12.5 M/s	1.5 M/s	8.3×
16P / 16C	12.5 M/s	1.8 M/s	6.9×

Capacity 1024

Threads	Lock-free	Mutex	Speedup
1P / 1C	63.7 M/s	6.3 M/s	10.1×
2P / 2C	24.4 M/s	4.6 M/s	5.3×
4P / 4C	16.4 M/s	2.4 M/s	6.8×
8P / 8C	12.7 M/s	2.2 M/s	5.8×
16P / 16C	13.0 M/s	2.0 M/s	6.5×

Asymmetric Workloads

Configuration	Lock-free	Mutex	Speedup
16P / 2C (cap 256)	12.8 M/s	268 K/s	47.8×
2P / 16C (cap 256)	12.4 M/s	263 K/s	47.1×
8P / 2C (cap 64)	12.1 M/s	344 K/s	35.2×

The asymmetric violin plot makes the disparity most dramatic — lock-free handles all three configurations in under 130ms, while the mutex 16P/2C case takes nearly 6 seconds per iteration:

Analysis

Lock-free scales flat. Throughput stays in the 12–13 M/s range from 4 to 16 thread pairs regardless of capacity. The atomic CAS protocol distributes contention across slots instead of funneling everything through a single lock.
Larger capacity helps lock-free disproportionately. At capacity 1024 with 1 thread pair, lock-free reaches 63.7 M/s — 10× faster than mutex.The ring buffer's sequential memory layout is highly hardware cache-friendly, allowing the CPU to prefetch data efficiently without waiting on main RAM.
Mutex convoy effect is severe. At cap 64 with 4+ threads, mutex throughput collapses below 700 K/s and never recovers.If a consumer is delayed by OS scheduling noise, fast producers will rapidly fill the queue and all park on the not_full condition variable. When a consumer finally pops an item and calls notify_one(), it wakes exactly one producer. That producer must then immediately fight to reacquire the mutex—but if another active thread has already claimed the lock, the woken producer is instantly forced back to sleep. This creates a cascading loop of wasted context switches where the CPU burns massive cycles.
Asymmetric workloads expose mutex worst-case. The 16P/2C mutex case at 268 K/s (vs 12.8 M/s lock-free) is a 47.8× gap — the worst ratio in the entire benchmark suite. 16 producers hammering a mutex while 2 consumers struggle to acquire it creates pathological convoy behavior.

Project Structure

.
├── Cargo.toml
├── README.md
├── LICENSE
├── assets/                  # Benchmark graphs and test screenshots
├── src/
│   ├── lib.rs               # Trait definition, module declarations
│   ├── mutex_queue.rs        # Mutex + Condvar implementation
│   ├── lockfree_queue.rs     # Lock-free ring buffer (Vyukov)
│   └── tests.rs              # Integration tests (parameterised for both impls)
└── benches/
    └── throughput.rs         # Criterion benchmarks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bounded MPMC Queue

Design Decisions

`unsafe` Justification

1. Writing to `UnsafeCell<MaybeUninit<T>>` in `try_push`

2. Reading from `UnsafeCell<MaybeUninit<T>>` in `try_pop`

`Send + Sync` impl

Memory Ordering

Building & Running

Test Results

Test Coverage

Benchmark Results

Symmetric Workloads

Capacity 64

Capacity 256

Capacity 1024

Asymmetric Workloads

Analysis

Project Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
assets		assets
benches		benches
src		src
.gitignore		.gitignore
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Bounded MPMC Queue

Design Decisions

unsafe Justification

1. Writing to UnsafeCell<MaybeUninit<T>> in try_push

2. Reading from UnsafeCell<MaybeUninit<T>> in try_pop

Send + Sync impl

Memory Ordering

Building & Running

Test Results

Test Coverage

Benchmark Results

Symmetric Workloads

Capacity 64

Capacity 256

Capacity 1024

Asymmetric Workloads

Analysis

Project Structure

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`unsafe` Justification

1. Writing to `UnsafeCell<MaybeUninit<T>>` in `try_push`

2. Reading from `UnsafeCell<MaybeUninit<T>>` in `try_pop`

`Send + Sync` impl

Packages