nvme: deadlock enabling doorbell buffer while doing I/O by iximeow · Pull Request #1080 · oxidecomputer/propolis

iximeow · 2026-03-13T02:28:22Z

9a3a93a introduced a bug in the previously-okay SubQueue::pop; before, acc_mem.access() would go through MemAccessor, acquire a reference on the underlying MemCtx (an Arc refcount bump), drop the lock, and continue. immediately after, SubQueue::pop would lock the SubQueue's state and actually perform the pop.

this ordering is backwards from set_db_buf, which locks the SubQueue's state then conditionally performs acc_mem.access() if the buffer is set up as part of an import.

after 9a3a93a, SubQueue::pop uses access_borrowed() to avoid contention on the refcount for MemCtx, the cost of retaining the lock on the SubQueue's MemAccessor node while taking the lock on the same queue's SubQueueState. at this point a concurrent SubQueue::pop and SubQueue::set_db_buf could deadlock; one holds the queue state lock, the other holds the accessor node lock, and both will try to take the other.

resolve this tension by picking a consistent ordering for the SubQueue state and acc_mem locks: SubQueueState first, then acc_mem.{access,_borrow}(). this was already the ordering in CompQueue::push and set_db_buf, so pop() gets the trivial change.

this fixes #1035 as experienced in https://github.com/oxidecomputer/propolis/runs/66881499333 but I'm left very confused; why did this only come up after @hawkw switched to running these tests on the lab Gimlet? this issue has been present since the commit was landed on Jan 9, and we'd been testing Propolis on Cosmos with it for at least a few weeks before.

iximeow · 2026-03-13T02:32:25Z

uh, I spent more time confused about why the lock ordering didn't bite us before than noticing that this "doesn't build".

hawkw · 2026-03-13T02:34:49Z

re:

this fixes #1035 as experienced in https://github.com/oxidecomputer/propolis/runs/66881499333 but I'm left very confused; why did this only come up after @hawkw switched to running these tests on the lab Gimlet? this issue has been present since the commit was landed on Jan 9, and we'd been testing Propolis on Cosmos with it for at least a few weeks before

is it possible that the cosmos are just like, too fast to hit it for whatever reason? i don’t like that explanation…

9a3a93a introduced a bug in the previously-okay SubQueue::pop; before, acc_mem.access() would go through MemAccessor, acquire a reference on the underlying MemCtx (an Arc refcount bump), drop the lock, and continue. immediately after, SubQueue::pop would lock the SubQueue's state and actually perform the pop. this ordering is backwards from set_db_buf, which locks the SubQueue's state *then* conditionally performs acc_mem.access() if the buffer is set up as part of an import. after 9a3a93a, SubQueue::pop uses access_borrowed() to avoid contention on the refcount for MemCtx, the cost of retaining the lock on the SubQueue's MemAccessor node while taking the lock on the same queue's SubQueueState. at this point a concurrent SubQueue::pop and SubQueue::set_db_buf could deadlock; one holds the queue state lock, the other holds the accessor node lock, and both will try to take the other. resolve this tension by picking a consistent ordering for the queue state and acc_mem locks: SubQueueState first, then acc_mem.{access,_borrow}(). this was already the ordering in CompQueue::push and both set_db_buf, so pop() gets the trivial change.

iximeow · 2026-03-13T02:49:15Z

.. too fast to hit it ..

I don't think, mostly because this has also been on dogfood for like two (?) months and as far as I know no one's VMs there have seen an issue anything like this. we're definitely launching more VMs and testing their bootedness in PHD more often and specifically than people looking at their dogfood VMs but ..

I'm more wondering if there's something interesting to how the Alpine image is submitting I/Os that it's doing work while configuring doorbell buffers. I assume (but don't know!) that's OK by the NVMe spec but I'll take a closer look at all that tomorrow..

hawkw · 2026-03-13T16:56:33Z

I'm more wondering if there's something interesting to how the Alpine image is submitting I/Os that it's doing work while configuring doorbell buffers. I assume (but don't know!) that's OK by the NVMe spec but I'll take a closer look at all that tomorrow..

OH yeah, that's plausible. Hm.

hawkw

Okay, so I think the fix is correct. I wondered a bit if there might be a nicer way to avoid future bugs here --- let me know if you think this makes sense? I'm okay with approving the PR now without having done any of that, since it fixes the bug, but I wanted to bring it up.

hawkw · 2026-03-13T17:07:08Z

lib/propolis/src/hw/nvme/queue.rs

+        // Lock the SubQueueState early to order consistently with MemAccessor;
+        // in all cases we acquire the queue state lock before working with
+        // memory.
+        let mut state = self.state.lock();


Should there maybe be a similar comment on, say, the declaration of SubQueue::state that documents this relationship? I like that it's been written down here, but I wonder about future code that might also go to use the MemAccessor and might not see this comment.

I wondered if we might want to consider changing this up more substantially so that the lock and the MemAccessor are always acquired together...could we just put the MemAccessor inside the lock, forcing you to lock first? AFAICT that ought to work, though it might be a pain to shuffle things around...

honestly yeah this comment should just be up on that field.

In theory moving acc_mem into QueueState would be straightforward (the shuffling is a bit funky but just "follow the types"), but QueueGuard<'_, CompQueueState>::db_buf_write and friends take a &mut while writing state.state.mem_acc.access_borrow() takes an immutable borrow of the QueueGuard to get to the mutex and the lock guard. so that's unfortunate.

I've settled on just writing this out more appropriately on acc_mem (and making it non-pub, since it doesn't need to be these days, so it's easier to trip over the "be careful about lock ordering" note)

along the way I noticed a really weird case that does seem fine.. CompQueue::reserve_entry does a self.state.lock() but gets a &MemCtx argument, so the caller must have either acc_mem.access() or acc_mem.access_borrow() and provided a reference. so that's a lock ordering violation, right? technically not, the provided MemCtx comes from the caller's acc_mem node, which is the accessor node for the calling SubQueue that is reserving space. so it's not two locks in the same structure, they're two different queue's locks. very uncomfortable, but fine.

providing the caller's acc_mem avoids the posix mutex lock/unlock. that would happen once for each submitted I/O, so this is all quite hot and the more obvious code would be measurably worse throughput.

yeah, i think this is good, thanks for adding the docs

hawkw

great cool awesome okay let's go

hawkw · 2026-03-13T20:53:11Z

lib/propolis/src/hw/nvme/queue.rs

+    /// This queue's memory accessor node.
+    ///
+    /// Be careful about lock ordering when using this accessor; access_borrow()
+    /// holds this node's lock. If a user of this queue state requires both
+    /// `access_borrow()` and `QueueInner`, the protocol is to lock queue
+    /// state first and this accessor second.


hawkw · 2026-03-13T20:53:46Z

lib/propolis/src/hw/nvme/queue.rs

+        // Lock the SubQueueState early to conform to lock ordering requirement;
+        // see docs on QueueState::acc_mem.
        let mut state = self.state.lock();


perhaps we should say this in the other functions that do the same thing? or not. i don't care.

in all the other cases the caller of self.state.lock() does something with the queue before maybe having to do stuff with memory, this one's "unnatural" if you're going just by who uses what first so it gets the extra note.

iximeow · 2026-03-13T21:26:38Z

I assume (but don't know!) that's OK by the NVMe spec but I'll take a closer look at all that tomorrow..

NVMe (2.0) section 5.8 Doorbell Buffer Config command doesn't mention anything about concurrent I/Os in the subsection about command completion so afaict it's legal to just set up (or tear down) doorbell buffers whenever you feel like, and that should not have any effect on I/O. so i should get back to #966 at some point and include this as one of the operations that can randomly Just Be Done.

iximeow · 2026-03-13T21:58:22Z

mostly to prove to myself this did happen and doesn't anymore,

#[phd_testcase]
async fn reboot_a_hundred_times(ctx: &TestCtx) {
    let mut vm = ctx.spawn_default_vm("api_reboot_test").await?;
    vm.launch().await?;
    vm.wait_to_boot().await?;

    use std::time::SystemTime;

    let mut last_boot = SystemTime::now();

    for i in 0..100 {
        eprintln!("reboot {}", i);
        vm.reset().await?;
        vm.wait_to_boot().await?;

        let booted_at = SystemTime::now();
        eprintln!(" ... took {}ms", booted_at.duration_since(last_boot).unwrap().as_millis());
        last_boot = booted_at;
    }
}

this eventually got a VM that was stuck after 67 resets without this patch, and has booted 100 times with the patch. still no idea why this would be more likely on Gimlet than the random (also Milan) other system that tests had previously been on, but I feel a little less sketched about the whole thing with the straightforward reproducer actually hitting it eventually.

ps: useless datapoint: the test Alpine image boots in 3.65 seconds with a debug propolis on my workstation

iximeow added bug Something that isn't working. storage Related to storage devices/backends. labels Mar 13, 2026

iximeow force-pushed the nvme-db-buf-deadlock branch from 2e2b31d to f6f2724 Compare March 13, 2026 02:42

hawkw reviewed Mar 13, 2026

View reviewed changes

better docs

aa8373f

hawkw approved these changes Mar 13, 2026

View reviewed changes

iximeow merged commit f99d237 into master Mar 13, 2026
12 checks passed

iximeow deleted the nvme-db-buf-deadlock branch March 13, 2026 21:59

Conversation

iximeow commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

iximeow commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hawkw commented Mar 13, 2026

Uh oh!

iximeow commented Mar 13, 2026

Uh oh!

hawkw commented Mar 13, 2026

Uh oh!

hawkw left a comment

Choose a reason for hiding this comment

Uh oh!

hawkw Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

iximeow Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

hawkw Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

hawkw left a comment

Choose a reason for hiding this comment

Uh oh!

hawkw Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

hawkw Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

iximeow Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

iximeow commented Mar 13, 2026

Uh oh!

iximeow commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

iximeow commented Mar 13, 2026 •

edited

Loading

iximeow commented Mar 13, 2026 •

edited

Loading

iximeow commented Mar 13, 2026 •

edited

Loading