nvme: deadlock enabling doorbell buffer while doing I/O#1080
Conversation
|
uh, I spent more time confused about why the lock ordering didn't bite us before than noticing that this "doesn't build". |
|
re:
is it possible that the cosmos are just like, too fast to hit it for whatever reason? i don’t like that explanation… |
9a3a93a introduced a bug in the previously-okay SubQueue::pop; before, acc_mem.access() would go through MemAccessor, acquire a reference on the underlying MemCtx (an Arc refcount bump), drop the lock, and continue. immediately after, SubQueue::pop would lock the SubQueue's state and actually perform the pop. this ordering is backwards from set_db_buf, which locks the SubQueue's state *then* conditionally performs acc_mem.access() if the buffer is set up as part of an import. after 9a3a93a, SubQueue::pop uses access_borrowed() to avoid contention on the refcount for MemCtx, the cost of retaining the lock on the SubQueue's MemAccessor node while taking the lock on the same queue's SubQueueState. at this point a concurrent SubQueue::pop and SubQueue::set_db_buf could deadlock; one holds the queue state lock, the other holds the accessor node lock, and both will try to take the other. resolve this tension by picking a consistent ordering for the queue state and acc_mem locks: SubQueueState first, then acc_mem.{access,_borrow}(). this was already the ordering in CompQueue::push and both set_db_buf, so pop() gets the trivial change.
2e2b31d to
f6f2724
Compare
I don't think, mostly because this has also been on dogfood for like two (?) months and as far as I know no one's VMs there have seen an issue anything like this. we're definitely launching more VMs and testing their bootedness in PHD more often and specifically than people looking at their dogfood VMs but .. I'm more wondering if there's something interesting to how the Alpine image is submitting I/Os that it's doing work while configuring doorbell buffers. I assume (but don't know!) that's OK by the NVMe spec but I'll take a closer look at all that tomorrow.. |
OH yeah, that's plausible. Hm. |
hawkw
left a comment
There was a problem hiding this comment.
Okay, so I think the fix is correct. I wondered a bit if there might be a nicer way to avoid future bugs here --- let me know if you think this makes sense? I'm okay with approving the PR now without having done any of that, since it fixes the bug, but I wanted to bring it up.
lib/propolis/src/hw/nvme/queue.rs
Outdated
| // Lock the SubQueueState early to order consistently with MemAccessor; | ||
| // in all cases we acquire the queue state lock before working with | ||
| // memory. | ||
| let mut state = self.state.lock(); |
There was a problem hiding this comment.
Should there maybe be a similar comment on, say, the declaration of SubQueue::state that documents this relationship? I like that it's been written down here, but I wonder about future code that might also go to use the MemAccessor and might not see this comment.
I wondered if we might want to consider changing this up more substantially so that the lock and the MemAccessor are always acquired together...could we just put the MemAccessor inside the lock, forcing you to lock first? AFAICT that ought to work, though it might be a pain to shuffle things around...
There was a problem hiding this comment.
honestly yeah this comment should just be up on that field.
In theory moving acc_mem into QueueState would be straightforward (the shuffling is a bit funky but just "follow the types"), but QueueGuard<'_, CompQueueState>::db_buf_write and friends take a &mut while writing state.state.mem_acc.access_borrow() takes an immutable borrow of the QueueGuard to get to the mutex and the lock guard. so that's unfortunate.
I've settled on just writing this out more appropriately on acc_mem (and making it non-pub, since it doesn't need to be these days, so it's easier to trip over the "be careful about lock ordering" note)
along the way I noticed a really weird case that does seem fine.. CompQueue::reserve_entry does a self.state.lock() but gets a &MemCtx argument, so the caller must have either acc_mem.access() or acc_mem.access_borrow() and provided a reference. so that's a lock ordering violation, right? technically not, the provided MemCtx comes from the caller's acc_mem node, which is the accessor node for the calling SubQueue that is reserving space. so it's not two locks in the same structure, they're two different queue's locks. very uncomfortable, but fine.
providing the caller's acc_mem avoids the posix mutex lock/unlock. that would happen once for each submitted I/O, so this is all quite hot and the more obvious code would be measurably worse throughput.
There was a problem hiding this comment.
yeah, i think this is good, thanks for adding the docs
hawkw
left a comment
There was a problem hiding this comment.
great cool awesome okay let's go
| /// This queue's memory accessor node. | ||
| /// | ||
| /// Be careful about lock ordering when using this accessor; access_borrow() | ||
| /// holds this node's lock. If a user of this queue state requires both | ||
| /// `access_borrow()` and `QueueInner`, the protocol is to lock queue | ||
| /// state first and this accessor second. |
| // Lock the SubQueueState early to conform to lock ordering requirement; | ||
| // see docs on QueueState::acc_mem. | ||
| let mut state = self.state.lock(); |
There was a problem hiding this comment.
perhaps we should say this in the other functions that do the same thing? or not. i don't care.
There was a problem hiding this comment.
in all the other cases the caller of self.state.lock() does something with the queue before maybe having to do stuff with memory, this one's "unnatural" if you're going just by who uses what first so it gets the extra note.
NVMe (2.0) section 5.8 |
|
mostly to prove to myself this did happen and doesn't anymore, #[phd_testcase]
async fn reboot_a_hundred_times(ctx: &TestCtx) {
let mut vm = ctx.spawn_default_vm("api_reboot_test").await?;
vm.launch().await?;
vm.wait_to_boot().await?;
use std::time::SystemTime;
let mut last_boot = SystemTime::now();
for i in 0..100 {
eprintln!("reboot {}", i);
vm.reset().await?;
vm.wait_to_boot().await?;
let booted_at = SystemTime::now();
eprintln!(" ... took {}ms", booted_at.duration_since(last_boot).unwrap().as_millis());
last_boot = booted_at;
}
}this eventually got a VM that was stuck after 67 resets without this patch, and has booted 100 times with the patch. still no idea why this would be more likely on Gimlet than the random (also Milan) other system that tests had previously been on, but I feel a little less sketched about the whole thing with the straightforward reproducer actually hitting it eventually. ps: useless datapoint: the test Alpine image boots in 3.65 seconds with a debug propolis on my workstation |
9a3a93a introduced a bug in the previously-okay SubQueue::pop; before, acc_mem.access() would go through MemAccessor, acquire a reference on the underlying MemCtx (an Arc refcount bump), drop the lock, and continue. immediately after, SubQueue::pop would lock the SubQueue's state and actually perform the pop.
this ordering is backwards from set_db_buf, which locks the SubQueue's state then conditionally performs acc_mem.access() if the buffer is set up as part of an import.
after 9a3a93a, SubQueue::pop uses access_borrowed() to avoid contention on the refcount for MemCtx, the cost of retaining the lock on the SubQueue's MemAccessor node while taking the lock on the same queue's SubQueueState. at this point a concurrent SubQueue::pop and SubQueue::set_db_buf could deadlock; one holds the queue state lock, the other holds the accessor node lock, and both will try to take the other.
resolve this tension by picking a consistent ordering for the SubQueue state and acc_mem locks: SubQueueState first, then acc_mem.{access,_borrow}(). this was already the ordering in CompQueue::push and set_db_buf, so pop() gets the trivial change.
this fixes #1035 as experienced in https://github.com/oxidecomputer/propolis/runs/66881499333 but I'm left very confused; why did this only come up after @hawkw switched to running these tests on the lab Gimlet? this issue has been present since the commit was landed on Jan 9, and we'd been testing Propolis on Cosmos with it for at least a few weeks before.