bug: Avahi dead lock#78
Conversation
sthelen-enqs
left a comment
There was a problem hiding this comment.
This looks like a real issue in the current code to me, thanks for finding and analyzing it.
I made some comments on some simplifications I think we could do to keep the code from getting unnecessarily complex. Mainly, I don't think the loop is necessary if this will already be handled by the reannounceWithNewInterfaces. Would you agree, or was there a reason you added the loop anyway?
| // Compare-and-retry loop: snapshot ifaceIndexes, perform DBus calls | ||
| // without holding a.mux (to avoid deadlock with chanListener), then | ||
| // verify the snapshot is still current. If interfaces changed during | ||
| // the DBus phase, discard the work and retry once with fresh data. | ||
| // The announceMux guarantees no external Announce can interleave, so | ||
| // a single retry suffices — further changes are left to the existing | ||
| // reannounceWithNewInterfaces mechanism. | ||
| for attempt := 0; ; attempt++ { |
There was a problem hiding this comment.
Is there a point to running a retry loop here if reannounceWithNewInterfaces will just call this again anyway?
There was a problem hiding this comment.
You are right to question this. This was my attempt to solve a race condition when an interface changes during an Announce's DBus phase. However, with Announce being called right after any interface update this isn't really a problem in my opinion. The current implementation just makes it overly complex. I'll remove the corresponding code.
| a.mux.Lock() | ||
| if a.manualShutdown { | ||
| a.mux.Unlock() | ||
| return fmt.Errorf("mdns: avahi provider is shut down") | ||
| } | ||
| ifaceIndexes := make([]int32, len(a.ifaceIndexes)) | ||
| copy(ifaceIndexes, a.ifaceIndexes) | ||
| a.mux.Unlock() | ||
| return err |
There was a problem hiding this comment.
Does checking for manualShutdown here give us any advantage? manualShutdown could be set by a goroutine on every other line in this function so we need to be ensure that all calls here are safe while Shutdown is blocked on our mux anyway.
A race condition could mean that manualShutdown is set the cycle after we unlock the mux here which doesn't give us an advantage (and makes the code more complex)
Not checking manualShutdown here would let us reuse getIfaceIndexes here
There was a problem hiding this comment.
You are right, the check doesn't give us any advantage. I'll remove it and also use getIfaceIndexes, as you suggest.
Thank you for the review.
| // Release the lock before freeing the old entry group, as | ||
| // EntryGroupFree may block on dBUS and we don't want to hold | ||
| // the mutex during that time. |
There was a problem hiding this comment.
Can we keep this comment?
There was a problem hiding this comment.
I'll restore the comment
|
We can merge this once the conflict is addressed |
Hi,
conducting further tests using pull request #76 I discovered a secondary dead-lock within the implementation of Avahi. I analyzed the issue using a dump of eebus-go. Here are my findings and the chain of events:
Goroutine A (the RPC handler calling Announce) is blocked on the go-avahi Server's mutex inside EntryGroupNew, not on its own state. Tracing the other waiters in the log:
Goroutine B (chanListener) is in AvahiProvider.chanListener → processService → getIfaceIndexes, waiting on AvahiProvider.mu. That is the provider's own mutex — and AvahiProvider.Announce (goroutine A) took that mutex on entry (see avahi.go:210) and is still holding it while stuck inside EntryGroupNew.
Goroutine C (DBus signal dispatcher) is in go-avahi.(*ServiceBrowser).DispatchSignal → runtime.chansend, parked trying to hand a DBus signal to the service-browser channel whose reader is goroutine B (chanListener) — which itself is parked on AvahiProvider.mu.
So the chain is:
Goroutine A — RPC handler (Announce)
holds AvahiProvider.mu
holds hub.muxReg
waits go-avahi Server.mu <-- inside EntryGroupNew
Goroutine B — chanListener
waits AvahiProvider.mu <-- held by A
Goroutine C — DBus signal dispatcher
waits send on ServiceBrowser channel <-- reader is B
Goroutines D, E, … — other RPC handlers
wait hub.muxReg <-- held by A
DBus replies to EntryGroupNew are delivered over this same signal plumbing. With the dispatcher (C) parked on a full channel and the consumer (B) parked on AvahiProvider.mu, the reply that would unblock EntryGroupNew can never arrive. That is a hard deadlock, not a slow DBus — it will never recover on its own.
It'd be great if you could have a look and confirm my findings and the corresponding fix. Thank you.
PS: Please note that the proposed fix is based on my other open pull request #76 and should be merged after it.