bug: Avahi dead lock by randomdudebunchofnumbers · Pull Request #78 · enbility/ship-go

randomdudebunchofnumbers · 2026-04-20T08:42:44Z

Hi,

conducting further tests using pull request #76 I discovered a secondary dead-lock within the implementation of Avahi. I analyzed the issue using a dump of eebus-go. Here are my findings and the chain of events:

Goroutine A (the RPC handler calling Announce) is blocked on the go-avahi Server's mutex inside EntryGroupNew, not on its own state. Tracing the other waiters in the log:

Goroutine B (chanListener) is in AvahiProvider.chanListener → processService → getIfaceIndexes, waiting on AvahiProvider.mu. That is the provider's own mutex — and AvahiProvider.Announce (goroutine A) took that mutex on entry (see avahi.go:210) and is still holding it while stuck inside EntryGroupNew.
Goroutine C (DBus signal dispatcher) is in go-avahi.(*ServiceBrowser).DispatchSignal → runtime.chansend, parked trying to hand a DBus signal to the service-browser channel whose reader is goroutine B (chanListener) — which itself is parked on AvahiProvider.mu.
So the chain is:

Goroutine A — RPC handler (Announce)
holds AvahiProvider.mu
holds hub.muxReg
waits go-avahi Server.mu <-- inside EntryGroupNew

Goroutine B — chanListener
waits AvahiProvider.mu <-- held by A

Goroutine C — DBus signal dispatcher
waits send on ServiceBrowser channel <-- reader is B

Goroutines D, E, … — other RPC handlers
wait hub.muxReg <-- held by A

DBus replies to EntryGroupNew are delivered over this same signal plumbing. With the dispatcher (C) parked on a full channel and the consumer (B) parked on AvahiProvider.mu, the reply that would unblock EntryGroupNew can never arrive. That is a hard deadlock, not a slow DBus — it will never recover on its own.

It'd be great if you could have a look and confirm my findings and the corresponding fix. Thank you.

PS: Please note that the proposed fix is based on my other open pull request #76 and should be merged after it.

coveralls · 2026-04-20T08:46:11Z

coverage: 94.86% (+0.05%) from 94.807% — randomdudebunchofnumbers:bug/avahi_dead_lock into enbility:dev

sthelen-enqs

This looks like a real issue in the current code to me, thanks for finding and analyzing it.
I made some comments on some simplifications I think we could do to keep the code from getting unnecessarily complex. Mainly, I don't think the loop is necessary if this will already be handled by the reannounceWithNewInterfaces. Would you agree, or was there a reason you added the loop anyway?

sthelen-enqs · 2026-04-20T09:03:10Z

+	// Compare-and-retry loop: snapshot ifaceIndexes, perform DBus calls
+	// without holding a.mux (to avoid deadlock with chanListener), then
+	// verify the snapshot is still current. If interfaces changed during
+	// the DBus phase, discard the work and retry once with fresh data.
+	// The announceMux guarantees no external Announce can interleave, so
+	// a single retry suffices — further changes are left to the existing
+	// reannounceWithNewInterfaces mechanism.
+	for attempt := 0; ; attempt++ {


Is there a point to running a retry loop here if reannounceWithNewInterfaces will just call this again anyway?

You are right to question this. This was my attempt to solve a race condition when an interface changes during an Announce's DBus phase. However, with Announce being called right after any interface update this isn't really a problem in my opinion. The current implementation just makes it overly complex. I'll remove the corresponding code.

sthelen-enqs · 2026-04-20T09:08:20Z

+		a.mux.Lock()
+		if a.manualShutdown {
+			a.mux.Unlock()
+			return fmt.Errorf("mdns: avahi provider is shut down")
+		}
+		ifaceIndexes := make([]int32, len(a.ifaceIndexes))
+		copy(ifaceIndexes, a.ifaceIndexes)
 		a.mux.Unlock()
-		return err


Does checking for manualShutdown here give us any advantage? manualShutdown could be set by a goroutine on every other line in this function so we need to be ensure that all calls here are safe while Shutdown is blocked on our mux anyway.
A race condition could mean that manualShutdown is set the cycle after we unlock the mux here which doesn't give us an advantage (and makes the code more complex)

Not checking manualShutdown here would let us reuse getIfaceIndexes here

You are right, the check doesn't give us any advantage. I'll remove it and also use getIfaceIndexes, as you suggest.

Thank you for the review.

sthelen-enqs · 2026-04-20T09:11:11Z

-	// Release the lock before freeing the old entry group, as
-	// EntryGroupFree may block on dBUS and we don't want to hold
-	// the mutex during that time.


Can we keep this comment?

I'll restore the comment

…eIndexes

sthelen-enqs · 2026-06-29T11:28:09Z

We can merge this once the conflict is addressed

Martin.Beims added 5 commits April 16, 2026 22:20

added tests to expose the issue

fdaf2a9

release muxReg before invoking the mDNS call

598d97b

fixed CI race condition

5bc7895

added unit tests to expose the avahi-dead-lock issue

de7ca78

fix: don't hold AvahiProvider.mu across DBus calls in Announce

0be88ca

DerAndereAndi approved these changes Apr 20, 2026

View reviewed changes

sthelen-enqs reviewed Apr 20, 2026

View reviewed changes

sthelen-enqs mentioned this pull request Apr 20, 2026

Hub.SetAutoAccept stalls SHIP stack: mDNS unannounce/announce runs under muxReg write lock #75

Open

sthelen-enqs added the bug Something isn't working label Apr 20, 2026

Martin.Beims added 3 commits April 20, 2026 20:28

remove retry when interfaces change during dbus phase of announce method

849e74e

restored comment

285a70e

removed check for manualShutdown and replaced it with call to getIfac…

a03c606

…eIndexes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

bug: Avahi dead lock#78

bug: Avahi dead lock#78
randomdudebunchofnumbers wants to merge 8 commits into
enbility:devfrom
randomdudebunchofnumbers:bug/avahi_dead_lock

randomdudebunchofnumbers commented Apr 20, 2026

Uh oh!

coveralls commented Apr 20, 2026 •

edited

Loading

Uh oh!

sthelen-enqs left a comment

Uh oh!

sthelen-enqs Apr 20, 2026

Uh oh!

randomdudebunchofnumbers Apr 20, 2026

Uh oh!

sthelen-enqs Apr 20, 2026

Uh oh!

randomdudebunchofnumbers Apr 20, 2026

Uh oh!

sthelen-enqs Apr 20, 2026

Uh oh!

randomdudebunchofnumbers Apr 20, 2026

Uh oh!

sthelen-enqs commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

randomdudebunchofnumbers commented Apr 20, 2026

Uh oh!

coveralls commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sthelen-enqs left a comment

Choose a reason for hiding this comment

Uh oh!

sthelen-enqs Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

randomdudebunchofnumbers Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

sthelen-enqs Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

randomdudebunchofnumbers Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

sthelen-enqs Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

randomdudebunchofnumbers Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

sthelen-enqs commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

coveralls commented Apr 20, 2026 •

edited

Loading