Wayland: resolve session lock client crashes on sleep, unlock, and output changes#829
Wayland: resolve session lock client crashes on sleep, unlock, and output changes#829Luke458 wants to merge 1 commit into
Conversation
|
Update: Added a second commit to prevent a crash during sleep/DPMS transitions. When display outputs are temporarily disconnected or in a transition state, updateSurfaces() is now prevented from creating lock surfaces on placeholder or invalid screens. This avoids triggering the qFatal check in the QSWaylandSessionLockSurface constructor. |
|
upon further testing, this is an improvement but it doesn't fully resolve the wake from suspend/dpms off crashing. It prevents all wayland clients from crashing but quickshell itself will still crash :/ |
|
Okay final update (I hope), after the second commit there were still 3 issues remaining. The first issue involved a screen state transition abort caused by a The second issue was a Wayland protocol error: The third issue was another protocol error: |
92a0fdd to
1ea83fc
Compare
|
FINAL FINAL commits this time to fix the last couple of issues. Firstly, just fixed the CI failure with qt 6.6.3 as described in the commit. More importantly, resolved a double-free crash with the error |
|
After extensive debugging and validation, I've identified why this crash may be highly reproducible on systems running The issue is triggered by how quickly the open-source AMD GPU stack handles display power transitions. Unlike some proprietary driver setups(Nvidia), Similarly, Hyprland is extremely strict about Wayland object state and the moment unlock_and_destroy is processed, Hyprland immediately invalidates the surface IDs. If the client sends a destroy or commit on those IDs a split-second later, Hyprland raises a fatal protocol error ( invalid object or Null buffer attached ) and kills the client. This exposes a race condition in Quickshell’s client-side surface lifecycle. If the client calls The final fix resolves this by enforcing a strict teardown order. In After that, the window is destroyed synchronously with The destructor also now begins with an early-return guard, Finally, |
|
AvengeMedia/DankMaterialShell#1468 (comment) Many thanks to @bzbetty for testing and proposing this final change ^^ The original reason for switching to synchronous
The current unlock sequence is:
By the time the deferred deletion from step 2 actually executes, the lock role has already been removed in step 1 and the session has already been unlocked in step 3. As a result, any null-buffer commits generated by Qt's normal window-hide or teardown routines are no longer subject to the restrictions of the |
3cc99df to
79f088d
Compare
outfoxxed
left a comment
There was a problem hiding this comment.
Sorry if I missed some reasoning for these. I took a look through qtbase and I'm not seeing the null surface commit on close either.
| this->manager->unlock(); | ||
|
|
||
| for (auto* surface: this->surfaces) { | ||
| surface->deleteLater(); | ||
| if (surface->ext != nullptr) { | ||
| surface->ext->destroySurface(); | ||
| } | ||
| } | ||
|
|
||
| for (auto* surface: this->surfaces) { | ||
| surface->deleteLater(); | ||
| } | ||
| this->surfaces.clear(); | ||
|
|
||
| this->manager->unlock(); |
There was a problem hiding this comment.
According to https://wayland.app/protocols/ext-session-lock-v1#ext_session_lock_v1:request:unlock_and_destroy and https://wayland.app/protocols/ext-session-lock-v1#ext_session_lock_surface_v1:request:destroy, it should be entirely legal and even preferable destroy the surfaces after the lock.
After this request is made, lock surfaces created through this object should be destroyed by the client as they will no longer be used by the compositor.
There was a problem hiding this comment.
While the spec says the client should destroy the surfaces after unlocking, doing so asynchronously in Qt's event loop (via deleteLater()) triggers a race condition.
If we send unlock_and_destroy first, the compositor immediately destroys the server-side lock objects and surfaces. When Qt's event loop subsequently runs the destructors for the window/surfaces, the client tries to send ext_session_lock_surface_v1.destroy. Because the compositor has already torn them down, this request targets a non-existent object ID and triggers a fatal Wayland connection crash (protocol error: invalid object).
Explicitly calling destroySurface() on the extension prior to unlock_and_destroy ensures the client cleanly tears down its Wayland surface role while they are still valid on the server.
There was a problem hiding this comment.
the compositor immediately destroys the server-side lock objects and surfaces
Because the compositor has already torn them down, this request targets a non-existent object ID and triggers a fatal Wayland connection crash (protocol error: invalid object).
This shouldn't happen for the same reasons mentioned in the surface destructor, only on a double destroy of the same surface.
I don't have a problem with moving destroy here if there is some case which commits a null surface like you mentioned but it should be after the unlock.
| QSWaylandSessionLock* lock = nullptr; | ||
| QPointer<QObject> lock = nullptr; |
There was a problem hiding this comment.
Why would this be a pointer to QObject+casting over QsWaylandSessionLock, and what destroy sequence is making the QPointer valuable?
There was a problem hiding this comment.
We need a QPointer because when the session is unlocked, the parent QSWaylandSessionLock is deleted. Since the surfaces themselves are deleted asynchronously (deleteLater()), their destructors run afterwards and need to check if the parent lock is still alive (via this->ext->lock->active()) to avoid a use-after-free segfault on a dangling raw pointer.
Using QPointer<QObject> and casting inside the source files is a workaround to pass compilation under older Qt versions (like 6.6.3 in CI). In those versions, instantiating QPointer<T> on an incomplete, forward-declared type fails compile-time checks. Because QSWaylandSessionLock relies on a generated Wayland header (qwayland-ext-session-lock-v1.h), we cannot include its definition in the public session_lock.hpp header without breaking the include paths of other targets. Using QPointer<QObject> avoids this but please let me know if there's something I'm missing here.
There was a problem hiding this comment.
Since the surfaces themselves are deleted asynchronously (deleteLater()), their destructors run afterwards and need to check if the parent lock is still alive (via this->ext->lock->active()) to avoid a use-after-free segfault on a dangling raw pointer.
since we resolved that thread this should no longer be true
d92828d to
6247351
Compare
…anges - Destroys Wayland lock surface roles synchronously prior to sending unlock_and_destroy. - Avoids creating lock surfaces on invalid/placeholder/removed outputs to prevent early initialization failures. - Fixes Qt 6.6.3 compilation compatibility when QPointer uses incomplete forward-declared types. - Reverts memory cleanup to asynchronous deleteLater() for QML safety.
6247351 to
66b5c24
Compare
|
Hi @bzbetty Sorry to bother you, do you mind rebuilding and testing on your side again? I've simplified the PR to more strictly align with the wayland protocol spec, but I want to make sure it doesn't reintroduce any real-world crashes or race conditions on your end. Seems to be working on my end after testing wake from suspend, dpms off/on and monitor unplug/plug while in sleep and while awake. |
Under the ext-session-lock-v1 protocol, calling unlock_and_destroy immediately invalidates and destroys all associated lock surface objects on the compositor.
In WlSessionLock::unlock(), the manager was unlocked before deleting the surfaces. Because surface->deleteLater() is asynchronous, the destructors ran in the next event loop tick and sent ext_session_lock_surface_v1.destroy requests to the compositor for object IDs that the compositor had already discarded, leading to a fatal invalid object Wayland protocol error and session crash.
Similarly, when screens are removed, deleting them asynchronously can trigger the same race condition.
This PR resolves the issue by synchronously deleting the surfaces first (delete surface;) before unlocking the session, conforming to the correct destruction order.
Fixes #540
Fixes #786
Fixes #733
Fixes #727
Fixes #608
Fixes #571
Fixes #406
Fixes #366
Fixes #368
Fixes #370
Fixes #361
Fixes #832