Refactor server state machine to fix various race conditions and lifetime management issues#502
Merged
Conversation
…time management issues that led to crashes due to accessing deleted objects.
Contributor
There was a problem hiding this comment.
Pull request overview
This PR refactors the server-side async RPC state machine and interop layer to improve lifetime management and reduce shutdown-related race conditions/crashes (per #501), particularly by keeping server/call objects alive until all completion-queue work and LabVIEW user cleanup has completed.
Changes:
- Reworked server CQ processing to rely on CQ shutdown/drain semantics and introduced CQ tag wrappers that hold
shared_ptrto keep call objects alive through async callbacks. - Refactored server call handling (
CallData) to useshared_ptrownership, serialized read/write operations, and added explicit “finish” paths from LabVIEW vs CQ completion. - Added an LV cleanup proc for call cleanup, and fixed a missing
breakin cluster serialization.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
src/grpc_server.h |
Updates server/call types for shared_ptr lifetime + adds CQ tag wrapper types. |
src/grpc_server.cc |
Refactors CQ loop to drain via _cq->Next() until shutdown, removing manual drain logic. |
src/event_data.cc |
Implements new CallData state machine, read/write serialization, CQ tag behaviors, and finish handling. |
src/grpc_interop.h |
Adds declaration for LV cleanup proc used by call cleanup. |
src/grpc_interop.cc |
Switches server creation to shared_ptr, routes request/response APIs through CallData, and adds cleanup proc plumbing. |
src/cluster_copier.cc |
Adds missing break in Fixed32Value serialization switch. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- public inheritance from CallDataBase
yash-ni
approved these changes
Mar 30, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This PR refactors the server state machine to fix a number of race conditions and lifetime management issues that can lead to crashes.
Implementation / Design
GenericMethodDataandCallDataobjects as they were effectively the same thing and had similar lifetimes. TheEventDataandIMessageElementMetadataOwnerbase class/interfaces fromGenericMethodDatawere not being utilized and were deleted as part of the merge. This solved/simplified some lifetime management issues that could lead to crashes since there were back pointers between the two objects that were not ref counted. This led to accessing deleted objects in some cases.CallDatanow implementsstd::enable_shared_from_this<CallData>so it can implement proper ref counting semantics with shared pointers. Since theCallDataneeds to stick around until the user closes the reference and it is no longer being referenced from the completion queue, we now always register callbacks with the completion queue using "tag" wrappers around theCallData. These tags reference theCallDatavia shared pointers and then delete themselves at the end of the callbacks. A shared pointer for theCallDatais also registered with thePointerManagerwhich ensures the reference remains valid until closed from user code.CallDatalifetime is controlled by user code, it can live much longer than the actual underlying RPC. To combat this leading to crashes during shutdown of the server, theCallDatanow maintains a shared pointer reference to thegRPCLabVIEWServer. This prevents the server and by extension the completion queue from getting cleaned up until the call is cleaned up by user code. This is necessary if you want to callCallData::IsCancelledafter shutdown which is used to prevent read/write calls. It also closes race conditions where the server may shutdown immediately after read/write calls proceed and then try to use the completion queue to get notified when the read/write is complete. These calls will complete immediately if the completion queue is shutdown as long as it is still valid and hasn't been deleted.CallDataare now protected by a mutex. This handles user code making multiple read/write calls in parallel which is not supported by the underlying stream we are using or the fact thatCallDataonly maintains a single message instance for managing each of the request/response data. The serialization code to move data between the LV cluster and protobuf message has also been moved into these methods so it is protected by the mutex. The "tags" used for read/write operations now mirror each other which simplifies state management and threading for the overallCallDatasince updates to theWritingResponsestate cannot race with other state transitions. TheWritingResponsestate has been deleted.CallDataso they get cleaned up even if the user aborts the VI and does not shutdown cleanly. We really should be registering cleanup procs for all objects registered with thePointerManager, but that is too much noise to include with this change.Testing
I manually tested using measurement plugins which utilize server streaming calls which remain open until canceled by the client. Since the crash doesn't happen until shutdown/unload of the library, automated testing of this scenario is difficult. I verified the following sequences do not crash: