Skip to content

Refactor server state machine to fix various race conditions and lifetime management issues#502

Merged
jasonmreding merged 4 commits into
masterfrom
users/jreding/RefactorServerStateMachine
Mar 31, 2026
Merged

Refactor server state machine to fix various race conditions and lifetime management issues#502
jasonmreding merged 4 commits into
masterfrom
users/jreding/RefactorServerStateMachine

Conversation

@jasonmreding
Copy link
Copy Markdown
Collaborator

@jasonmreding jasonmreding commented Mar 19, 2026

Description

This PR refactors the server state machine to fix a number of race conditions and lifetime management issues that can lead to crashes.

Implementation / Design

  • Merged the GenericMethodData and CallData objects as they were effectively the same thing and had similar lifetimes. The EventData and IMessageElementMetadataOwner base class/interfaces from GenericMethodData were not being utilized and were deleted as part of the merge. This solved/simplified some lifetime management issues that could lead to crashes since there were back pointers between the two objects that were not ref counted. This led to accessing deleted objects in some cases.
  • CallData now implements std::enable_shared_from_this<CallData> so it can implement proper ref counting semantics with shared pointers. Since the CallData needs to stick around until the user closes the reference and it is no longer being referenced from the completion queue, we now always register callbacks with the completion queue using "tag" wrappers around the CallData. These tags reference the CallData via shared pointers and then delete themselves at the end of the callbacks. A shared pointer for the CallData is also registered with the PointerManager which ensures the reference remains valid until closed from user code.
  • Since the CallData lifetime is controlled by user code, it can live much longer than the actual underlying RPC. To combat this leading to crashes during shutdown of the server, the CallData now maintains a shared pointer reference to the gRPCLabVIEWServer. This prevents the server and by extension the completion queue from getting cleaned up until the call is cleaned up by user code. This is necessary if you want to call CallData::IsCancelled after shutdown which is used to prevent read/write calls. It also closes race conditions where the server may shutdown immediately after read/write calls proceed and then try to use the completion queue to get notified when the read/write is complete. These calls will complete immediately if the completion queue is shutdown as long as it is still valid and hasn't been deleted.
  • Reads/Writes on CallData are now protected by a mutex. This handles user code making multiple read/write calls in parallel which is not supported by the underlying stream we are using or the fact that CallData only maintains a single message instance for managing each of the request/response data. The serialization code to move data between the LV cluster and protobuf message has also been moved into these methods so it is protected by the mutex. The "tags" used for read/write operations now mirror each other which simplifies state management and threading for the overall CallData since updates to the WritingResponse state cannot race with other state transitions. The WritingResponse state has been deleted.
  • Registered cleanup proc for CallData so they get cleaned up even if the user aborts the VI and does not shutdown cleanly. We really should be registering cleanup procs for all objects registered with the PointerManager, but that is too much noise to include with this change.

Testing

I manually tested using measurement plugins which utilize server streaming calls which remain open until canceled by the client. Since the crash doesn't happen until shutdown/unload of the library, automated testing of this scenario is difficult. I verified the following sequences do not crash:

  • Start the service, invoke RPC from client, cancel from client, shutdown service, and then close the LV project.
  • Start the service, invoke RPC from client, shutdown service, and then close the LV project. The client displays an UNAVAILABLE error.

…time management issues that led to crashes due to accessing deleted objects.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the server-side async RPC state machine and interop layer to improve lifetime management and reduce shutdown-related race conditions/crashes (per #501), particularly by keeping server/call objects alive until all completion-queue work and LabVIEW user cleanup has completed.

Changes:

  • Reworked server CQ processing to rely on CQ shutdown/drain semantics and introduced CQ tag wrappers that hold shared_ptr to keep call objects alive through async callbacks.
  • Refactored server call handling (CallData) to use shared_ptr ownership, serialized read/write operations, and added explicit “finish” paths from LabVIEW vs CQ completion.
  • Added an LV cleanup proc for call cleanup, and fixed a missing break in cluster serialization.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/grpc_server.h Updates server/call types for shared_ptr lifetime + adds CQ tag wrapper types.
src/grpc_server.cc Refactors CQ loop to drain via _cq->Next() until shutdown, removing manual drain logic.
src/event_data.cc Implements new CallData state machine, read/write serialization, CQ tag behaviors, and finish handling.
src/grpc_interop.h Adds declaration for LV cleanup proc used by call cleanup.
src/grpc_interop.cc Switches server creation to shared_ptr, routes request/response APIs through CallData, and adds cleanup proc plumbing.
src/cluster_copier.cc Adds missing break in Fixed32Value serialization switch.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/grpc_server.h Outdated
Comment thread src/grpc_interop.cc Outdated
Comment thread src/grpc_server.cc
Comment thread src/event_data.cc
- public inheritance from CallDataBase
@jasonmreding jasonmreding requested review from bkeryan and yash-ni March 20, 2026 00:14
@ni ni deleted a comment from Copilot AI Mar 31, 2026
@jasonmreding jasonmreding merged commit 7900920 into master Mar 31, 2026
5 checks passed
@jasonmreding jasonmreding deleted the users/jreding/RefactorServerStateMachine branch March 31, 2026 00:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

LV 2025 crashes when closing projects after running a service and then canceling a RPC from the client

3 participants