Skip to content

relpsess: skip graceful close for half-open clients#289

Open
rgerhards wants to merge 2 commits intomasterfrom
fix/issue-6547-relp-shutdown-crash
Open

relpsess: skip graceful close for half-open clients#289
rgerhards wants to merge 2 commits intomasterfrom
fix/issue-6547-relp-shutdown-crash

Conversation

@rgerhards
Copy link
Copy Markdown
Member

@rgerhards rgerhards commented Apr 14, 2026

Summary

  • avoid graceful client disconnect during teardown unless the session is already in a state that can legally send RELP close traffic
  • stop emitting the redundant wait-state callback when the session is already broken
  • add a regression covering a client connect attempt against a plain TCP peer

Why

relpCltConnect() can destruct a client session that only reached INIT_CMD_SENT when the remote peer accepts TCP but never completes a RELP handshake. The old destructor always tried the graceful client disconnect path, which waits for READY_TO_SEND and triggers the extra error waiting on required session state callback during teardown. That callback path is what shows up in rsyslog issue #6547 during omrelp shutdown.

Closes rsyslog/rsyslog#6547


Summary by cubic

Skips graceful client disconnect for half‑open sessions and removes duplicate wait‑state errors during teardown. Adds server session lifecycle callbacks to librelp; resolves rsyslog/rsyslog#6547 seen during omrelp shutdown.

  • Bug Fixes

    • Only attempt client graceful close in READY_TO_SEND/WINDOW_FULL; skip for half‑open states.
    • Do not emit the wait‑state error when the session is already broken; just mark BROKEN.
    • Tear down sessions with an explicit reason and handle engine shutdown cleanly.
  • New Features

    • Added server callbacks: relpEngineSetOnSessOpen, relpEngineSetOnSessClose, relpEngineSetOnSessOpenFail (documented in headers).
    • Engine now emits onSessOpen on successful handshake, onSessOpenFail on handshake failure, and onSessClose with a reason on any removal.
    • Tests: receiver logs callbacks via --sessfile; new session-callbacks.sh, invalid-open.py, and send-dummyserver-no-waitstate-errmsg.sh regressions.

Written for commit c8faf19. Summary will update on new commits.

rgerhards and others added 2 commits January 4, 2026 13:25
Expose open/close/open-failed callbacks for server sessions.
Document callback semantics in librelp headers.
Emit close callbacks on engine teardown.
Extend test receiver plus add session-callback test harness.
Client teardown currently calls relpSessCltDoDisconnect() for any client session that is not already DISCONNECTED or BROKEN. When relpCltConnect() tears down a session that only made it to INIT_CMD_SENT against a plain TCP peer, that destructor path waits for READY_TO_SEND, converts the already-broken state into a second generic wait-state error, and surfaces destructor-time onErr callbacks to callers such as rsyslog omrelp shutdown.

Only READY_TO_SEND and WINDOW_FULL can still send a valid RELP close frame, so gate graceful client disconnect to those states. When relpSessWaitState() already returns RELP_RET_SESSION_BROKEN, mark the session broken without emitting the extra 'error waiting on required session state' callback again.

Add a regression that connects the sender to dummyserver.py, confirms relpCltConnect() still fails, and asserts the bogus wait-state error no longer appears.

Closes rsyslog/rsyslog#6547

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces session-level callbacks for RELP session lifecycle events, including successful opens, failures, and closures. It adds new API functions to librelp.h, updates the engine to trigger these callbacks, and includes comprehensive tests to verify the new functionality. In src/relpsess.c, the logic for handling session state timeouts and breakages was refined, though feedback suggests further guarding against redundant error reporting when the session is already in a broken state.

Comment thread src/relpsess.c
Comment on lines +624 to 631
if(iRet == RELP_RET_TIMED_OUT || relpEngineShouldStop(pThis->pEngine)) {
/* the session is broken! */
callOnErr(pThis, (char*) "error waiting on required session state, session broken",
RELP_RET_SESSION_BROKEN);
pThis->sessState = eRelpSessState_BROKEN;
} else if(iRet == RELP_RET_SESSION_BROKEN) {
pThis->sessState = eRelpSessState_BROKEN;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current logic still allows for redundant callOnErr invocations if relpEngineShouldStop is true but the session was already broken (and reported) by relpSessRcvData. Additionally, it suppresses legitimate error reporting if iRet is RELP_RET_SESSION_BROKEN due to a poll failure (which hasn't been reported yet). A more robust approach is to guard the entire block by checking if the session is already in the BROKEN state.

	if(pThis->sessState != eRelpSessState_BROKEN) {
		if(iRet == RELP_RET_TIMED_OUT || iRet == RELP_RET_SESSION_BROKEN || relpEngineShouldStop(pThis->pEngine)) {
			/* the session is broken! */
			callOnErr(pThis, (char*) "error waiting on required session state, session broken",
				RELP_RET_SESSION_BROKEN);
			pThis->sessState = eRelpSessState_BROKEN;
		}
	}

Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 issues found across 10 files

Confidence score: 3/5

  • There is some meaningful regression risk: in src/relpsess.c, RELP_RET_SESSION_BROKEN no longer invokes onErr, so poll/select failures in relpSessWaitState() can silently break sessions without error reporting.
  • src/copen.c may emit a spurious onSessOpenFail after a duplicate open on an already-open session, which can confuse failure handling and user-visible state transitions.
  • The issue in src/librelp.h is documentation-only but important: onSessClose can run during relpEngineDestruct(), not just relpEngineRun(), so callback thread-context assumptions may be wrong.
  • Pay close attention to src/relpsess.c, src/copen.c, and src/librelp.h - error callback regression, false open-failure signaling, and incorrect thread-context documentation.
Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="src/librelp.h">

<violation number="1" location="src/librelp.h:250">
P2: The `onSessClose` docs promise the wrong thread context: shutdown can invoke this callback from `relpEngineDestruct()`, not only from `relpEngineRun()`.</violation>
</file>

<file name="src/relpsess.c">

<violation number="1" location="src/relpsess.c:629">
P2: `RELP_RET_SESSION_BROKEN` no longer triggers `onErr` here, so poll/select failures in `relpSessWaitState()` now break the session without reporting the error.</violation>
</file>

<file name="src/copen.c">

<violation number="1" location="src/copen.c:174">
P2: Guard `onSessOpenFail` so it only fires for sessions that never reached the open state; otherwise a duplicate `open` request reports a spurious open failure on an already-open session.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.

Comment thread src/librelp.h
* The callback fires when a session is removed from the engine, regardless
* of the close cause (protocol close, I/O error, or engine shutdown). The
* reason parameter contains the relpRetVal that triggered teardown; engine
* shutdown uses RELP_RET_SESSION_CLOSED. The callback runs on the
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai bot Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: The onSessClose docs promise the wrong thread context: shutdown can invoke this callback from relpEngineDestruct(), not only from relpEngineRun().

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/librelp.h, line 250:

<comment>The `onSessClose` docs promise the wrong thread context: shutdown can invoke this callback from `relpEngineDestruct()`, not only from `relpEngineRun()`.</comment>

<file context>
@@ -220,6 +220,62 @@ relpRetVal relpEngineSetOnErr(relpEngine_t *pThis,
+ * The callback fires when a session is removed from the engine, regardless
+ * of the close cause (protocol close, I/O error, or engine shutdown). The
+ * reason parameter contains the relpRetVal that triggered teardown; engine
+ * shutdown uses RELP_RET_SESSION_CLOSED. The callback runs on the
+ * relpEngineRun() thread and pSess is valid only during the callback.
+ *
</file context>
Fix with Cubic

Comment thread src/relpsess.c
callOnErr(pThis, (char*) "error waiting on required session state, session broken",
RELP_RET_SESSION_BROKEN);
pThis->sessState = eRelpSessState_BROKEN;
} else if(iRet == RELP_RET_SESSION_BROKEN) {
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai bot Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: RELP_RET_SESSION_BROKEN no longer triggers onErr here, so poll/select failures in relpSessWaitState() now break the session without reporting the error.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/relpsess.c, line 629:

<comment>`RELP_RET_SESSION_BROKEN` no longer triggers `onErr` here, so poll/select failures in `relpSessWaitState()` now break the session without reporting the error.</comment>

<file context>
@@ -613,15 +619,15 @@ relpSessWaitState(relpSess_t *const pThis, const relpSessState_t stateExpected,
 		callOnErr(pThis, (char*) "error waiting on required session state, session broken",
 			RELP_RET_SESSION_BROKEN);
 		pThis->sessState = eRelpSessState_BROKEN;
+	} else if(iRet == RELP_RET_SESSION_BROKEN) {
+		pThis->sessState = eRelpSessState_BROKEN;
 	}
</file context>
Fix with Cubic

Comment thread src/copen.c
relpOffersDestruct(&pSrvOffers);

if(iRet != RELP_RET_OK) {
if(pSess->pEngine->onSessOpenFail != NULL) {
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai bot Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Guard onSessOpenFail so it only fires for sessions that never reached the open state; otherwise a duplicate open request reports a spurious open failure on an already-open session.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/copen.c, line 174:

<comment>Guard `onSessOpenFail` so it only fires for sessions that never reached the open state; otherwise a duplicate `open` request reports a spurious open failure on an already-open session.</comment>

<file context>
@@ -168,6 +171,9 @@ BEGINcommand(S, Init)
 		relpOffersDestruct(&pSrvOffers);
 	
 	if(iRet != RELP_RET_OK) {
+		if(pSess->pEngine->onSessOpenFail != NULL) {
+			pSess->pEngine->onSessOpenFail(pSess->pUsr, pSess, iRet);
+		}
</file context>
Suggested change
if(pSess->pEngine->onSessOpenFail != NULL) {
if(!pSess->bServerConnOpen && pSess->pEngine->onSessOpenFail != NULL) {
Fix with Cubic

rgerhards added a commit to rgerhards/rsyslog that referenced this pull request Apr 14, 2026
librelp can emit onErr callbacks while tearing down a client session that only reached INIT_CMD_SENT against a plain TCP peer. omrelp used to store wrkrInstanceData_t as the librelp user pointer, so those teardown-time callbacks depended on worker-lifetime state while freeWrkrInstance() was already dismantling the worker.

rsyslog already keeps the shared action instance alive through worker cleanup, so pass instanceData to librelp instead and log from that stable callback context. Add null-pointer fallbacks in onErr() and onAuthErr() so callback logging stays safe even if librelp invokes them without a user pointer.

Add a shutdown regression built around a dumb TCP listener that accepts the socket and holds it open long enough for omrelp to remain in a partial RELP session. With that test in place, the rsyslog workaround was rebuilt and validated against an unpatched librelp checkout to confirm distros can carry the mitigation before the librelp fix is available.

Closes rsyslog#6547

See also rsyslog/librelp#289

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[General]: rsyslog crash on shutdown due to a broken relp session

1 participant