Skip to content

test(self-talk): Replace superstitious and timing-racy sleep() with proper primitives#5855

Open
alexw91 wants to merge 1 commit into
aws:mainfrom
alexw91:fix-slow-tests-v4
Open

test(self-talk): Replace superstitious and timing-racy sleep() with proper primitives#5855
alexw91 wants to merge 1 commit into
aws:mainfrom
alexw91:fix-slow-tests-v4

Conversation

@alexw91
Copy link
Copy Markdown
Contributor

@alexw91 alexw91 commented Apr 29, 2026

Goal

Replace ad-hoc sleep() synchronization with proper primitives across eight fork-based self-talk / IO tests, removing it entirely where it was superstitious and replacing it with deterministic synchronization where it was load-bearing.

Why

See: #5704

Eight fork-based tests used hardcoded sleep() calls between the parent (TLS server) and child (TLS client) processes. The sleeps fell into three categories:

  1. Superstitious pre-negotiate sleeps: "give the server a chance to listen" in mock_client before s2n_negotiate. Unnecessary because the socketpair is valid from the moment of fork() and the kernel buffers writes until the parent reads.
  2. Superstitious pre-cleanup sleeps: "give the server a chance to avoid a sigpipe" before tearing down the socket. Unnecessary because SIGPIPE is already ignored process-wide via signal(SIGPIPE, SIG_IGN) in s2n_io_pair_init, inherited across fork().
  3. Load-bearing sleeps: actually waiting for a state transition in the other process. These were either races (could fire before the event they were meant to synchronize with) or guesses at how long the state transition would take.

Problems with this: runtime bloat (most tests had 2-8 seconds of sleep per run), and potential flakes on slow or loaded CI where the chosen sleep duration could be too short.

How

Superstitious sleeps removed. All pre-negotiate and pre-cleanup sleep(1) calls in mock_client-style functions across all eight tests, replaced with a short comment explaining why the sleep was unnecessary.

Load-bearing sleep in s2n_self_talk_nonblocking_test.c replaced with waitpid(pid, &status, WUNTRACED). The parent previously sleep(1)'d before kill(pid, SIGCONT) to give the child time to raise(SIGSTOP). WUNTRACED causes waitpid to return when the child transitions to the stopped state, so the parent is guaranteed to only send SIGCONT after the child has stopped. This also fixes a latent race: on a loaded system the parent's SIGCONT could arrive before the child had called SIGSTOP.

Load-bearing sleeps in s2n_self_talk_broken_pipe_test.c replaced with a pair of control pipes. The test needs two process-to-process events:

  • Parent must wait for child to shut down its read side before calling s2n_send (otherwise no EPIPE). Replaced with read(ready_pipe).
  • Child must wait for parent to finish its EPIPE assertions before tearing down the socket (otherwise race with final s2n_shutdown). Replaced with read(done_pipe).

Per-file breakdown:

File Sleeps removed Replacement
s2n_self_talk_client_hello_cb_test.c 2 (all superstitious)
s2n_self_talk_nonblocking_test.c 3 2 superstitious + 1 sleep(1)waitpid WUNTRACED
s2n_self_talk_broken_pipe_test.c 4 2 superstitious + sleep(2) / sleep(1) → control pipes
s2n_self_talk_session_id_test.c 4 (all superstitious: 1 pre-negotiate + 3 pre-cleanup)
s2n_self_talk_tls13_test.c 2 (all superstitious)
s2n_self_talk_tls12_test.c 2 (all superstitious)
s2n_self_talk_alerts_test.c 1 (superstitious, called 3x via outer loop)
s2n_mem_allocator_test.c 2 (all superstitious, called 2x via outer loop)

Callouts

  • Left in place: the load-bearing nanosleep() calls in self_talk_tls12_test, self_talk_tls13_test, and s2n_mem_allocator_test that exercise the dynamic-record-size inactivity-timeout reset. Those genuinely need real elapsed time. A follow-up could replace them with a mock-clock override on conn->config->wall_clock, but that's a larger change worth doing separately.
  • Also left in place: the sleep(1) in s2n_locking_test.c, which is inside a #if !(S2N_OPENSSL_VERSION_AT_LEAST(1, 1, 0)) compat block that only runs on libcrypto < 1.1.0. The sleep there gives a thread time to block on a mutex before asserting the mutex is held, and there's no cheap primitive to replace it in that context.
  • The WUNTRACED change in nonblocking_test and the control-pipe change in broken_pipe_test are correctness improvements, not just speedups. The original sleeps were races that happened to work in practice on fast machines but could fail under load.
  • No product code is touched. Only the eight test files.

Testing

Ran each test 100 times in a tight loop after the changes. All 800 runs passed.

Runtime on an M4 MacBook Pro:

Test Before After Speedup
s2n_self_talk_client_hello_cb_test 19.54 sec 2.81 sec ~7x
s2n_self_talk_nonblocking_test 19.25 sec 4.75 sec ~4x
s2n_self_talk_broken_pipe_test 11.36 sec 2.88 sec ~4x
s2n_self_talk_tls12_test 8.54 sec 5.41 sec ~1.6x
s2n_self_talk_session_id_test 6.12 sec 2.69 sec ~2.3x
s2n_self_talk_tls13_test 5.56 sec 4.08 sec ~1.4x
s2n_self_talk_alerts_test 5.76 sec 2.86 sec ~2.0x
s2n_mem_allocator_test 9.14 sec 5.45 sec ~1.7x

Related

Refs #5704

…roper primitives

Eight fork-based self-talk / IO tests used hardcoded sleep() calls as
ad-hoc synchronization between the parent (TLS server) and child (TLS
client) processes. Three categories of sleeps were present:

1. Superstitious pre-negotiate sleeps: "give the server a chance to
   listen". Unnecessary because the socketpair underlying the io_pair
   is valid from the moment of fork(); the kernel buffers writes until
   the parent starts reading.

2. Superstitious pre-cleanup sleeps: "give the server a chance to avoid
   a sigpipe". Unnecessary because SIGPIPE is ignored process-wide via
   signal(SIGPIPE, SIG_IGN) in s2n_io_pair_init, and the disposition is
   inherited across fork().

3. Load-bearing sleeps implementing ad-hoc synchronization. These were
   either a race (parent's pre-SIGCONT sleep in nonblocking_test, which
   could fire before the child had called raise(SIGSTOP)) or a guess at
   how long the other process would take to reach a given state
   (broken_pipe_test's sleeps around the half-closed-pipe assertions).
   Both flavors are replaced with proper primitives.

Changes per file:
- s2n_self_talk_client_hello_cb_test.c: removed 2 superstitious sleeps
  (pre-negotiate and pre-cleanup).
- s2n_self_talk_nonblocking_test.c: removed 2 superstitious pre-negotiate
  sleeps; replaced the parent's pre-SIGCONT sleep with
  waitpid(pid, &status, WUNTRACED). WUNTRACED causes waitpid to return
  when the child transitions to the stopped state, eliminating a
  real race where SIGCONT could arrive before SIGSTOP.
- s2n_self_talk_broken_pipe_test.c: removed 2 superstitious sleeps;
  replaced the two load-bearing sleeps (child's sleep(2) for "let
  parent observe EPIPE" and parent's sleep(1) for "let child shut
  down read side") with a pair of control pipes. The child writes to
  ready_pipe after shutting down its read side; parent blocks on
  read(ready_pipe) before its s2n_send. Parent writes to done_pipe
  after completing its assertions; child blocks on read(done_pipe)
  before teardown. Fully deterministic handoff.
- s2n_self_talk_session_id_test.c: removed 4 superstitious sleeps
  (1 pre-negotiate + 3 pre-cleanup).
- s2n_self_talk_tls13_test.c: removed 2 superstitious sleeps.
- s2n_self_talk_tls12_test.c: removed 2 superstitious sleeps.
- s2n_self_talk_alerts_test.c: removed 1 superstitious pre-negotiate
  sleep (called 3 times via the outer loop).
- s2n_mem_allocator_test.c: removed 2 superstitious sleeps (called
  twice via the outer loop).

All eight tests were run 100x in a flakiness loop after changes with
no failures (800 total runs).

Runtime on an M4 MacBook Pro:
- s2n_self_talk_client_hello_cb_test: 19.54s ->  2.81s  (~7x faster)
- s2n_self_talk_nonblocking_test:     19.25s ->  4.75s  (~4x faster)
- s2n_self_talk_broken_pipe_test:     11.36s ->  2.88s  (~4x faster)
- s2n_self_talk_tls12_test:            8.54s ->  5.41s  (~1.6x faster)
- s2n_self_talk_session_id_test:       6.12s ->  2.69s  (~2.3x faster)
- s2n_self_talk_tls13_test:            5.56s ->  4.08s  (~1.4x faster)
- s2n_self_talk_alerts_test:           5.76s ->  2.86s  (~2.0x faster)
- s2n_mem_allocator_test:              9.14s ->  5.45s  (~1.7x faster)

Refs: aws#5704
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant