Skip to content

fix(pulse): possible cgroups race condition.#1048

Open
Zenithar wants to merge 3 commits intomainfrom
zenithar/chaos-controller/pulse_mode_fixes
Open

fix(pulse): possible cgroups race condition.#1048
Zenithar wants to merge 3 commits intomainfrom
zenithar/chaos-controller/pulse_mode_fixes

Conversation

@Zenithar
Copy link
Copy Markdown
Contributor

@Zenithar Zenithar commented Mar 5, 2026

Summary

Fix pulse mode for memory and CPU pressure disruptions. Three root causes were identified and fixed:

Bug 1: Child subprocesses enter their own pulse loop (node-level)

In cli/injector/main.go, the switch statement routes node-level disruptions into the pulse loop without checking if the current process is a child subprocess. The parent spawns a child (memory-stress/cpu-stress), but that child inherits the --level node flag and enters its own independent pulse loop — both parent and child try to pulse, causing chaotic behavior.

Fix: Added parentPID == 0 && guard so only the parent process enters the node-level pulse loop.

Bug 2: Pulse args leak to child subprocesses

In injector/factory.go, CreateCmdArgs() passes --pulse-active-duration, --pulse-dormant-duration, and --pulse-initial-delay to child subprocesses. If PulseInitialDelay >= PulseActiveDuration, the child subprocess is killed by its own pulse timer before it ever allocates memory or starts stressing CPU. The child thinks it should pulse, but pulsing is the parent's responsibility (clean → re-inject cycle).

Fix: Zero out pulse args (PulseActiveDuration, PulseDormantDuration, PulseInitialDelay) before passing them to the child subprocess command.

Bug 3: Clean() doesn't wait for child process to exit

In injector/memory_pressure.go and injector/cpu_pressure.go, Clean() sends SIGTERM and returns immediately. During pulse mode, the parent calls Clean() then Inject() in quick succession. The old child may still be dying (holding mmap'd memory in the cgroup) when the new child starts. The new child reads cgroup memory usage, sees it's at/above target, calculates targetBytes <= 0, and skips allocation — producing no stress during the next pulse.

Fix: Added a Done() channel to BackgroundCmd that closes when the child process exits. Clean() now blocks on <-Done() (with 5s timeout) before returning, ensuring cgroup accounting reflects freed memory before re-injection.

Changed files

  • cli/injector/main.go — Guard node-level pulse loop with parentPID == 0
  • injector/factory.go — Zero out pulse args for child subprocesses
  • command/command.go — Add Done() channel to BackgroundCmd interface
  • command/background_cmd_mock.go — Add Done() mock method
  • injector/memory_pressure.go — Wait for child exit in Clean()
  • injector/cpu_pressure.go — Wait for child exit in Clean()
  • injector/cpu_pressure_test.go — Add Done() mock expectation
  • injector/memory_pressure_test.go — Add Done() mock expectation

Test plan

  • Unit tests pass (command, injector, services packages)
  • Regenerate mocks with make generate-mocks
  • Run full test suite (make test)
  • E2E: deploy memory pressure disruption with pulse mode, verify pulses cycle correctly
  • E2E: deploy CPU pressure disruption with pulse mode, verify pulses cycle correctly

🤖 Generated with Claude Code

@datadog-datadog-prod-us1-2
Copy link
Copy Markdown

datadog-datadog-prod-us1-2 Bot commented Mar 5, 2026

✅ Tests

🎉 All green!

❄️ No new flaky tests detected
🧪 All tests passed

🎯 Code Coverage (details)
Patch Coverage: 45.45%
Overall Coverage: 38.43% (-0.06%)

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: d6e6ff7 | Docs | Datadog PR Page | Was this helpful? React with 👍/👎 or give us feedback!

@Zenithar Zenithar self-assigned this Mar 5, 2026
@Zenithar Zenithar marked this pull request as ready for review April 13, 2026 07:24
@Zenithar Zenithar requested a review from a team as a code owner April 13, 2026 07:24
Comment thread injector/cpu_pressure.go Outdated
@aymericDD
Copy link
Copy Markdown
Contributor

Could you deploy this PR on staging and show if it solves the issue?

@Zenithar Zenithar force-pushed the zenithar/chaos-controller/pulse_mode_fixes branch from 2c71181 to d6e6ff7 Compare April 13, 2026 09:22
Comment thread injector/injector.go

// stopAndWaitForBackgroundCmd stops a background command and waits for the process to fully exit
// before returning. This prevents cgroup race conditions during pulse mode re-injection.
func stopAndWaitForBackgroundCmd(log *zap.SugaredLogger, backgroundCmd command.BackgroundCmd, cancel context.CancelFunc) error {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think codex highlighted an interesting behavior:

  • [P1] Return an error when the background process never reaches Done()/Users/aymeric.daurelle/go/src/github.com/DataDog/chaos-controller/injector/injector.go:86-90 If the stress subprocess does not exit within 5 seconds after Stop() (for example under heavy CPU/memory pressure, or if it ignores SIGTERM until its context is canceled), this helper only logs a warning and still returns nil. clean() will then treat cleanup as successful and may immediately reinject while the previous process is still running, which recreates the overlap/race this change is trying to avoid.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants