feat(disruption): disk full injection. by Zenithar · Pull Request #1058 · DataDog/chaos-controller

Zenithar · 2026-04-08T15:07:05Z

What does this PR do?

Adds new functionality

Adds a new diskFull disruption kind that genuinely fills a target pod volume using the fallocate(2) syscall, causing real ENOSPC errors on all subsequent write operations. This fills a gap where existing disruptions (DiskPressure = I/O throttling, DiskFailure = eBPF on openat only) don't simulate actual disk exhaustion visible to monitoring and all syscalls.

Features

Volume fill via ballast file: Creates a ballast file via fallocate(2) syscall (instant, O(1) on ext4/xfs) to genuinely consume disk space. Falls back to writing zeros on unsupported filesystems.
Safety: 1Mi minimum free space floor (overridable via unsafeMode.allowDiskFullNoFloor). Pod-level only. Webhook warning for ephemeral-storage eviction risk.
Pure Go fallocate: Vendored fallocate/ package (adapted from detailyang/go-fallocate, MIT) — no dependency on fallocate or dd binaries in the injector image.

How it differs from existing disruptions

Disruption	Mechanism	ENOSPC on writes?	Visible to `df`/monitoring?
Disk Pressure	Cgroup blkio throttling	No	No
Disk Failure	eBPF on `openat` only	Only on file open	No
Disk Full (new)	Real space allocation	Yes (all syscalls)	Yes

Example

apiVersion: chaos.datadoghq.com/v1beta1
kind: Disruption
metadata:
  name: disk-full-test
spec:
  selector:
    app: my-service
  count: 1
  level: pod
  duration: 10m
  diskFull:
    path: "/data"
    capacity: "95%"

Code Quality Checklist

The documentation is up to date.
My code is sufficiently commented and passes continuous integration checks.
I have signed my commit (see Contributing Docs).

Testing

I leveraged continuous integration testing
- by adding new unit tests.
I manually tested the following steps:
- locally.
- as a canary deployment to a cluster.

Test coverage

Spec validation: capacity/remaining mutual exclusivity, boundary values, GenerateArgs, Explain
Injector: creation, inject with capacity/remaining, dry-run, remaining > available (skip), inject+clean round trip, idempotent cleanup

Files changed (24 files, ~1350 lines)

Component	Files
CRD spec + validation	`api/v1beta1/disk_full.go`, `disruption_types.go`, `disruption_webhook.go`, `safemode.go`
Injector	`injector/disk_full.go` (ballast file via fallocate)
CLI	`cli/injector/disk_full.go`, `cli/injector/main.go`
fallocate package	`fallocate/` (4 platform-specific files, adapted from go-fallocate MIT)
Safemode	`safemode/safemode_disk_full.go`, `safemode/safemode.go`
Types	`types/types.go` (`DisruptionKindDiskFull`)
Docs	`docs/disk_full.md`, `docs/disruption_catalogue.md`
Tests	`api/v1beta1/disk_full_test.go`, `injector/disk_full_test.go`

Signed-off-by: Thibault NORMAND <thibault.normand@datadoghq.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Thibault NORMAND <me@zenithar.org>

datadog-prod-us1-4 · 2026-04-08T15:18:49Z

🎉 All green!

❄️ No new flaky tests detected
🧪 All tests passed

🎯 Code Coverage (details)
• Patch Coverage: 61.73%
• Overall Coverage: 39.04% (+0.55%)

_{This comment will be updated automatically if new data arrives.

🔗 Commit SHA: fb46e35 | Docs | Datadog PR Page | Was this helpful? React with 👍/👎 or give us feedback!}

aymericDD · 2026-04-16T08:01:50Z

The diskFull disruption creates a ballast file on the host filesystem via the injector pod. The injector pod mounts the host root at /mnt/host, but that mount has ReadOnly: true — which was correct for all existing injectors (network, CPU, etc.) that only read the host. diskFull must write to the host, so it gets read-only file system ENOSPC before even starting.

Root cause

services/chaospod.go:573 — the host VolumeMount unconditionally sets ReadOnly: true:

     {
         Name:      "host",
         MountPath: "/mnt/host",
         ReadOnly:  true,   // ← must be false for diskFull
     },

Fix

Add a hostWritable bool parameter to generateChaosPodSpec

File: services/chaospod.go:466

Change signature:

     func (m *chaosPodService) generateChaosPodSpec(..., hostWritable bool) corev1.PodSpec {

Inside the function, use the parameter:

     {
         Name:      "host",
         MountPath: "/mnt/host",
         ReadOnly:  !hostWritable,
     },

Pass kind == DisruptionKindDiskFull from the call site

File: services/chaospod.go:332

     Spec: m.generateChaosPodSpec(
         targetNodeName,
         terminationGracePeriod,
         activeDeadlineSeconds,
         args,
         hostPathDirectory,
         hostPathFile,
         kind == chaostypes.DisruptionKindDiskFull,  // hostWritable
     ),

 **Critical files**

 - `services/chaospod.go` — only file to modify

Zenithar · 2026-04-16T08:17:09Z

Many thanks for the deep investigation. I will fix that ASAP. I still have concerns about allowing write to a complete FS for writing a ballast in a dedicated directory. It will allow someone with access to the pod to alter the disrupted pod/node for purposes other than the expected disruption.

I will propose a security gate.

aymericDD · 2026-04-16T10:02:41Z

Could you also create an example file to test locally the disruption:

example/disk_full.yaml

# Unless explicitly stated otherwise all files in this repository are licensed
# under the Apache License Version 2.0.
# This product includes software developed at Datadog (https://www.datadoghq.com/).
# Copyright 2026 Datadog, Inc.

apiVersion: chaos.datadoghq.com/v1beta1
kind: Disruption
metadata:
  name: disk-full
  namespace: chaos-demo
spec:
  level: pod
  selector:
    service: demo-curl
  count: 1
  duration: 10m
  diskFull:
    path: "/mnt/data"
    capacity: "95%"

aymericDD · 2026-04-16T10:03:23Z

Could you also update the examples/complete.yaml please

aymericDD · 2026-04-16T10:05:02Z

Could you also update the docs/README.md to add a link to the docs/disk_full.md disruption please

… address PR comments. Add diskFull to 5 missing registration points in validateGlobalDisruptionScope (at-least-one-kind check, ContainerFailure/NodeFailure/PodReplacement compatibility, OnInit compatibility), DisruptionCount(), and Explain(). Add writable shadow mount for the target path in chaos pod spec so the injector can write ballast files while keeping /mnt/host read-only. Add capacity mode test coverage, disk_full example, complete.yaml entry, and docs/README.md link. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat(disruption): disk full injection.

6109573

Signed-off-by: Thibault NORMAND <thibault.normand@datadoghq.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Thibault NORMAND <me@zenithar.org>

Zenithar force-pushed the zenithar/chaos-controller/disk_full_disruption branch from d238abd to 6109573 Compare April 8, 2026 15:11

test(disruption): add safet net to prevent to fill the CI runner.

499ca66

Zenithar self-assigned this Apr 9, 2026

feat(disk-full): remove epbf interception.

a8c2006

Zenithar marked this pull request as ready for review April 13, 2026 07:25

Zenithar requested a review from a team as a code owner April 13, 2026 07:25

Zenithar marked this pull request as draft April 16, 2026 09:35

aymericDD reviewed Apr 16, 2026

View reviewed changes

Comment thread api/v1beta1/disruption_types.go Outdated

Comment thread api/v1beta1/disruption_types.go Outdated

Comment thread api/v1beta1/disruption_types.go

Comment thread cli/injector/disk_full.go

Comment thread injector/disk_full.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(disruption): disk full injection.#1058

feat(disruption): disk full injection.#1058
Zenithar wants to merge 4 commits intomainfrom
zenithar/chaos-controller/disk_full_disruption

Zenithar commented Apr 8, 2026 •

edited

Loading

Uh oh!

datadog-prod-us1-4 Bot commented Apr 8, 2026 •

edited by datadog-datadog-prod-us1-2 Bot

Loading

Uh oh!

aymericDD commented Apr 16, 2026

Uh oh!

Zenithar commented Apr 16, 2026 •

edited

Loading

Uh oh!

aymericDD commented Apr 16, 2026

Uh oh!

aymericDD commented Apr 16, 2026

Uh oh!

aymericDD commented Apr 16, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Zenithar commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Features

How it differs from existing disruptions

Example

Code Quality Checklist

Testing

Test coverage

Files changed (24 files, ~1350 lines)

Uh oh!

datadog-prod-us1-4 Bot commented Apr 8, 2026 • edited by datadog-datadog-prod-us1-2 Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aymericDD commented Apr 16, 2026

Uh oh!

Zenithar commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aymericDD commented Apr 16, 2026

Uh oh!

aymericDD commented Apr 16, 2026

Uh oh!

aymericDD commented Apr 16, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Zenithar commented Apr 8, 2026 •

edited

Loading

datadog-prod-us1-4 Bot commented Apr 8, 2026 •

edited by datadog-datadog-prod-us1-2 Bot

Loading

Zenithar commented Apr 16, 2026 •

edited

Loading