Skip to content

feat(disruption): disk full injection.#1058

Draft
Zenithar wants to merge 4 commits intomainfrom
zenithar/chaos-controller/disk_full_disruption
Draft

feat(disruption): disk full injection.#1058
Zenithar wants to merge 4 commits intomainfrom
zenithar/chaos-controller/disk_full_disruption

Conversation

@Zenithar
Copy link
Copy Markdown
Contributor

@Zenithar Zenithar commented Apr 8, 2026

What does this PR do?

  • Adds new functionality

Adds a new diskFull disruption kind that genuinely fills a target pod volume using the fallocate(2) syscall, causing real ENOSPC errors on all subsequent write operations. This fills a gap where existing disruptions (DiskPressure = I/O throttling, DiskFailure = eBPF on openat only) don't simulate actual disk exhaustion visible to monitoring and all syscalls.

Features

  • Volume fill via ballast file: Creates a ballast file via fallocate(2) syscall (instant, O(1) on ext4/xfs) to genuinely consume disk space. Falls back to writing zeros on unsupported filesystems.
  • Safety: 1Mi minimum free space floor (overridable via unsafeMode.allowDiskFullNoFloor). Pod-level only. Webhook warning for ephemeral-storage eviction risk.
  • Pure Go fallocate: Vendored fallocate/ package (adapted from detailyang/go-fallocate, MIT) — no dependency on fallocate or dd binaries in the injector image.

How it differs from existing disruptions

Disruption Mechanism ENOSPC on writes? Visible to df/monitoring?
Disk Pressure Cgroup blkio throttling No No
Disk Failure eBPF on openat only Only on file open No
Disk Full (new) Real space allocation Yes (all syscalls) Yes

Example

apiVersion: chaos.datadoghq.com/v1beta1
kind: Disruption
metadata:
  name: disk-full-test
spec:
  selector:
    app: my-service
  count: 1
  level: pod
  duration: 10m
  diskFull:
    path: "/data"
    capacity: "95%"

Code Quality Checklist

  • The documentation is up to date.
  • My code is sufficiently commented and passes continuous integration checks.
  • I have signed my commit (see Contributing Docs).

Testing

  • I leveraged continuous integration testing
    • by adding new unit tests.
  • I manually tested the following steps:
    • locally.
    • as a canary deployment to a cluster.

Test coverage

  • Spec validation: capacity/remaining mutual exclusivity, boundary values, GenerateArgs, Explain
  • Injector: creation, inject with capacity/remaining, dry-run, remaining > available (skip), inject+clean round trip, idempotent cleanup

Files changed (24 files, ~1350 lines)

Component Files
CRD spec + validation api/v1beta1/disk_full.go, disruption_types.go, disruption_webhook.go, safemode.go
Injector injector/disk_full.go (ballast file via fallocate)
CLI cli/injector/disk_full.go, cli/injector/main.go
fallocate package fallocate/ (4 platform-specific files, adapted from go-fallocate MIT)
Safemode safemode/safemode_disk_full.go, safemode/safemode.go
Types types/types.go (DisruptionKindDiskFull)
Docs docs/disk_full.md, docs/disruption_catalogue.md
Tests api/v1beta1/disk_full_test.go, injector/disk_full_test.go

Signed-off-by: Thibault NORMAND <thibault.normand@datadoghq.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Thibault NORMAND <me@zenithar.org>
@Zenithar Zenithar force-pushed the zenithar/chaos-controller/disk_full_disruption branch from d238abd to 6109573 Compare April 8, 2026 15:11
@datadog-prod-us1-4
Copy link
Copy Markdown

datadog-prod-us1-4 Bot commented Apr 8, 2026

Tests

🎉 All green!

❄️ No new flaky tests detected
🧪 All tests passed

🎯 Code Coverage (details)
Patch Coverage: 61.73%
Overall Coverage: 39.04% (+0.55%)

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: fb46e35 | Docs | Datadog PR Page | Was this helpful? React with 👍/👎 or give us feedback!

@Zenithar Zenithar self-assigned this Apr 9, 2026
@Zenithar Zenithar marked this pull request as ready for review April 13, 2026 07:25
@Zenithar Zenithar requested a review from a team as a code owner April 13, 2026 07:25
@aymericDD
Copy link
Copy Markdown
Contributor

The diskFull disruption creates a ballast file on the host filesystem via the injector pod. The injector pod mounts the host root at /mnt/host, but that mount has ReadOnly: true — which was correct for all existing injectors (network, CPU, etc.) that only read the host. diskFull must write to the host, so it gets read-only file system ENOSPC before even starting.

Root cause

services/chaospod.go:573 — the host VolumeMount unconditionally sets ReadOnly: true:

     {
         Name:      "host",
         MountPath: "/mnt/host",
         ReadOnly:  true,   // ← must be false for diskFull
     },

Fix

  1. Add a hostWritable bool parameter to generateChaosPodSpec

File: services/chaospod.go:466

Change signature:

     func (m *chaosPodService) generateChaosPodSpec(..., hostWritable bool) corev1.PodSpec {

Inside the function, use the parameter:

     {
         Name:      "host",
         MountPath: "/mnt/host",
         ReadOnly:  !hostWritable,
     },
  1. Pass kind == DisruptionKindDiskFull from the call site

File: services/chaospod.go:332

     Spec: m.generateChaosPodSpec(
         targetNodeName,
         terminationGracePeriod,
         activeDeadlineSeconds,
         args,
         hostPathDirectory,
         hostPathFile,
         kind == chaostypes.DisruptionKindDiskFull,  // hostWritable
     ),
 **Critical files**

 - `services/chaospod.go` — only file to modify

@Zenithar
Copy link
Copy Markdown
Contributor Author

Zenithar commented Apr 16, 2026

Many thanks for the deep investigation. I will fix that ASAP. I still have concerns about allowing write to a complete FS for writing a ballast in a dedicated directory. It will allow someone with access to the pod to alter the disrupted pod/node for purposes other than the expected disruption.

I will propose a security gate.

@Zenithar Zenithar marked this pull request as draft April 16, 2026 09:35
@aymericDD
Copy link
Copy Markdown
Contributor

Could you also create an example file to test locally the disruption:

example/disk_full.yaml

# Unless explicitly stated otherwise all files in this repository are licensed
# under the Apache License Version 2.0.
# This product includes software developed at Datadog (https://www.datadoghq.com/).
# Copyright 2026 Datadog, Inc.

apiVersion: chaos.datadoghq.com/v1beta1
kind: Disruption
metadata:
  name: disk-full
  namespace: chaos-demo
spec:
  level: pod
  selector:
    service: demo-curl
  count: 1
  duration: 10m
  diskFull:
    path: "/mnt/data"
    capacity: "95%"

@aymericDD
Copy link
Copy Markdown
Contributor

Could you also update the examples/complete.yaml please

@aymericDD
Copy link
Copy Markdown
Contributor

Could you also update the docs/README.md to add a link to the docs/disk_full.md disruption please

Comment thread api/v1beta1/disruption_types.go Outdated
Comment thread api/v1beta1/disruption_types.go Outdated
Comment thread api/v1beta1/disruption_types.go
Comment thread cli/injector/disk_full.go
Comment thread injector/disk_full.go
… address PR comments.

Add diskFull to 5 missing registration points in validateGlobalDisruptionScope
(at-least-one-kind check, ContainerFailure/NodeFailure/PodReplacement
compatibility, OnInit compatibility), DisruptionCount(), and Explain().

Add writable shadow mount for the target path in chaos pod spec so the
injector can write ballast files while keeping /mnt/host read-only.

Add capacity mode test coverage, disk_full example, complete.yaml entry,
and docs/README.md link.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants