Skip to content

Fix worker-runner communication on cgroup v2 + containerd#194

Draft
hanthor wants to merge 2 commits intobuildbarn:mainfrom
hanthor:main
Draft

Fix worker-runner communication on cgroup v2 + containerd#194
hanthor wants to merge 2 commits intobuildbarn:mainfrom
hanthor:main

Conversation

@hanthor
Copy link
Copy Markdown

@hanthor hanthor commented May 3, 2026

Problem

On cgroup v2 systems with containerd (Debian 12+, Ubuntu 22.04+, k3s with containerd), Unix sockets in shared emptyDir volumes are not visible across containers in the same pod due to namespace isolation.

This causes bb_worker to fail with:

dial unix /worker/runner: connect: no such file or directory

Even though bb_runner successfully creates the socket in the shared volume, the worker container cannot see it.

Solution

Switch worker→runner communication from Unix socket to TCP:

  1. Runner config: listenPaths: ["/worker/runner"]listenAddresses: [":50051"]

    • listenPaths is Unix-socket-only; listenAddresses uses TCP
  2. Worker config: address: "unix:///worker/runner"address: "127.0.0.1:50051"

    • IPv4 explicitly to avoid IPv6 resolution to ::1
  3. Worker deployment:

    • Added TCP readiness probe for runner container (tcp-socket :50051)
    • Fixed cache directory permissions (07000777) for nobody user

Why TCP?

Both containers are in the same pod, so 127.0.0.1 resolves to the same network namespace. TCP avoids the cgroup v2 isolation entirely while maintaining the same-process-isolation model.

Testing

Verified on k3s v1.33.0+k3s1 with containerd on Debian 12:

  • 8 worker pods all 2/2 Ready
  • Worker→runner TCP connection successful
  • No readiness check failures

On cgroup v2 systems with containerd (e.g. Debian 12+, Ubuntu 22.04+),
Unix sockets in shared emptyDir volumes are not visible across containers
in the same pod due to namespace isolation.

This fix switches the bb_runner<->bb_worker communication from Unix
socket to TCP:

1. runner config: listenPaths ['/worker/runner'] -> listenAddresses [':50051']
   (listenPaths is Unix-only, listenAddresses is TCP)

2. worker config: endpoint address 'unix:///worker/runner' -> '127.0.0.1:50051'
   (use IPv4 explicitly to avoid IPv6 resolution issues)

3. worker deployment: add TCP readiness probe for runner container
   and fix cache directory permissions (0700 -> 0777 for nobody user)
@hanthor hanthor marked this pull request as draft May 3, 2026 04:33
Updated all manifests and configurations to use catthehacker/ubuntu:act-24.04
(Ubuntu 24.04 LTS - Noble Numbat) instead of 22.04. Ubuntu 24.04 is the current
LTS release with better long-term support and updated tooling.

Changes:
- Renamed worker and runner configs: ubuntu22-04 -> ubuntu24-04
- Updated runner container image: act-22.04 -> act-24.04
- Updated image digest to point to 24.04 build
- Updated all deployment/service selectors and labels
- Updated kustomization references

The 24.04 LTS is stable and well-tested, with better support for modern
build tools compared to the older 22.04 LTS.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant