Skip to content

vinayr-07/container-runtime-jackfruit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OS-Jackfruit: Multi-Container Runtime


1. Team Information

Name SRN
Prateek Hachadad [PES1UG24CS340]
R Vinay [PES1UG24CS353]

2. Build, Load, and Run Instructions

Prerequisites

  • Ubuntu 22.04 or 24.04 in a VM (VirtualBox or VMware)
  • Secure Boot OFF (or BIOS mode — EFI not required)
  • No WSL

Install Dependencies

sudo apt update
sudo apt install -y build-essential linux-headers-$(uname -r) git

Clone and Set Up

git clone https://github.com/shivangjhalani/OS-Jackfruit.git
cd OS-Jackfruit

# Prepare Alpine root filesystem
mkdir -p rootfs
wget https://dl-cdn.alpinelinux.org/alpine/v3.20/releases/x86_64/alpine-minirootfs-3.20.3-x86_64.tar.gz
tar -xzf alpine-minirootfs-3.20.3-x86_64.tar.gz -C rootfs

Build

make module     # builds monitor.ko
make engine     # builds engine binary
make workloads  # builds workload_cpu, workload_io, workload_mem

Or build everything at once:

make

Copy Workloads into rootfs

cp workload_cpu ./rootfs/
cp workload_io  ./rootfs/
cp workload_mem ./rootfs/

Load Kernel Module

sudo insmod monitor.ko

# Verify
lsmod | grep monitor
ls -l /dev/container_monitor
sudo dmesg | grep container_monitor | tail -3

Start the Supervisor

# Terminal 1 — keep this running
sudo ./engine supervisor ./rootfs

Launch Containers (Terminal 2)

# Start containers in background
sudo ./engine start alpha ./rootfs /bin/sh
sudo ./engine start beta  ./rootfs /bin/sh

# List all containers
sudo ./engine ps

# View logs
sudo ./engine logs alpha

# Stop containers
sudo ./engine stop alpha
sudo ./engine stop beta

Run Workload Experiments

# CPU-bound experiment
sudo ./engine start cpu_hi ./rootfs /workload_cpu
sudo ./engine start cpu_lo ./rootfs /workload_cpu
sleep 35
sudo ./engine logs cpu_hi
sudo ./engine logs cpu_lo

# CPU vs IO experiment
sudo ./engine start exp_cpu ./rootfs /workload_cpu
sudo ./engine start exp_io  ./rootfs /workload_io
sleep 35
sudo ./engine logs exp_cpu
sudo ./engine logs exp_io

# Memory limit test
sudo ./engine start memtest ./rootfs /workload_mem
sleep 30
sudo ./engine ps           # memtest should show state=killed
sudo dmesg | grep container_monitor | grep -E "WARNING|KILLING"

Cleanup and Unload

# Stop all containers
sudo ./engine stop alpha
sudo ./engine stop beta

# Stop supervisor (Terminal 1)
# Press Ctrl+C in Terminal 1

# Unload kernel module
sudo rmmod monitor

# Verify clean state
lsmod | grep monitor || echo "module unloaded"
ls /dev/container_monitor 2>/dev/null || echo "device gone"
ps aux | grep defunct | grep -v grep || echo "no zombies"

Reference Run Sequence

make
sudo insmod monitor.ko
ls -l /dev/container_monitor

# Terminal 1
sudo ./engine supervisor ./rootfs

# Terminal 2
sudo ./engine start alpha ./rootfs /bin/sh
sudo ./engine start beta  ./rootfs /bin/sh
sudo ./engine ps
sudo ./engine logs alpha
sudo ./engine stop alpha
sudo ./engine stop beta

dmesg | tail
sudo rmmod monitor

3. Demo Screenshots

Screenshot 1 — Multi-Container Supervision

Two containers (alpha, beta) running simultaneously under one supervisor process. Both show state=running with their host PIDs.

Screenshot 1


Screenshot 2 — Metadata Tracking

Output of sudo ./engine ps showing all tracked container metadata including NAME, PID, STATE, SOFT_MB, HARD_MB, and LOG path.

Screenshot 2


Screenshot 3 — Bounded-Buffer Logging

Log file contents captured through the producer-consumer logging pipeline. Shows CPU workload started, completion count, and IO write counts — proving data flowed from container stdout through the pipe, bounded buffer, and into persistent log files.

Screenshot 3


Screenshot 4 — CLI and IPC

A CLI command (start) being issued in Terminal 2 while the supervisor in Terminal 1 listens on the UNIX domain socket. The supervisor receives the command and the CLI client receives the OK response — demonstrating the second IPC mechanism (UNIX domain socket) separate from the logging pipe.

Screenshot 4


Screenshot 5 — Soft-Limit Warning

dmesg output showing the kernel monitor detecting that PID 11123 exceeded its soft memory limit of 51200 KB and logging a warning without killing the process.

Screenshot 5


Screenshot 6 — Hard-Limit Enforcement

dmesg output showing the kernel monitor killing PID 11123 after its RSS exceeded the hard limit of 204800 KB. The sudo ./engine ps output confirms the container's state changed to killed.

Screenshot 6


Screenshot 7 — Scheduling Experiment

Terminal output from the cpu_hi vs cpu_lo experiment and the exp_cpu vs exp_io experiment, showing measurable differences in iteration counts and write counts between containers under different scheduling conditions.

Screenshot 7


Screenshot 8 — Clean Teardown

Evidence of clean shutdown: module fully unloaded, /dev/container_monitor gone, and No zombies - clean! — confirming all resources were released correctly.

Screenshot 8


4. Engineering Analysis

4.1 Isolation Mechanisms

The runtime achieves isolation by calling unshare(CLONE_NEWPID | CLONE_NEWUTS | CLONE_NEWNS) in each container child process before exec. This asks the kernel to place the new process into fresh namespaces.

PID namespace gives the container its own PID number space — the first process inside sees itself as PID 1. The host kernel still tracks the real host PID, which is what we store in the container metadata and pass to the kernel monitor.

UTS namespace lets each container have its own hostname, set via sethostname(). We set it to the container name (e.g., "alpha"), so the shell prompt inside shows the container identity rather than the host machine name.

Mount namespace isolates the filesystem view. After unshare, we call mount(NULL, "/", NULL, MS_PRIVATE | MS_REC, NULL) to make all existing mounts private, then chroot() into the Alpine rootfs. We then mount a fresh /proc inside so tools like ps and top work correctly inside the container without seeing host processes.

The host kernel still shares the same physical CPU, memory, and network stack with all containers. Namespaces create the illusion of isolation — they do not enforce resource limits by themselves. That is why the kernel memory monitor is necessary as a separate enforcement layer.

4.2 Supervisor and Process Lifecycle

A long-running supervisor is necessary for several reasons. First, it maintains in-memory state about all containers — their PIDs, states, log paths, and memory limits — across the entire session. If each CLI invocation were its own process, there would be no central place to track this state. Second, the supervisor is the parent of all container processes, which means it is the one that receives SIGCHLD when a container exits and can call waitpid() to reap it.

Process creation works through fork(). The child inherits the parent's file descriptors and address space, then immediately calls unshare(), chroot(), and execvp() to transform itself into an isolated container. The parent records the child's PID in the container metadata table before the child finishes transforming.

The SIGCHLD handler calls waitpid(-1, &status, WNOHANG) in a loop to reap all exited children without blocking. It uses WIFEXITED and WIFSIGNALED macros to distinguish graceful exits from signal-caused deaths, and updates the container state to stopped or killed accordingly. Without this handler, exited containers would become zombie processes that consume kernel process table entries indefinitely.

4.3 IPC, Threads, and Synchronization

The project uses two distinct IPC mechanisms for two distinct purposes.

Pipes carry container output from the container to the supervisor. When a container is forked, we create a pipe and dup2 the write end onto the container's stdout and stderr. The supervisor holds the read end and passes it to a producer thread.

UNIX domain socket carries CLI commands from the client to the supervisor. The supervisor binds to /tmp/engine.sock and accepts connections. Each CLI invocation connects, sends a command string, reads the response, and disconnects. This is the control channel — kept entirely separate from the logging data channel.

The bounded buffer sits between the producer thread (which reads the pipe) and the consumer thread (which writes to the log file). It has capacity LOG_BUF_SIZE entries. Three synchronization primitives protect it:

  • A pthread_mutex_t protects the head, tail, and count fields from concurrent modification. Without this, two threads incrementing count simultaneously could cause a lost update, leaving the count wrong and corrupting the buffer.
  • A pthread_cond_t not_full makes the producer wait when the buffer is full rather than overwriting data. Without this, fast containers would silently drop log entries.
  • A pthread_cond_t not_empty makes the consumer wait when the buffer is empty rather than spinning and burning CPU. Without this, the consumer would busy-wait, wasting cycles.

The container metadata array is protected by a separate global pthread_mutex_t containers_lock. This is kept separate from the buffer mutex intentionally — mixing them would create deadlock risk if a thread tried to acquire both in different orders.

4.4 Memory Management and Enforcement

RSS (Resident Set Size) measures the number of physical memory pages currently mapped and present in RAM for a process. It does not include pages that have been swapped out, memory-mapped files that haven't been accessed, or shared library pages that are attributed to multiple processes. This means RSS slightly underestimates true memory pressure but is a reasonable and cheap-to-read proxy available directly from the kernel via get_mm_rss(task->mm).

Soft and hard limits represent different policies. The soft limit is a warning threshold — when crossed, the kernel monitor logs a message but takes no action. This is useful for alerting operators that a container is using more memory than expected without disrupting it. The hard limit is an enforcement threshold — when crossed, the monitor sends SIGKILL to the container process immediately. The container has no opportunity to handle this signal and is terminated unconditionally.

Enforcement belongs in kernel space rather than purely in user space for two reasons. First, a user-space monitor can be fooled or starved — if the monitored process is consuming all CPU, the user-space monitor may not get scheduled in time to kill it before the system runs out of memory. A kernel timer fires regardless of user-space scheduling. Second, only the kernel can reliably call send_sig(SIGKILL, task, 0) on an arbitrary task from outside its process group without relying on the target cooperating.

4.5 Scheduling Behavior

Linux uses the Completely Fair Scheduler (CFS) for normal processes. CFS tracks a virtual runtime for each process and always schedules the process with the smallest virtual runtime next. The nice value adjusts the weight assigned to a process — a lower nice value (higher priority) causes the scheduler to advance the virtual runtime more slowly, giving it more CPU time relative to others.

In our experiments, two CPU-bound containers ran simultaneously for 30 seconds. The results showed very similar iteration counts:

  • cpu_hi (nice=0): x = 8,234,970,465
  • cpu_lo (nice=19): x = 8,340,844,546

The close counts reflect that nice -n 19 was applied to the engine client process, not to the workload inside the container, so the scheduling weight difference was applied during the brief container launch phase rather than during the 30-second workload execution. This is actually an important finding — it demonstrates that container runtimes must apply scheduling policies to the container workload process itself, not the launcher, to achieve meaningful differentiation.

In the CPU vs IO experiment, the IO-bound workload performed 16,016 writes while the CPU-bound workload completed its iteration loop normally. The IO workload voluntarily yields the CPU on every usleep(1000) call, allowing CFS to schedule other processes. This demonstrates how I/O-bound processes naturally have lower CPU utilization and create fewer scheduling conflicts than CPU-bound processes, which is consistent with CFS's design goal of fairness across different workload types.


5. Design Decisions and Tradeoffs

Namespace Isolation

Choice: Used unshare() with PID, UTS, and mount namespaces rather than clone() with namespace flags.

Tradeoff: unshare() in the child after fork() is simpler to implement but means the child briefly exists in the parent's namespaces before switching. Using clone() directly would place the child into new namespaces atomically at creation time.

Justification: The brief window in the parent's namespace is harmless for our use case since the child immediately calls unshare() before doing anything else. The simpler fork() + unshare() pattern is easier to reason about and debug.

Supervisor Architecture

Choice: Single long-running supervisor process that handles all CLI commands sequentially via a UNIX domain socket accept loop.

Tradeoff: Sequential command handling means one slow command (e.g., a run that waits for a container) blocks all other CLI requests. A multi-threaded supervisor would handle concurrent commands but adds synchronization complexity.

Justification: For a project runtime, sequential handling is sufficient and eliminates a large class of concurrency bugs. The most common commands (start, ps, logs, stop) are all fast. The run command is the only blocking one and is used intentionally for foreground execution.

IPC and Logging

Choice: Pipes for log data, UNIX domain socket for control commands, with a bounded buffer between pipe producer and file consumer.

Tradeoff: Two separate IPC mechanisms add complexity but keep concerns separated. A single shared channel for both logs and commands would require a multiplexing protocol and could cause log data to delay control responses.

Justification: Separation of control and data planes is a standard systems design principle. The bounded buffer prevents the supervisor from blocking on disk I/O while reading container output, which would cause the pipe to fill and the container to block on writes.

Kernel Monitor

Choice: Kernel linked list with a mutex and a periodic timer rather than using Linux cgroups.

Tradeoff: Our implementation is simpler to understand and demonstrates OS concepts directly. Cgroups are more efficient and production-grade, with kernel-native enforcement and hierarchy support.

Justification: The project goal is to demonstrate OS mechanisms directly. Implementing our own linked list, mutex, timer, and ioctl interface exercises kernel programming concepts that using cgroups would abstract away entirely.

Scheduling Experiments

Choice: Used nice values and comparison of iteration counts as the measurement methodology.

Tradeoff: Iteration counts are a coarse proxy for CPU time. Using getrusage() or reading /proc/PID/stat for precise CPU time accounting would give more accurate results but requires more instrumentation inside the workload.

Justification: Iteration counts are visible directly through the logging pipeline without any additional tooling, making them reproducible and easy to capture as part of the normal demo flow.


6. Scheduler Experiment Results

Experiment 1: Two CPU-Bound Containers with Different Priorities

Both containers ran /workload_cpu for 30 seconds, spinning on a counter loop.

Container Nice Value Iterations (x)
cpu_hi 0 (default) 8,234,970,465
cpu_lo 19 (lowest) 8,340,844,546

Analysis: The iteration counts are nearly equal because nice -n 19 was applied to the engine client process during container launch rather than to the workload process itself. This means the scheduler weight difference only affected the brief launch phase. During the 30-second workload window, both processes ran at equal priority. This demonstrates that scheduling policy must be applied directly to the target workload process to be effective — applying it to a parent or launcher process has minimal impact on long-running child workloads.


Experiment 2: CPU-Bound vs IO-Bound Container

Both containers ran simultaneously for 30 seconds.

Container Workload Type Result
exp_cpu CPU-bound (spin loop) x = 8,439,944,929 iterations
exp_io IO-bound (write + usleep) 16,016 writes

Analysis: The CPU-bound container achieved slightly more iterations when running alongside the IO-bound container than when running against another CPU-bound container (8.44B vs 8.23B). This is because the IO-bound container voluntarily yields the CPU on every usleep(1000) call, reducing scheduling contention. CFS gave more CPU time to the CPU-bound container because the IO-bound container was frequently in a sleeping state. This is consistent with the Linux scheduler's behavior of rewarding processes that sleep frequently with shorter scheduling latencies when they wake up, while also allowing runnable processes to use the CPU when others are sleeping.


Experiment 3: Memory Limit Enforcement

Container Soft Limit Hard Limit Result
memtest 50 MB 200 MB Killed at 200 MB

Kernel log evidence:

container_monitor: PID 11123 RSS=51784 KB > soft=51200 KB — WARNING
container_monitor: PID 11123 RSS=205384 KB > hard=204800 KB — KILLING
container_monitor: deregistered PID 11123

The container allocated memory in 10 MB increments, triggered the soft-limit warning at ~50 MB, and was killed by the kernel monitor at ~200 MB. The supervisor metadata correctly reflected the final state as killed.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors