This project demonstrates manual container implementation using low-level Linux kernel features. Containers are essentially isolated processes that are constrained by various Linux kernel mechanisms, including namespaces, cgroups, chroot/pivot_root, seccomp, and capabilities.
Namespaces provide isolation of global system resources. Each namespace creates a separate instance of a global resource, making processes within the namespace see their own isolated instance of the resource.
Linux provides several namespace types:
- PID Namespace: Isolates process IDs. The first process in a new PID namespace gets PID 1.
- Network Namespace: Isolates network stack (interfaces, routing tables, firewall rules).
- Mount Namespace: Isolates filesystem mount points.
- UTS Namespace: Isolates hostname and domain name.
- IPC Namespace: Isolates inter-process communication resources.
- User Namespace: Isolates user and group IDs.
- Cgroup Namespace: Isolates cgroup root directory.
- Time Namespace: Isolates system time.
Cgroups limit and account for resource usage of process groups. They can control:
- CPU: Limit CPU shares and bandwidth.
- Memory: Limit memory usage and generate OOM notifications.
- Devices: Control access to devices.
- Pids: Limit number of processes.
- IO: Control block device I/O.
Filesystem isolation restricts a process's view of the filesystem hierarchy:
- chroot: Changes the root directory of a process and its children.
- pivot_root: More complex but complete way to change the root filesystem.
Seccomp restricts the system calls a process can make, reducing the attack surface:
- Strict Mode: Allows only read(), write(), exit(), and sigreturn() syscalls.
- Filter Mode: Uses BPF filters to allow/deny specific syscalls.
Capabilities divide privileged operations (traditionally associated with root) into distinct units:
- CAP_CHOWN: Allow changing file ownership.
- CAP_NET_BIND_SERVICE: Allow binding to ports below 1024.
- CAP_SYS_ADMIN: Allow various administrative operations.
- And many more...
Our demonstration script (run-container.sh) implements the following:
- Creates cgroups for resource limiting (memory, CPU, PIDs).
- Sets up a network namespace with virtual interfaces for networking.
- Prepares a seccomp profile to restrict dangerous syscalls.
- Sets up mount points for the container.
- Uses unshare to create isolated namespaces.
- Configures capabilities by dropping unnecessary ones.
- Uses chroot for filesystem isolation.
- Runs a test program inside the isolated container.
# Run the container demo (requires root)
sudo ./run-container.shProduction container engines like Docker, containerd, or CRI-O use the same underlying Linux kernel features but add:
- Image management (layers, registries)
- Advanced networking options
- Volume management
- Resource orchestration
- Logging and monitoring
- Security hardening
This project focuses on the core isolation mechanisms for educational purposes.
- Study the Linux kernel documentation for namespaces, cgroups, etc.
- Explore container runtimes like runc
- Learn about OCI (Open Container Initiative) specifications
- Experiment with container orchestration (Kubernetes)