Skip to content

thuanpham582002/container-deep-dive

Repository files navigation

Manual Container Implementation

This project demonstrates manual container implementation using low-level Linux kernel features. Containers are essentially isolated processes that are constrained by various Linux kernel mechanisms, including namespaces, cgroups, chroot/pivot_root, seccomp, and capabilities.

Key Concepts

1. Namespaces

Namespaces provide isolation of global system resources. Each namespace creates a separate instance of a global resource, making processes within the namespace see their own isolated instance of the resource.

Linux provides several namespace types:

  • PID Namespace: Isolates process IDs. The first process in a new PID namespace gets PID 1.
  • Network Namespace: Isolates network stack (interfaces, routing tables, firewall rules).
  • Mount Namespace: Isolates filesystem mount points.
  • UTS Namespace: Isolates hostname and domain name.
  • IPC Namespace: Isolates inter-process communication resources.
  • User Namespace: Isolates user and group IDs.
  • Cgroup Namespace: Isolates cgroup root directory.
  • Time Namespace: Isolates system time.

2. Control Groups (cgroups)

Cgroups limit and account for resource usage of process groups. They can control:

  • CPU: Limit CPU shares and bandwidth.
  • Memory: Limit memory usage and generate OOM notifications.
  • Devices: Control access to devices.
  • Pids: Limit number of processes.
  • IO: Control block device I/O.

3. Filesystem Isolation (chroot/pivot_root)

Filesystem isolation restricts a process's view of the filesystem hierarchy:

  • chroot: Changes the root directory of a process and its children.
  • pivot_root: More complex but complete way to change the root filesystem.

4. Seccomp (Secure Computing Mode)

Seccomp restricts the system calls a process can make, reducing the attack surface:

  • Strict Mode: Allows only read(), write(), exit(), and sigreturn() syscalls.
  • Filter Mode: Uses BPF filters to allow/deny specific syscalls.

5. Linux Capabilities

Capabilities divide privileged operations (traditionally associated with root) into distinct units:

  • CAP_CHOWN: Allow changing file ownership.
  • CAP_NET_BIND_SERVICE: Allow binding to ports below 1024.
  • CAP_SYS_ADMIN: Allow various administrative operations.
  • And many more...

Implementation Details

Our demonstration script (run-container.sh) implements the following:

  1. Creates cgroups for resource limiting (memory, CPU, PIDs).
  2. Sets up a network namespace with virtual interfaces for networking.
  3. Prepares a seccomp profile to restrict dangerous syscalls.
  4. Sets up mount points for the container.
  5. Uses unshare to create isolated namespaces.
  6. Configures capabilities by dropping unnecessary ones.
  7. Uses chroot for filesystem isolation.
  8. Runs a test program inside the isolated container.

Usage

# Run the container demo (requires root)
sudo ./run-container.sh

Differences from Production Container Engines

Production container engines like Docker, containerd, or CRI-O use the same underlying Linux kernel features but add:

  • Image management (layers, registries)
  • Advanced networking options
  • Volume management
  • Resource orchestration
  • Logging and monitoring
  • Security hardening

This project focuses on the core isolation mechanisms for educational purposes.

Further Learning

  • Study the Linux kernel documentation for namespaces, cgroups, etc.
  • Explore container runtimes like runc
  • Learn about OCI (Open Container Initiative) specifications
  • Experiment with container orchestration (Kubernetes)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published