Continuing from the last section, here we’ll give the kernel the ability to load programs and run them in a restricted “user mode”, while talking to the kernel using system calls (“syscall”s).
As with many other parts of this guide, this follows closely what MOROS does.
The standard executable format on Unix-like operating systems is ELF.
We use the object crate to parse the ELF format
object = { version = "0.27.1", default-features = false, features = ["read"] }
then in process.rs
use object::{Object, ObjectSegment};
To create a simple executable, create a file src/bin/hello.rs
#![no_std]
#![no_main]
use core::panic::PanicInfo;
#[panic_handler]
fn panic(_info: &PanicInfo) -> ! {
loop {}
}
#[no_mangle]
pub unsafe extern "sysv64" fn _start() -> ! {
loop {}
}
then compile with
$ cargo build
which should create an executable target/x86_64-blog_os/debug/hello
In process.rs
create a new function to create a new user thread,
taking as input an ELF file binary in an array. We’re going to include
our hello
executable in the kernel for now, and pass it to this
function. This first checks for the expected ELF header, and then
uses the object crate to parse the data. ELF files are organised into
segments with a size, a starting memory address and (usually) some
data to be loaded. For now we’ll just print the segment addresses:
pub fn new_user_thread(bin: &[u8]) -> Result<usize, &'static str> {
// Check the header
const ELF_MAGIC: [u8; 4] = [0x7f, b'E', b'L', b'F'];
if bin[0..4] != ELF_MAGIC {
return Err("Expected ELF binary");
}
// Use the object crate to parse the ELF file
// https://crates.io/crates/object
if let Ok(obj) = object::File::parse(bin) {
let entry_point = obj.entry();
println!("Entry point: {:#016X}", entry_point);
for segment in obj.segments() {
println!("Section {:?} : {:#016X}", segment.name(), segment.address());
}
return Ok(0);
}
Err("Could not parse ELF")
}
In main.rs
we now include the hello
executable using the
include_bytes macro, and call new_user_thread
:
process::new_user_thread(include_bytes!("../target/x86_64-blog_os/debug/hello"));
Running this with
$ cargo run --bin blog_os
should produce a result like
Entry point: 0x00000000201120 Section Ok(None) : 0x00000000200000 Section Ok(None) : 0x00000000201120
Unfortunately this entry point virtual memory address is in the same range as the kernel. The user program has to be loaded at these memory addresses to work correctly, but if we do that we will overwrite part of the kernel.
To handle this we can try to use separate page tables for kernel and users, so they can have the same virtual memory address but different physical addresses. This means frequently switching page tables (and flushing the TLB). In addition interrupt handlers must be mapped in the user virtual memory because page tables are not changed when an interrupt occurs. Instead all operating systems keep at least some kernel memory mapped in a reserved part of virtual memory: Linux is a “high half” operating system, where the high half of virtual memory is reserved for kernel use. Note that the kernel pages can in principle only be accessed from ring 0 (kernel), but the Meltdown security vulnerability allowed this to be bypassed on some processors (Intel x86 & some ARM Cortex). Linux can use Kernel page-table isolation (KPTI) to keep only minimal interrupt handlers mapped while in user mode and mitigate this vulnerability.
To change the address of the user code, we can use the GNU ld linker,
which has options to control the virtual address where code and data
segments are loaded. Choosing a memory range which is not
currently unused e.g above 0x5000000
, we can build userspace programs
with this makefile
rule:
.PHONY: user
# Compile user programs in src/bin
user: user/hello
user/% : src/bin/%.rs makefile
cargo rustc --release --bin $* -- \
-C linker-flavor=ld \
-C link-args="-Ttext-segment=5000000 -Trodata-segment=5100000" \
-C relocation-model=static
mkdir -p user
cp target/x86_64-blog_os/release/$* user/
which will build the hello
executable and copy it into a user/
directory. The main.rs
code can be changed to point to this new location:
process::new_user_thread(include_bytes!("../user/hello"));
While we’re at it, we can add another rule:
run : user
cargo run --bin blog_os
so running make run
will build everything and run. Check that the new entry
point is correct.
[Note: In section 22 we will eventually do the more sensible thing and shift the kernel to high memory addresses, rather than user programs]
To load the ELF into this new virtual memory address we need to create
entries in the page table. Note that we have to use segment.size()
when allocating memory, because there can be segments with non-zero
size but no data (so data.len()
is zero). BSS segments, for example,
are used to allocate space for uninitialised static variables.
for segment in obj.segments() {
let segment_address = segment.address() as u64;
println!("Section {:?} : {:#016X}", segment.name(), segment_address);
// Allocate memory in the pagetable
if memory::allocate_pages(user_page_table_ptr,
VirtAddr::new(segment_address), // Start address
segment.size() as u64, // Size (bytes)
PageTableFlags::PRESENT |
PageTableFlags::WRITABLE |
PageTableFlags::USER_ACCESSIBLE).is_err() {
return Err("Could not allocate memory");
}
if let Ok(data) = segment.data() {
// Copy data
let dest_ptr = segment_address as *mut u8;
for (i, value) in data.iter().enumerate() {
unsafe {
let ptr = dest_ptr.add(i);
core::ptr::write(ptr, *value);
}
}
}
}
To add a little security to our ELF user code loader we can define a
range of allowed addresses in process.rs
const USER_CODE_START: u64 = 0x5000000;
const USER_CODE_END: u64 = 0x80000000;
then before allocating memory in new_user_thread
we can check that
the memory is in the allowed range, returning an error if it is
not. Remember to change page table back before returning. We should
also free the new page tables, but haven’t added functions to do that
yet.
let start_address = VirtAddr::new(segment_address);
let end_address = start_address + segment.size() as u64;
if (start_address < VirtAddr::new(USER_CODE_START))
|| (end_address >= VirtAddr::new(USER_CODE_END)) {
return Err("ELF segment outside allowed range");
}
if memory::allocate_pages(...)
We could also check that the data length is not bigger than the size of the segment.
Having loaded data into memory we now need to create a Thread
struct,
similar to the new_kernel_thread
function
Following this blog by Nikos Filippakis, and borrowing some code from MOROS, we are now going to switch programs to user mode.
First we add some segment entries to the Global Descriptor Table (GDT) for user code and data segments:
static ref GDT: (GlobalDescriptorTable, Selectors) = {
let mut gdt = GlobalDescriptorTable::new();
let code_selector = gdt.add_entry(Descriptor::kernel_code_segment());
let data_selector = gdt.add_entry(Descriptor::kernel_data_segment());
let tss_selector = gdt.add_entry(Descriptor::tss_segment(
unsafe {tss_reference()}));
let user_code_selector = gdt.add_entry(Descriptor::user_code_segment()); // new
let user_data_selector = gdt.add_entry(Descriptor::user_data_segment()); // new
(gdt, Selectors { code_selector, data_selector, tss_selector,
user_code_selector, user_data_selector}) // new
};
struct Selectors {
code_selector: SegmentSelector,
data_selector: SegmentSelector,
tss_selector: SegmentSelector,
user_data_selector: SegmentSelector, // new
user_code_selector: SegmentSelector // new
}
According to this post the actual code and data segments are obsolete and not used, but the code segment (CS register) sets the processor privilege level (“ring”). It also seems to be important to set the Stack Segment (SS) to avoid General Protection Faults. The order of the segments in the GDT does not seem to matter if interrupts are going to be used for system calls. The order may be important if the faster (and more recent) syscall/sysret mechanism is used.
As we have a get_kernel_segments
function, we can add a function to get
the user segment selectors:
pub fn get_user_segments() -> (SegmentSelector, SegmentSelector) {
(GDT.1.user_code_selector, GDT.1.user_data_selector)
}
context.cs = code_selector.0 as usize; // Code segment flags
context.ss = data_selector.0 as usize; // Without this we get a GPF
Setting the CS register without also setting the SS register results
in a General Protection Fault on the iretq
instruction. Fixing this we get
a different error:
New process PID: 0x00000000000000, rip: 0x00000005001000 Kernel stack: 0x00444444440068 - 0x00444444442068 Context: 0x000444444441FE8 Thread stack: 0x00444444442068 - 0x00444444447068 RSP: 0x00444444447068 EXCEPTION: PAGE FAULT Accessed Address: VirtAddr(0x444444447060) Error Code: PROTECTION_VIOLATION | CAUSED_BY_WRITE | USER_MODE InterruptStackFrame { instruction_pointer: VirtAddr( 0x5001000, ), code_segment: 51, cpu_flags: 0x246, stack_pointer: VirtAddr( 0x444444447068, ), stack_segment: 43, }
The error code (USER_MODE
flag) means that we’re running in user
mode (Ring 3)! Unfortunately our code has tried to write to an
address that it’s not allowed to: It tried to write to
0x444444447060
which is in the thread stack address range
(0x00444444442068 - 0x00444444447068
). The error occurred because we
are allocating the stacks on the kernel heap with Vec
objects, and
those kernel pages are not accessible to user programs.
// Allocate pages for the user stack
const USER_STACK_START: u64 = 0x5002000;
memory::allocate_pages(user_page_table_ptr,
VirtAddr::new(USER_STACK_START), // Start address
USER_STACK_SIZE as u64, // Size (bytes)
PageTableFlags::PRESENT |
PageTableFlags::WRITABLE |
PageTableFlags::USER_ACCESSIBLE);
context.rsp = (USER_STACK_START as usize) + USER_STACK_SIZE;
Now the userspace code runs! Until we press a key. Then we get:
EXCEPTION: PAGE FAULT Accessed Address: VirtAddr(0xfffffffffffffff8) Error Code: CAUSED_BY_WRITE InterruptStackFrame { instruction_pointer: VirtAddr( 0x5001000, ), code_segment: 51, cpu_flags: 0x202, stack_pointer: VirtAddr( 0x5007000, ), stack_segment: 43 }
The accessed address is 8 bytes below address 0, and the access occurred in kernel mode (no USER_MODE flag).
Ensure that the keyboard interrupt handler has a valid kernel stack.
In interrupts.rs
:
idt[InterruptIndex::Keyboard.as_usize()]
.set_handler_fn(keyboard_interrupt_handler)
.set_stack_index(gdt::KEYBOARD_INTERRUPT_INDEX); // new
and in gdt.rs
pub const KEYBOARD_INTERRUPT_INDEX: u16 = 0;
Right now the user process can’t do much because printing to screen requires ring 0 (kernel) privileges. It has to ask the kernel to perform this task and many others. Every operating system therefore has a system call interface, for example this is the Linux syscall table.
First we need to enable syscalls, and specify the function to be called.
In a new file syscalls.rs
we’re going to need some assembly code:
use core::arch::asm;
Then define some constants which refer to the Model Specific Registers (MSRs) used to control syscalls:
const MSR_STAR: usize = 0xc0000081;
const MSR_LSTAR: usize = 0xc0000082;
const MSR_FMASK: usize = 0xc0000084;
Define a function which will be called when a syscall occurs:
#[naked]
extern "C" fn handle_syscall() {
// Empty for now
}
Then an init
function to set up syscalls to call this function
pub fn init() {
let handler_addr = handle_syscall as *const () as u64;
unsafe {
// Assembly code to go here
}
}
There are four steps needed to set this up: (1) enable the syscall
and sysret opcodes by setting the last bit in the MSR IA32_EFER,
which has code 0xC0000080
:
asm!("mov ecx, 0xC0000080",
"rdmsr",
"or eax, 1",
"wrmsr");
When a syscall is made we need to disable interrupts. Step (2)
is therefore to use FMASK
MSR to appliy a mask to the RFLAGS
when a syscall occurs:
asm!("xor rdx, rdx",
"mov rax, 0x200",
"wrmsr",
in("rcx") MSR_FMASK);
Step (3) is to set the LSTAR
MSR to the address of the handler
which gets called:
asm!("mov rdx, rax",
"shr rdx, 32",
"wrmsr",
in("rax") handler_addr,
in("rcx") MSR_LSTAR);
Finally (4) is to set the segment selectors (i.e. ring 0 or ring 3)
which get changed when syscall
and sysret
are executed:
asm!(
"xor rax, rax",
"mov rdx, 0x230008",
"wrmsr",
in("rcx") MSR_STAR);
The value 0x230008
specifies that selectors 8, 16 are used for
syscall (going to kernel code) and 43, 51 for sysret (returning to
user code).
Now to call our (empty) syscall handler, modify src/bin/hello.rs
so that it now executes syscall
:
#[no_mangle]
pub unsafe extern "sysv64" fn _start() -> ! {
asm!("syscall");
loop {}
}
Try running this, to ensure that everything is working so far.
Now we can make the syscall handler do something, but to do that we
need to save the registers so we can restore them afterwards. In
future we will want to distinguish between cases where we will return
to the same process, and cases where we will want to switch to a
different process. We’ll also want to change stack so that we’re not
messing with, or leaking kernel data into, the user’s stack. For now
we’ll just push registers on the user’s stack in the body of the
handle_syscall()
function.
Since naked functions can only contain a single asm
block, it’s
probably best to do the minimum necessary to get to Rust code.
#[naked]
extern "C" fn handle_syscall() {
unsafe {
asm!(
// Here should switch stack to avoid messing with user stack
// backup registers for sysretq
"push rcx",
"push r11",
"push rbp",
"push rbx", // save callee-saved registers
"push r12",
"push r13",
"push r14",
"push r15",
// Call the rust handler
"call {sys_write}",
"pop r15", // restore callee-saved registers
"pop r14",
"pop r13",
"pop r12",
"pop rbx",
"pop rbp", // restore stack and registers for sysretq
"pop r11",
"pop rcx",
"sysretq", // back to userland
sys_write = sym sys_write,
options(noreturn));
}
}
where the sym
keyword is replaced with the address of the
symbol (i.e function in this case) by the linker. The sys_write
function
will just print something so we can see if it’s working:
extern "C" fn sys_write() {
println!("write");
}
Try running again, now should see “write” appear.
To be able to do anything useful, we need to be able to pass parameters to our syscall, typically through registers though perhaps also on the stack. From this summary of Linux syscalls, it can be seen that Linux does this in two stages: First a syscall function is selected by setting the RAX register. Then other registers are used to pass parameters to the syscall function. The order of these parameters (rdi, rsi, rdx, r10, r8, r9) is slightly different from the System V ABI and C calling conventions (rdi, rsi, rdx, rcx, r8, r9) because the RCX register is used to store the caller’s instruction pointer.
Linux uses a call table to choose which function to call: The RAX
register is the index into an array of function pointers.
At some point we’ll need to implement something like this in Rust,
but for now we’ll just implement a simple conditional.
Replacing call sys_write
with:
"cmp rax, 0", // if rax == 0 {
"jne 1f",
"call {sys_read}", // sys_read();
"1: cmp rax, 1", // } if rax == 1 {
"jne 2f",
"call {sys_write}", // sys_write();
"2: ", // }
and get the addresses of both functions in the asm!
macro:
sys_read = sym sys_read, // new
sys_write = sym sys_write,
and add the other syscall function:
extern "C" fn sys_read() {
println!("read");
}
Now we can modify the hello.rs
userland code, setting
the rax
register to select which syscall to run:
asm!("mov rax, 1", // write
"syscall");
Now we have called the syscall function, we can use the other
registers to pass parameters. To start with we’ll use sys_write
to
print strings. Then we’ll be able to print debugging information from
user programs.
We can change the sys_write
function to accept two arguments,
which will be in the RDI and RSI registers:
extern "C" fn sys_write(ptr: *mut u8, len:usize) {
// Body to go here...
}
The first argument (in RDI) is the pointer to the start of the string,
and the second (in RSI) is its length. Both of these arguments should
be thoroughly checked before use, as user code may be malfunctioning
or malicious. All len
bytes of the string must be in the
user’s memory range, for example, and not part of kernel memory.
For now we’ll just check that len
is not zero, and then convert
the pointer and length to a slice and then an str
to be printed:
extern "C" fn sys_write(ptr: *mut u8, len:usize) {
// Check all inputs: Does ptr -> ptr+len lie entirely in user address space?
if len == 0 {
return;
}
// Convert raw pointer to a slice
let u8_slice = unsafe {slice::from_raw_parts(ptr, len)};
if let Ok(s) = str::from_utf8(u8_slice) {
println!("Write '{}'", s);
} // else error
}
Let’s try calling this with a string: In ==src/bin/hello.rs=
the _start
function becomes:
#[no_mangle]
pub unsafe extern "sysv64" fn _start() -> ! {
let s = "hello";
unsafe {
asm!("mov rax, 1", // syscall function
"syscall",
in("rdi") s.as_ptr(), // First argument
in("rsi") s.len()); // Second argument
}
loop {}
}
When run you should now see “Write: ‘hello’” appear!
Note: We can’t use hlt
inside the loop because this is a privileged
instruction (needs to run in ring 0). Linux has a sched_yield()
syscall, which a user thread can call if no work needs to be done. If
every process is sleeping, waiting, or calls this function, then the
kernel calls hlt to save power.
Finally we can wrap this syscall up in a function and make the interface
nicer for the user by implementing the print
and println
macros. First
wrap up the syscall in a write_str
function, implementing the fmt:Write
trait
on an empty type we define:
use core::format_args;
use core::fmt;
struct Writer {}
impl fmt::Write for Writer {
fn write_str(&mut self, s: &str) -> fmt::Result {
unsafe {
asm!("mov rax, 1", // syscall function
"syscall",
in("rdi") s.as_ptr(), // First argument
in("rsi") s.len()); // Second argument
}
Ok(())
}
}
then a function which calls this with format arguments:
pub fn _print(args: fmt::Arguments) {
use core::fmt::Write;
Writer{}.write_fmt(args).unwrap();
}
and then the macros to call this function:
macro_rules! print {
($($arg:tt)*) => {
_print(format_args!($($arg)*));
};
}
macro_rules! println {
() => (print!("\n"));
($fmt:expr) => (print!(concat!($fmt, "\n")));
($fmt:expr, $($arg:tt)*) => (print!(
concat!($fmt, "\n"), $($arg)*));
}
Eventually we will want to define these in a standard library which
all user programs can use (section 10), but this will be enough for
testing for now. Our hello.rs
start function can now be simplified
to:
#[no_mangle]
pub unsafe extern "sysv64" fn _start() -> ! {
print!("Hello from user world! {}", 42);
loop{}
}