Skip to content

Latest commit

 

History

History
576 lines (496 loc) · 21.8 KB

04-more-syscalls.org

File metadata and controls

576 lines (496 loc) · 21.8 KB

Syscalls for thread control

Now we’re going to add system calls to enable user threads to create new threads, and exit cleanly. We’ll learn about the swapgs and wrmsr instructions.

Switching stack on Syscall

Syscall handlers have a problem because they need to preserve all registers, while switching to a kernel stack and saving the user stack pointer somewhere. Reasons to avoid using the user stack include:

  1. Avoiding a stack overflow while in ring 0, which could overwrite anything or just lead to a page fault while in kernel space. Instead we switch to a known good stack location.
  2. Not leaving kernel data on the user stack, potentially leaking information which a malicious program could use.

A common solution is to use the GS segment register. In 64-bit mode this register contains a 64-bit address, and memory can be accessed as an offset from this address. For example

mov gs:0x24, rsp

copies the stack pointer (RSP) to a virtual memory address calculated by taking the value in GS and adding 0x24. If we can set the value of GS, then we can use it to save the user stack pointer to a memory address and retrieve a kernel stack pointer.

There are two ways to access the GS and FS 64-bit registers directly:

  1. The wrmsr instruction writes to Model Specific Registers, and was used in an earlier section to set up syscalls. It can only be executed in ring 0. According to the AMD documentation the FS base address is MSR 0xC000_0100 and the GS base address is 0xC000_0101.
  2. The wrfsbase and wrgsbase instructions write to the FS and GS base registers, and can be executed in any privilege level. These need to be enabled by setting a bit in the CR4 control register. There is an LWN article on enabling FSGSBASE in Linux.

Note that push, pop and mov only access the lower 32 bits of the GS and FS registers.

There is a third way to set the GS register indirectly, which allows the kernel to store an address in a location that is hidden from user programs. This Kernel GS base address is MSR 0xC0000102. The swapgs instruction swaps the values in the user and kernel GS base registers. What we can do is:

  1. Store a memory address in the kernel GS base register
  2. Allow users to modify the GS register
  3. On entering the syscall handler, execute swapgs to swap the user and kernel GS values
  4. Use GS to access memory, saving the user stack pointer and loading a kernel stack pointer.
  5. Before going back to user code swap the GS registers back

As discussed in this OSdev page a common strategy is to swap the GS registers on every transition between user and kernel code, at the start and end of syscalls and interrupt handlers. That page notes that there are some problems with this approach: If an interrupt is allowed to occur during a syscall, while the GS registers are swapped, then they will be swapped again! To handle this there are ways to detect whether the interrupt occurred while in kernel code.

The approach we’ll use here is to avoid this problem by only using swapgs in the syscall handler, keeping the user GS register loaded in the rest of the kernel code. There may be a good reason why this is a bad idea, but I haven’t found it yet.

Setting kernel GS base

A good choice for the kernel GS base register is to point to the Task State Segment (TSS) table. This table is different for each CPU core, so will still work if (when?) we support multiple cores, and we already store kernel stack pointers in the TSS for use in the timer interrupt context switch. The Redox OS syscall handler does something like this.

We need a function to get the address of the TSS. In gdt.rs add a function

pub fn tss_address() -> u64 {
    let tss_ptr = &*TSS.lock() as *const TaskStateSegment;
    tss_ptr as u64
}

then in syscalls.rs we define the MSR number:

const MSR_KERNEL_GS_BASE: usize = 0xC0000102;

and then the init function can put the TSS address in the kernel GS base register:

asm!(
    // Want to move RDX into MSR but wrmsr takes EDX:EAX i.e. EDX
    // goes to high 32 bits of MSR, and EAX goes to low order bits
    // https://www.felixcloutier.com/x86/wrmsr
    "mov eax, edx",
    "shr rdx, 32", // Shift high bits into EDX
    "wrmsr",
    in("rcx") MSR_KERNEL_GS_BASE,
    in("rdx") gdt::tss_address()
);

This puts the TSS address in RDX and writes it to the MSR. It looks more complicated than it is because wrmsr uses two 32-bit registers to set the 64-bit address: The low 32 bits of the address are moved to EAX, and then the high 32 bits in RDX are shifted down to the 32 bits in EDX. wrmsr takes these two pieces and puts them back together in the MSR.

Using SWAPGS in syscall handler

When the syscall handler starts, we want to save the user RSP, and load a kernel stack pointer. The kernel stack pointer for the current process is stored in the TSS, entry number gdt::TIMER_INTERRUPT_INDEX which is currently set to 1. There are 7 available interrupt stack entries, so we can use one of them to temporarily store the user stack pointer. In gdt.rs we’ll define a constant:

pub const SYSCALL_TEMP_INDEX: u16 = 2;

which we can use in syscalls.rs at the start of the handle_syscall function:

asm!("swapgs",
     "mov gs:{tss_temp}, rsp", // save user RSP
     "mov rsp, gs:{tss_timer}" // load kernel RSP
     ...
     tss_timer = const(0x24 + gdt::TIMER_INTERRUPT_INDEX * 8),
     tss_temp = const(0x24 + gdt::SYSCALL_TEMP_INDEX * 8),

The offset of the interrupt stack index (0x24) is determined from the Task State Segment layout.

This kernel stack is also used by the timer interrupt for context switches. If we want to allow context switches while handling a syscall, then we need to make sure that syscalls use a different part of the kernel stack. The kernel stack is two pages (8k) so we can move the pointer by an offset and have enough space:

const SYSCALL_KERNEL_STACK_OFFSET: u64 = 1024;

which is applied to rsp:

asm!(...
     "sub rsp, {ks_offset}",
     ...
     ks_offset = const(SYSCALL_KERNEL_STACK_OFFSET));

We can now save the user stack pointer onto the kernel stack, and swap the GS registers back:

asm!(...
     "push gs:{tss_temp}", // user RSP
     "swapgs"
     ...
     ks_offset = const(SYSCALL_KERNEL_STACK_OFFSET));

The handle_syscall() syscall entry function now starts with:

#[naked]
extern "C" fn handle_syscall() {
    unsafe {
        asm!(
            "swapgs", // Put the TSS address into GS (stored in syscalls::init)
            "mov gs:{tss_temp}, rsp", // Save user stack pointer in TSS entry

            "mov rsp, gs:{tss_timer}", // Get kernel stack pointer
            "sub rsp, {ks_offset}", // Use a different location than timer interrupt

            // Create an Exception stack frame
            "sub rsp, 8", // To be replaced with SS
            "push gs:{tss_temp}", // User stack pointer
            "swapgs", // Put TSS address back

After that we can create the rest of the Context struct.

Creating a Context struct in syscall

When a thread fork syscall is made, a new thread context must be made that is the same as the original thread, and can be put in the scheduler. The easiest way to do this is to capture a Context in syscall in the same way that we do in a timer interrupt.

The Context struct is defined in interrupts.rs. Because the stack grows downwards in memory we start at the end of the struct (ss and rsp fields), and store the values in order until we get to the top (r14 and r15). We’ve already pushed the user stack (rsp) but need to reserve space before that for the stack segment (SS). We therefore subtract 8 (bytes) from rsp to make space before the rsp value, and we have to do a similar thing for CS. Other differences from the interrupt handler code are that syscall stores the user instruction pointer (rip) in rcx, and RFLAGS in r11. The assembly code in the naked handle_syscall() function so far looks like:

asm!(
    "swapgs", // Put the TSS address into GS (stored in syscalls::init)
    "mov gs:{tss_temp}, rsp", // Save user stack pointer in TSS entry

    "mov rsp, gs:{tss_timer}", // Get kernel stack pointer
    "sub rsp, {ks_offset}", // Use a different location than timer interrupt

    // Create an Exception stack frame
    "sub rsp, 8", // To be replaced with SS
    "push gs:{tss_temp}", // User stack pointer
    "swapgs", // Put TSS address back

    "push r11", // Caller's RFLAGS
    "sub rsp, 8",  // CS
    "push rcx", // Caller's RIP

    "push rax",
    "push rbx",
    "push rcx",
    "push rdx",

    "push rdi",
    "push rsi",
    "push rbp",
    "push r8",

    "push r9",
    "push r10",
    "push r11",
    "push r12",

    "push r13",
    "push r14",
    "push r15",

Dispatching syscalls

We now have a thread Context and need to decide which syscall function to call. In section 2 we used some conditionals and called the sys_read or sys_write functions from the handle_syscalls assembly code. Linux uses a jump table, an array of function pointers, to dispatch syscalls; to do this in Rust we’ll use a two-stage method: handle_syscalls will call a new (Rust) function dispatch_syscalls, which will then use match to call the separate syscall functions.

To pass parameters to dispatch_syscalls we’ll use the x86_64 C calling convention: The first six function parameters are in rdi, rsi, rdx, rcx, r8 and r9 registers. We’re going to use five of these, for the Context address; the syscall number; and then three syscall parameters which the user can store in rdi, rsi and rdx:

"mov r8, rdx", // Fifth argument <- Syscall third argument
"mov rcx, rsi", // Fourth argument <- Syscall second argument
"mov rdx, rdi", // Third argument <- Syscall first argument
"mov rsi, rax", // Second argument is the syscall number
"mov rdi, rsp", // First argument is the Context address
"call {dispatch_fn}",
...
dispatch_fn = sym dispatch_syscall,

In the dispatch_syscall function we need to finish the Context that we’ve created: In handle_syscalls we didn’t set the stack or code segments. If this Context is used to return to a thread via an interrupt (e.g. in a context switch) then those segments will be wrong and we’ll probably get a General Protection Fault. To set these values we’ll use:

extern "C" fn dispatch_syscall(
    context_ptr: *mut Context,
    syscall_id: u64,
    arg1: u64, arg2: u64, arg3: u64) {

    let context = unsafe{&mut *context_ptr};

    // Set the CS and SS segment selectors
    let (code_selector, data_selector) =
          gdt::get_user_segments();
    context.cs = code_selector.0 as usize;
    context.ss = data_selector.0 as usize;
    ...

After that it’s just a match to choose which syscall function to call:

...
    match syscall_id {
        SYSCALL_FORK_THREAD => process::fork_current_thread(context),
        SYSCALL_DEBUG_WRITE => sys_debug_write(arg1 as *const u8, arg2 as usize),
        _ => println!("Unknown syscall {:?} {} {} {}",
                      context_ptr, syscall_id, arg1, arg2)
    }
 }

The SYSCALL_FORK_THREAD branch calls a new function fork_current_thread which we’ll add to process.rs.

Kernel code to fork threads

To create a new thread in the current process we need to:

  1. Allocate a new user stack and a new kernel stack for the new thread
  2. Create a new Thread object, with a reference counted pointer (Arc) to the same shared Process object as the caller.
  3. Set the return values so that the original thread can be distinguished from the new thread. Here we’ll do this by setting the rdi register to 0 in the new thread, and to the (non-zero) thread ID (tid) in the original thread.

The code to do this is in kernel/src/process.rs:

pub fn fork_current_thread(current_context: &mut Context) {
    if let Some(current_thread) = CURRENT_THREAD.read().as_ref() {

        // Allocate user stack
        let page_table_ptr = memory::active_pagetable_ptr();
        if let Ok((user_stack_start, user_stack_end)) = memory::allocate_user_stack(page_table_ptr) {
            let new_thread = {
                // Create a new kernel stack
                let kernel_stack = Vec::with_capacity(KERNEL_STACK_SIZE);
                let kernel_stack_start = VirtAddr::from_ptr(kernel_stack.as_ptr());
                let kernel_stack_end = (kernel_stack_start + KERNEL_STACK_SIZE).as_u64();

                Box::new(Thread {
                    tid: unique_id(),
                    process: current_thread.process.clone(), // Shared state
                    page_table_physaddr: current_thread.page_table_physaddr, // Shared page table
                    kernel_stack,
                    kernel_stack_end,
                    user_stack_end,
                    context: kernel_stack_end - INTERRUPT_CONTEXT_SIZE as u64,
                })
            };

            let new_context = unsafe {&mut *(new_thread.context as *mut Context)};
            *new_context = current_context.clone(); // Copy of caller

            // Set new stack pointer
            new_context.rsp = new_thread.user_stack_end as usize;

            // Set return values in rax
            new_context.rax = 0; // No error
            new_context.rdi = 0; // Indicates that this is the new thread
            current_context.rax = 0; // No error
            current_context.rdi = new_thread.tid as usize;

            RUNNING_QUEUE.write().push_back(new_thread);
        } else {
            // Failed to allocate user stack
            current_context.rax = syscalls::SYSCALL_ERROR_MEMALLOC; // Error code
        }
    } else {
        // Somehow no current thread
        current_context.rax = 2; // Error code
    }
}

Note that the new context is a clone (copy) of the calling thread’s context, including the return instruction pointer. Both threads will return from this syscall at the same point in the user code. We therefore need to be quite careful to make sure that we don’t accidentally share or double-free variables in user code.

Exiting threads

Now we have the syscall dispatch code in place, adding more syscalls becomes quite easy: We create a new function, choose a syscall number, and add it to the match in dispatch_syscalls:

pub const SYSCALL_EXIT_THREAD: u64 = 1;
...
// in dispatch_syscall()
    match syscall_id {
        SYSCALL_EXIT_THREAD => process::exit_current_thread(context),
        ...
    }

Exiting a thread is quite straightforward: We take ownership of the running Thread object, and then drop it. The drop implementation will then take care of freeing resources the thread is holding:

pub fn exit_current_thread(_current_context: &mut Context) {
    {
        let mut current_thread = CURRENT_THREAD.write();

        if let Some(_thread) = current_thread.take() {
            // Drop thread, freeing stacks. If this is the last thread
            // in this process, memory and page tables will be freed
            // in the Process drop() function
        }
    }
    // Can't return from this syscall, so this thread now waits for a
    // timer interrupt to switch context.
    unsafe {
        asm!("sti",
             "2:",
             "hlt",
             "jmp 2b");
    }
}

User code to spawn new threads

The above kernel code will create a new user thread with the same registers and page table as the caller, including instruction pointer (rip), but with different stack pointer (rsp) and zero in rdi. To use it we need a user function to call syscall with SYSCALL_FORK_THREAD in rax, and then treat the two threads which return differently.

  1. The new thread has a new stack, so we can’t rely on any local variables (which may be stored in registers or on the stack) or return from any function.
  2. The Rust compiler may add code to the start and end of asm blocks, and assumes that an asm block which is entered is left once (or never for noreturn blocks), not twice.

We therefore need to make sure that the new thread never leaves the asm block where it is created; What happens in the asm block stays in the asm block.

The solution used here is to detect the new thread (which has both rax and rdi equal to zero), call a user-provided function, and when that returns use the SYSCALL_EXIT_THREAD syscall to stop the thread. That syscall never returns, so the new thread never leaves the asm block.

This is in euralios_std/src/syscalls.rs.

pub fn thread_spawn(
    func: extern "C" fn(usize) -> (),
    param: usize
) -> Result<u64, SyscallError> {

    let tid: u64;
    let errcode: u64;
    unsafe {
        asm!("syscall",
             // rax = 0 indicates no error
             "cmp rax, 0",
             "jnz 2f",
             // rdi = 0 for new thread
             "cmp rdi, 0",
             "jnz 2f",
             // New thread
             "mov rdi, r9", // Function argument
             "call r8",
             "mov rax, 1", // exit_current_thread syscall
             "syscall",
             // New thread never leaves this asm block
             "2:",
             in("rax") SYSCALL_FORK_THREAD,
             in("r8") func,
             in("r9") param,
             lateout("rax") errcode,
             lateout("rdi") tid,
             out("rcx") _,
             out("r11") _);
    }
    if errcode != 0 {
        return Err(SyscallError(errcode));
    }
    Ok(tid)
}

(I was not very consistent with fork vs spawn naming. Sorry). This thread_spawn function takes a function pointer with C calling convention, and a parameter which will be stored in the first argument (rdi). In section 6 once we’ve got user-space memory allocation we can use this to pass a closure’s context and provide the Rust threading interface.

User code to exit from a thread

The implementation of thread_exit is simpler because we just call the SYSCALL_EXIT_THREAD syscall and never return:

pub fn thread_exit() -> ! {
    unsafe {
        asm!("syscall",
             in("rax") SYSCALL_EXIT_THREAD,
             options(noreturn));
    }
}

User panic handler

We can also use these syscalls to write a better panic handler in the user program hello.rs:

#[panic_handler]
fn panic(info: &PanicInfo) -> ! {
    println!("User panic: {}", info);
    syscalls::thread_exit();
}

So now when a user thread panics it will print the error message and exit.

Next steps

When a thread exits we should be able to recover any memory it was allocated and re-use it for other threads or processes. Unfortunately at the moment our frame allocator doesn’t allow frames to be free’d. In the next section we’ll write a new frame allocator which will allow us to free memory when it’s not needed.

Appendix: Security issues

Security issues: https://fuchsia.dev/fuchsia-src/concepts/kernel/sysret_problem

#+LABEL sec:jump_table

Appendix: Syscall jump table

This is an alternative approach which I don’t think is optimal, but is here in case it’s helpful.

In C we can create a static array of function pointers (addresses), so that functions can be called when indexing into this array. This is used in Linux (for example) to enable fast lookup of a function pointer from a syscall number.

In rust this seems to be difficult: Function pointers aren’t known at compile time (only link time), and so attempting to cast a function to a u64 statically doesn’t compile. The usual trick of using lazy_static also doesn’t work because we need to know the address of the array at link time.

The closest I’ve found so far is to define a static mutable array (highly discouraged!) in syscalls.rs:

const SYSCALL_NUMBER: usize = 2;
static mut SYSCALL_HANDLERS : [u64; SYSCALL_NUMBER]
  = [0; SYSCALL_NUMBER];

In the init() function we can populate this array:

unsafe {
    SYSCALL_HANDLERS = [
        sys_read as u64,
        sys_write as u64
    ];
}

Now the syscall handler code can be simplified: It first checks that the syscall number (in rax) is in range, and if so looks up the handler address in the SYSCALL_HANDLERS table:

asm!(
    ...,
    "push r15",

    "cmp rax, {syscall_max}",
    "jge 1f",  // Out of range
    "mov rax, [{syscall_handlers} + 8*rax]", // Lookup handler address
    "call rax",
    "1: ",

    "pop r15",
    ...,
    syscall_handlers = sym SYSCALL_HANDLERS,
    syscall_max = const SYSCALL_NUMBER,
    options(noreturn)
);

To be able to use the const argument to asm we need to add this feature to the top of lib.rs with #![feature(asm_const)].