Now we’re going to add system calls to enable user threads to create new threads, and exit cleanly. We’ll learn about the swapgs and wrmsr instructions.
Syscall handlers have a problem because they need to preserve all registers, while switching to a kernel stack and saving the user stack pointer somewhere. Reasons to avoid using the user stack include:
- Avoiding a stack overflow while in ring 0, which could overwrite anything or just lead to a page fault while in kernel space. Instead we switch to a known good stack location.
- Not leaving kernel data on the user stack, potentially leaking information which a malicious program could use.
A common solution is to use the GS segment register. In 64-bit mode this register contains a 64-bit address, and memory can be accessed as an offset from this address. For example
mov gs:0x24, rsp
copies the stack pointer (RSP) to a virtual memory address calculated
by taking the value in GS and adding 0x24
. If we can set the value
of GS, then we can use it to save the user stack pointer to a memory
address and retrieve a kernel stack pointer.
There are two ways to access the GS and FS 64-bit registers directly:
- The wrmsr instruction writes to Model Specific Registers, and was
used in an earlier section to set up syscalls. It can only be
executed in ring 0. According to the AMD documentation the FS base
address is MSR
0xC000_0100
and the GS base address is0xC000_0101
. - The wrfsbase and wrgsbase instructions write to the FS and GS base registers, and can be executed in any privilege level. These need to be enabled by setting a bit in the CR4 control register. There is an LWN article on enabling FSGSBASE in Linux.
Note that push, pop and mov only access the lower 32 bits of the GS and FS registers.
There is a third way to set the GS register indirectly, which allows
the kernel to store an address in a location that is hidden from user
programs. This Kernel GS base address is MSR 0xC0000102
. The
swapgs
instruction swaps the values in the user and kernel GS base
registers. What we can do is:
- Store a memory address in the kernel GS base register
- Allow users to modify the GS register
- On entering the syscall handler, execute
swapgs
to swap the user and kernel GS values - Use GS to access memory, saving the user stack pointer and loading a kernel stack pointer.
- Before going back to user code swap the GS registers back
As discussed in this OSdev page a common strategy is to swap the GS registers on every transition between user and kernel code, at the start and end of syscalls and interrupt handlers. That page notes that there are some problems with this approach: If an interrupt is allowed to occur during a syscall, while the GS registers are swapped, then they will be swapped again! To handle this there are ways to detect whether the interrupt occurred while in kernel code.
The approach we’ll use here is to avoid this problem by only using
swapgs
in the syscall handler, keeping the user GS register loaded
in the rest of the kernel code. There may be a good reason why this is
a bad idea, but I haven’t found it yet.
A good choice for the kernel GS base register is to point to the Task State Segment (TSS) table. This table is different for each CPU core, so will still work if (when?) we support multiple cores, and we already store kernel stack pointers in the TSS for use in the timer interrupt context switch. The Redox OS syscall handler does something like this.
We need a function to get the address of the TSS. In gdt.rs
add
a function
pub fn tss_address() -> u64 {
let tss_ptr = &*TSS.lock() as *const TaskStateSegment;
tss_ptr as u64
}
then in syscalls.rs
we define the MSR number:
const MSR_KERNEL_GS_BASE: usize = 0xC0000102;
and then the init
function can put the TSS address in
the kernel GS base register:
asm!(
// Want to move RDX into MSR but wrmsr takes EDX:EAX i.e. EDX
// goes to high 32 bits of MSR, and EAX goes to low order bits
// https://www.felixcloutier.com/x86/wrmsr
"mov eax, edx",
"shr rdx, 32", // Shift high bits into EDX
"wrmsr",
in("rcx") MSR_KERNEL_GS_BASE,
in("rdx") gdt::tss_address()
);
This puts the TSS address in RDX and writes it to the MSR. It looks
more complicated than it is because wrmsr uses two 32-bit registers to
set the 64-bit address: The low 32 bits of the address are moved to EAX,
and then the high 32 bits in RDX are shifted down to the 32 bits in EDX.
wrmsr
takes these two pieces and puts them back together in the MSR.
When the syscall handler starts, we want to save the user RSP, and
load a kernel stack pointer. The kernel stack pointer for the current
process is stored in the TSS, entry number gdt::TIMER_INTERRUPT_INDEX
which is currently set to 1. There are 7 available interrupt stack entries,
so we can use one of them to temporarily store the user stack pointer.
In gdt.rs
we’ll define a constant:
pub const SYSCALL_TEMP_INDEX: u16 = 2;
which we can use in syscalls.rs
at the start of the handle_syscall
function:
asm!("swapgs",
"mov gs:{tss_temp}, rsp", // save user RSP
"mov rsp, gs:{tss_timer}" // load kernel RSP
...
tss_timer = const(0x24 + gdt::TIMER_INTERRUPT_INDEX * 8),
tss_temp = const(0x24 + gdt::SYSCALL_TEMP_INDEX * 8),
The offset of the interrupt stack index (0x24) is determined from the Task State Segment layout.
This kernel stack is also used by the timer interrupt for context switches. If we want to allow context switches while handling a syscall, then we need to make sure that syscalls use a different part of the kernel stack. The kernel stack is two pages (8k) so we can move the pointer by an offset and have enough space:
const SYSCALL_KERNEL_STACK_OFFSET: u64 = 1024;
which is applied to rsp:
asm!(...
"sub rsp, {ks_offset}",
...
ks_offset = const(SYSCALL_KERNEL_STACK_OFFSET));
We can now save the user stack pointer onto the kernel stack, and swap the GS registers back:
asm!(...
"push gs:{tss_temp}", // user RSP
"swapgs"
...
ks_offset = const(SYSCALL_KERNEL_STACK_OFFSET));
The handle_syscall()
syscall entry function now starts with:
#[naked]
extern "C" fn handle_syscall() {
unsafe {
asm!(
"swapgs", // Put the TSS address into GS (stored in syscalls::init)
"mov gs:{tss_temp}, rsp", // Save user stack pointer in TSS entry
"mov rsp, gs:{tss_timer}", // Get kernel stack pointer
"sub rsp, {ks_offset}", // Use a different location than timer interrupt
// Create an Exception stack frame
"sub rsp, 8", // To be replaced with SS
"push gs:{tss_temp}", // User stack pointer
"swapgs", // Put TSS address back
After that we can create the rest of the Context struct.
When a thread fork syscall is made, a new thread context must be made that is the same as the original thread, and can be put in the scheduler. The easiest way to do this is to capture a Context in syscall in the same way that we do in a timer interrupt.
The Context
struct is defined in interrupts.rs
. Because the stack
grows downwards in memory we start at the end of the struct (ss
and
rsp
fields), and store the values in order until we get to the top
(r14
and r15
). We’ve already pushed the user stack (rsp
) but
need to reserve space before that for the stack segment (SS). We
therefore subtract 8 (bytes) from rsp to make space before the rsp
value, and we have to do a similar thing for CS. Other differences
from the interrupt handler code are that syscall
stores the user
instruction pointer (rip
) in rcx
, and RFLAGS in r11
. The
assembly code in the naked handle_syscall()
function so far looks
like:
asm!(
"swapgs", // Put the TSS address into GS (stored in syscalls::init)
"mov gs:{tss_temp}, rsp", // Save user stack pointer in TSS entry
"mov rsp, gs:{tss_timer}", // Get kernel stack pointer
"sub rsp, {ks_offset}", // Use a different location than timer interrupt
// Create an Exception stack frame
"sub rsp, 8", // To be replaced with SS
"push gs:{tss_temp}", // User stack pointer
"swapgs", // Put TSS address back
"push r11", // Caller's RFLAGS
"sub rsp, 8", // CS
"push rcx", // Caller's RIP
"push rax",
"push rbx",
"push rcx",
"push rdx",
"push rdi",
"push rsi",
"push rbp",
"push r8",
"push r9",
"push r10",
"push r11",
"push r12",
"push r13",
"push r14",
"push r15",
We now have a thread Context
and need to decide which syscall
function to call. In section 2 we used some conditionals and called
the sys_read
or sys_write
functions from the handle_syscalls
assembly code. Linux uses a jump table, an array of function pointers,
to dispatch syscalls; to do this in Rust we’ll use a two-stage method:
handle_syscalls
will call a new (Rust) function dispatch_syscalls
,
which will then use match
to call the separate syscall functions.
To pass parameters to dispatch_syscalls
we’ll use the x86_64 C
calling convention: The first six function parameters are in rdi
,
rsi
, rdx
, rcx
, r8
and r9
registers. We’re going to use five
of these, for the Context
address; the syscall number; and then
three syscall parameters which the user can store in rdi
, rsi
and
rdx
:
"mov r8, rdx", // Fifth argument <- Syscall third argument
"mov rcx, rsi", // Fourth argument <- Syscall second argument
"mov rdx, rdi", // Third argument <- Syscall first argument
"mov rsi, rax", // Second argument is the syscall number
"mov rdi, rsp", // First argument is the Context address
"call {dispatch_fn}",
...
dispatch_fn = sym dispatch_syscall,
In the dispatch_syscall
function we need to finish the Context
that we’ve created: In handle_syscalls
we didn’t set the stack or
code segments. If this Context
is used to return to a thread via an
interrupt (e.g. in a context switch) then those segments will be
wrong and we’ll probably get a General Protection Fault. To set these
values we’ll use:
extern "C" fn dispatch_syscall(
context_ptr: *mut Context,
syscall_id: u64,
arg1: u64, arg2: u64, arg3: u64) {
let context = unsafe{&mut *context_ptr};
// Set the CS and SS segment selectors
let (code_selector, data_selector) =
gdt::get_user_segments();
context.cs = code_selector.0 as usize;
context.ss = data_selector.0 as usize;
...
After that it’s just a match
to choose which syscall function to call:
...
match syscall_id {
SYSCALL_FORK_THREAD => process::fork_current_thread(context),
SYSCALL_DEBUG_WRITE => sys_debug_write(arg1 as *const u8, arg2 as usize),
_ => println!("Unknown syscall {:?} {} {} {}",
context_ptr, syscall_id, arg1, arg2)
}
}
The SYSCALL_FORK_THREAD
branch calls a new function
fork_current_thread
which we’ll add to process.rs
.
To create a new thread in the current process we need to:
- Allocate a new user stack and a new kernel stack for the new thread
- Create a new
Thread
object, with a reference counted pointer (Arc
) to the same sharedProcess
object as the caller. - Set the return values so that the original thread can be
distinguished from the new thread. Here we’ll do this by
setting the
rdi
register to 0 in the new thread, and to the (non-zero) thread ID (tid
) in the original thread.
The code to do this is in kernel/src/process.rs
:
pub fn fork_current_thread(current_context: &mut Context) {
if let Some(current_thread) = CURRENT_THREAD.read().as_ref() {
// Allocate user stack
let page_table_ptr = memory::active_pagetable_ptr();
if let Ok((user_stack_start, user_stack_end)) = memory::allocate_user_stack(page_table_ptr) {
let new_thread = {
// Create a new kernel stack
let kernel_stack = Vec::with_capacity(KERNEL_STACK_SIZE);
let kernel_stack_start = VirtAddr::from_ptr(kernel_stack.as_ptr());
let kernel_stack_end = (kernel_stack_start + KERNEL_STACK_SIZE).as_u64();
Box::new(Thread {
tid: unique_id(),
process: current_thread.process.clone(), // Shared state
page_table_physaddr: current_thread.page_table_physaddr, // Shared page table
kernel_stack,
kernel_stack_end,
user_stack_end,
context: kernel_stack_end - INTERRUPT_CONTEXT_SIZE as u64,
})
};
let new_context = unsafe {&mut *(new_thread.context as *mut Context)};
*new_context = current_context.clone(); // Copy of caller
// Set new stack pointer
new_context.rsp = new_thread.user_stack_end as usize;
// Set return values in rax
new_context.rax = 0; // No error
new_context.rdi = 0; // Indicates that this is the new thread
current_context.rax = 0; // No error
current_context.rdi = new_thread.tid as usize;
RUNNING_QUEUE.write().push_back(new_thread);
} else {
// Failed to allocate user stack
current_context.rax = syscalls::SYSCALL_ERROR_MEMALLOC; // Error code
}
} else {
// Somehow no current thread
current_context.rax = 2; // Error code
}
}
Note that the new context is a clone (copy) of the calling thread’s context, including the return instruction pointer. Both threads will return from this syscall at the same point in the user code. We therefore need to be quite careful to make sure that we don’t accidentally share or double-free variables in user code.
Now we have the syscall dispatch code in place, adding more syscalls
becomes quite easy: We create a new function, choose a syscall number,
and add it to the match
in dispatch_syscalls
:
pub const SYSCALL_EXIT_THREAD: u64 = 1;
...
// in dispatch_syscall()
match syscall_id {
SYSCALL_EXIT_THREAD => process::exit_current_thread(context),
...
}
Exiting a thread is quite straightforward: We take ownership
of the running Thread
object, and then drop it. The drop
implementation will then take care of freeing resources the
thread is holding:
pub fn exit_current_thread(_current_context: &mut Context) {
{
let mut current_thread = CURRENT_THREAD.write();
if let Some(_thread) = current_thread.take() {
// Drop thread, freeing stacks. If this is the last thread
// in this process, memory and page tables will be freed
// in the Process drop() function
}
}
// Can't return from this syscall, so this thread now waits for a
// timer interrupt to switch context.
unsafe {
asm!("sti",
"2:",
"hlt",
"jmp 2b");
}
}
The above kernel code will create a new user thread with the same
registers and page table as the caller, including instruction pointer
(rip
), but with different stack pointer (rsp
) and zero in rdi
.
To use it we need a user function to call syscall with SYSCALL_FORK_THREAD
in rax
, and then treat the two threads which return differently.
- The new thread has a new stack, so we can’t rely on any local variables (which may be stored in registers or on the stack) or return from any function.
- The Rust compiler may add code to the start and end of
asm
blocks, and assumes that anasm
block which is entered is left once (or never fornoreturn
blocks), not twice.
We therefore need to make sure that the new thread never leaves
the asm
block where it is created; What happens in the asm
block stays in the asm
block.
The solution used here is to detect the new thread (which has both
rax
and rdi
equal to zero), call a user-provided function, and
when that returns use the SYSCALL_EXIT_THREAD
syscall to stop the
thread. That syscall never returns, so the new thread never leaves the
asm
block.
This is in euralios_std/src/syscalls.rs
.
pub fn thread_spawn(
func: extern "C" fn(usize) -> (),
param: usize
) -> Result<u64, SyscallError> {
let tid: u64;
let errcode: u64;
unsafe {
asm!("syscall",
// rax = 0 indicates no error
"cmp rax, 0",
"jnz 2f",
// rdi = 0 for new thread
"cmp rdi, 0",
"jnz 2f",
// New thread
"mov rdi, r9", // Function argument
"call r8",
"mov rax, 1", // exit_current_thread syscall
"syscall",
// New thread never leaves this asm block
"2:",
in("rax") SYSCALL_FORK_THREAD,
in("r8") func,
in("r9") param,
lateout("rax") errcode,
lateout("rdi") tid,
out("rcx") _,
out("r11") _);
}
if errcode != 0 {
return Err(SyscallError(errcode));
}
Ok(tid)
}
(I was not very consistent with fork
vs spawn
naming. Sorry).
This thread_spawn
function takes a function pointer with C calling
convention, and a parameter which will be stored in the first argument
(rdi
). In section 6 once we’ve got user-space memory allocation we
can use this to pass a closure’s context and provide the Rust
threading interface.
The implementation of thread_exit
is simpler because we just
call the SYSCALL_EXIT_THREAD
syscall and never return:
pub fn thread_exit() -> ! {
unsafe {
asm!("syscall",
in("rax") SYSCALL_EXIT_THREAD,
options(noreturn));
}
}
We can also use these syscalls to write a better panic handler in the
user program hello.rs
:
#[panic_handler]
fn panic(info: &PanicInfo) -> ! {
println!("User panic: {}", info);
syscalls::thread_exit();
}
So now when a user thread panics it will print the error message and exit.
When a thread exits we should be able to recover any memory it was allocated and re-use it for other threads or processes. Unfortunately at the moment our frame allocator doesn’t allow frames to be free’d. In the next section we’ll write a new frame allocator which will allow us to free memory when it’s not needed.
Security issues: https://fuchsia.dev/fuchsia-src/concepts/kernel/sysret_problem
#+LABEL sec:jump_table
This is an alternative approach which I don’t think is optimal, but is here in case it’s helpful.
In C we can create a static array of function pointers (addresses), so that functions can be called when indexing into this array. This is used in Linux (for example) to enable fast lookup of a function pointer from a syscall number.
In rust this seems to be difficult: Function pointers aren’t known at
compile time (only link time), and so attempting to cast a function to
a u64
statically doesn’t compile. The usual trick of using
lazy_static
also doesn’t work because we need to know the address of
the array at link time.
The closest I’ve found so far is to define a static mutable array
(highly discouraged!) in syscalls.rs
:
const SYSCALL_NUMBER: usize = 2;
static mut SYSCALL_HANDLERS : [u64; SYSCALL_NUMBER]
= [0; SYSCALL_NUMBER];
In the init()
function we can populate this array:
unsafe {
SYSCALL_HANDLERS = [
sys_read as u64,
sys_write as u64
];
}
Now the syscall handler code can be simplified: It first
checks that the syscall number (in rax
) is in range, and if so
looks up the handler address in the SYSCALL_HANDLERS
table:
asm!(
...,
"push r15",
"cmp rax, {syscall_max}",
"jge 1f", // Out of range
"mov rax, [{syscall_handlers} + 8*rax]", // Lookup handler address
"call rax",
"1: ",
"pop r15",
...,
syscall_handlers = sym SYSCALL_HANDLERS,
syscall_max = const SYSCALL_NUMBER,
options(noreturn)
);
To be able to use the const
argument to asm
we need to add
this feature to the top of lib.rs
with #![feature(asm_const)]
.