Kernel Development Learning Pipeline

Home Courses Articles Slides FAQ

Lecture 9 - 10 October 2023

Topics covered:

Syscalls: end to end (cont. from L08)

Syscalls: end-to-end

At the end of L08 we had discussed the syscall process up when the processor jumps to the entry_SYSCALL_64 function within entry_64.S which is written in assembly. As documented here, for performance reasons the only actions taken by the syscall instruction besides elevating privilege to ring 0 are:

Saving the return address before jumping to the kernel handler:
- The current instruction pointer rip is copied into rcx
- The address of the kernel handler is loaded from the LSTAR model specific register into rip
Saving the current processor flags before resetting them to a known value
- The current flags RFLAGS are copied into r11
- The flags register is adjusted with a bitmask from the FMASK model specific register

Within the entry_SYSCALL_64 handler function, it is the responsibility of the kernel to save any other userspace state that it wishes to restore when the syscall returns. In the case of the linux kernel, all normal CPU registers should be saved.

However, this presents a problem because essentially all cpu instructions involve manipulating the values stored in the cpu registers. We want to save the data somewhere in memory, but we can’t even load a fixed pointer into a register to move data into that memory location because that would clobber one of the values we need to save.

Fortunately the designers of the CPU built an escape hatch in for this exact problem: the swapgs instruction.

On x86 gs is a special type of register called a “segment” register. Segment registers were historically added to facilitate easier access to more than 64K of memory on Intel’s 16bit 8086 cpu (general purpose registers could store 16 bit pointers and segmentation could fill in the correct higher order bits to determine the full virtual address depending on what your instruction was doing with the pointer (using it to access code, or data, or the stack, etc.).

Segmentation is no longer a concern on 64 bit systems where the registers can easily store pointers to orders of magnitude more virtual addresses than any computer could have physical ram, but the segmentation registers still exist on modern CPUs and they have picked up a new function: storing a pointer for accessing thread local data. The gs register holds a pointer to a block of memory reserved for thread specific data and any instruction can access this pointer by setting the gs prefix on a memory access and providing the desired offset into the thread specific data as the “address”. The CPU will add the base address of the thread specific data from gs to the offset supplied in the instruction and the thread local data will be accessed.

The special swapgs instruction (line 91) allows ring 0 (kernel) code to atomically swap the value of gs with a well known value previously established by the kernel in a model specific register that will hold a pointer to per cpu data while saving the old gs value from userspace into a different MSR so it can be restored later.

The handler code can then use scratch space allocated in the per cpu block to save the userspace stack pointer (line 93) and replace rsp with a pointer to kernel stack (line 95). Once that has been completed, the rest of the registers can be saved by just pushing them onto the kernel stack. The values are pushed in a specific order (lines 100-109) to make the overall footprint of the data on the stack match the layout of a struct pt_regs.

This means that after all the pushing, rsp is a valid struct pt_regs * pointer. It can be copied into rdi (line 112) to be the first argument along with the syscall number in rax into rsi (line 114) to become the second argument when it calls the C function do_syscall_64.

From this point things are more simple, the kernel attempts to interpret the syscall number as a 64 bit syscall by calling do_syscall_x64 and we can assume this is successful if we are calling from 64 bit code.

The meat of that function is verifying that the syscall number is in range (line 48) and then looking up the function pointer for the corresponding syscall number in an array then calling it and saving the return value into the entry for the ax register in the struct pt_regs that will be restored when the kernel code returns (line 50).

The sys_call_table array in generated using a technique called an X macro. During the kernel build process, a header file is generated using the syscall table that invokes a macro __SYSCALL (that is not defined within the header) once for each syscall with arguments of its number and entry point.

The __SYSCALL macro can be given whatever definition the user wants and then the header can be included to programmatically generate invocations of that specific version of the macro for each syscall. Within arch/x86/entry/syscall_64.c the __SYSCALL macro is defined twice, the first time (lines 10-12) it uses the syscall name in the argument to form a declaration for a function named __x64_sys_something that takes a const struct pt_regs * argument, and the second time (lines 14-18) it fills the sys_call_table variable will pointers to each of those functions in the right order.

These wrapper functions are defined as part of the SYSCALL_DEFINE macro. The __X64_SYS_STUBx macro generates a function named __x64_sys_whatever that takes a struct pt_regs whose body just calls another wrapper starting the __se_ with the real syscall args. These are obtained by the SC_X86_64_REGS_TO_ARGS macro which converts the list of arguments into accessing regs->register for each register in the appropriate order.

The __se_ wrapper (short for sign extension) has to do with 32 bit compatibility and is generated in place by the SYSCALL_DEFINE macro (line 233), and finally it calls into a function with the prefix __do_ that is the function whose header is right at the end of the wrapper (line 240) and whose body is supplied by the code within the curly braces that follows a given SYSCALL_DEFINE invocation.

At this point it is running the code in that block and the syscall has officially begun :)