Syscalls: End-to-End

Part I

All previously discussed exceptions were just that: exceptional.

The kernel handled them behind the scenes invisibly to the user or they were the result of a bug in which case the kernel sent the program a fatal signal (e.g. SIGILL, SIGSEGV, and SIGFPE)

Today we demonstrate the mechanism by which your code can intentionally call a system function running in kernel mode; in other words, a system call.

We want to know how the userspace invocation ultimately connects back to the SYSCALL_DEFINE macro that you hunted down, traced, and wrote a history report on during E1

Let’s follow the path of a syscall on x86.

First of all, a user program invokes the syscall instruction.

As described therein, the CPU immediately elevates privilege level and jumps to the address in the LSTAR MSR register.

Who set the value of this LSTAR Model Specific Register?

In the syscall_init function, we find this line:

	wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);

Which sets set the value of LSTAR to the address of the entry_SYSCALL_64 function.

This is not C code :)

The .S suffix indicates this is not just regular assembly code, but code that must be run through cpp in order to resolve macros in the file. The kernel uses these macros to make it easier to write the code correctly and facilitate interoperability with the rest of the kernel.

At this time, the kernel is executing in a privileged mode, but all of the registers refer to userspace data.

Part II

At the end of part I we had discussed the syscall process up when the processor jumps to the entry_SYSCALL_64 function within entry_64.S which is written in assembly. As documented here, for performance reasons the only actions taken by the syscall instruction besides elevating privilege to ring 0 are:

Within the entry_SYSCALL_64 handler function, it is the responsibility of the kernel to save any other userspace state that it wishes to restore when the syscall returns. In the case of the linux kernel, all normal CPU registers should be saved.

However, this presents a problem because essentially all cpu instructions involve manipulating the values stored in the cpu registers. We want to save the data somewhere in memory, but we can’t even load a fixed pointer into a register to move data into that memory location because that would clobber one of the values we need to save.

Fortunately the designers of the CPU built an escape hatch in for this exact problem: the swapgs instruction.

On x86 gs is a special type of register called a “segment” register. Segment registers were historically added to facilitate easier access to more than 64K of memory on Intel’s 16bit 8086 cpu (general purpose registers could store 16 bit pointers and segmentation could fill in the correct higher order bits to determine the full virtual address depending on what your instruction was doing with the pointer (using it to access code, or data, or the stack, etc.).

Segmentation is no longer a concern on 64 bit systems where the registers can easily store pointers to orders of magnitude more virtual addresses than any computer could have physical ram, but the segmentation registers still exist on modern CPUs and they have picked up a new function: storing a pointer for accessing thread local data. The gs register holds a pointer to a block of memory reserved for thread specific data and any instruction can access this pointer by setting the gs prefix on a memory access and providing the desired offset into the thread specific data as the “address”. The CPU will add the base address of the thread specific data from gs to the offset supplied in the instruction and the thread local data will be accessed.

The special swapgs instruction (line 91) allows ring 0 (kernel) code to atomically swap the value of gs with a well known value previously established by the kernel in a model specific register that will hold a pointer to per cpu data while saving the old gs value from userspace into a different MSR so it can be restored later.

The handler code can then use scratch space allocated in the per cpu block to save the userspace stack pointer (line 93) and replace rsp with a pointer to kernel stack (line 95). Once that has been completed, the rest of the registers can be saved by just pushing them onto the kernel stack. The values are pushed in a specific order (lines 100-109) to make the overall footprint of the data on the stack match the layout of a struct pt_regs.

This means that after all the pushing, rsp is a valid struct pt_regs * pointer. It can be copied into rdi (line 112) to be the first argument along with the syscall number in rax into rsi (line 114) to become the second argument when it calls the C function do_syscall_64.

From this point things are more simple, the kernel attempts to interpret the syscall number as a 64 bit syscall by calling do_syscall_x64 and we can assume this is successful if we are calling from 64 bit code.

The meat of that function is verifying that the syscall number is in range (line 48) and then looking up the function pointer for the corresponding syscall number in an array then calling it and saving the return value into the entry for the ax register in the struct pt_regs that will be restored when the kernel code returns (line 50).

The sys_call_table array in generated using a technique called an X macro. During the kernel build process, a header file is generated using the syscall table that invokes a macro __SYSCALL (that is not defined within the header) once for each syscall with arguments of its number and entry point.

The __SYSCALL macro can be given whatever definition the user wants and then the header can be included to programmatically generate invocations of that specific version of the macro for each syscall. Within arch/x86/entry/syscall_64.c the __SYSCALL macro is defined twice, the first time (lines 10-12) it uses the syscall name in the argument to form a declaration for a function named __x64_sys_something that takes a const struct pt_regs * argument, and the second time (lines 14-18) it fills the sys_call_table variable will pointers to each of those functions in the right order.

These wrapper functions are defined as part of the SYSCALL_DEFINE macro. The __X64_SYS_STUBx macro generates a function named __x64_sys_whatever that takes a struct pt_regs whose body just calls another wrapper starting the __se_ with the real syscall args. These are obtained by the SC_X86_64_REGS_TO_ARGS macro which converts the list of arguments into accessing regs->register for each register in the appropriate order.

The __se_ wrapper (short for sign extension) has to do with 32 bit compatibility and is generated in place by the SYSCALL_DEFINE macro (line 233), and finally it calls into a function with the prefix __do_ that is the function whose header is right at the end of the wrapper (line 240) and whose body is supplied by the code within the curly braces that follows a given SYSCALL_DEFINE invocation.

At this point it is running the code in that block and the syscall has officially begun :)