System call is one of the few ways by which a user can request services from the kernel. You might be using functions like read(), write(), etc. These functions are implemented by the C library as wrappers around the system call, trying to pass minimum amount of work to the kernel. When the library functions makes the system call, the processor switches to a high privilege mode and the kernel does work on behalf of the user in a 'user context'.
Making the actual system call is different from calling a regular function. This is because the processor needs to enter a privileged mode and this cannot be achieved by a mere function call into the kernel space as this would allow the user into the kernel address space, a security bug. Instead, linux makes use of a software interrupt (interrupt number 0x80 on x86). In Linux, a system call is identified by an index called the system call number, which is available at /usr/include/asm/unistd_32.h in the 32 bit linux-3.11.1.
To make a system call, we place the system call number in the %eax register along with the operands in registers %ebx, %ecx and so on. On issuing the software interrupt 0x80, the processor switches to kernel mode and starts executing the appropriate interrupt handler. The interrupt handler, which is called 'syscall_handler' in this case saves the process state of the user program and calls the required system call with the appropriate arguments. After checking the user permission, the system call does its job and returns, saving the return value in the %eax register. It is the responsibility of the system call to check if the addresses provided by the user are valid and the user can access them.
Doing a system call from user space
Here is a piece of code which makes the
getpid() system call, making use of inline assembly code.
#include <stdio.h> // for printf()
#include <sys/syscall.h> // for SYS_getpid
int _ret, _sys_call_no;
void test_syscall() {
_sys_call_no = SYS_getpid;
asm("pushal;" // save all registers
"mov _sys_call_no, %eax;" // put system call no. in %eax
"int $0x80;" // sw interrupt
"mov %eax, _ret;" // get result
"popal;"); // load saved registers
printf("%d\n", _ret);
}
int main() {
test_syscall();
return 0;
}
The glibc library provides 'syscall', a macro to do the above job for us. Here is our code making use of it.
#include <stdio.h> // for printf()
#include <sys/syscall.h> // for SYS_getpid
#include // for syscall()
void test_syscall() {
printf("%ld\n", syscall(SYS_getpid));
}
As an interesting example, let's call
chdir() system call. I want to point out here that we are passing as argument an address into the user space.
#include <stdio.h> // for printf()
#include <sys/syscall.h> // for SYS_getpid
int _ret, _sys_call_no;
char _path[] = ".."; // path for chdir()
void test_syscall() {
_sys_call_no = SYS_getpid;
asm("pushal;"
"mov $12, %eax;"
"lea _path, %ebx;"
"int $0x80;"
"mov %eax, _ret;"
"popal;");
printf("%d\n", _ret);
}
Doing a system call from kernel space
Let's call the
chdir() system call again, in a similar manner, but this time from the kernel space using a module. (You can read about writing a module in
[1])
#include <linux/module.h> // for moudule_init(), printk()
int _ret = 0;
char _path[] = "..";
static int __init kernel_syscall_init(void) {
asm("pushal;"
"mov $12, %eax;"
"lea _path, %ebx;"
"int $0x80;"
"mov %eax, _ret;"
"popal;");
printk("%d\n", _ret);
return 0;
}
static void __exit kernel_syscall_exit(void) {
}
MODULE_LICENSE("GPL");
module_init(kernel_syscall_init);
module_exit(kernel_syscall_exit);
You can automate the build process using
this script and read about it in
[2].
What happened on running the module? It returned a negative number, an error! But why, it ran correctly earlier?
The system call checks if the provided buffer is a legal address. When called from the kernel space, an address that lies in the user address range (0-3 GB for standard kernel configuration) is considered valid, and an address that lies in kernel address space (3 GB-4 GB) is not. When the system call is invoked from kernel space, we must prevent the usual check to fail, because the virtual address of our destination buffer will be in kernel space, above the 3 GB mark. Here is how it is done.
#include <asm/uaccess.h> // for get_fs(), set_fs()
#include <linux/module.h> // for moudule_init(), printk()
int _ret = 0;
char _path[] = "..";
static int __init kernel_syscall_init(void) {
mm_segment_t saved_fs = get_fs();
set_fs(get_ds());
asm("pushal;"
"mov $12, %eax;"
"lea _path, %ebx;"
"int $0x80;"
"mov %eax, _ret;"
"popal;");
printk("%d\n", _ret);
set_fs(saved_fs);
return 0;
}
static void __exit kernel_syscall_exit(void) {
}
MODULE_LICENSE("GPL");
module_init(kernel_syscall_init);
module_exit(kernel_syscall_exit);
Working now. :)
What is this get_fs(), get_ds(), set_fs() stuff?
The field
addr_limit is used to describe the highest legal virtual address and
get_fs() and
set_fs() are macros which act as modifiers. The limit that must be used is returned by
get_ds(). It is important to restore
addr_limit back to its original value, or else the user space program calling this code might retain it.
I know that it is hard to believe that modifiers for
addr_limit would be called
get_fs() and
set_fs(). Here is the code to prove that I am right!
#include <asm/uaccess.h> // for get_fs(), set_fs()
#include <linux/module.h> // for moudule_init(), printk()
int _fs_register, _ds_register;
static int __init kernel_syscall_init(void) {
// Initial values.
mm_segment_t addr_limit = current_thread_info()->addr_limit;
mm_segment_t fs_macro_value = get_fs();
asm("mov %fs, _fs_register;" // FS register to _fs_register
"mov %ds, _ds_register;"); // DS register to _ds_register
printk("addr_limit = %lu\n", addr_limit.seg);
printk("fs_macro_value = %lu\n", fs_macro_value.seg);
printk("_fs_register = %d\n", _fs_register);
printk("_ds_register = %d\n", _ds_register);
// Values after set_fs(get_ds()).
mm_segment_t saved_fs = get_fs();
set_fs(get_ds());
addr_limit = current_thread_info()->addr_limit;
fs_macro_value = get_fs();
asm("mov %fs, _fs_register;" // FS register to _fs_register
"mov %ds, _ds_register;"); // DS register to _ds_register
printk("addr_limit = %lu\n", addr_limit.seg);
printk("fs_macro_value = %lu\n", fs_macro_value.seg);
printk("_fs_register = %d\n", _fs_register);
printk("_ds_register = %d\n", _ds_register);
// Restored values.
set_fs(saved_fs);
addr_limit = current_thread_info()->addr_limit;
fs_macro_value = get_fs();
asm("mov %fs, _fs_register;" // FS register to _fs_register
"mov %ds, _ds_register;"); // DS register to _ds_register
printk("addr_limit = %lu\n", addr_limit.seg);
printk("fs_macro_value = %lu\n", fs_macro_value.seg);
printk("_fs_register = %d\n", _fs_register);
printk("_ds_register = %d\n", _ds_register);
return 0;
}
static void __exit kernel_syscall_exit(void) {
}
MODULE_LICENSE("GPL");
module_init(kernel_syscall_init);
module_exit(kernel_syscall_exit);
[623233.082407] addr_limit = 3221225472
[623233.082432] fs_macro_value = 3221225472
[623233.082458] _fs_register = 216
[623233.082480] _ds_register = 123
[623233.082507] addr_limit = 4294967295
[623233.082526] fs_macro_value = 4294967295
[623233.082546] _fs_register = 216
[623233.082565] _ds_register = 123
[623233.082589] addr_limit = 3221225472
[623233.082607] fs_macro_value = 3221225472
[623233.082626] _fs_register = 216
[623233.082645] _ds_register = 123
It turns out that FS which is an extra segment register in x86 was used by linux earlier to keep the address of the user space data segment when running in kernel mode. This has since been done away with and the name only appears as legacy code in a few macros.
This is not a good thing to do
The Linux developers disapprove this, so they have not made it convenient for us to create modules that would bypass the expected privilege-restrictions. This is to keep the kernel cleaner and secure.
[1]
http://pointer-overloading.blogspot.com/2013/09/a-hello-world-linux-kernel-module.html
[2]
http://pointer-overloading.blogspot.com/2013/09/linux-python-script-to-build-linux.html