Advanced Operating Systems (LV 7680)
Debugging SOS
When developing an operating system on top of seL4 you do not have the luxury of using a debugger such as gdb. Your best bet is a combination of dprintf
and objdump
. If you are doing the project on a Linux machine (using our linux toolchain) you can also use the backtrace function, combined with a script for parsing objdump output to obtain a primitive stack trace.
Debugging Faults
All exceptions result in a context switch to the seL4 kernel. The kernel exports fault handling policies to user level by sending an IPC to the appropriate IPC endpoint (see Chapter 5 of the seL4 manual). For applications such as ttytest, the assigned endpoint will usually be the IPC endpoint that SOS waits on in the syscallloop. SOS on the other hand is initialised with no fault endpoint and hence is expected to ensure that it does not cause a fault.
SOS faults
Below is an example of code that triggers a page fault from the SOS main function.
int main(void)
{
init_muslc();
/* bootinfo was set as an environment variable in _sel4_start */
char *bi_string = getenv("bootinfo");
ZF_LOGF_IF(!bi_string, "Could not parse bootinfo from env.");
/* register the location of the unwind_tables -- this is required for
* backtrace() to work */
__register_frame(&__eh_frame_start);
seL4_BootInfo *boot_info;
if (sscanf(bi_string, "%p", &boot_info) != 1) {
ZF_LOGF("bootinfo environment value '%s' was not valid.", bi_string);
}
debug_print_bootinfo(boot_info);
printf("\nSOS Starting...\n");
NAME_THREAD(seL4_CapInitThreadTCB, "SOS:root");
// cause a fault at 0xDEADBEEF
*((char*)0xDEADBEEF) = 0;
When this code runs, the following output is displayed in the picocom
terminal:
SOS Starting...
Caught cap fault in send phase at address (nil)
while trying to handle:
vm fault on data at address 0xdeadbeef with status 0x92000045
in thread 0xffffff807efd6400 "SOS:root" at address 0x407c50
With stack:
0x4e59c0: 0x0
0x4e59c8: 0x48966c
0x4e59d0: 0x4e5a78
0x4e59d8: 0x0
0x4e59e0: 0x0
0x4e59e8: 0x0
0x4e59f0: 0x5ea000
0x4e59f8: 0x4895ec
0x4e5a00: 0x4e5a78
0x4e5a08: 0x4ac025
0x4e5a10: 0x48965c
0x4e5a18: 0x0
0x4e5a20: 0x4e5af0
0x4e5a28: 0x0
0x4e5a30: 0x0
0x4e5a38: 0x0
What's going on here? First, a page fault is triggered by the dereference of invalid address 0xDEADBEEF
. Then, seL4 attempts to send a page fault IPC to the fault handling endpoint of SOS. However, SOS doesn't have a fault handling endpoint, as it is set to the default value of 0, an empty slot in the cspace. As a result, a double fault is sent, which means the thread is stopped by the kernel, and in debug mode, the above output is produced.
From the output, observe that SOS caused a vm fault with the following important values:
- The fault address,
0xdeadbeef
. - The exception status register,
0x92000045
- The address of the instruction that faulted,
0x407c50
The fault address
The fault address is the invalid address which caused a vm fault (a TLB translation error). This is 0xdeadbeef, the address we tried to write to. Note that invalid addresses will not always trigger a vm fault: if you are unlucky enough to accidentally derefence a mapped address no fault will occur.
The exception status register
The exception status register is referred to as far_el1 on 64-bit ARM architectures (aarch64) and DFSR on 32-bit ARM architectures (aarch32). To decode the value, look at section D.10.4 of the ARMv8 Architecture Reference Manual
The fault address register
The most useful address is the fault address register, which contains the value of the instruction pointer at the time of the fault. This value can then be looked up with the objdump
tool to to determine the context of fault.
Application faults
When an application thread is created, an appropriate fault handler IPC endpoint should be set in the thread's TCB (see Chapter 5 of the seL4 manual). If the application causes a fault and IPC will be sent to the assigned fault endpoint. Message registers provide important information regarding the cause and location of the fault. objdump
may then be used to determine the context of fault.
Using objdump
Whenever you cause a fault, you should be able to see the current program counter printed out. You can use objdump to find where the offending code is. Pass to objdump the location of the binary that faulted: in this case, it's the sos binary, located in the build directory:
aarch64-linux-gnu-objdump -Dlx build/projects/aos/sos/sos | less
You can now use the searching facility in less to search for the faulting address. In this case I find the following fragment of output:
407c44: 97fff6dc bl 4057b4
/home/mmust/aos-2020-test/build/../projects/aos/sos/src/main.c:516
407c48: d297dde0 mov x0, #0xbeef // #48879
407c4c: f2bbd5a0 movk x0, #0xdead, lsl #16
407c50: 3900001f strb wzr, [x0]
/home/mmust/aos-2020-test/build/../projects/aos/sos/src/main.c:519
407c54: f9401ba1 ldr x1, [x29,#48]
We can see that the error has occurred on code at line 516 of main.c
, exactly where we triggered a fault by trying to write to 0xDEADBEEF
.
If the fault occurs in an application (such as tty_test
or sosh
) rather than SOS, the following command should be used, with APP_NAME
substituted for the name of the application you are debugging:
aarch64-linux-gnu-objdump -Dlx projects/aos/apps/APP_NAME/APP_NAME | less
More on objdump
objdump
is a very handy utility for working out exactly what is where in an executable so you can work out what exactly is going wrong.
The two standard incantations for objdump
are:
aarch64-linux-gnu-objdump -dl binary_file | less
and
aarch64-linux-gnu-objdump -lx binary_file | less
The first command (-dl
) disassembles the text segment and shows you all the instructions and at what address. Using this information you can find out things such as:
What does my function compile to?
eg. Am I compiling the right function? Do the instructions make sense?
Why is this instruction crashing?
Look at the fsr for the type of fault, and use objdump to find the exact instruction causing the fault.
NB: In case it's not yet obvious, you will need to get up to speed on your ARM aarch64 assembly and not be afraid to get your hands dirty if you want to minimise the time you spend debugging.
The second objdump command (-lx
) is useful for when addresses appear inside an object file but outside of the text segment. This is especially useful when debugging ELF loading. The -lx
option displays section and symbol information. Further options can be added to dump data segments etc. man objdump
is your friend.
Generating a Stack Trace
Warning: Backtrace only supported for the Linux toolchain
We have provided a function print_backtrace
in SOS which allows you to print a primitive stacktrace from anywhere in SOS. Note that backtrace does not work across random jumps, including from within muslc system calls.
You can call print_backtrace()
from within any arbitrary place in your code. Backtrace will traverse the eh_frame
constructed by gcc
during compilation, which provides stack offsets, and print out the addresses of each function in the call stack (as long as gcc can recognise the function call) This can be useful if you need to know what your code is doing.
Consider this example, where print_backtrace
is called in the network_init
function. The output is as follows:
SOS Started!
test_cspace@tests.c:53 Test cspace
libsel4muslcsys: Error attempting syscall 233
run_tests@tests.c:116 Root CSpace test passed!
test_cspace@tests.c:53 Test cspace
run_tests@tests.c:124 Single level cspace test passed!
test_cspace@tests.c:53 Test cspace
libsel4muslcsys: Error attempting syscall 233
run_tests@tests.c:137 Double level cspace test passed!
run_tests@tests.c:141 DMA test passed!
Network init
Backtracing stack PCs:
0x408fa4
0x409288
0x407a24
0x418fec
Usage of the backtrace script is shown below.
$ ./projects/aos/backtrace.py build/projects/aos/sos/sos
Enter whitespace separated addresses, <Ctrl>+D to process and exit
>
> 0x408fa4
0x409288
0x407a24
0x418fec
> > > > Processing....
--------------------------------------------------------------------------------
0 408fa4 in print_backtrace
408f9c: 52800141 mov w1, #0xa // #10
408fa0: 94000868 bl 40b140
408fa4: b9006ba0 str w0, [x29,#104]
408fa8: b9406ba0 ldr w0, [x29,#104]
408fac: 7100001f cmp w0, #0x0
--------------------------------------------------------------------------------
1 409288 in network_init
409280: f9000fa0 str x0, [x29,#24]
409284: 97ffff38 bl 408f64
409288: d2a00022 mov x2, #0x10000 // #65536
40928c: d2b92821 mov x1, #0xc9410000 // #3376480256
409290: f9400fa0 ldr x0, [x29,#24]
--------------------------------------------------------------------------------
2 407a24 in main_continued
407a1c: 911f8000 add x0, x0, #0x7e0
407a20: 94000616 bl 409278
407a24: 900004a0 adrp x0, 49b000
407a28: 91112000 add x0, x0, #0x448
407a2c: 94021680 bl 48d42c
--------------------------------------------------------------------------------
3 418fec in utils_run_on_stack
418fe4: aa0203e0 mov x0, x2
418fe8: d63f0020 blr x1
418fec: 9100029f mov sp, x20
418ff0: aa0003e0 mov x0, x0
418ff4: f90027a0 str x0, [x29,#72]