Probably the worst bug to debug
Posted: Tue Dec 23, 2014 9:27 pm
For the past 2 days, I was debugging my code because it caused a very strange segmentation fault, strange because it did not happen while my code was executing, but it was happening inside exit().
Firstly, this rules out some of the most common cases:
1. Accessing an invalid memory region / NULL pointer / already freed region (then I would've got the segfault pretty earlier)
2. Double free (same as above)
3. Accessing read-only-memory (e.g. Code Segment -- no, same as above)
etc..
Since the program is crashing during exit, this probably indicates that I might've caused a stack overrun, by doing an operation to a buffer created on stack, such that the size specified during the operation is greater than what actually the buffer is. For this to happen the crash must've happened very early in the program. Out of ideas, I went to #osdev, where sortie asked me if calling "exit(0)" causes a segfault, -- and it did, which means that this isn't a problem with the stack, since explicitly calling exit shouldn't require the return address to be on stack, but rather my 'exit' is somehow crashing. I searched on google for possible cases, one of them being memory and heap corruption. I ran a bunch of tools valgrind, gdb, electric fence, GCC's built-in address sanitizer, etc. and none of them actually told me what actually was causing the segfault, bringing me to a point where I considered that my libc had bugs. GDB was interesting though, as the program would work fine in most cases.
Then I updated my glibc, same result, upgraded my compiler, same result, and I went to point of upgrading my entire distro, well uh, same result.
Since a lot of people on #osdev, ##c, and #glibc suggested me to re-run my code under valgrind, I finally decided to pay serious attention to it's output, and it pointed me to an internal glibc function: _IO_flush_all_lockp, under __run_exit_handlers, which is called by exit @ genops.c. Unfortunately, I couldn't find much info about it, except another person having the same problem, but due to using uninitialized pointers.
It was 2 days already, I finally decided to take a look at the glib sources, and found my "_IO_flush_all_lockp", looking a what it was doing which seemed to flush all open streams, and then locking them, I realised my fault.
Now that's pretty clean, debuggers won't suspect a thing, since I'm just freeing memory someone (in this case the kernel perhaps), allocated to me, GCC wouldn't warn me, since I could always allocate a "FILE*" pointer myself, for some unusual reason. Really hard to detect if your source files are long, worse even, it was a typo. And poor libc is trying to access that, since the stream is still open, and BOOM.
The most annoying part of this bug was that it'd randomly happen, for example making small "so-called fixes" in the program, would make it look like the bug disappeared, but after a few runs, it'd appear again.
/me checks again to see if the bug is actually solved.
Firstly, this rules out some of the most common cases:
1. Accessing an invalid memory region / NULL pointer / already freed region (then I would've got the segfault pretty earlier)
2. Double free (same as above)
3. Accessing read-only-memory (e.g. Code Segment -- no, same as above)
etc..
Since the program is crashing during exit, this probably indicates that I might've caused a stack overrun, by doing an operation to a buffer created on stack, such that the size specified during the operation is greater than what actually the buffer is. For this to happen the crash must've happened very early in the program. Out of ideas, I went to #osdev, where sortie asked me if calling "exit(0)" causes a segfault, -- and it did, which means that this isn't a problem with the stack, since explicitly calling exit shouldn't require the return address to be on stack, but rather my 'exit' is somehow crashing. I searched on google for possible cases, one of them being memory and heap corruption. I ran a bunch of tools valgrind, gdb, electric fence, GCC's built-in address sanitizer, etc. and none of them actually told me what actually was causing the segfault, bringing me to a point where I considered that my libc had bugs. GDB was interesting though, as the program would work fine in most cases.
Then I updated my glibc, same result, upgraded my compiler, same result, and I went to point of upgrading my entire distro, well uh, same result.
Since a lot of people on #osdev, ##c, and #glibc suggested me to re-run my code under valgrind, I finally decided to pay serious attention to it's output, and it pointed me to an internal glibc function: _IO_flush_all_lockp, under __run_exit_handlers, which is called by exit @ genops.c. Unfortunately, I couldn't find much info about it, except another person having the same problem, but due to using uninitialized pointers.
It was 2 days already, I finally decided to take a look at the glib sources, and found my "_IO_flush_all_lockp", looking a what it was doing which seemed to flush all open streams, and then locking them, I realised my fault.
Code: Select all
FILE* fp = fopen("filename.ext");
....
free(fp); /** error: Must be fclose **/
Now that's pretty clean, debuggers won't suspect a thing, since I'm just freeing memory someone (in this case the kernel perhaps), allocated to me, GCC wouldn't warn me, since I could always allocate a "FILE*" pointer myself, for some unusual reason. Really hard to detect if your source files are long, worse even, it was a typo. And poor libc is trying to access that, since the stream is still open, and BOOM.
The most annoying part of this bug was that it'd randomly happen, for example making small "so-called fixes" in the program, would make it look like the bug disappeared, but after a few runs, it'd appear again.
/me checks again to see if the bug is actually solved.