Archive
SystemTap monitoring ptrace activity
I’d like to present a real and hopefully interesting example: how I recently used SystemTap to help with a complex ptrace issue. I was trying to track down an issue which involved multiple processes, and one of those was ptrace-attached to the other. In this situation, it would be perfectly fine to attach a debugger to the controlling process, but since ptrace is exclusive, I could not have attached again to that child process. Where traditional tools are left out, SystemTap may still help!
For my problem, in a little more detail, the ptracer process was watching for forks in the child. When a fork occurred, it executed a bit of instrumentation in both the original child and the new fork, then allowed both to continue. My bug was that the fork return value in the caller (as seen after the instrumentation) should have been the new PID, but it was always 56. A bit of corruption, it seemed, but always the same 56. Except, even more strangely, on some systems once auditd started, this changed to -1 with errno
set to ENOSYS
, despite the fork having truly succeeded.
My “aha!” moment was when I realized that on x86_64, SYS_clone
happens to be 56, and further the syscall number on entry and its return value on exit both use the same %rax
register. The ptrace controller needs to save and restore register state when running instrumentation, so it seemed likely here was the culprit somehow. Now this is something easily traced with SystemTap, even across processes and kernel activity. I only needed to watch fork, ptrace, and the wait4 that gets ptrace signals. Here’s that script I used:
# NB: TASK_COMM_LEN is 16, including '\0', so truncate to 15 chars global mutator = "test_driver" global mutatee = "test4_2.dyn_g++" function trace(str) { printf("%16s[%5d] %s\n", execname(), pid(), str) } probe syscall.fork { if (execname() == mutatee) trace(sprintf("-> %s (%s)", name, argstr)) } probe syscall.fork.return { if (execname() == mutatee) trace(sprintf("<- %s = %s", name, retstr)) } global ptrace_watch probe begin { ptrace_watch[7] = 1 # PTRACE_CONT ptrace_watch[12] = 1 # PTRACE_GETREGS ptrace_watch[13] = 1 # PTRACE_SETREGS ptrace_watch[24] = 1 # PTRACE_SYSCALL } probe syscall.ptrace { if (execname() == mutator && request in ptrace_watch) trace(sprintf("%s (%s)", name, argstr)) } probe syscall.wait4.return { if (execname() == mutator) trace(sprintf("%s = %s %s", name, retstr, status_str)) }
This is a relatively simple script, with just a little logic to filter the events down, and then it prints a trace at each point. Even with that filtering, I got more output than I care to flood in a blog post, but here’s the excerpt of the bug in action:
test4_2.dyn_g++[18304] -> fork () test_driver[18287] wait4 = 18305 WSTOPSIG=SIGSTOP test_driver[18287] wait4 = 18304 WSTOPSIG=SIGTRAP | PTRACE_EVENT_FORK << 8 test_driver[18287] ptrace (PTRACE_GETREGS, 18304, data=0x7fff291f0990) test_driver[18287] ptrace (PTRACE_CONT, 18304, SIG_0) test4_2.dyn_g++[18304] <- fork = 18305 test_driver[18287] wait4 = 18304 WSTOPSIG=SIGTRAP test_driver[18287] ptrace (PTRACE_SETREGS, 18304, data=0x7fff291f09c0)
This showed me that my guess was pretty much on the mark! I have a pet peeve against blind bug fixes, where some programmers will just take a guess at a problem and make changes, without really confirming that they’ve understood what’s happening. So it’s gratifying to me to see this real data showing how things went down. There’s the fork entered and returned in one process, including the proper return value (at least from the kernel perspective). The PTRACE_EVENT_FORK
came during the middle of the fork, so when the ptracer grabbed current registers, %rax
was still the entry value. The fork returned the right value to the temporary instrumentation path (that didn’t care about it). When the ptracer went to restore registers to get back to where fork should have returned, it would be restoring the syscall’s entry value — 56.
It’s a tricky situation, made trickier by the fact that I could reproduce the bug on RHEL5 and Fedora, but RHEL6 was unaffected. But with the same script, I could see that RHEL6 waited until after the internal kernel fork returned to deliver the PTRACE_EVENT_FORK
stop, so the PTRACE_GETREGS
just happened to be from a more convenient place. (The behavior difference is due to utrace, which is a whole can of worms intertwined with ptrace internals.)
My plan to fix it was to introduce a PTRACE_SYSCALL
call after the event, so the kernel fork would complete but still trap before returning to userspace. As with tracking down the bug, the proof of a fix requires data too:
test4_2.dyn_g++[19488] -> fork () test_driver[19471] wait4 = 19488 WSTOPSIG=SIGTRAP | PTRACE_EVENT_FORK << 8 test_driver[19471] ptrace (PTRACE_SYSCALL, 19488, SIG_0) test4_2.dyn_g++[19488] <- fork = 19489 test_driver[19471] wait4 = 19489 WSTOPSIG=SIGSTOP test_driver[19471] wait4 = 19488 WSTOPSIG=TRACESYSGOOD test_driver[19471] ptrace (PTRACE_GETREGS, 19488, data=0x7fff07d26e90) test_driver[19471] ptrace (PTRACE_CONT, 19488, SIG_0) test_driver[19471] wait4 = 19488 WSTOPSIG=SIGTRAP test_driver[19471] ptrace (PTRACE_SETREGS, 19488, data=0x7fff07d26ec0
That’s a better looking sequence of events, and indeed the fork is finally showing the proper PID even after instrumentation. It worked on all systems which previously reproduced the bug, and even the unaffected RHEL6 was fine with the added wait. Fixed and confirmed!
Finally, I should admit, the problem was not so cleanly found and resolved as I might make it sound. Like many debugging sessions, I went through a lot steps before I’d narrowed it down this far. I read a lot of both application code and kernel code, and I wrote quite a few one-off SystemTap scripts watching behavior in different areas. (Perhaps a journal of this process might be interesting some other time…) So I don’t mean to imply that immediately SystemTap magically shined a light right where I needed to look — but it did make it possible for me to see into an otherwise opaque problem.