Posts Tagged ‘SystemTap’

SystemTap monitoring ptrace activity

November 25, 2013 2 comments

I’d like to present a real and hopefully interesting example: how I recently used SystemTap to help with a complex ptrace issue. I was trying to track down an issue which involved multiple processes, and one of those was ptrace-attached to the other. In this situation, it would be perfectly fine to attach a debugger to the controlling process, but since ptrace is exclusive, I could not have attached again to that child process.  Where traditional tools are left out, SystemTap may still help!

For my problem, in a little more detail, the ptracer process was watching for forks in the child. When a fork occurred, it executed a bit of instrumentation in both the original child and the new fork, then allowed both to continue. My bug was that the fork return value in the caller (as seen after the instrumentation) should have been the new PID, but it was always 56. A bit of corruption, it seemed, but always the same 56. Except, even more strangely, on some systems once auditd started, this changed to -1 with errno set to ENOSYS, despite the fork having truly succeeded.

My “aha!” moment was when I realized that on x86_64, SYS_clone happens to be 56, and further the syscall number on entry and its return value on exit both use the same %rax register. The ptrace controller needs to save and restore register state when running instrumentation, so it seemed likely here was the culprit somehow. Now this is something easily traced with SystemTap, even across processes and kernel activity. I only needed to watch fork, ptrace, and the wait4 that gets ptrace signals. Here’s that script I used:

# NB: TASK_COMM_LEN is 16, including '\0', so truncate to 15 chars
global mutator = "test_driver"
global mutatee = "test4_2.dyn_g++"

function trace(str) {
    printf("%16s[%5d] %s\n", execname(), pid(), str)

probe syscall.fork {
  if (execname() == mutatee)
    trace(sprintf("-> %s (%s)", name, argstr))

probe syscall.fork.return {
  if (execname() == mutatee)
    trace(sprintf("<- %s = %s", name, retstr))

global ptrace_watch
probe begin {
  ptrace_watch[7] = 1 # PTRACE_CONT
  ptrace_watch[12] = 1 # PTRACE_GETREGS
  ptrace_watch[13] = 1 # PTRACE_SETREGS
  ptrace_watch[24] = 1 # PTRACE_SYSCALL

probe syscall.ptrace {
  if (execname() == mutator && request in ptrace_watch)
    trace(sprintf("%s (%s)", name, argstr))

probe syscall.wait4.return {
  if (execname() == mutator)
    trace(sprintf("%s = %s  %s", name, retstr, status_str))

This is a relatively simple script, with just a little logic to filter the events down, and then it prints a trace at each point. Even with that filtering, I got more output than I care to flood in a blog post, but here’s the excerpt of the bug in action:

 test4_2.dyn_g++[18304] -> fork ()
     test_driver[18287] wait4 = 18305  WSTOPSIG=SIGSTOP
     test_driver[18287] wait4 = 18304  WSTOPSIG=SIGTRAP | PTRACE_EVENT_FORK << 8
     test_driver[18287] ptrace (PTRACE_GETREGS, 18304, data=0x7fff291f0990)
     test_driver[18287] ptrace (PTRACE_CONT, 18304, SIG_0)
 test4_2.dyn_g++[18304] <- fork = 18305
     test_driver[18287] wait4 = 18304  WSTOPSIG=SIGTRAP
     test_driver[18287] ptrace (PTRACE_SETREGS, 18304, data=0x7fff291f09c0)

This showed me that my guess was pretty much on the mark!  I have a pet peeve against blind bug fixes, where some programmers will just take a guess at a problem and make changes, without really confirming that they’ve understood what’s happening.  So it’s gratifying to me to see this real data showing how things went down.  There’s the fork entered and returned in one process, including the proper return value (at least from the kernel perspective).  The PTRACE_EVENT_FORK came during the middle of the fork, so when the ptracer grabbed current registers, %rax was still the entry value. The fork returned the right value to the temporary instrumentation path (that didn’t care about it). When the ptracer went to restore registers to get back to where fork should have returned, it would be restoring the syscall’s entry value — 56.

It’s a tricky situation, made trickier by the fact that I could reproduce the bug on RHEL5 and Fedora, but RHEL6 was unaffected.  But with the same script, I could see that RHEL6 waited until after the internal kernel fork returned to deliver the PTRACE_EVENT_FORK stop, so the PTRACE_GETREGS just happened to be from a more convenient place.  (The behavior difference is due to utrace, which is a whole can of worms intertwined with ptrace internals.)

My plan to fix it was to introduce a PTRACE_SYSCALL call after the event, so the kernel fork would complete but still trap before returning to userspace.  As with tracking down the bug, the proof of a fix requires data too:

 test4_2.dyn_g++[19488] -> fork ()
     test_driver[19471] wait4 = 19488  WSTOPSIG=SIGTRAP | PTRACE_EVENT_FORK << 8
     test_driver[19471] ptrace (PTRACE_SYSCALL, 19488, SIG_0)
 test4_2.dyn_g++[19488] <- fork = 19489
     test_driver[19471] wait4 = 19489  WSTOPSIG=SIGSTOP
     test_driver[19471] wait4 = 19488  WSTOPSIG=TRACESYSGOOD
     test_driver[19471] ptrace (PTRACE_GETREGS, 19488, data=0x7fff07d26e90)
     test_driver[19471] ptrace (PTRACE_CONT, 19488, SIG_0)
     test_driver[19471] wait4 = 19488  WSTOPSIG=SIGTRAP
     test_driver[19471] ptrace (PTRACE_SETREGS, 19488, data=0x7fff07d26ec0

That’s a better looking sequence of events, and indeed the fork is finally showing the proper PID even after instrumentation.  It worked on all systems which previously reproduced the bug, and even the unaffected RHEL6 was fine with the added wait.  Fixed and confirmed!

Finally, I should admit, the problem was not so cleanly found and resolved as I might make it sound.  Like many debugging sessions, I went through a lot steps before I’d narrowed it down this far.  I read a lot of both application code and kernel code, and I wrote quite a few one-off SystemTap scripts watching behavior in different areas.  (Perhaps a journal of this process might be interesting some other time…)  So I don’t mean to imply that immediately SystemTap magically shined a light right where I needed to look — but it did make it possible for me to see into an otherwise opaque problem.


Hacking Linux Filenames

April 8, 2009 2 comments

I recently read an LWN article on David A. Wheeler’s essay, “Fixing Unix/Linux/POSIX Filenames.”  The gist is that he thinks the filename rules are too permissive — we have ‘/’ as the path separator, and a raw 0 terminates the path, but anything else is fair game.  On the surface, this has a certain beautiful simplicity to it.  However, there are characters that have special meaning depending on the context, so almost any code that actually tries to interpret a filename will have to add a lot of complexity to be robust.  The essay delves into many ways that things can go wrong.

Filenames have been this way in for a long time though, and I don’t expect that this will change officially anytime soon.  Still, my day job now is developing SystemTap, and this sort of problem is one of many sorts that SystemTap can address.  Here’s a script to show how a system administrator could patch the kernel with their own addendum to the filename rules:

#!/usr/bin/stap -g
# badname.stp
# Prevent the creation of files with undesirable names.

# return non-zero if the filename should be blocked
function filter:long (name:string)
  return euid() && isinstr(name, "XXX")

global squash_inode_permission
probe kernel.function("may_create@fs/namei.c")
  # screen out the conditions which may_create will fail anyway
  if ($child->d_inode || $dir->i_flags & 16) next

  # check that the new file meets our naming rules
  if (filter(kernel_string($child->d_name->name)))
    squash_inode_permission[tid()] = 1
probe kernel.function("inode_permission@fs/namei.c").return !,
  if (!$return && squash_inode_permission[tid()])
    $return = -13 # -EACCES (Permission denied)
  delete squash_inode_permission[tid()]

The script starts by defining a filter function.  It first check whether the effective user ID is non-zero, so the root user can bypass the filter.  Then, for the prude admins out there, I’ve chosen to block filenames that contain the string “XXX”.  I intentionally kept this part small for this example, but you could easily write a function covering all of the new rules that Wheeler suggests.

After that is a probe on the may_create function, which is what the kernel calls to validate permissions for new files.  We can call our filtering function from here to see if the filename is OK, but since may_create is an inline, we don’t have a direct way to influence its result.  The last thing may_create does though is copy the result of inode_permission (or permission in earlier kernels), which we can override.  So, we save the filtering decision in a global, and then in a return probe on inode_permission, we can change the successful $return code to our own error value.  Now, any attempt to create a file that doesn’t pass our rules will get an error of “Permission denied”.

This sort of script is really just a band-aid, and it doesn’t do anything to deal with files that already have “bad” names.  Still, I hope this is an interesting example of how easily one can modify kernel behavior with SystemTap.  This script can be a starting point to define and try out your own filename rules, and changes can be reloaded and tested without ever having to reboot.  Once your policy has been decided, you can configure the script to load as soon as the system boots, so you’re always running with your improved filename rules, even across kernel upgrades.

It’s powerful stuff, but don’t let it get to your head… 🙂

%d bloggers like this: