Home > Technology > SystemTap monitoring ptrace activity

SystemTap monitoring ptrace activity

November 25, 2013

I’d like to present a real and hopefully interesting example: how I recently used SystemTap to help with a complex ptrace issue. I was trying to track down an issue which involved multiple processes, and one of those was ptrace-attached to the other. In this situation, it would be perfectly fine to attach a debugger to the controlling process, but since ptrace is exclusive, I could not have attached again to that child process.  Where traditional tools are left out, SystemTap may still help!

For my problem, in a little more detail, the ptracer process was watching for forks in the child. When a fork occurred, it executed a bit of instrumentation in both the original child and the new fork, then allowed both to continue. My bug was that the fork return value in the caller (as seen after the instrumentation) should have been the new PID, but it was always 56. A bit of corruption, it seemed, but always the same 56. Except, even more strangely, on some systems once auditd started, this changed to -1 with errno set to ENOSYS, despite the fork having truly succeeded.

My “aha!” moment was when I realized that on x86_64, SYS_clone happens to be 56, and further the syscall number on entry and its return value on exit both use the same %rax register. The ptrace controller needs to save and restore register state when running instrumentation, so it seemed likely here was the culprit somehow. Now this is something easily traced with SystemTap, even across processes and kernel activity. I only needed to watch fork, ptrace, and the wait4 that gets ptrace signals. Here’s that script I used:

# NB: TASK_COMM_LEN is 16, including '\0', so truncate to 15 chars
global mutator = "test_driver"
global mutatee = "test4_2.dyn_g++"

function trace(str) {
    printf("%16s[%5d] %s\n", execname(), pid(), str)
}

probe syscall.fork {
  if (execname() == mutatee)
    trace(sprintf("-> %s (%s)", name, argstr))
}

probe syscall.fork.return {
  if (execname() == mutatee)
    trace(sprintf("<- %s = %s", name, retstr))
}

global ptrace_watch
probe begin {
  ptrace_watch[7] = 1 # PTRACE_CONT
  ptrace_watch[12] = 1 # PTRACE_GETREGS
  ptrace_watch[13] = 1 # PTRACE_SETREGS
  ptrace_watch[24] = 1 # PTRACE_SYSCALL
}

probe syscall.ptrace {
  if (execname() == mutator && request in ptrace_watch)
    trace(sprintf("%s (%s)", name, argstr))
}

probe syscall.wait4.return {
  if (execname() == mutator)
    trace(sprintf("%s = %s  %s", name, retstr, status_str))
}

This is a relatively simple script, with just a little logic to filter the events down, and then it prints a trace at each point. Even with that filtering, I got more output than I care to flood in a blog post, but here’s the excerpt of the bug in action:

 test4_2.dyn_g++[18304] -> fork ()
     test_driver[18287] wait4 = 18305  WSTOPSIG=SIGSTOP
     test_driver[18287] wait4 = 18304  WSTOPSIG=SIGTRAP | PTRACE_EVENT_FORK << 8
     test_driver[18287] ptrace (PTRACE_GETREGS, 18304, data=0x7fff291f0990)
     test_driver[18287] ptrace (PTRACE_CONT, 18304, SIG_0)
 test4_2.dyn_g++[18304] <- fork = 18305
     test_driver[18287] wait4 = 18304  WSTOPSIG=SIGTRAP
     test_driver[18287] ptrace (PTRACE_SETREGS, 18304, data=0x7fff291f09c0)

This showed me that my guess was pretty much on the mark!  I have a pet peeve against blind bug fixes, where some programmers will just take a guess at a problem and make changes, without really confirming that they’ve understood what’s happening.  So it’s gratifying to me to see this real data showing how things went down.  There’s the fork entered and returned in one process, including the proper return value (at least from the kernel perspective).  The PTRACE_EVENT_FORK came during the middle of the fork, so when the ptracer grabbed current registers, %rax was still the entry value. The fork returned the right value to the temporary instrumentation path (that didn’t care about it). When the ptracer went to restore registers to get back to where fork should have returned, it would be restoring the syscall’s entry value — 56.

It’s a tricky situation, made trickier by the fact that I could reproduce the bug on RHEL5 and Fedora, but RHEL6 was unaffected.  But with the same script, I could see that RHEL6 waited until after the internal kernel fork returned to deliver the PTRACE_EVENT_FORK stop, so the PTRACE_GETREGS just happened to be from a more convenient place.  (The behavior difference is due to utrace, which is a whole can of worms intertwined with ptrace internals.)

My plan to fix it was to introduce a PTRACE_SYSCALL call after the event, so the kernel fork would complete but still trap before returning to userspace.  As with tracking down the bug, the proof of a fix requires data too:

 test4_2.dyn_g++[19488] -> fork ()
     test_driver[19471] wait4 = 19488  WSTOPSIG=SIGTRAP | PTRACE_EVENT_FORK << 8
     test_driver[19471] ptrace (PTRACE_SYSCALL, 19488, SIG_0)
 test4_2.dyn_g++[19488] <- fork = 19489
     test_driver[19471] wait4 = 19489  WSTOPSIG=SIGSTOP
     test_driver[19471] wait4 = 19488  WSTOPSIG=TRACESYSGOOD
     test_driver[19471] ptrace (PTRACE_GETREGS, 19488, data=0x7fff07d26e90)
     test_driver[19471] ptrace (PTRACE_CONT, 19488, SIG_0)
     test_driver[19471] wait4 = 19488  WSTOPSIG=SIGTRAP
     test_driver[19471] ptrace (PTRACE_SETREGS, 19488, data=0x7fff07d26ec0

That’s a better looking sequence of events, and indeed the fork is finally showing the proper PID even after instrumentation.  It worked on all systems which previously reproduced the bug, and even the unaffected RHEL6 was fine with the added wait.  Fixed and confirmed!

Finally, I should admit, the problem was not so cleanly found and resolved as I might make it sound.  Like many debugging sessions, I went through a lot steps before I’d narrowed it down this far.  I read a lot of both application code and kernel code, and I wrote quite a few one-off SystemTap scripts watching behavior in different areas.  (Perhaps a journal of this process might be interesting some other time…)  So I don’t mean to imply that immediately SystemTap magically shined a light right where I needed to look — but it did make it possible for me to see into an otherwise opaque problem.

About these ads
  1. portl4t
    February 22, 2014 at 2:52 AM

    namespace nk {
    struct Pda {
    int t;
    };
    };

    class Dummy {
    public:
    struct State {
    int a;
    };
    };

    class Ep {

    public:
    Ep(): r(7)
    {
    ds.a = 169;
    pd.t = 200;
    }

    ~Ep(){}

    int mb(int n)
    {
    return n * (n – 1);
    }

    public:
    int r;
    Dummy::State ds;
    nk::Pda pd;
    };

    ///////////////////

    Hi, I probe function Ep::mb with systemtap, and I can get Ep::r like this:
    r = @cast(longlong_arg(1), “Ep”)->r

    but, how can I get Ep::ds.a and Ep::pd.t ? I try to get them like this:
    pd = &@cast(longlong_arg(1), “Ep”)->pd
    t = @cast(pd, “nk::Pda”)->t

    ds = &@cast(longlong_arg(1), “Ep”)->ds
    a = @cast(ds, “Dummy::State”)->a

    They don’t work, I got semantic error : type definition ‘Dummy::State’ not found in xx operator :’@cast’ at xxx.

    • February 22, 2014 at 10:35 AM

      @portl4t – Since your question is not really related to this blog post, I’d rather you emailed the mailing list, systemtap@sourceware.org. Please at least include the version of systemtap, the compiler version and options you use, and a complete example script. My first guess is that you’re running systemtap older than 1.7, which fixed bug 12136.

      Note, you can also chain dereferences like ->ds->a or ->pd->t, which should work even on an older systemtap that doesn’t know how to cast :: types properly. It’s always -> in systemtap scripts, even for members that would use . in C/C++.

  1. No trackbacks yet.
Comments are closed.
Follow

Get every new post delivered to your Inbox.

%d bloggers like this: