Archive
SystemTap monitoring ptrace activity
I’d like to present a real and hopefully interesting example: how I recently used SystemTap to help with a complex ptrace issue. I was trying to track down an issue which involved multiple processes, and one of those was ptrace-attached to the other. In this situation, it would be perfectly fine to attach a debugger to the controlling process, but since ptrace is exclusive, I could not have attached again to that child process. Where traditional tools are left out, SystemTap may still help!
For my problem, in a little more detail, the ptracer process was watching for forks in the child. When a fork occurred, it executed a bit of instrumentation in both the original child and the new fork, then allowed both to continue. My bug was that the fork return value in the caller (as seen after the instrumentation) should have been the new PID, but it was always 56. A bit of corruption, it seemed, but always the same 56. Except, even more strangely, on some systems once auditd started, this changed to -1 with errno
set to ENOSYS
, despite the fork having truly succeeded.
My “aha!” moment was when I realized that on x86_64, SYS_clone
happens to be 56, and further the syscall number on entry and its return value on exit both use the same %rax
register. The ptrace controller needs to save and restore register state when running instrumentation, so it seemed likely here was the culprit somehow. Now this is something easily traced with SystemTap, even across processes and kernel activity. I only needed to watch fork, ptrace, and the wait4 that gets ptrace signals. Here’s that script I used:
# NB: TASK_COMM_LEN is 16, including '\0', so truncate to 15 chars global mutator = "test_driver" global mutatee = "test4_2.dyn_g++" function trace(str) { printf("%16s[%5d] %s\n", execname(), pid(), str) } probe syscall.fork { if (execname() == mutatee) trace(sprintf("-> %s (%s)", name, argstr)) } probe syscall.fork.return { if (execname() == mutatee) trace(sprintf("<- %s = %s", name, retstr)) } global ptrace_watch probe begin { ptrace_watch[7] = 1 # PTRACE_CONT ptrace_watch[12] = 1 # PTRACE_GETREGS ptrace_watch[13] = 1 # PTRACE_SETREGS ptrace_watch[24] = 1 # PTRACE_SYSCALL } probe syscall.ptrace { if (execname() == mutator && request in ptrace_watch) trace(sprintf("%s (%s)", name, argstr)) } probe syscall.wait4.return { if (execname() == mutator) trace(sprintf("%s = %s %s", name, retstr, status_str)) }
This is a relatively simple script, with just a little logic to filter the events down, and then it prints a trace at each point. Even with that filtering, I got more output than I care to flood in a blog post, but here’s the excerpt of the bug in action:
test4_2.dyn_g++[18304] -> fork () test_driver[18287] wait4 = 18305 WSTOPSIG=SIGSTOP test_driver[18287] wait4 = 18304 WSTOPSIG=SIGTRAP | PTRACE_EVENT_FORK << 8 test_driver[18287] ptrace (PTRACE_GETREGS, 18304, data=0x7fff291f0990) test_driver[18287] ptrace (PTRACE_CONT, 18304, SIG_0) test4_2.dyn_g++[18304] <- fork = 18305 test_driver[18287] wait4 = 18304 WSTOPSIG=SIGTRAP test_driver[18287] ptrace (PTRACE_SETREGS, 18304, data=0x7fff291f09c0)
This showed me that my guess was pretty much on the mark! I have a pet peeve against blind bug fixes, where some programmers will just take a guess at a problem and make changes, without really confirming that they’ve understood what’s happening. So it’s gratifying to me to see this real data showing how things went down. There’s the fork entered and returned in one process, including the proper return value (at least from the kernel perspective). The PTRACE_EVENT_FORK
came during the middle of the fork, so when the ptracer grabbed current registers, %rax
was still the entry value. The fork returned the right value to the temporary instrumentation path (that didn’t care about it). When the ptracer went to restore registers to get back to where fork should have returned, it would be restoring the syscall’s entry value — 56.
It’s a tricky situation, made trickier by the fact that I could reproduce the bug on RHEL5 and Fedora, but RHEL6 was unaffected. But with the same script, I could see that RHEL6 waited until after the internal kernel fork returned to deliver the PTRACE_EVENT_FORK
stop, so the PTRACE_GETREGS
just happened to be from a more convenient place. (The behavior difference is due to utrace, which is a whole can of worms intertwined with ptrace internals.)
My plan to fix it was to introduce a PTRACE_SYSCALL
call after the event, so the kernel fork would complete but still trap before returning to userspace. As with tracking down the bug, the proof of a fix requires data too:
test4_2.dyn_g++[19488] -> fork () test_driver[19471] wait4 = 19488 WSTOPSIG=SIGTRAP | PTRACE_EVENT_FORK << 8 test_driver[19471] ptrace (PTRACE_SYSCALL, 19488, SIG_0) test4_2.dyn_g++[19488] <- fork = 19489 test_driver[19471] wait4 = 19489 WSTOPSIG=SIGSTOP test_driver[19471] wait4 = 19488 WSTOPSIG=TRACESYSGOOD test_driver[19471] ptrace (PTRACE_GETREGS, 19488, data=0x7fff07d26e90) test_driver[19471] ptrace (PTRACE_CONT, 19488, SIG_0) test_driver[19471] wait4 = 19488 WSTOPSIG=SIGTRAP test_driver[19471] ptrace (PTRACE_SETREGS, 19488, data=0x7fff07d26ec0
That’s a better looking sequence of events, and indeed the fork is finally showing the proper PID even after instrumentation. It worked on all systems which previously reproduced the bug, and even the unaffected RHEL6 was fine with the added wait. Fixed and confirmed!
Finally, I should admit, the problem was not so cleanly found and resolved as I might make it sound. Like many debugging sessions, I went through a lot steps before I’d narrowed it down this far. I read a lot of both application code and kernel code, and I wrote quite a few one-off SystemTap scripts watching behavior in different areas. (Perhaps a journal of this process might be interesting some other time…) So I don’t mean to imply that immediately SystemTap magically shined a light right where I needed to look — but it did make it possible for me to see into an otherwise opaque problem.
Defining your own Git commands
When you use a command prompt long enough, you’ll find yourself repeating some common command sequences. If you value your time, you’ll probably start saving shell aliases and even full shell scripts to simplify these things.
The same is true with using Git: you’ll probably find yourself repeating some long commands, and wish for an easier way. You may also just have some tricky Git recipes that you want to remember. Well sure enough, you can write Git aliases too, as well as add your own scripts as git subcommands. The Git wiki has a pretty good page describing aliases, so I’ll let you read that, and there are examples too. Here are a few more examples that I felt are worth sharing.
up = remote update
This first one is just pure laziness. A git remote update
is the same as git fetch
for all configured remotes, and now git up
can do the same with less typing.
new = log ..@{upstream} pend = log @{upstream}..
These show how the current branch differs from the upstream tracking branch (often origin/master
, but could be anything). My version of git new
is different than the example on the Git wiki, instead showing what commits you’ve fetched but not yet pulled/merged. Then git pend
is the opposite, showing what commits are pending to go out in a push.
log1 = log --abbrev-commit --pretty=oneline logf = log --pretty=fuller
These are just shortcuts for formatted logs; git log1
for one line per commit, and git logf
for the fuller format that includes committer info (instead of just author). The log1
options could also be condensed down to --oneline
these days, but that wasn’t available when I first wrote this alias.
lol = log --graph --decorate --pretty=oneline --abbrev-commit lola = log --graph --decorate --pretty=oneline --abbrev-commit --all
More logging tricks, and for these I’ll give credit to the blog where I found them. These add ASCII-art graph lines to help visualize merges, and they also label tags and branch heads.
show-patch = format-patch --stdout -1
This works like git show
, dumping the most recent commit on stdout, but with format-patch
it uses email-style headers. This makes it easier to elsewhere use git am
on that patch with preserved commit message, author, and date. It also includes binary diffs by default.
make = !test -d \"$(git rev-parse --show-toplevel)\" && make -C \"$_\"
This runs make from the top of the git directory, no matter what your current working directory is, so you can just git make
anywhere. The test -d
beforehand is just a precaution to make sure you’re really within a git path, otherwise you’ll get both a git error and make complaining of an empty -C
argument.
Once you start getting into complicated “!...
” sequences, it may make sense to break it out into its own file. As long as you name it starting with “git-
“, like git-unique-abbrev
in my last post, and save it somewhere in your PATH
, then you can call it as you would any other git command like git unique-abbrev
. Another good example is this git merge-ff
script on StackOverflow, which lets you safely fast-forward any branch without affecting your current work tree.
The last one I’ll leave you with is back to shell aliases:
alias gcd='test -d "$(git rev-parse --show-toplevel)" && cd "$_"'
This changes your current directory to the top of the git tree, using the same mechanism as I did for git make
. This has to be a shell alias, not a git command or a script, because it’s changing the working directory of the shell. Don’t forget cd -
to jump back too.
How short can Git abbreviate?
How short can a Git hash be abbreviated? A full SHA-1 hash is 160 bits, 40 hex characters, which is a bit long to talk about and retype when you’re not just copy-pasting. Therefore most people will use just the first few characters to refer to commits, but how safe is this?
I know of a couple posts where Linus Torvalds laments Git’s default hash abbreviation of 7 characters. This gives you 28 unique bits, almost 270 million possibilities, which seems like it should be plenty. Even with the Linux kernel, one of the biggest repositories I know, there are fewer than 4 million objects. However, since hash functions like SHA-1 have approximately random distribution, you easily can and will run into the birthday problem at this level.
As Linus noted in that second link, Git will now make sure that the abbreviated hashes it prints are currently long enough to be unique, but that’s no guarantee against future collisions. He recommends kernel developers use 12 characters, which got me curious how to evaluate how much headroom this leaves for any given repository. One way to get a basic idea is to ask Git to abbreviate all commit objects as much as possible. For example, on linux.git
:
$ git rev-list --all --abbrev=0 --abbrev-commit | wc -L 11
To see which commits require 11-character abbreviations:
$ git rev-list --all --abbrev=0 --abbrev-commit | grep -E '.{11,}' 8b82547e33e 3ee5014185b 21949f00a02
And indeed, if you try a command like git show
on any of those with just 10 characters, it will complain that this is ambiguous. Notice that none of those 3 are near each other though, which means that the collision must have been with some other object type, like a tree or blob. In fact, git log
is happy with 10 characters on these, as it only has to disambiguate among revisions.
How about a histogram of commit abbreviation lengths:
$ git rev-list --all --abbrev=0 --abbrev-commit | awk '{ a[length] += 1 } END { for (len in a) print len, a[len] }' 5 1771 6 286066 7 106897 8 7899 9 494 10 27 11 3
In fact, most commits here are just fine with that 7-character abbreviation, even a few as low as 5, but there’s that tail end of commits which require up to 11. So 12 does seem like a reasonable suggestion to leave some headroom, but I don’t think it’s quite the “later generations” headroom that Linus wanted.
This is all fine for commit objects, but I didn’t find a way for Git to print this minimized abbreviation on all object types. So, I wrote a quick script to run through various lengths:
#!/bin/bash # git-unique-abbrev OBJECTS="$(mktemp)" git rev-list --all --objects | cut -c1-40 | sort >"$OBJECTS" printf "%d objects\n" $(wc -l <"$OBJECTS") for abbrev in $(seq 4 40); do DUPES="$(mktemp)" uniq -D -w $abbrev <"$OBJECTS" >"$DUPES" count=$(wc -l <"$DUPES") acount=$(uniq -w $abbrev <"$DUPES" | wc -l) printf "%2d: %d / %d\n" $abbrev $count $acount test $count -eq 0 && cat "$OBJECTS" mv "$DUPES" "$OBJECTS" test $count -eq 0 && break done rm -f "$OBJECTS"
On linux.git
:
$ git-unique-abbrev 3253824 objects 4: 3253824 / 65536 5: 3107326 / 854621 6: 573758 / 277590 7: 39577 / 19748 8: 2609 / 1304 9: 160 / 80 10: 12 / 6 11: 2 / 1 12: 0 / 0 d597639e2036f04f0226761e2d818b31f2db7820 d597639e203a100156501df8a0756fd09573e2de
Each line is reporting how many total objects are ambiguous at that abbreviation and how many such abbreviations there are. The minimum Git allows is 4, which is totally saturated in this repository — every object is ambiguous, and all 65536 possible prefixes are included. The default 7 does disambiguate most objects, but there’s still 1.2% remaining. At 8 characters, with 2609 objects and 1304 common prefixes, we can infer there’s even still a triplicate around. We really do require 12 characters now to disambiguate all objects. The last lines are the final set of duplicates, and git cat-file -t
will tell you these are both tree objects.
For a probabilistic view, we can use the square approximation from Wikipedia: , where n is the number of random values, m is the number of possible values, and here we should use
for c characters. For n = 3253824, this gives us approximately 30% chance of colliding at 11 characters, and only 1.9% at 12. So it’s not all that strange to have gotten our
d597639e203
result for 11 after all.
What’s the takeaway from all this? If you are a kernel developer, or use a Git repository of similar size, you should definitely set Git to core.abbrev
of at least 12, and maybe more to have some real headroom. For smaller projects, you can try some of the commands above to see where you stand, but it may just be prudent for everyone to get used to using longer hashes everywhere. Finally, when you reference a Git hash for posterity, e.g. in another commit message, I’d recommend always using the full value.
Boot Time versus LVM Snapshots
Linux tip: LVM snapshots are a handy feature, but they will destroy your boot time if they diverge very far.
I do a lot of development and testing in virtual machines. Sometimes if I want to try something disruptive, I’ll do it in a clone of some base image. Using LVM snapshots makes it easy and quick to create the new image.
But recently I noticed my workstation was taking abnormally long to boot. The command “systemd-analyze blame
” told me that fedora-storage-init.service alone was taking over two minutes. While I imagine this might happen occasionally with fsck, it certainly shouldn’t take so long every time.
I finally found out that the culprit was the LVM vgchange command which initializes the volumes, and apparently this can take quite a while with snapshots to enumerate the extents. I had a few old snapshotted machines I had used for testing, and the base images those had originated from had since diverged quite a bit due to simple update activity.
I didn’t need those tests anymore, so delete them and reboot — and I booted in only 15 seconds, which even includes a few seconds to enter the encryption passphrase! This is an order of magnitude faster — or rather, now I know the LVM snapshots were causing an order of magnitude slowdown.
Lesson learned, don’t keep LVM snapshots around for a long time. It’s too bad, because this capability is otherwise very useful…
pdiff: Comparing process output
The diff command is an essential piece of any programmer’s or administrator’s toolbox, to show the difference between two files. Or if not that tool exactly, there are many like it. But sometimes I want to compare the output of two commands, often really the same command operating on different inputs. You could save the output of each to separate files and diff that, or in a shell like bash you can use “<(cmd)” fed directly to diff. For example, when I was recently dealing with an elfutils bug with prelinked binaries, I ran a command like this:
diff <(eu-readelf -hlS foo) <(eu-readelf -hlS foo.prelink)
That works just fine, though it’s a little redundant, but it gets really bad if the command gets much more complex. And if you need to adjust the command, you have to do it in both places, or the diff falls apart. So after getting annoyed by this, I decided to scratch that itch and write this “pdiff” script:
#!/bin/bash CMD= CMDARG= DIFF= OPTS=$(getopt -o d:gvc:I: -n pdiff -- "$@") || exit eval set -- "$OPTS" while true; do case "$1" in -d) DIFF="$2"; shift 2;; -g) DIFF=gvimdiff; shift;; -v) DIFF=vimdiff; shift;; -c) CMD="$2"; shift 2;; -I) CMDARG="$2"; shift 2;; --) shift; break;; *) echo "Internal error!" >&2; exit 2 ;; esac done function run() { local cmd test -n "$CMDARG" && cmd="${CMD/$CMDARG/$@}" || cmd="$CMD $@" echo "# $cmd" eval $cmd } case $# in 2) exec ${DIFF:=diff -u -p} <(run $1) <(run $2);; 3) exec ${DIFF:=diff3} <(run $1) <(run $2) <(run $3);; 4) test -n "$DIFF" && exec $DIFF <(run $1) <(run $2) <(run $3) <(run $4);;& *) echo "Unable to diff $# args" >&2; exit 1 ;; esac
It even has option processing, like -v to open in Vim, -g for gVim, or -d for a custom diff command. Then -c gives a shared command, and -I lets you specify a replacement string within the command (rather than placing arguments at the end). Any remaining arguments get fed to separate invocations of the command for comparison. That’s well beyond the one-liner I started with, but now I can reuse this to make future comparisons easier. The first command I gave is now more simply:
pdiff -c 'eu-readelf -hlS' foo foo.prelink
That command can now get more complex, with compound pipelines or whatever else I need, without being such a pain to compose.
I did a little searching afterward and found that there are already other tools called pdiff, like here and here, so if I ever want to publish this more formally I should find a new name. But for now, it meets my needs, and maybe it can be useful for someone else too.
Hacking Linux Filenames
I recently read an LWN article on David A. Wheeler’s essay, “Fixing Unix/Linux/POSIX Filenames.” The gist is that he thinks the filename rules are too permissive — we have ‘/’ as the path separator, and a raw 0 terminates the path, but anything else is fair game. On the surface, this has a certain beautiful simplicity to it. However, there are characters that have special meaning depending on the context, so almost any code that actually tries to interpret a filename will have to add a lot of complexity to be robust. The essay delves into many ways that things can go wrong.
Filenames have been this way in for a long time though, and I don’t expect that this will change officially anytime soon. Still, my day job now is developing SystemTap, and this sort of problem is one of many sorts that SystemTap can address. Here’s a script to show how a system administrator could patch the kernel with their own addendum to the filename rules:
#!/usr/bin/stap -g # badname.stp # Prevent the creation of files with undesirable names. # return non-zero if the filename should be blocked function filter:long (name:string) { return euid() && isinstr(name, "XXX") } global squash_inode_permission probe kernel.function("may_create@fs/namei.c") { # screen out the conditions which may_create will fail anyway if ($child->d_inode || $dir->i_flags & 16) next # check that the new file meets our naming rules if (filter(kernel_string($child->d_name->name))) squash_inode_permission[tid()] = 1 } probe kernel.function("inode_permission@fs/namei.c").return !, kernel.function("permission@fs/namei.c").return { if (!$return && squash_inode_permission[tid()]) $return = -13 # -EACCES (Permission denied) delete squash_inode_permission[tid()] }
The script starts by defining a filter function. It first check whether the effective user ID is non-zero, so the root user can bypass the filter. Then, for the prude admins out there, I’ve chosen to block filenames that contain the string “XXX”. I intentionally kept this part small for this example, but you could easily write a function covering all of the new rules that Wheeler suggests.
After that is a probe on the may_create
function, which is what the kernel calls to validate permissions for new files. We can call our filtering function from here to see if the filename is OK, but since may_create
is an inline, we don’t have a direct way to influence its result. The last thing may_create
does though is copy the result of inode_permission
(or permission
in earlier kernels), which we can override. So, we save the filtering decision in a global, and then in a return probe on inode_permission
, we can change the successful $return
code to our own error value. Now, any attempt to create a file that doesn’t pass our rules will get an error of “Permission denied”.
This sort of script is really just a band-aid, and it doesn’t do anything to deal with files that already have “bad” names. Still, I hope this is an interesting example of how easily one can modify kernel behavior with SystemTap. This script can be a starting point to define and try out your own filename rules, and changes can be reloaded and tested without ever having to reboot. Once your policy has been decided, you can configure the script to load as soon as the system boots, so you’re always running with your improved filename rules, even across kernel upgrades.
It’s powerful stuff, but don’t let it get to your head… 🙂
Free Wireless
I made an upgrade on my netbook today. Can you spot the difference?
If you guessed that it’s now 100% open-source compatible, you are correct!
Even though I ordered the Linux package, Dell cheaped out with a Broadcom wireless card that doesn’t have very good Linux support. Broadcom does have drivers available, but you have to download and compile the wrapper yourself. Downloading a new driver and all of the kernel-devel packages is a little harder when your network is not connected…
So to replace the Broadcom, I ordered an Intel 3945ABG card, because it is well supported in Linux. The kernel has the right drivers already included, so Fedora works right out of the box, even booting off of a live cd. And not only are the drivers open-source, but they even work better. My connection time shrunk from 20-30 seconds down to about 5 seconds. I can live with that!