Archive
Add new warnings to GCC with Python
I found myself getting annoyed lately with a certain C++ anti-pattern: functions which take string parameters by value unnecessarily. The pattern itself isn’t a huge sin, and can have legitimate uses, but often functions would be just as happy with a string const&
parameter, so copying it is a waste. No matter how the optimized string copy is, it won’t be better than a simple reference. Then throw more complex classes in the mix, and it would be good have a tool or diagnostic to tell when things are copied unnecessarily.
My first thought was to try this with elfutils libdw, or perhaps Dyninst SymtabAPI. It shouldn’t be that hard to iterate debuginfo for all the functions in a binary, and scan their parameters for direct class or struct types. I may still try this, actually, just as a case study for relatively simple example code. However, one of my coworkers suggested this idea would also work as a GCC plugin, so I tried it with gcc-python-plugin and found it embarrassingly easy. Voila!
import gcc def on_pass_execution(p, fn): # This pass is called fairly early on, per-function, after the # CFG has been built. Skip every other pass. if p.name != '*warn_function_return': return # I don't want notices for system headers. # (is data like -isystem available?) if fn.decl.location.file.startswith('/usr/'): return for parm, parmtype in zip(fn.decl.arguments, fn.decl.type.argument_types): if type(parmtype) == gcc.RecordType: gcc.inform(parm.location, 'parameter type is not trivial') gcc.register_callback(gcc.PLUGIN_PASS_EXECUTION, on_pass_execution)
I started with the hello-world example, which already shows how to hook every function and extract more info than I even need. You’ll see that my case is even shorter in code, just checking the rough parameter type and issuing a notice. In practice it looks like this:
$ cat test.cc #include <string> size_t len(std::string s) { return s.size(); } $ gcc-with-python3 script.py -c test.cc test.cc: In function ‘size_t len(std::string)’: test.cc:2:24: note: parameter type is not trivial size_t len(std::string s) { return s.size(); } ^
I got this working with my whole build just by tweaking make options:
make CXXLD=g++ CXX="gcc-with-python3 /path/to/script.py"
Pull that into Vim quickfix, and I’m now easily browsing through every instance of non-trivial parameter copies in my code!
Part of the reason this was so easy was that I just flag every direct class/struct parameter, regardless of whether an option like const&
would fit. This might be a little more intelligent if it scanned the CFG to see if the parameter really could be constant. It could also relax a bit on what’s flagged, perhaps just types for which std::is_trivially_copyable
returns false
in C++11. But for just scratching an itch, I’m amazed at how quick this simple plugin was to write.
SystemTap monitoring ptrace activity
I’d like to present a real and hopefully interesting example: how I recently used SystemTap to help with a complex ptrace issue. I was trying to track down an issue which involved multiple processes, and one of those was ptrace-attached to the other. In this situation, it would be perfectly fine to attach a debugger to the controlling process, but since ptrace is exclusive, I could not have attached again to that child process. Where traditional tools are left out, SystemTap may still help!
For my problem, in a little more detail, the ptracer process was watching for forks in the child. When a fork occurred, it executed a bit of instrumentation in both the original child and the new fork, then allowed both to continue. My bug was that the fork return value in the caller (as seen after the instrumentation) should have been the new PID, but it was always 56. A bit of corruption, it seemed, but always the same 56. Except, even more strangely, on some systems once auditd started, this changed to -1 with errno
set to ENOSYS
, despite the fork having truly succeeded.
My “aha!” moment was when I realized that on x86_64, SYS_clone
happens to be 56, and further the syscall number on entry and its return value on exit both use the same %rax
register. The ptrace controller needs to save and restore register state when running instrumentation, so it seemed likely here was the culprit somehow. Now this is something easily traced with SystemTap, even across processes and kernel activity. I only needed to watch fork, ptrace, and the wait4 that gets ptrace signals. Here’s that script I used:
# NB: TASK_COMM_LEN is 16, including '\0', so truncate to 15 chars global mutator = "test_driver" global mutatee = "test4_2.dyn_g++" function trace(str) { printf("%16s[%5d] %s\n", execname(), pid(), str) } probe syscall.fork { if (execname() == mutatee) trace(sprintf("-> %s (%s)", name, argstr)) } probe syscall.fork.return { if (execname() == mutatee) trace(sprintf("<- %s = %s", name, retstr)) } global ptrace_watch probe begin { ptrace_watch[7] = 1 # PTRACE_CONT ptrace_watch[12] = 1 # PTRACE_GETREGS ptrace_watch[13] = 1 # PTRACE_SETREGS ptrace_watch[24] = 1 # PTRACE_SYSCALL } probe syscall.ptrace { if (execname() == mutator && request in ptrace_watch) trace(sprintf("%s (%s)", name, argstr)) } probe syscall.wait4.return { if (execname() == mutator) trace(sprintf("%s = %s %s", name, retstr, status_str)) }
This is a relatively simple script, with just a little logic to filter the events down, and then it prints a trace at each point. Even with that filtering, I got more output than I care to flood in a blog post, but here’s the excerpt of the bug in action:
test4_2.dyn_g++[18304] -> fork () test_driver[18287] wait4 = 18305 WSTOPSIG=SIGSTOP test_driver[18287] wait4 = 18304 WSTOPSIG=SIGTRAP | PTRACE_EVENT_FORK << 8 test_driver[18287] ptrace (PTRACE_GETREGS, 18304, data=0x7fff291f0990) test_driver[18287] ptrace (PTRACE_CONT, 18304, SIG_0) test4_2.dyn_g++[18304] <- fork = 18305 test_driver[18287] wait4 = 18304 WSTOPSIG=SIGTRAP test_driver[18287] ptrace (PTRACE_SETREGS, 18304, data=0x7fff291f09c0)
This showed me that my guess was pretty much on the mark! I have a pet peeve against blind bug fixes, where some programmers will just take a guess at a problem and make changes, without really confirming that they’ve understood what’s happening. So it’s gratifying to me to see this real data showing how things went down. There’s the fork entered and returned in one process, including the proper return value (at least from the kernel perspective). The PTRACE_EVENT_FORK
came during the middle of the fork, so when the ptracer grabbed current registers, %rax
was still the entry value. The fork returned the right value to the temporary instrumentation path (that didn’t care about it). When the ptracer went to restore registers to get back to where fork should have returned, it would be restoring the syscall’s entry value — 56.
It’s a tricky situation, made trickier by the fact that I could reproduce the bug on RHEL5 and Fedora, but RHEL6 was unaffected. But with the same script, I could see that RHEL6 waited until after the internal kernel fork returned to deliver the PTRACE_EVENT_FORK
stop, so the PTRACE_GETREGS
just happened to be from a more convenient place. (The behavior difference is due to utrace, which is a whole can of worms intertwined with ptrace internals.)
My plan to fix it was to introduce a PTRACE_SYSCALL
call after the event, so the kernel fork would complete but still trap before returning to userspace. As with tracking down the bug, the proof of a fix requires data too:
test4_2.dyn_g++[19488] -> fork () test_driver[19471] wait4 = 19488 WSTOPSIG=SIGTRAP | PTRACE_EVENT_FORK << 8 test_driver[19471] ptrace (PTRACE_SYSCALL, 19488, SIG_0) test4_2.dyn_g++[19488] <- fork = 19489 test_driver[19471] wait4 = 19489 WSTOPSIG=SIGSTOP test_driver[19471] wait4 = 19488 WSTOPSIG=TRACESYSGOOD test_driver[19471] ptrace (PTRACE_GETREGS, 19488, data=0x7fff07d26e90) test_driver[19471] ptrace (PTRACE_CONT, 19488, SIG_0) test_driver[19471] wait4 = 19488 WSTOPSIG=SIGTRAP test_driver[19471] ptrace (PTRACE_SETREGS, 19488, data=0x7fff07d26ec0
That’s a better looking sequence of events, and indeed the fork is finally showing the proper PID even after instrumentation. It worked on all systems which previously reproduced the bug, and even the unaffected RHEL6 was fine with the added wait. Fixed and confirmed!
Finally, I should admit, the problem was not so cleanly found and resolved as I might make it sound. Like many debugging sessions, I went through a lot steps before I’d narrowed it down this far. I read a lot of both application code and kernel code, and I wrote quite a few one-off SystemTap scripts watching behavior in different areas. (Perhaps a journal of this process might be interesting some other time…) So I don’t mean to imply that immediately SystemTap magically shined a light right where I needed to look — but it did make it possible for me to see into an otherwise opaque problem.
Defining your own Git commands
When you use a command prompt long enough, you’ll find yourself repeating some common command sequences. If you value your time, you’ll probably start saving shell aliases and even full shell scripts to simplify these things.
The same is true with using Git: you’ll probably find yourself repeating some long commands, and wish for an easier way. You may also just have some tricky Git recipes that you want to remember. Well sure enough, you can write Git aliases too, as well as add your own scripts as git subcommands. The Git wiki has a pretty good page describing aliases, so I’ll let you read that, and there are examples too. Here are a few more examples that I felt are worth sharing.
up = remote update
This first one is just pure laziness. A git remote update
is the same as git fetch
for all configured remotes, and now git up
can do the same with less typing.
new = log ..@{upstream} pend = log @{upstream}..
These show how the current branch differs from the upstream tracking branch (often origin/master
, but could be anything). My version of git new
is different than the example on the Git wiki, instead showing what commits you’ve fetched but not yet pulled/merged. Then git pend
is the opposite, showing what commits are pending to go out in a push.
log1 = log --abbrev-commit --pretty=oneline logf = log --pretty=fuller
These are just shortcuts for formatted logs; git log1
for one line per commit, and git logf
for the fuller format that includes committer info (instead of just author). The log1
options could also be condensed down to --oneline
these days, but that wasn’t available when I first wrote this alias.
lol = log --graph --decorate --pretty=oneline --abbrev-commit lola = log --graph --decorate --pretty=oneline --abbrev-commit --all
More logging tricks, and for these I’ll give credit to the blog where I found them. These add ASCII-art graph lines to help visualize merges, and they also label tags and branch heads.
show-patch = format-patch --stdout -1
This works like git show
, dumping the most recent commit on stdout, but with format-patch
it uses email-style headers. This makes it easier to elsewhere use git am
on that patch with preserved commit message, author, and date. It also includes binary diffs by default.
make = !test -d \"$(git rev-parse --show-toplevel)\" && make -C \"$_\"
This runs make from the top of the git directory, no matter what your current working directory is, so you can just git make
anywhere. The test -d
beforehand is just a precaution to make sure you’re really within a git path, otherwise you’ll get both a git error and make complaining of an empty -C
argument.
Once you start getting into complicated “!...
” sequences, it may make sense to break it out into its own file. As long as you name it starting with “git-
“, like git-unique-abbrev
in my last post, and save it somewhere in your PATH
, then you can call it as you would any other git command like git unique-abbrev
. Another good example is this git merge-ff
script on StackOverflow, which lets you safely fast-forward any branch without affecting your current work tree.
The last one I’ll leave you with is back to shell aliases:
alias gcd='test -d "$(git rev-parse --show-toplevel)" && cd "$_"'
This changes your current directory to the top of the git tree, using the same mechanism as I did for git make
. This has to be a shell alias, not a git command or a script, because it’s changing the working directory of the shell. Don’t forget cd -
to jump back too.
How short can Git abbreviate?
How short can a Git hash be abbreviated? A full SHA-1 hash is 160 bits, 40 hex characters, which is a bit long to talk about and retype when you’re not just copy-pasting. Therefore most people will use just the first few characters to refer to commits, but how safe is this?
I know of a couple posts where Linus Torvalds laments Git’s default hash abbreviation of 7 characters. This gives you 28 unique bits, almost 270 million possibilities, which seems like it should be plenty. Even with the Linux kernel, one of the biggest repositories I know, there are fewer than 4 million objects. However, since hash functions like SHA-1 have approximately random distribution, you easily can and will run into the birthday problem at this level.
As Linus noted in that second link, Git will now make sure that the abbreviated hashes it prints are currently long enough to be unique, but that’s no guarantee against future collisions. He recommends kernel developers use 12 characters, which got me curious how to evaluate how much headroom this leaves for any given repository. One way to get a basic idea is to ask Git to abbreviate all commit objects as much as possible. For example, on linux.git
:
$ git rev-list --all --abbrev=0 --abbrev-commit | wc -L 11
To see which commits require 11-character abbreviations:
$ git rev-list --all --abbrev=0 --abbrev-commit | grep -E '.{11,}' 8b82547e33e 3ee5014185b 21949f00a02
And indeed, if you try a command like git show
on any of those with just 10 characters, it will complain that this is ambiguous. Notice that none of those 3 are near each other though, which means that the collision must have been with some other object type, like a tree or blob. In fact, git log
is happy with 10 characters on these, as it only has to disambiguate among revisions.
How about a histogram of commit abbreviation lengths:
$ git rev-list --all --abbrev=0 --abbrev-commit | awk '{ a[length] += 1 } END { for (len in a) print len, a[len] }' 5 1771 6 286066 7 106897 8 7899 9 494 10 27 11 3
In fact, most commits here are just fine with that 7-character abbreviation, even a few as low as 5, but there’s that tail end of commits which require up to 11. So 12 does seem like a reasonable suggestion to leave some headroom, but I don’t think it’s quite the “later generations” headroom that Linus wanted.
This is all fine for commit objects, but I didn’t find a way for Git to print this minimized abbreviation on all object types. So, I wrote a quick script to run through various lengths:
#!/bin/bash # git-unique-abbrev OBJECTS="$(mktemp)" git rev-list --all --objects | cut -c1-40 | sort >"$OBJECTS" printf "%d objects\n" $(wc -l <"$OBJECTS") for abbrev in $(seq 4 40); do DUPES="$(mktemp)" uniq -D -w $abbrev <"$OBJECTS" >"$DUPES" count=$(wc -l <"$DUPES") acount=$(uniq -w $abbrev <"$DUPES" | wc -l) printf "%2d: %d / %d\n" $abbrev $count $acount test $count -eq 0 && cat "$OBJECTS" mv "$DUPES" "$OBJECTS" test $count -eq 0 && break done rm -f "$OBJECTS"
On linux.git
:
$ git-unique-abbrev 3253824 objects 4: 3253824 / 65536 5: 3107326 / 854621 6: 573758 / 277590 7: 39577 / 19748 8: 2609 / 1304 9: 160 / 80 10: 12 / 6 11: 2 / 1 12: 0 / 0 d597639e2036f04f0226761e2d818b31f2db7820 d597639e203a100156501df8a0756fd09573e2de
Each line is reporting how many total objects are ambiguous at that abbreviation and how many such abbreviations there are. The minimum Git allows is 4, which is totally saturated in this repository — every object is ambiguous, and all 65536 possible prefixes are included. The default 7 does disambiguate most objects, but there’s still 1.2% remaining. At 8 characters, with 2609 objects and 1304 common prefixes, we can infer there’s even still a triplicate around. We really do require 12 characters now to disambiguate all objects. The last lines are the final set of duplicates, and git cat-file -t
will tell you these are both tree objects.
For a probabilistic view, we can use the square approximation from Wikipedia: , where n is the number of random values, m is the number of possible values, and here we should use
for c characters. For n = 3253824, this gives us approximately 30% chance of colliding at 11 characters, and only 1.9% at 12. So it’s not all that strange to have gotten our
d597639e203
result for 11 after all.
What’s the takeaway from all this? If you are a kernel developer, or use a Git repository of similar size, you should definitely set Git to core.abbrev
of at least 12, and maybe more to have some real headroom. For smaller projects, you can try some of the commands above to see where you stand, but it may just be prudent for everyone to get used to using longer hashes everywhere. Finally, when you reference a Git hash for posterity, e.g. in another commit message, I’d recommend always using the full value.