Posts Tagged ‘probability’

How short can Git abbreviate?

November 10, 2013 3 comments

How short can a Git hash be abbreviated?  A full SHA-1 hash is 160 bits, 40 hex characters, which is a bit long to talk about and retype when you’re not just copy-pasting.  Therefore most people will use just the first few characters to refer to commits, but how safe is this?

I know of a couple posts where Linus Torvalds laments Git’s default hash abbreviation of 7 characters.  This gives you 28 unique bits, almost 270 million possibilities, which seems like it should be plenty.  Even with the Linux kernel, one of the biggest repositories I know, there are fewer than 4 million objects.  However, since hash functions like SHA-1 have approximately random distribution, you easily can and will run into the birthday problem at this level.

As Linus noted in that second link, Git will now make sure that the abbreviated hashes it prints are currently long enough to be unique, but that’s no guarantee against future collisions.  He recommends kernel developers use 12 characters, which got me curious how to evaluate how much headroom this leaves for any given repository.  One way to get a basic idea is to ask Git to abbreviate all commit objects as much as possible.  For example, on linux.git:

$ git rev-list --all --abbrev=0 --abbrev-commit | wc -L

To see which commits require 11-character abbreviations:

$ git rev-list --all --abbrev=0 --abbrev-commit | grep -E '.{11,}'

And indeed, if you try a command like git show on any of those with just 10 characters, it will complain that this is ambiguous. Notice that none of those 3 are near each other though, which means that the collision must have been with some other object type, like a tree or blob. In fact, git log is happy with 10 characters on these, as it only has to disambiguate among revisions.

How about a histogram of commit abbreviation lengths:

$ git rev-list --all --abbrev=0 --abbrev-commit |
  awk '{ a[length] += 1 } END { for (len in a) print len, a[len] }'
5 1771
6 286066
7 106897
8 7899
9 494
10 27
11 3

In fact, most commits here are just fine with that 7-character abbreviation, even a few as low as 5, but there’s that tail end of commits which require up to 11. So 12 does seem like a reasonable suggestion to leave some headroom, but I don’t think it’s quite the “later generations” headroom that Linus wanted.

This is all fine for commit objects, but I didn’t find a way for Git to print this minimized abbreviation on all object types. So, I wrote a quick script to run through various lengths:

# git-unique-abbrev

git rev-list --all --objects | cut -c1-40 | sort >"$OBJECTS"
printf "%d objects\n" $(wc -l <"$OBJECTS")
for abbrev in $(seq 4 40); do
    uniq -D -w $abbrev <"$OBJECTS" >"$DUPES"
    count=$(wc -l <"$DUPES")
    acount=$(uniq -w $abbrev <"$DUPES" | wc -l)
    printf "%2d: %d / %d\n" $abbrev $count $acount
    test $count -eq 0 && cat "$OBJECTS"
    mv "$DUPES" "$OBJECTS"
    test $count -eq 0 && break
rm -f "$OBJECTS"

On linux.git:

$ git-unique-abbrev 
3253824 objects
 4: 3253824 / 65536
 5: 3107326 / 854621
 6: 573758 / 277590
 7: 39577 / 19748
 8: 2609 / 1304
 9: 160 / 80
10: 12 / 6
11: 2 / 1
12: 0 / 0

Each line is reporting how many total objects are ambiguous at that abbreviation and how many such abbreviations there are. The minimum Git allows is 4, which is totally saturated in this repository — every object is ambiguous, and all 65536 possible prefixes are included.  The default 7 does disambiguate most objects, but there’s still 1.2% remaining.  At 8 characters, with 2609 objects and 1304 common prefixes, we can infer there’s even still a triplicate around.  We really do require 12 characters now to disambiguate all objects.  The last lines are the final set of duplicates, and git cat-file -t will tell you these are both tree objects.

For a probabilistic view, we can use the square approximation from Wikipedia: p(n) \approx n^2/2m, where n is the number of random values, m is the number of possible values, and here we should use m=2^{4c} for c characters.  For n = 3253824, this gives us approximately 30% chance of colliding at 11 characters, and only 1.9% at 12. So it’s not all that strange to have gotten our d597639e203 result for 11 after all.

What’s the takeaway from all this? If you are a kernel developer, or use a Git repository of similar size, you should definitely set Git to core.abbrev of at least 12, and maybe more to have some real headroom.  For smaller projects, you can try some of the commands above to see where you stand, but it may just be prudent for everyone to get used to using longer hashes everywhere.  Finally, when you reference a Git hash for posterity, e.g. in another commit message, I’d recommend always using the full value.

%d bloggers like this: