### Archive

Posts Tagged ‘probability’

## How short can Git abbreviate?

How short can a Git hash be abbreviated?  A full SHA-1 hash is 160 bits, 40 hex characters, which is a bit long to talk about and retype when you’re not just copy-pasting.  Therefore most people will use just the first few characters to refer to commits, but how safe is this?

I know of a couple posts where Linus Torvalds laments Git’s default hash abbreviation of 7 characters.  This gives you 28 unique bits, almost 270 million possibilities, which seems like it should be plenty.  Even with the Linux kernel, one of the biggest repositories I know, there are fewer than 4 million objects.  However, since hash functions like SHA-1 have approximately random distribution, you easily can and will run into the birthday problem at this level.

As Linus noted in that second link, Git will now make sure that the abbreviated hashes it prints are currently long enough to be unique, but that’s no guarantee against future collisions.  He recommends kernel developers use 12 characters, which got me curious how to evaluate how much headroom this leaves for any given repository.  One way to get a basic idea is to ask Git to abbreviate all commit objects as much as possible.  For example, on `linux.git`:

```\$ git rev-list --all --abbrev=0 --abbrev-commit | wc -L
11```

To see which commits require 11-character abbreviations:

```\$ git rev-list --all --abbrev=0 --abbrev-commit | grep -E '.{11,}'
8b82547e33e
3ee5014185b
21949f00a02```

And indeed, if you try a command like `git show` on any of those with just 10 characters, it will complain that this is ambiguous. Notice that none of those 3 are near each other though, which means that the collision must have been with some other object type, like a tree or blob. In fact, `git log` is happy with 10 characters on these, as it only has to disambiguate among revisions.

How about a histogram of commit abbreviation lengths:

```\$ git rev-list --all --abbrev=0 --abbrev-commit |
awk '{ a[length] += 1 } END { for (len in a) print len, a[len] }'
5 1771
6 286066
7 106897
8 7899
9 494
10 27
11 3```

In fact, most commits here are just fine with that 7-character abbreviation, even a few as low as 5, but there’s that tail end of commits which require up to 11. So 12 does seem like a reasonable suggestion to leave some headroom, but I don’t think it’s quite the “later generations” headroom that Linus wanted.

This is all fine for commit objects, but I didn’t find a way for Git to print this minimized abbreviation on all object types. So, I wrote a quick script to run through various lengths:

```#!/bin/bash
# git-unique-abbrev

OBJECTS="\$(mktemp)"
git rev-list --all --objects | cut -c1-40 | sort >"\$OBJECTS"
printf "%d objects\n" \$(wc -l <"\$OBJECTS")
for abbrev in \$(seq 4 40); do
DUPES="\$(mktemp)"
uniq -D -w \$abbrev <"\$OBJECTS" >"\$DUPES"
count=\$(wc -l <"\$DUPES")
acount=\$(uniq -w \$abbrev <"\$DUPES" | wc -l)
printf "%2d: %d / %d\n" \$abbrev \$count \$acount
test \$count -eq 0 && cat "\$OBJECTS"
mv "\$DUPES" "\$OBJECTS"
test \$count -eq 0 && break
done
rm -f "\$OBJECTS"
```

On `linux.git`:

```\$ git-unique-abbrev
3253824 objects
4: 3253824 / 65536
5: 3107326 / 854621
6: 573758 / 277590
7: 39577 / 19748
8: 2609 / 1304
9: 160 / 80
10: 12 / 6
11: 2 / 1
12: 0 / 0
d597639e2036f04f0226761e2d818b31f2db7820
d597639e203a100156501df8a0756fd09573e2de```

Each line is reporting how many total objects are ambiguous at that abbreviation and how many such abbreviations there are. The minimum Git allows is 4, which is totally saturated in this repository — every object is ambiguous, and all 65536 possible prefixes are included.  The default 7 does disambiguate most objects, but there’s still 1.2% remaining.  At 8 characters, with 2609 objects and 1304 common prefixes, we can infer there’s even still a triplicate around.  We really do require 12 characters now to disambiguate all objects.  The last lines are the final set of duplicates, and `git cat-file -t` will tell you these are both tree objects.

For a probabilistic view, we can use the square approximation from Wikipedia: $p(n) \approx n^2/2m$, where n is the number of random values, m is the number of possible values, and here we should use $m=2^{4c}$ for c characters.  For n = 3253824, this gives us approximately 30% chance of colliding at 11 characters, and only 1.9% at 12. So it’s not all that strange to have gotten our `d597639e203` result for 11 after all.

What’s the takeaway from all this? If you are a kernel developer, or use a Git repository of similar size, you should definitely set Git to `core.abbrev` of at least 12, and maybe more to have some real headroom.  For smaller projects, you can try some of the commands above to see where you stand, but it may just be prudent for everyone to get used to using longer hashes everywhere.  Finally, when you reference a Git hash for posterity, e.g. in another commit message, I’d recommend always using the full value.