Posts Tagged ‘Git’

Defining your own Git commands

November 17, 2013 Comments off

When you use a command prompt long enough, you’ll find yourself repeating some common command sequences.  If you value your time, you’ll probably start saving shell aliases and even full shell scripts to simplify these things.

The same is true with using Git: you’ll probably find yourself repeating some long commands, and wish for an easier way.  You may also just have some tricky Git recipes that you want to remember.  Well sure enough, you can write Git aliases too, as well as add your own scripts as git subcommands.  The Git wiki has a pretty good page describing aliases, so I’ll let you read that, and there are examples too.  Here are a few more examples that I felt are worth sharing.

up = remote update

This first one is just pure laziness.  A git remote update is the same as git fetch for all configured remotes, and now git up can do the same with less typing.

new = log ..@{upstream}
pend = log @{upstream}..

These show how the current branch differs from the upstream tracking branch (often origin/master, but could be anything).  My version of git new is different than the example on the Git wiki, instead showing what commits you’ve fetched but not yet pulled/merged.  Then git pend is the opposite, showing what commits are pending to go out in a push.

log1 = log --abbrev-commit --pretty=oneline
logf = log --pretty=fuller

These are just shortcuts for formatted logs; git log1 for one line per commit, and git logf for the fuller format that includes committer info (instead of just author).  The log1 options could also be condensed down to --oneline these days, but that wasn’t available when I first wrote this alias.

lol = log --graph --decorate --pretty=oneline --abbrev-commit
lola = log --graph --decorate --pretty=oneline --abbrev-commit --all

More logging tricks, and for these I’ll give credit to the blog where I found them.  These add ASCII-art graph lines to help visualize merges, and they also label tags and branch heads.

show-patch = format-patch --stdout -1

This works like git show, dumping the most recent commit on stdout, but with format-patch it uses email-style headers.  This makes it easier to elsewhere use git am on that patch with preserved commit message, author, and date.  It also includes binary diffs by default.

make = !test -d \"$(git rev-parse --show-toplevel)\" && make -C \"$_\"

This runs make from the top of the git directory, no matter what your current working directory is, so you can just git make anywhere.  The test -d beforehand is just a precaution to make sure you’re really within a git path, otherwise you’ll get both a git error and make complaining of an empty -C argument.

Once you start getting into complicated “!...” sequences, it may make sense to break it out into its own file.  As long as you name it starting with “git-“, like git-unique-abbrev in my last post, and save it somewhere in your PATH, then you can call it as you would any other git command like git unique-abbrev.  Another good example is this git merge-ff script on StackOverflow, which lets you safely fast-forward any branch without affecting your current work tree.

The last one I’ll leave you with is back to shell aliases:

alias gcd='test -d "$(git rev-parse --show-toplevel)" && cd "$_"'

This changes your current directory to the top of the git tree, using the same mechanism as I did for git make.  This has to be a shell alias, not a git command or a script, because it’s changing the working directory of the shell.  Don’t forget cd - to jump back too.


How short can Git abbreviate?

November 10, 2013 3 comments

How short can a Git hash be abbreviated?  A full SHA-1 hash is 160 bits, 40 hex characters, which is a bit long to talk about and retype when you’re not just copy-pasting.  Therefore most people will use just the first few characters to refer to commits, but how safe is this?

I know of a couple posts where Linus Torvalds laments Git’s default hash abbreviation of 7 characters.  This gives you 28 unique bits, almost 270 million possibilities, which seems like it should be plenty.  Even with the Linux kernel, one of the biggest repositories I know, there are fewer than 4 million objects.  However, since hash functions like SHA-1 have approximately random distribution, you easily can and will run into the birthday problem at this level.

As Linus noted in that second link, Git will now make sure that the abbreviated hashes it prints are currently long enough to be unique, but that’s no guarantee against future collisions.  He recommends kernel developers use 12 characters, which got me curious how to evaluate how much headroom this leaves for any given repository.  One way to get a basic idea is to ask Git to abbreviate all commit objects as much as possible.  For example, on linux.git:

$ git rev-list --all --abbrev=0 --abbrev-commit | wc -L

To see which commits require 11-character abbreviations:

$ git rev-list --all --abbrev=0 --abbrev-commit | grep -E '.{11,}'

And indeed, if you try a command like git show on any of those with just 10 characters, it will complain that this is ambiguous. Notice that none of those 3 are near each other though, which means that the collision must have been with some other object type, like a tree or blob. In fact, git log is happy with 10 characters on these, as it only has to disambiguate among revisions.

How about a histogram of commit abbreviation lengths:

$ git rev-list --all --abbrev=0 --abbrev-commit |
  awk '{ a[length] += 1 } END { for (len in a) print len, a[len] }'
5 1771
6 286066
7 106897
8 7899
9 494
10 27
11 3

In fact, most commits here are just fine with that 7-character abbreviation, even a few as low as 5, but there’s that tail end of commits which require up to 11. So 12 does seem like a reasonable suggestion to leave some headroom, but I don’t think it’s quite the “later generations” headroom that Linus wanted.

This is all fine for commit objects, but I didn’t find a way for Git to print this minimized abbreviation on all object types. So, I wrote a quick script to run through various lengths:

# git-unique-abbrev

git rev-list --all --objects | cut -c1-40 | sort >"$OBJECTS"
printf "%d objects\n" $(wc -l <"$OBJECTS")
for abbrev in $(seq 4 40); do
    uniq -D -w $abbrev <"$OBJECTS" >"$DUPES"
    count=$(wc -l <"$DUPES")
    acount=$(uniq -w $abbrev <"$DUPES" | wc -l)
    printf "%2d: %d / %d\n" $abbrev $count $acount
    test $count -eq 0 && cat "$OBJECTS"
    mv "$DUPES" "$OBJECTS"
    test $count -eq 0 && break
rm -f "$OBJECTS"

On linux.git:

$ git-unique-abbrev 
3253824 objects
 4: 3253824 / 65536
 5: 3107326 / 854621
 6: 573758 / 277590
 7: 39577 / 19748
 8: 2609 / 1304
 9: 160 / 80
10: 12 / 6
11: 2 / 1
12: 0 / 0

Each line is reporting how many total objects are ambiguous at that abbreviation and how many such abbreviations there are. The minimum Git allows is 4, which is totally saturated in this repository — every object is ambiguous, and all 65536 possible prefixes are included.  The default 7 does disambiguate most objects, but there’s still 1.2% remaining.  At 8 characters, with 2609 objects and 1304 common prefixes, we can infer there’s even still a triplicate around.  We really do require 12 characters now to disambiguate all objects.  The last lines are the final set of duplicates, and git cat-file -t will tell you these are both tree objects.

For a probabilistic view, we can use the square approximation from Wikipedia: p(n) \approx n^2/2m, where n is the number of random values, m is the number of possible values, and here we should use m=2^{4c} for c characters.  For n = 3253824, this gives us approximately 30% chance of colliding at 11 characters, and only 1.9% at 12. So it’s not all that strange to have gotten our d597639e203 result for 11 after all.

What’s the takeaway from all this? If you are a kernel developer, or use a Git repository of similar size, you should definitely set Git to core.abbrev of at least 12, and maybe more to have some real headroom.  For smaller projects, you can try some of the commands above to see where you stand, but it may just be prudent for everyone to get used to using longer hashes everywhere.  Finally, when you reference a Git hash for posterity, e.g. in another commit message, I’d recommend always using the full value.

%d bloggers like this: