git

Highlights from Git 2.31

The open source Git project just released Git 2.31 with features and bug fixes from 85 contributors, 23 of them new. Last time we caught up with you, Git 2.29 had just been released. Two versions later, let’s take a look at the most interesting features and changes that have happened since.

Introducing git maintenance

Picture this: you’re at your terminal, writing commits, pulling from another repository, and pushing up the results when all of the sudden, you’re greeted by this unfriendly message:

Auto packing the repository for optimum performance. You may also
run "git gc" manually. See "git help gc" for more information.

…and, you’re stuck. Now you’ve got to wait for Git to finish running git gc --auto before you can get back to work.

What happened here? In the course of normal use, Git writes lots of data: objects, packfiles, references, and the like. Some of those paths are optimized for write performance. For example, it’s much quicker to write a single “loose” object, but it’s faster to read a packfile.

To keep you productive, Git makes a trade-off: in general, it optimizes for the write path while you’re working, pausing every so often to represent its internal data-structures in a way that is more efficient to read in order to keep you productive in the long-run.

Git has its own heuristics about when is a good time to perform this “pause,” but sometimes those heuristics trigger a blocking git gc at the worst possible time. You could manage these data-structures yourself, but you might not want to invest the time figuring out when and how to do that.

Starting in Git 2.31, you can get the best of both worlds with background maintenance. This cross-platform feature allows Git to keep your repository healthy while not blocking any of your interactions. In particular, this will improve your git fetch times by pre-fetching the latest objects from your remotes once an hour.

Getting started with background maintenance couldn’t be easier. Simply navigate your terminal to any repository you want to enable background maintenance on, and run the following:

$ git maintenance start

…and Git will take care of the rest. Besides pre-fetching the latest objects once an hour, Git will make sure that its own data is organized, too. It will update its commit-graph file once an hour, and pack any loose objects (as well as incrementally repack packed objects) nightly.

Read more about this feature in the git maintenance documentation and learn how to customize it with maintenance.* config options. If you have any trouble, you can check the troubleshooting documentation.

[sourcesourcesourcesource]

On-disk reverse indexes

You may know that Git stores all data as “objects:” commits, trees, and blobs which store the contents of individual files. For efficiency, Git puts many objects into packfiles, which are essentially a concatenated stream of objects (this same stream is also how objects are transferred by git fetch and git push). In order to efficiently access individual objects, Git generates an index for each packfile. Each of these .idx files allows quick conversion of an object id into its byte offset within the packfile.

What happens when you want to go in the other direction? In particular, if all Git knows is what byte it’s looking at in some packfile, how does it go about figuring out which object that byte is part of?

To accomplish this, Git uses an aptly-named reverse index: an opaque mapping between locations in a packfile, and the object each location is a part of. Prior to Git 2.31, there was no on-disk format for reverse indexes (like there is for the .idx file), and so it had to generate and store the reverse index in memory each time. This roughly boils down to generating an array of object-position pairs, and then sorting that array by position (for the curious, the exact details can be found here).

But this takes time. In the case of repositories with large packfiles, this can take a lot of time. To better understand the scale, consider an experiment which compares the time it takes to print the size of an object, versus the time it a takes to print that object’s contents. To simply print an object’s contents, Git uses the forward index to locate the desired object in a pack, and then it reassembles and prints out its contents. But to print an object’s size in a packfile, Git needs to locate not just the object we want to measure, but the object immediately following it, and then subtract the two to find out how much space it’s using. To find the position of the first byte in the adjacent object, Git needs to use the reverse index.

Comparing the two, it is more than 62 times slower to print the size of an object than it is to print that entire object’s contents. You can try this at home with hyperfine by running:

$ git rev-parse HEAD >tip
$ hyperfine --warmup=3 \
  'git cat-file --batch <tip' \
  'git cat-file --batch-check="%(objectsize:disk)" <tip'

In 2.31, Git gained the ability to serialize the reverse index into a new, on-disk format with the .rev extension. After generating an on-disk reverse index and repeating the above experiment, our results now show that it takes roughly the same amount of time to print an object’s contents as it does its size.

Observant readers may ask themselves why Git even needs to bother using a reverse index. After all, if you can print the contents of an object, then surely printing that object’s size is no more difficult than knowing how many bytes you wrote when printing the contents. But, this depends on the size of the object. If it’s enormous, then counting up all of its bytes is much more expensive than simply subtracting.

Reverse indexes can help beyond synthetic experiments like these: when sending objects for a fetch or push, the reverse index is used to send object bytes directly from disk. Having a reverse index computed ahead of time makes this process run faster.

Git doesn’t generate .rev files by default yet, but you can experiment with them yourself by running git config pack.writeReverseIndex true, and then repacking your repository (with git repack -Ad). We have been using these at GitHub for the past couple of months to enable dramatic improvements in many different Git operations.

[sourcesource]

Tidbits

  • We’ve talked on this blog before about the commit-graph file. It’s an incredibly useful serialization of common information about commits, like which parents they have, what their root tree is, and so on. (For a more detailed exposition, the blog post series beginning here is a great exposition). Commit graphs also store information about a commit’s generation number, which can be used to accelerate many kinds of commit walks. In Git 2.31, a new kind of generation number was used, which can improve performance further in certain situations.These patches were contributed by Abhishek Kumar, a Google Summer of Code student.[source]
  • In recent versions of Git, it has become easier to change the default name for the main branch in a new repository with the init.defaultBranch configuration. Git has always tried to check out the branch at the HEAD of your remote (i.e., if the remote’s default branch was “foo“, then git clone would try to checkout foo locally), but this hasn’t worked with empty repositories.In Git 2.31, this now works with empty repositories, too. Now if you are cloning a newly-created repository locally to start writing the first patches, your local copy will respect the default branch name set by the remote, even if there aren’t any commits yet.[source]
  • On the topic of renaming things, Git 2.30 makes it easier to change the name of another default: a repository’s first remote. When git clone-ing a repository, the first remote initialized is always named “origin”.Prior to Git 2.30, your options for renaming this were limited to running git remote rename origin <newname>. Git 2.30 allows you to configure a different name to be chosen by default, instead of always using “origin”. To give it a try for yourself, set the clone.defaultRemoteName configuration.[source]
  • When a repository grows large, it can be hard to figure out which branches are responsible. In Git 2.31, git rev-list now has a --disk-usage option which is both simpler and faster than using the existing tools to sum up object sizes. The examples section of the rev-list manual shows off some uses (and check out the source link below for timings and to see the “old” way of doing it).[source]
  • You may have used Git’s -G<regex> option to find commits which modified a line that mentions a particular string (e.g., git log -G'foo\(' will look for changes that added, removed, or modified calls to the foo() function). But you may also want to ignore lines matching a certain pattern. Git 2.30 introduces -I<regex>, which lets you ignore changes in lines matching a regular expression. For instance, git log -p -I'//' would show the patch for each commit, but omit any hunks that only touched comment lines (those containing //).[source]
  • In preparation for replacing the merge backend, rename detection has been substantially optimized. You can read more about these changes from their author in Optimizing git’s merge machinery, #1, and Optimizing git’s merge machinery, #2.

That’s just a sample of changes from the last couple of releases. For more, check out the release notes for 2.30 and 2.31, or any previous version in the Git repository.

Spring Sale 2020

Leave a Reply

Your email address will not be published. Required fields are marked *