Hands-On Git Garbage Collections

Introduction

On this post, I would like to explain how git garbage collector works, but before we jump into this topic, I will briefly explain what garbage collector is and we can move on from there.

Garbage collection is a well-known software practice, the garbage collector tries to reclaim memory occupied by objects that are no longer in use by the program.

Git Garbage Collector runs a number of tasks within the repository, such as compressing files revisions ( to reduce disk space and increase performance), removing unreachable objects which may have been created from orphaned or inaccessible commits for example when performing git resets or git rebase. In an effort to preserve history and avoid data loss Git will not delete detached commits.

An object (blobs, tress and commits) with SHA will be stored at .git/objects/b1/sha_number, see below from my git repository that I have multiple objects, each of them containing a SHA number indicating a commit, blobs or tree.

It is generally recommended to run this task on a regular basis within each repository to maintain good disk space utilization and good operating system performance.

Little bit of basics!

Git is a content-addressable file system, it takes and stores it in file addressable by hashes.
Repository is simply the database containing all the information needed to retain and manage the revisions and history of a project, within a repository, git maintains two primary data structures, the object store and index.

Object Store is designed to be copied during a clone operation as part of the mechanism that supports a Distributed Version Control System

Index is a transitory information, which is private to repository and ca be created, modified as needed. Git Index is also used as a staging area, between your working directory and your repository. You can use the index to build up a set of changes that you want to commit together and when you create a commit, the changes committed is what is currently in the index while what is not committed is in the working directory.

Blob is used to store file data, basically this is the file with the data you modify.

Tree is basically a directory which references to other trees and blobs (files and sub-directories).

Commit object holds metadata for each change introduced in the repository, including the author, committer, commit-data, and log-messages.

Why?

Imagine a scenario when you are contributing with a code with many other developers, and you are responsible to work on a specific feature of the project. You start developing the feature and you check it in, however, you realize that is not working as you expected and you need to perform few more modifications and check in again. After that check in, you also realize that there are few conditions which are not met so you modify the code again and check in for the third time.

Let’s say to avoid all the logs about your commits with typos and changes, you do a git reset — hard HEAD^ and thew out your last commit. When you do a reset, the commit you threw out goes to a dangling state and it is still in git’s database, waiting for the next git gc execution to clean it up.

Git gc tries not to delete objects that are referenced anywhere in your repository, it will keep not only objects referenced by your current set of branches and tags but also objects referenced by the index, remote-tracking branches, refs saved by git filter-branch in refs/original/, or reflogs (which may reference commits in branches that were later amended or rewound).

How does git gc actually works?

Before the execution, git gc checks several git config values and these values help clarify the rest of the actions that git gc will perform.
Behind the scenes git gc actually executes commands such as git prune, git repack, git pack and git rerere. The high-level responsibility of these commands is to identify any Git objects that are outside the threshold levels set from the git gc configuration. Once identified, these objects are then compressed, or pruned.

Git prune command is an utility that cleans up unreachable or orphaned git objects. Unreachable objects are those that are inaccessible by any refs. Any commit that cannot be accessed thorugh a branch or tag is considered unreachable and git prune will run when garbage collection runs. Git prune is often considered as a child command of the git gc command.

How do I set up git garbage collection?

One thing that many engineers are not aware of is that git gc runs automatically on several frequently used commands such as git pull, git merge, git rebase, git commit. However, if you would like to modify the Garbage Collection settings you can modify the settings on how garbage collection will behave for your repository

There are several set of parameters that can be set on git gc to tell Garbage Collection what to do, let’s have a look at few of them.

git config gc.reflogExpire
Default is 90 days and it is used to set how long records in a branch reflog should be kept.

git config gc.reflogExpireUnreachable
Default to 30 days and it is used to set how long inaccessible reflog records should be kept.

git config gc.pruneExpire
default of 2 weeks, it sets how long a inaccessible object will be preserved before pruning.

git config gc.worktreePruneExpire
this configuration parameter sets how long a stale working tree will be preserved before being deleted.

Hands-on git garbage collection!!

When you start up with your repository, you mostly likely have loose objects and as the number goes high, it becomes inefficient and they are stored in a pack file, which are called packed objects. Let’s play around with garbage collection to see how it works.

First, let’s initialize a new repository called “garbage-collection-repo”. In order to do that, you can simply create a folder and issue git init inside the folder.

% cd garbage-collections-repo 
% git init
Initialized empty Git repository in garbage-collections-repo/.git

Now, let’s create a file in the repository and add some data to it, then, commit it to the repo:

% git add .
% git commit -m "first file added"
[master (root-commit) 710785e] first file added
1 file changed, 780000 insertions(+)
create mode 100644 test_file.txt

Let’s have a look at how many objects we have in .git/objects folder:

% find .git/objects -type f
.git/objects/be/a84ad3b397bf51e91b665622dd86fd7c5c42b7
.git/objects/7c/4bf2db7246f042aaae85d8dd1a7cb6fdb6f9c3
.git/objects/71/0785ef7078f5b91bedfc373f01490aa1aa95b1

As you can see above, we have 3 objects and the SHA-1 for each object. With that information you can now check the type of the object and the contents of the object using git cat-file command:

% git cat-file -t bea84ad3b397bf51e91b665622dd86fd7c5c42b7
blob
% git cat-file -t 7c4bf2db7246f042aaae85d8dd1a7cb6fdb6f9c3
tree
% git cat-file -t 710785ef7078f5b91bedfc373f01490aa1aa95b1
commit

When we modify the file now and commit it again, git will create a new blob object that contains the file, duplicating the file every time we commit. To show you this, I have added a few lines into the file created in this repository and committed the change.

Getting the list of objects again we can see a new SHA-1 objects, which is a blob file and a tree file.

% find .git/objects -type f
.git/objects/be/a84ad3b397bf51e91b665622dd86fd7c5c42b7
.git/objects/fc/548b4d62913128be722907c0020e96b01b4f24
.git/objects/7c/4bf2db7246f042aaae85d8dd1a7cb6fdb6f9c3
.git/objects/39/4206dfec163458463e58971a38f4219476dda3
.git/objects/ef/2827ecd79df3bfd08bceebe79f9bbb666f4e98
.git/objects/71/0785ef7078f5b91bedfc373f01490aa1aa95b1

This is the new tree file, because the “pointer” moved, the tree is longer now since there a new commit:

% git cat-file -t ef2827ecd79df3bfd08bceebe79f9bbb666f4e98
tree

and we can see the blob object:

% git cat-file -t fc548b4d62913128be722907c0020e96b01b4f24
blob

You can also see the all information with the git cat-file — batch-check — batch-all-objects

% git cat-file --batch-check --batch-all-objects
394206dfec163458463e58971a38f4219476dda3 commit 268
710785ef7078f5b91bedfc373f01490aa1aa95b1 commit 205
7c4bf2db7246f042aaae85d8dd1a7cb6fdb6f9c3 tree 41
bea84ad3b397bf51e91b665622dd86fd7c5c42b7 blob 27320049
ef2827ecd79df3bfd08bceebe79f9bbb666f4e98 tree 41
fc548b4d62913128be722907c0020e96b01b4f24 blob 27320181

As you can see from the output above, a new blob object sha1 “fc548b4d” has been created for the commit for the same file. Each of this object are taking some space in disk, specifically in this case we have two blob objects with pretty much same content taking 27MB of disk space.

This is a small file and you unlikely to see impact of this in a small repository, however, as the repository goes bigger, larger files and projects gets larger, you will likely to see this becoming a possible problem.

Now we will see how we can reduce the size of the repository by using git gc manually. Remember that git runs it automatically depending on the git command you run, however, it is a good practice to check your repository and clean it up or manipulate git gc configurations with the parameters shown above in order to ensure the repository size is reduced saving disk space.

To execute Git Garbage Collection you just need to run git gc command.

% git gc 
Enumerating objects: 6, done.
Counting objects: 100% (6/6), done.
Delta compression using up to 4 threads
Compressing objects: 100% (4/4), done.
Writing objects: 100% (6/6), done.
Total 6 (delta 1), reused 0 (delta 0), pack-reused 0

Then we can check the objects again:

% git cat-file — batch-check — batch-all-objects
394206dfec163458463e58971a38f4219476dda3 commit 268
710785ef7078f5b91bedfc373f01490aa1aa95b1 commit 205
7c4bf2db7246f042aaae85d8dd1a7cb6fdb6f9c3 tree 41
bea84ad3b397bf51e91b665622dd86fd7c5c42b7 blob 27320049
ef2827ecd79df3bfd08bceebe79f9bbb666f4e98 tree 41
fc548b4d62913128be722907c0020e96b01b4f24 blob 27320181

From the output above you cannot really see the objects are not stored in object file as before, so we can check in other way:

% find .git/objects/ -type f
.git/objects//pack/pack-029170b2e410e3fd9c202f13536b84e6ff7489c8.pack
.git/objects//pack/pack-029170b2e410e3fd9c202f13536b84e6ff7489c8.idx
.git/objects//info/commit-graph
.git/objects//info/packs

From the output above, you can see there is no object file and it contains now pack and index files. The index is a binary file which contains a sorted list of path names, each with permissions and the SHA1 of a blob object. The index is one of the most important data structures in git which represents the virtual working tree state by recording list of paths and their object names and servers as a staging area to write out the next tree object to be committed.

The pack file is also a binary object which contains inside all the objects that we had before. Git has the ability to merge together multiple objects into single files, called pack files. It is basically multiple objects stored with an efficient delta compression scheme as a single compressed file, you can correlate it like a zip file which git can extract efficiently when needed. You can see the content of the pack file with the following command

% git verify-pack -v .git/objects/pack/pack-029170b2e410e3fd9c202f13536b84e6ff7489c8.pack 
394206dfec163458463e58971a38f4219476dda3 commit 268 176 12
710785ef7078f5b91bedfc373f01490aa1aa95b1 commit 205 134 188
fc548b4d62913128be722907c0020e96b01b4f24 blob 27320181 4932876 322
ef2827ecd79df3bfd08bceebe79f9bbb666f4e98 tree 41 51 4933198
7c4bf2db7246f042aaae85d8dd1a7cb6fdb6f9c3 tree 41 52 4933249
bea84ad3b397bf51e91b665622dd86fd7c5c42b7 blob 1422 720 4933301 1 fc548b4d62913128be722907c0020e96b01b4f24
non delta: 5 objects
chain length = 1: 1 object
.git/objects/pack/pack-029170b2e410e3fd9c202f13536b84e6ff7489c8.pack: ok

You can see the blob object has totally different size compared to before because it contains only the difference between the commits and it does not store the entire file duplicated, saving disk space and based on the SHA1’s you can see the additional lines added or removed to the objects.

Final words

As said before, some git commands already trigger git gc, but it is always recommended to check the size of your repository and how git is storing the files and run git gc to ensure you are not wasting disk space, specially when working on larger projects when disk space can become a problem.

I hope you enjoyed this post and feel free to comment and send me a message if you like it =D!!

Senior Devops Cloud Support Engineer at AWS