Hands-On Git Garbage Collections


On this post, I would like to explain how git garbage collector works, but before we jump into this topic, I will briefly explain what garbage collector is and we can move on from there.

Garbage collection is a well-known software practice, the garbage collector tries to reclaim memory occupied by objects that are no longer in use by the program.

Git Garbage Collector runs a number of tasks within the repository, such as compressing files revisions ( to reduce disk space and increase performance), removing unreachable objects which may have been created from orphaned or inaccessible commits for example when performing git resets or git rebase. In an effort to preserve history and avoid data loss Git will not delete detached commits.

An object (blobs, tress and commits) with SHA will be stored at .git/objects/b1/sha_number, see below from my git repository that I have multiple objects, each of them containing a SHA number indicating a commit, blobs or tree.

It is generally recommended to run this task on a regular basis within each repository to maintain good disk space utilization and good operating system performance.

Little bit of basics!

Git is a content-addressable file system, it takes and stores it in file addressable by hashes.
Repository is simply the database containing all the information needed to retain and manage the revisions and history of a project, within a repository, git maintains two primary data structures, the object store and index.

Object Store is designed to be copied during a clone operation as part of the mechanism that supports a Distributed Version Control System

Index is a transitory information, which is private to repository and ca be created, modified as needed. Git Index is also used as a staging area, between your working directory and your repository. You can use the index to build up a set of changes that you want to commit together and when you create a commit, the changes committed is what is currently in the index while what is not committed is in the working directory.

Blob is used to store file data, basically this is the file with the data you modify.

Tree is basically a directory which references to other trees and blobs (files and sub-directories).

Commit object holds metadata for each change introduced in the repository, including the author, committer, commit-data, and log-messages.


Imagine a scenario when you are contributing with a code with many other developers, and you are responsible to work on a specific feature of the project. You start developing the feature and you check it in, however, you realize that is not working as you expected and you need to perform few more modifications and check in again. After that check in, you also realize that there are few conditions which are not met so you modify the code again and check in for the third time.

Let’s say to avoid all the logs about your commits with typos and changes, you do a git reset — hard HEAD^ and thew out your last commit. When you do a reset, the commit you threw out goes to a dangling state and it is still in git’s database, waiting for the next git gc execution to clean it up.

Git gc tries not to delete objects that are referenced anywhere in your repository, it will keep not only objects referenced by your current set of branches and tags but also objects referenced by the index, remote-tracking branches, refs saved by git filter-branch in refs/original/, or reflogs (which may reference commits in branches that were later amended or rewound).

How does git gc actually works?

Before the execution, git gc checks several git config values and these values help clarify the rest of the actions that git gc will perform.
Behind the scenes git gc actually executes commands such as git prune, git repack, git pack and git rerere. The high-level responsibility of these commands is to identify any Git objects that are outside the threshold levels set from the git gc configuration. Once identified, these objects are then compressed, or pruned.

Git prune command is an utility that cleans up unreachable or orphaned git objects. Unreachable objects are those that are inaccessible by any refs. Any commit that cannot be accessed thorugh a branch or tag is considered unreachable and git prune will run when garbage collection runs. Git prune is often considered as a child command of the git gc command.

How do I set up git garbage collection?

One thing that many engineers are not aware of is that git gc runs automatically on several frequently used commands such as git pull, git merge, git rebase, git commit. However, if you would like to modify the Garbage Collection settings you can modify the settings on how garbage collection will behave for your repository

git config gc.reflogExpire
Default is 90 days and it is used to set how long records in a branch reflog should be kept.

git config gc.reflogExpireUnreachable
Default to 30 days and it is used to set how long inaccessible reflog records should be kept.

git config gc.pruneExpire
default of 2 weeks, it sets how long a inaccessible object will be preserved before pruning.

git config gc.worktreePruneExpire
this configuration parameter sets how long a stale working tree will be preserved before being deleted.

Hands-on git garbage collection!!

When you start up with your repository, you mostly likely have loose objects and as the number goes high, it becomes inefficient and they are stored in a pack file, which are called packed objects. Let’s play around with garbage collection to see how it works.

First, let’s initialize a new repository called “garbage-collection-repo”. In order to do that, you can simply create a folder and issue git init inside the folder.

% cd garbage-collections-repo 
% git init
Initialized empty Git repository in garbage-collections-repo/.git
% git add .
% git commit -m "first file added"
[master (root-commit) 710785e] first file added
1 file changed, 780000 insertions(+)
create mode 100644 test_file.txt
% find .git/objects -type f
% git cat-file -t bea84ad3b397bf51e91b665622dd86fd7c5c42b7
% git cat-file -t 7c4bf2db7246f042aaae85d8dd1a7cb6fdb6f9c3
% git cat-file -t 710785ef7078f5b91bedfc373f01490aa1aa95b1
% find .git/objects -type f
% git cat-file -t ef2827ecd79df3bfd08bceebe79f9bbb666f4e98
% git cat-file -t fc548b4d62913128be722907c0020e96b01b4f24
% git cat-file --batch-check --batch-all-objects
394206dfec163458463e58971a38f4219476dda3 commit 268
710785ef7078f5b91bedfc373f01490aa1aa95b1 commit 205
7c4bf2db7246f042aaae85d8dd1a7cb6fdb6f9c3 tree 41
bea84ad3b397bf51e91b665622dd86fd7c5c42b7 blob 27320049
ef2827ecd79df3bfd08bceebe79f9bbb666f4e98 tree 41
fc548b4d62913128be722907c0020e96b01b4f24 blob 27320181
% git gc 
Enumerating objects: 6, done.
Counting objects: 100% (6/6), done.
Delta compression using up to 4 threads
Compressing objects: 100% (4/4), done.
Writing objects: 100% (6/6), done.
Total 6 (delta 1), reused 0 (delta 0), pack-reused 0
% git cat-file — batch-check — batch-all-objects
394206dfec163458463e58971a38f4219476dda3 commit 268
710785ef7078f5b91bedfc373f01490aa1aa95b1 commit 205
7c4bf2db7246f042aaae85d8dd1a7cb6fdb6f9c3 tree 41
bea84ad3b397bf51e91b665622dd86fd7c5c42b7 blob 27320049
ef2827ecd79df3bfd08bceebe79f9bbb666f4e98 tree 41
fc548b4d62913128be722907c0020e96b01b4f24 blob 27320181
% find .git/objects/ -type f
% git verify-pack -v .git/objects/pack/pack-029170b2e410e3fd9c202f13536b84e6ff7489c8.pack 
394206dfec163458463e58971a38f4219476dda3 commit 268 176 12
710785ef7078f5b91bedfc373f01490aa1aa95b1 commit 205 134 188
fc548b4d62913128be722907c0020e96b01b4f24 blob 27320181 4932876 322
ef2827ecd79df3bfd08bceebe79f9bbb666f4e98 tree 41 51 4933198
7c4bf2db7246f042aaae85d8dd1a7cb6fdb6f9c3 tree 41 52 4933249
bea84ad3b397bf51e91b665622dd86fd7c5c42b7 blob 1422 720 4933301 1 fc548b4d62913128be722907c0020e96b01b4f24
non delta: 5 objects
chain length = 1: 1 object
.git/objects/pack/pack-029170b2e410e3fd9c202f13536b84e6ff7489c8.pack: ok

Final words

As said before, some git commands already trigger git gc, but it is always recommended to check the size of your repository and how git is storing the files and run git gc to ensure you are not wasting disk space, specially when working on larger projects when disk space can become a problem.

I hope you enjoyed this post and feel free to comment and send me a message if you like it =D!!

Senior Devops Cloud Support Engineer at AWS