The Git's Guts
by Mikołaj Karebski @mkarebski
and Paweł Lipski @plipski
Also included: a bunch of prevalent misconceptions and nifty everyday tricks!
Software Engineer
(Kotlin, Golang)
mail: mkarebski@virtuslab.com
github: github.com/mkarebski
Contact - Mikołaj Karebski
mail: plipski@virtuslab.com
github: github.com/PawelLipski
Contact - Paweł Lipski
Agenda
- Commits
- Objects
- Branches
- Tags
- Reflogs
- Garbage collection
- Contact & Questions
M
Commits
P
Commits
Question: what properties of a commit (other than message) can you think of?
P
Commits
Parent(s) of Commit
Number of parents | How it can be created |
---|---|
Zero | root commit(s!) of the repo |
One | well... just "git commit" |
Two | regular merge (with 1 branch) |
Three... or more (WTF?!) | octopus merge (with 2+ branches) |
P
Commits
Parent(s) of Commit
P
GitHub's Octocat, named after octopus merges
Commits
Parent(s) of Commit
$ git log -1 2cde51f
commit 2cde51fbd0f310c8a2c5f977e665c0ac3945b46d
Merge: 7471c5c c097d5f 74c375c 04c3a85 5095f55 4f53477
2f54d2a 56d37d8 192043c f467a0f bbe5803 3990c51 d754fa9
516ea4b 69ae848 25c1a63 f52c919 111bd7b aafa85e dd407a3
71467e4 0f7f3d1 8778ac6 0406a40 308a0f3 2650bc4 8cb7a36
323702b ef74940 3cec159 72aa62b 328089a 11db0da e1771bc
f60e547 a010ff6 5e81543 58381da 626bcac 38136bd 06b2bd2
8c5178f 8e6ad35 008ef94 f58c4fc4 2309d67 5c15371 b65ab73
26090a8 9ea6fbc 2c48643 1769267 f3f9a60 f25cf34 3f30026
fbbf7fe c3e8494 e40e0b5 50c9697 6358711 0112b62 a0a0591
b888edb d44008b 9a199b8 784cbf8
Author: Mark Brown <[email redacted for privacy]>
Date: Thu Jan 2 13:01:55 2014 +0000
Merge remote-tracking branches [65 remote branch names]
P
Commits
Parent(s) of Commit
P
Commits
Tree vs DAG (directed acyclic graph)
P
Commits
Commit structure doesn't really constitute a tree in a general case.
Since each commit can have more than one parent, in fact the structure is a directed acyclic graph (DAG).
In a special case, however, commits would still form a tree as long as there are no merge commits in the entire repository.
Prevalent Misconception #1
P
Commits
Committer vs Author
Author + author date: set up only once when the commit is first created
Committer + commit date: updated every time when commit is "rewritten": amend, rebase, cherry-pick, ...
P
Commits
Nifty Everyday Trick #1
$ git log --pretty=fuller
commit 1ff36a94530ed96ae9cf41147922985337555f10 (HEAD -> some-branch, origin/some-branch)
Author: Pawel Lipski <plipski@virtuslab.com>
AuthorDate: Thu Jan 24 19:00:38 2019 +0100
Commit: Someone Else <selse@virtuslab.com>
CommitDate: Sat Jan 26 01:41:48 2019 +0100
Craft a bunch of nifty hacks
commit 17cbd52bd16e89d96d10e51558ffb45351f17cd8 (develop)
Merge: 3c6020295 08b753152
Author: Someone Else <selse@virtuslab.com>
AuthorDate: Fri Jan 25 11:31:21 2019 +0000
Commit: Pawel Lipski <plipski@virtuslab.com>
CommitDate: Fri Jan 25 11:31:21 2019 +0000
P
Commits
Prevalent Misconception #2
The committer&author name&email are not verified in any way.
Users can even specify basically any author and committer, just a matter of setting the right git config.
There are other mechanisms for verifying authorship (signed tags/commits)... but remember SHA-1 has been SHAttered in Feb 2017 :/
P
Commits
Prevalent Misconception #2
P
git config --global user.name "John Doe"
git config --global user.email "john@doe.org"
# for the given repository
git config user.name "John Doe"
git config user.email "john@doe.org"
# per operation
....
git commit --author="John Doe <john@doe.org>" --no-edit
Commits
Committer vs Author
P
Objects
P
- Commits
- Trees
- Blobs
- Tags
Objects
The Real Guts of Git
P
Objects
P
Remember Linus Torvalds is primarly an OS/filesystem guy!
The underlying git storage is basically a very specialized FS... concepts like files, directories, symbolic links and file permissions are all reflected to some extent.
.git folder contents
Objects
.git folder contents
P
Someone really sucks at naming stuff...
Objects
.git folder contents
Is called | Should rather be called |
---|---|
.git/refs/heads/ | .git/refs/local_branches/ |
.git/refs/remotes/ | .git/refs/remote_branches/ |
.git/HEAD | .git/refs/HEAD (???) |
.git/logs/ | .git/reflogs/ |
.git/objects/ .git/index |
that's ok :) |
P
Objects
Internal structure
Objects are basically deflated/zlib-compressed text (for commits) or binary data (for trees/blobs)...
...../.git/objects/xx/[a-f0-9]{38}
eg.:
...../.git/objects/c1/70510a828fd2c6d35f943b2a27b51605e5a450
M
Objects
Internal structure
DIY trick (available out of the box on most Linux distros):
$ pigz -d < .git/objects/c1/70510a828fd2c6d35f943b2a27b51605e5a450
commit 261<zero-byte>tree 3ee08f945d2d00b1be1c02e99bfd907eaa03ca19
parent d9229b06110638b0cc9c3dd143324d32c51229f8
author Pawel Lipski <pawel.p.lipski@gmail.com> 1552768066 +0100
committer Pawel Lipski <pawel.p.lipski@gmail.com> 1553727732 +0100
Migrate codebase to Python 3 (#35)
M
Objects
Internal structure
SHA-1 hash (that each object is identified by) is computed for unzipped contents, though:
pigz -d < .git/objects/c1/70510a828fd2c6d35f943b2a27b51605e5a450
<some contents...>
pigz -d < .git/objects/c1/70510a828fd2c6d35f943b2a27b51605e5a450 | sha1sum
c170510a828fd2c6d35f943b2a27b51605e5a450 -
M
Objects
Back to commits
Commits are just objects stored in .git/objects!
To view object contents in human-readable form, use plumbing command git cat-file -p <object-hash>...
$ git cat-file -p c170510a
tree 3ee08f945d2d00b1be1c02e99bfd907eaa03ca19
parent d9229b06110638b0cc9c3dd143324d32c51229f8
author Pawel Lipski <pawel.p.lipski@gmail.com> 1552768066 +0100
committer Pawel Lipski <pawel.p.lipski@gmail.com> 1553727732 +0100
Migrate codebase to Python 3 (#35)
Note the parent commits hash(es) and tree hash (3ee08f94)...
M
Objects
Trees
$ git cat-file -p 3ee08f94
100644 blob 88d4bacf2148a890a659544ba4c71293bc40ea6b .gitignore
100644 blob 8b2fda574589bb659e8ad17ffcd27a71977f226c .stestr.conf
100644 blob a7e2d1f420c6422ef20f9a22324ba29f9e1381f7 .travis.yml
100644 blob b06b13a98898da336b8273273a726d14840ad829 ISSUE_TEMPLATE.md
100644 blob 31da76bac0577cef7711016ab298847c5e403138 LICENSE
100644 blob 22ee63f4ab01807b3e336fec3ed3fe50fe9271cb Makefile
100644 blob 101276cc905258d268086aba8066a1369ecfc6e6 README.md
100644 blob 5e9e418b32df4cac805d1a125872c9b1f48ebfce RELEASE_NOTES.md
040000 tree ce13624259dbf791569f8a41526a6a54fa868ac7 completion
040000 tree bcbd4adea0a0c171bfb834072cf187fac9fe33aa git_machete
040000 tree 9d938d03bd0f0fbf7b357abef7710b214c9f29ea hook_samples
.....
M
Objects
Trees
They group files together and solve the problem of filenames.
tree == snapshot
(tree != changeset)
M
Objects
Prevalent Misconception #3
Git in principle does not store commits as changesets - even though that's what you see in diff/log!
Git generally stores snapshots.
Changesets (deltas) are only used for optimization (packs/packfiles) in long-term storage and also generated on the fly when pushing/pulling.
M
Objects
git init
echo 'hello world' > greeting.txt
git add greeting.txt
git commit -m 'initial commit'
git tag R1 -m R1
echo 'bye bye' > parting.txt
git add parting.txt
git commit -m 'added parting'
echo 'welcome' > greeting.txt
git add greeting.txt
Nifty not-so-Everyday Trick #2
M
Objects
Trees
M
Objects
Nifty not-so-Everyday Trick #2
M
Objects
Blobs
zlib compressed file contents, prepended with
"blob" <decimal-size><zero-byte>
M
Objects
Prevalent Misconception #4
Even though git stores whole snapshots (rather than just diffs), it generally doesn't take a lot of space to keep the entire repository.
For example, Mozilla reduced their repository size from 12GB to 300MB when they switch from svn to git.
M
Objects
Tags
To be continued later...
M
Branches
P
Branches
Local
Pointers stored in .git/refs/heads
Should really be called .git/refs/local_branches
P
Branches
Local
P
Branches
Prevalent Misconception #5
mikolaj@mikolaj:~/repos/git_internal/.git/refs/heads$ ls
master
mikolaj@mikolaj:~/repos/git_internal/.git/refs/heads$ file master
master: ASCII text
mikolaj@mikolaj:~/repos/git_internal/.git/refs/heads$ cat master
95474fce125caefa931e066f702bccf2821b3fbd
P
Branches
Prevalent Misconception #5
Branches don't really contain commits (that's not Mercurial/SVN/...).
They just point to a commit specified by its SHA-1.
Since commits have their parent(s), those parents have their parents etc., for each branch we can find a set of commits reachable from the commit it points to.
P
Branches
HEAD
P
Pointer to the current commit stored in .git/HEAD
Could be either other branch name or just commit SHA (detached HEAD)
$ cat .git/HEAD
ref: refs/heads/refactor/python-3
$ git checkout HEAD~1
Note: checking out 'HEAD~1'.
You are in 'detached HEAD' state. You can look around, ...
.....
$ cat .git/HEAD
d9229b06110638b0cc9c3dd143324d32c51229f8
Branches
Remote
Pointers stored in .git/refs/remotes
Should really be called .git/refs/remote_branches
The word remote typically denotes a remote repository, not a remote branch.
P
Branches
Prevalent Misconception #6
Remote branches (.git/refs/remotes) don't strictly reflect the current state of remote repository.
They simply store the state as of the latest fetch/pull.
Of course this still can be up to date if nothing has been modified (e.g. pushed) in the remote repository in the meantime.
P
Branches
Nifty Everyday Trick #3
Remove the remote branches from local repo that no longer exist in the remote repo!
Nothing is removed from remote repo itself.
$ git fetch --prune
From bitbucket.org:your-org/your-repo
- [deleted] (none) -> origin/your-old-branch
- [deleted] (none) -> origin/your-other-old-branch
- [deleted] (none) -> origin/someone-elses-branch-you-only-checked-out-to-do-review
P
Tags
M
Tags
Tags are fixed pointers, while branches are moving pointers.
Lightweight tags are stored as references
.git/refs/tags/<tag_name> file's content is specific commit hash
Lightweight
M
Tags
Annotated tags are stored as objects (zlib-compressed, similar to commits) in .git/objects
Pointers to the annotated tag objects are stored within .git/refs/tags directory (as with lightweight).
Annotated
M
Reflogs
M
Reflogs
git log is an acyclic graph of commits, traversable by parent references.
git reflog is a list of every commit ever that the given reference (branch or HEAD) was pointing to.
The reflogs are stored in .git/logs and are mostly used after git reset or git checkout goes wrong.
M
Reflogs
mikolaj@pop-os:~/git-prez/repo1$ echo "Milk" >> shopping.txt
mikolaj@pop-os:~/git-prez/repo1$ echo "Bananas" >> shopping.txt
mikolaj@pop-os:~/git-prez/repo1$ git add .
mikolaj@pop-os:~/git-prez/repo1$ git commit -m "Initial shopping list"
[master (root-commit) 888bef1] Initial shopping list
1 file changed, 2 insertions(+)
create mode 100644 shopping.txt
mikolaj@pop-os:~/git-prez/repo1$ git log
commit 888bef1b83b1990a7039e0f0c20e7f82cf946637 (HEAD -> master)
Author: Mikolaj Karebski <mikolaj.karebski@tesco.com>
Date: Wed Jan 30 14:47:31 2019 +0100
Initial shopping list
mikolaj@pop-os:~/git-prez/repo1$ git reflog ######### shorthand for: git reflog HEAD
888bef1 (HEAD -> master) HEAD@{0}: commit (initial): Initial shopping list
M
Reflogs
mikolaj@pop-os:~/git-prez/repo1$ echo "Oranges" >> shopping.txt
mikolaj@pop-os:~/git-prez/repo1$ echo "Chocolate" >> shopping.txt
mikolaj@pop-os:~/git-prez/repo1$ git add .
mikolaj@pop-os:~/git-prez/repo1$ git commit -m "Add Oranges & chocolate to shopping list"
[master 276d1b0] Add Oranges & chocolate to shopping list
1 file changed, 2 insertions(+)
mikolaj@pop-os:~/git-prez/repo1$ git log
commit 276d1b037e7287ba7448c968b9763afa4e3654cf (HEAD -> master)
Author: Mikolaj Karebski <mikolaj.karebski@tesco.com>
Date: Wed Jan 30 14:49:49 2019 +0100
Add Oranges & chocolate to shopping list
commit 888bef1b83b1990a7039e0f0c20e7f82cf946637
Author: Mikolaj Karebski <mikolaj.karebski@tesco.com>
Date: Wed Jan 30 14:47:31 2019 +0100
Initial shopping list
mikolaj@pop-os:~/git-prez/repo1$ git reflog
276d1b0 (HEAD -> master) HEAD@{0}: commit: Add Oranges & chocolate to shopping list
888bef1 HEAD@{1}: commit (initial): Initial shopping list
M
Reflogs
mikolaj@pop-os:~/git-prez/repo1$ git rebase -i --root ######## <-- squash
[detached HEAD 5cbaa64] Initial shopping list
Date: Wed Jan 30 14:47:31 2019 +0100
1 file changed, 4 insertions(+)
create mode 100644 shopping.txt
Successfully rebased and updated refs/heads/master.
mikolaj@pop-os:~/git-prez/repo1$ git log
commit 5cbaa64b345631d02b7166261bcd9bb061ccd8b2 (HEAD -> master)
Author: Mikolaj Karebski <mikolaj.karebski@tesco.com>
Date: Wed Jan 30 14:47:31 2019 +0100
Initial shopping list
Add Oranges & chocolate to shopping list
mikolaj@pop-os:~/git-prez/repo1$ git reflog
5cbaa64 (HEAD -> master) HEAD@{0}: rebase -i (finish): returning to refs/heads/master
5cbaa64 (HEAD -> master) HEAD@{1}: rebase -i (squash): Initial shopping list
255e373 HEAD@{2}: rebase -i (pick): Initial shopping list
251a141 HEAD@{3}: rebase -i (pick): Initial shopping list
fd86ee4 HEAD@{4}: rebase -i (start): checkout fd86ee436ed1d3b655d4edb62239afe9f77f66a7
276d1b0 HEAD@{5}: commit: Add Oranges & chocolate to shopping list
888bef1 HEAD@{6}: commit (initial): Initial shopping list
M
Reflogs
mikolaj@pop-os:~/git-prez/repo1$ git reset --hard 276d1b0
HEAD is now at 276d1b0 Add Oranges & chocolate to shopping list
mikolaj@pop-os:~/git-prez/repo1$ git log
commit 276d1b037e7287ba7448c968b9763afa4e3654cf (HEAD -> master)
Author: Mikolaj Karebski <mikolaj.karebski@tesco.com>
Date: Wed Jan 30 14:49:49 2019 +0100
Add Oranges & chocolate to shopping list
commit 888bef1b83b1990a7039e0f0c20e7f82cf946637
Author: Mikolaj Karebski <mikolaj.karebski@tesco.com>
Date: Wed Jan 30 14:47:31 2019 +0100
Initial shopping list
mikolaj@pop-os:~/git-prez/repo1$ cat shopping.txt
Milk
Bananas
Oranges
Chocolate
M
Reflogs
mikolaj@pop-os:~/git-prez/repo1$ git reflog
276d1b0 (HEAD -> master) HEAD@{0}: reset: moving to 276d1b0
5cbaa64 HEAD@{1}: rebase -i (finish): returning to refs/heads/master
5cbaa64 HEAD@{2}: rebase -i (squash): Initial shopping list
255e373 HEAD@{3}: rebase -i (pick): Initial shopping list
251a141 HEAD@{4}: rebase -i (pick): Initial shopping list
fd86ee4 HEAD@{5}: rebase -i (start): checkout fd86ee436ed1d3b655d4edb62239afe9f77f66a7
276d1b0 (HEAD -> master) HEAD@{6}: commit: Add Oranges & chocolate to shopping list
888bef1 HEAD@{7}: commit (initial): Initial shopping list
M
Reflogs
Prevalent Misconception #7
Commit amend, rebase, cherry-pick etc. don't really modify any history.
They simply create a brand new history based on the existing one.
The old history will be still available via the reflogs (until they are GC'ed, which is usually in ca. 90 days).
M
Garbage collection
M
Garbage collection
git gc removes loose objects - the objects which are not reachable from any branch (or any reflog).
GC also compresses old blobs into packfiles (where they can be also stored as deltas, not only as snapshots!) and expires old reflog entries.
M
Questions
M
git machete
P
git machete
P
P
The Git's Guts
By plipski
The Git's Guts
- 5,951