Git Internals

So what does "git" even mean?

The name "git" was given by Linus Torvalds when he wrote the very first version. He described the tool as "the stupid content tracker" and the name as (depending on your mood)....

1. random three-letter combination that is pronounceable, and not actually used by any common UNIX command. The fact that it is a mispronunciation of "get" may or may not be relevant.

2. stupid. contemptible and despicable. simple. Take your pick from the dictionary of slang.

3. "global information tracker": you're in a good mood, and it actually works for you. Angels sing, and a light suddenly fills the room.

4. "goddamn idiotic truckload of sh*t": when it breaks

Why learn more about git?

  • Git is a tool you'll be using every day
  • Git's architecture is quite beautiful: it's an excellent example that can inform your design of everything from data structures to database schemas
  • "De-mystify" an area of perceived complexity
  • Major nerd cred

Your working directory

  • In git parlance, we'll refer to your project's files and folders as the working directory or working copy
    • That is, everything outside of the .git directory
  • When you use git commands, git uses the contents of a project's .git directory to change your working directory into the appropriate representation of how it was at the chosen point in time

The .git directory

  • Whenever you use git init, you create a .git directory that controls the working copy defined by root where the .git/ is located
  • The .git folder is not magical - let's explore it now!
./git
    HEAD
    config      
    hooks/     
    objects/
    branches/
    description
    info/
    refs/
        heads/
            master
        tags/
// We're only going to care about these for now

./git
    HEAD
    objects/
    refs/
        heads/
            master

// there will also be a file called index
// which will be important, but that's it!

Git Objects

  • All Git Objects are stored in the /objects directory of your .git directory
  • A Git Object itself is a directory + a file
    • The name of the directory is the first two characters of the SHA1 hash of the file's contents
    • The name of the file is the rest of the characters in the SHA1 hash
  • Git Objects can represent one of three things - a commit, a tree (a directory in your project), or a blob (a file in your project)

git add (step 1)

  • Creates a new blob object in the objects directory
  • The content of the the file, the blob, is just a binary compression of the file contents

git cat-file -p {hash}

  • Allows you to view the contents of a git object
  • I've heard the -p stands for "patch" or "pretty"

git add (step 2)

  • Creates an entry for the file in the index 
  • Each line of the index contains an added file and its hash
  • The index will be used when you commit
foo.js 4f83hd2...
bar.js d83heud....

git ls-files -s

  • View the contents of the index
  • The -s stands for stage, because you're listing the files that have been staged

git commit (step 1)

  • Creates a tree based on the index, which represents your project in its current state
  • A tree is another git object
    • Is basically a directory in your project (including your root directory)
    • The contents of a tree file are a list of references to blobs (your files) or trees (nested directories)
100664 blob 2e65efe2a145dda7ee51d1741299f848e5bf752e foo.txt
100664 blob 56a6051ca2b02b04ef92d5150c9ef600403cb1de bar.txt
100664 tree dda7eeca2b02b04145dda7eec9ef600403cb1def baz

git commit (step 2)

  • Creates a commit object
  • Contains a tree (your project root), some metadata, and your message!
tree ffe298c3ce8bb07326f888907996eaa48d266db4
author Zeke <zeke@fullstackacademy.com> 1424798436 -0500
committer Zeke <zeke@fullstackacademy.com> 1424798436 -0500

a1

git commit (step 3)

  • Points the current branch to your commit object
  • The HEAD file says which file in the refs/heads represents your current branch
  • And then that file points to your commit object
HEAD:

ref: refs/heads/master
refs/heads/master:

87d7fgb3bfd8sfb3bfd8db2ndf...

git commit again!

  • What about when you commit a second time?
  • The only difference is that, when you create your commit object, you can go to HEAD to find the previous commit, and include it in the new commit object
tree ce72afb5ff229a39f6cce47b00d1b0ed60fe3556
parent 774b54a193d6cfdd081e581a007d2e11f784b9fe 
author Zeke <zeke@fullstackacademy.com> 1424798436 -0500
committer Zeke <zeke@fullstackacademy.com> 1424798436 -0500

a1

git branch

  • Only one step: write the current commit's hash to refs/heads/{branchName}
  • The hash is the hash that HEAD points to

git checkout {branchName}

  • Checking out a branch is quite easy now!
    • Step 1: get the commit that the branch points to in the refs/heads directory
    • Step 2: write the contents of the file tree to your working copy
    • Step 3: write the file entries to the index
    • Step 4: point HEAD at the new branch

git merge

  • Say we've checked out a new branch, commit some changes, and now we want to merge those into master (and there are no conflicts)...
  • Say we're on master, and we git merge our new branch.
    • Master, in this case, is the receiver
    • The other branch would be the giver
  • An eight step process follows
  • Note: this is just one of four scenarios that occur when merging....

git merge

  • Step 1: write the hash of the giver to a file called MERGE_HEAD
  • Step 2: find the base commit (most recent common ancestor) between the giver and receiver
  • Step 3: generate the indices for the base, receiver and giver
  • Step 4: generate a diff between the giver and receiver commits
    • Does this by comparing the three indices

git merge

  • Step 5: the changes indicated by the diff are applied to the working copy
  • Step 6: the changes from the diff are applied to the index
  • Step 7: the updated index is committed
    • Note: this commit will have two parents, which is totally fine, of course!
  • Step 8: the current branch is pointed to the new commit

Summary

  • For the most part, git's architecture is as simple as moving references around - just like with any implementation of a tree or a graph
  • Hashing allows your data to be content addressable
  • By simply changing references to different commits with different trees with different nodes (which are never destroyed or mutated), you can easily move about the history of a project

Workshop

Your task

  • You'll implement several of the basic commands in Git, and create Git-like objects in Fullstack's own Fullstack Version System (FVS)
  • Spec: 2-fullstack-version-system/fvs.spec.js
  • Work primarily from 2-fullstack-version-system/fvs.js
  • No need to do anything from helpers/
    • I've done some work for you when you implement commit

Resources/further reading

Git Internals

By Tom Kelly

Git Internals

  • 2,097