What I learned from running git.kernel.org

by Konstantin Ryabitsev

Running git.kernel.org

  • 300+ forks of the same repo
  • Replicating to 3 geo-distributed servers
  • Tweaking git-daemon to play nicely
    • RAM, disk and processors
    • Repack flags
    • Repo and bundles

300 forks of

git/torvalds/linux.git

Each linux.git is about 1.8GB

Forks are efficient... sorta

  • git clone --local
    • uses hardlinks
    • safe to use everywhere
    • saves space until next repack
  • git clone --shared
    • sets up objects/info/alternates
    • can result in repo corruption
    • saves lots of space

git.kernel.org is 30GB

without alternates, 400GB

Using alternates carefully

  • Avoid object cleanups in the "mother" repository
    • "git repack -Ad" leaves loose objects intact
    • Do not run "git gc", just "git repack"
  • Object cleanups in "daughter" repositories are OK
    • "git gc" is okay and encouraged
    • "git repack -adl" to save space
  • Grandchildren are a recipe for disaster

Grokmirror

replicating git repos sanely

(it can be done!)

Grokmirror highlights

  • Works via git hooks
  • Creates a manifest file that replicas can pull
  • Updates only repos that changed
    • Runs "git remote update" for changes
    • Runs "git clone" for new repos
    • Prunes junk
  • Parallelizing and very efficient
  • Will even fsck and repack for you

github.com/mricon/grokmirror

Still adding features

  • Automatically recognize nearly-identical repos
    • And set up alternates when warranted
  • Recognize when running "git gc" is safe
  • Add "--dissociate" clone support (new in git 2.3)
  • Support both python 2.7 and 3.5
  • Other features upon request

git-daemon tweaks

Because cloning linux.git eats up 1GB RAM

Hardware

  • RAM, lots and lots of RAM
    • Git-daemon will eat all of it, and then some
  • Use haproxy in the front to combat abuse
    • Don't use http caches, as they will likely break repos after gc/repack
  • Use fast disks with good seek times
    • Active repos will quickly create lots of loose objects all over the disk
  • Repacking (deltas and compression) eats processing power

Useful repack flags

  • -b --pack-kept-objects are your best friends
    • creates a bitmap index
    • cuts down on "counting objects" time
    • Git 2.0+ only
  • -f in limited cases (we don't use it)
  • Will not repack refs (tags, etc)
    • Use a separate "git pack-refs" command

Repo and bundles

Android-specific

Repo

  • A tool to keep track of hundreds of repos, using a manfiest (that is a git repo itself)
  • Not for the same purposes as grokmirror (dev-oriented, not mirroring-oriented)
  • Parallelizes clones and updates
    • (resulting in abuse of your servers)

Repo bundles

  • Neat feature that can use lookaside "git bundle" packs to offload most traffic to http cache
    • Can be placed on akamai or other accelerators
  • Only works with http:// clone URLs
  • Can save you tons of expensive bandwidth if your project uses on repo at all
  • See "git-bundle(1)" and Google's "repo" tool

Thank you!

Questions?