Borg Backup

(a fork of Attic)

 

"The holy grail of backup software?"

 

 

Thomas Waldmann (07/2017)

Feature Set (1)

  • simple & fast
  • deduplication
  • compression
  • authenticated encryption

  • easy pruning of old backups

  • simple backend (k/v, fs, via ssh)

Feature Set (2)

  • FOSS (BSD license)

  • good docs

  • good platform / arch support
    Linux, *BSD, OS X, OpenIndiana, Cygwin, Win10 Linux Subsystem, HURD -
    native Windows port still unfinished / not merged yet into master.

  • xattr / acl support
  • FUSE support ("mount a backup")

Code

  • 95% Python 3.4+, Cython
    (high-level code, glue code)
  • 5% C
    (performance critical stuff)
  • 1.1: vendorized C: blake2, xxh64, crc32
  • only ~20000 LOC total
  • few dependencies
  • unit tests, CI

Security

  • Signatures / Authentication
    no undetected corruption/tampering
     
  • Encryption / Confidentiality
    only you have access to your data
     
  • FOSS in Python
    review possible, no buffer overflows

Safety

  • Robustness
    (by append-only design, transactions)
     
  • Checkpoints
    every 5 minutes (between files)
     
  • msgpack with "limited" Unpacker
    (no memory DoS)

Crypto Keys

  • client-side meta+data encryption
     
  • separate keys for sep. concerns
     
  • passphrase pbkdf2 100k rounds
     
  • Keys:
    • none
    • repokey (replaces: passphrase-only)
    • passphrase protected keyfile

Crypto Cipher/MAC

  • AEAD, Encrypt-then-MAC
    • AES256-CTR + HMAC-SHA256
    • Counter / IV deterministic, never repeats
    • we're working on adding AES256-GCM, maybe
      also others (AES-OCB? chacha20-poly1305?)

       
  • uses OpenSSL (libcrypto)

     
  • Intel/AMD: AES-NI, PCLMULQDQ

Compression

  • none
    • no compression, 1:1 pass through, no cpu usage
  • lz4
    • low compression, super fast (500MB/s)
    • sometimes faster than w/o compression
  • zlib
    • medium compression, medium fast, level 0..9
  • lzma
    • high compression, slow, level 0..9
    • beware of higher levels of lzma: super slow and they do not compress better due to chunk size

Deduplication (1)

  • No problem with:
    • VM images (sparse file support)
    • (physical) disk images
    • renamed huge directories/trees
    • inner deduplication of data set
    • historical deduplication
    • deduplication between different machines

 

Deduplication (2)

  • Content defined chunking:
    • "buzhash" rolling hash
    • cut data when hash has specific bit pattern,
      yields chunks with 2^n bits target size
    • n + other chunker params configurable now
    • seeded, to avoid fingerprinting chunk lengths
       
  • Store chunks under id into store:
    • id = HASH(chunk)  [without encryption]
    • id = MAC(mac_key, chunk)  [with encryption]

Fork from Attic (May 2015)

  • attic has a good codebase
  • attracted quite some devs
  • lots of pull requests and activity
     
  • but:
  • low / slow PR acceptance
  • 1 main developer with little time
  • rather wanted it as his little pet project
  • rather coding on his own than review code
  • "compatibility forever"

Borg - different goals

  • developed by "The Borg Collective"

  • more open development

  • new developers are welcome!

  • quicker development

  • redesign where needed

  • changes, new features

  • incompatible changes with good reason,
    minor things at minor, major at major releases

  • thus: less "sta(b)le"

Borg, 2 years after forking

  • attic repo:    ~600 changesets

  • borg repo: ~4400 changesets
  • developers, developers, developers!
  • active community:
    on github, irc channel, mailing list
  • bug and scalability fixes, #5
  • features!  testing.  platforms. docs.

Borg 1.0 (stable)

  • packaged for many Linux distributions

  • also in *BSD and Mac OS X dists

  • more or less works on Windows w/ Cygwin

  • Happy users on Twitter, Reddit and the Blogosphere.

Borg 1.1 (soon rc)

  • new features:
    • diff, recreate, with-lock, export-tar
    • borg mount: versions view
    • "auto" compression (heuristic)
    • blake2b id hash
    • JSON API, JSON logging
  • better speed: FUSE, traversal, HDDs
  • checksums for indexes & caches
  • some source reorg / cleanup

Borg 1.2: Multi-Threading

  • zeromq? actor model?
    • traverse, read, chunk
    • hash, dedup, compress, encrypt
    • store, sync
  • fully use CPU and IO capabilities
  • but: avoid races, crypto issues
  • GIL is no big issue:
    • heavy I/O, heavy C (library) code
    • lightweight Python based stuff

Borg 1.2: Crypto

  • AEAD-style internal crypto API
  • key and cipher agility
  • add new, faster ciphers
    • single-pass encrypt & authenticate
    • aes-gcm, aes-ocb
    • chacha20-poly1305
    • keccak?

Borg - you can be assimilated!

  • test scalability / reliability / security

  • be careful!

  • find, file and fix bugs

  • file feature requests

  • improve docs

  • contribute code

  • spread the word

  • create dist packages

  • care for misc. platforms

Borg Backup - Links

borgbackup.org
 

#borgbackup on chat.freenode.net

Questions / Feedback?

  • Just grab me at the conference!

  • Thomas J Waldmann @ twitter

Borg the Project ->

Borg Internals & Ideas  v

Error Correction?

  • borg does error (and even tampering) detection

  • but not (yet?) error correction

  • kinds of errors / threat model:

    • single/few bit errors

    • defect / unreadable blocks

    • media failure (defect disk, ssd)

  • see issue #225 for discussion

  • implement something in borg?

  • rely on other soft- or hardware solutions?

  • avoid futile attempts

Modernize Crypto

  • sha256, hmac-sha256, crc32 are slow

  • aes is also slow, if not hw accelerated

  • faster: poly1305, blake2, sha512-256, crc32c, chacha20

  • we will support OpenSSL 1.1 for better crypto:

    • aes-ocb / aes-gcm

    • chacha2-poly1305

  • also use blake2b (borg 1.1)

  • see PR #1034 crypto-aead branch

Key Gen. / Management

  • currently:

    • 1 AES key

    • 1 HMAC key

    • 1 chunker seed

    • stored highest IV value for AES CTR mode

    • encrypted using key passphrase

  • ideas:
    • session keys? always start from IV=0.
    • per archive? per thread? per chunk?
    • asymm. crypto: encrypt these keys for receiver

RAM consumption

  • borg >= 1.0 now has lower RAM consumption

  • chunks, files and repo index kept in memory

  • uses bigger chunks (2MiB, was: 64kiB)

  • less chunks to manage -> smaller chunks index.

  • be careful on small machines (NAS, rpi, ...)

  • or with huge amount of data / huge file count

  • in the docs, there is a formula to estimate RAM usage

Hash Tables

  • own hash table implementation in C

  • compact block of memory, no pyobj overhead

  • e.g. used for the chunks index, repo index

  • uses closed hashing (bucket array, no linked lists)

  • uses linear probing for collision handling

  • sometimes slow (maybe when HT full of tombstones?)

  • use Robin Hood hashing?

  • HT performance difficult to measure

Chunk Index Sync

  • problem: multiple clients updating same repo

  • then: chunk index needs to get re-synced

  • slow, esp. if remote, many and/or big archives

  • local collection of single-archive chunk indexes

  • needs lots of space, merging still expensive

  • idea: "borgception"

    • backup chunks index into a secondary borg repo

    • fetch it from there when out of sync

  • idea: "build chunks index from repo index" (in 1.1)

    • repo index knows all chunk IDs

    • but: no size/csize info in repo index

GitHub  &  Git

  • "Organisation" with some Repositories
    • Main Repo "borg" with good README
    • Community Repo with links to related stuff
  • "Issues" (+ Tags)
    • bugs
    • todo / planning
    • ideas / feature requests / discussion
    • questions / docs enhancements
    • bounties $$ via bountysource.com
  • "Pull Requests" + Code Review
  • "Milestones" for Release Planning
  • "Releases" to publish sources and binaries

Sphinx for Docs  &  RTD

  • sphinx-based docs (reST markup)
  • README == elevator talk == docs intro
  • documentation == home page
  • borgbackup.readthedocs.io:
    • docs automatically built from github repo
    • docs for multiple releases available
    • nice theme, works for mobile devices
    • offers PDF (and other) docs downloads
    • problematic: sphinx search could be better
    • annoyance: rtd "edit on github" link is 404

asciinema demo "video"

  • asciinema.org
  • create cool demo "videos" of console tools
  • embed them into home page
  • static preview screen is selectable
  • copy & paste works
  • recording is readable/editable json (fix typos)
  • speed factor to adjust playback speed
  • small: commit it to your docs folder

Mailing List  &  IRC

  • borgbackup@python.org mailing list + archive
    • support
    • discussion
    • (release) announcements

 

  • #borgbackup @ chat.freenode.net IRC channel
    • support
    • discussion
    • faster-paced than ML

pytest  &  travis-ci.org

  • travis-ci.org
    • automatically runs our tests
    • for commits
    • for pull requests
    • both on Linux and MacOS

 

  • unit tests via tox + pytest (pytest.org):
    • pretty unit tests,
    • powerful framework,
    • have fun writing unit tests.
    • tox tests on all python versions.
    • flake8 checks the style.

vagrant + virtualbox

 

  • automate VMs (start / ssh into it / destroy)
  • provision VMs with the stuff needed
  • execute tests or build steps
  • automate once, have it reproducible
  • test on:
    • misc. Linux dists (old / new, 32 / 64bit, ...)
    • FreeBSD, OpenBSD, NetBSD
    • OS X
    • Windows (native, Cygwin, W10 Lx SubSys)
  • have less surprises "oh, it does not work on X?"
  • use a PowerPC64 qemu VM with debian to test on big-endian BE (x86/x64 and most other stuff is LE)

pyenv

 

  • pull and build any python version you want
  • easily switch between versions
  • tests: you want the minimum requirement
    • older point release == more bugs / weirdnesses
    • e.g. test on 3.4.0, 3.5.0, 3.6.0 to find all the issues
  • build / bundle: you want the latest / best release
    • build on 3.5.3 (or 3.6.1)
  • get unusual versions not provided by linux dist

pyinstaller

 

  • creates a single-file (or single directory) "binary" from:
    • your Python / Cython / C code
    • (C)Python Interpreter of your choice (latest?)
    • all required Python stdlib libraries
    • other libraries
    • but not the (g)libc
  • We use PyInstaller 3.2.1 with Python 3.5.3 on:
    • Debian 7 Wheezy 32/64bit, glibc 2.13
    • FreeBSD 10.3 64bit
    • OS X 10.10 "Yosemite"
    • Intentionally "old" stuff, so all as-old or newer systems usually work.

setuptools_scm

 

  • tired of editing your version numbers?
  • use setuptools_scm
  • use git tags for your releases
  • setuptools-scm does the rest:
    • considers latest tag
    • distance to that tag
    • workdir state (uncommitted changes?)
  • 1.2.3 (tagged release code)

  • 1.2.4.dev3+gdeadbee (3 commits later)

  • 1.2.4.dev3+gdeadbee.d20170709 ("" + unclean)

GPG Basics

 

  • GnuPG (GPG) is public key cryptography software.
  • A GnuPG keypair has:
    • a secret key (usually passphrase protected)
    • a public key (often published via keyserver)
    • a public key fingerprint used to identify a PK
  • Signing requires access to the secret key.
  • If data has a valid GPG signature, this proofs:
    • signature was made by secret key holder
    • data is authentic (unmodified, untampered) as it was at the time when the signature was made.

Secure Releasing

 

  • users execute downloaded code (as root) - give them a chance to verify it first and make sure it is authentic:
  • widely publish the release signing key fingerprint, upload public key to keyserver.
  • git: tag and sign the release
  • release files: sign them with GPG
  • everybody can now use GPG to verify the signatures

 

  • note: publishing hashes of files is often no protection against attacks (just against accidential corruption)

Python / Cython / C

  • Python:
    • most of the code, easy.
  • Cython:
    • write Python code, get C speed, easy "nogil"
    • access C data types, functions via Python
    • write high-level interface code for C libs
    • we use it e.g. for OpenSSL, lz4 and own C code
  • C:
    • only for the most performance critical parts
    • own C code, bundled C code
    • aka "the danger zone"
Made with Slides.com