Borg Backup

(a fork of Attic)

"I found the Holy Grail of backups."

(Stavros K. about Attic-Backup, 8/2013)

Thomas Waldmann (PyCon DE, 2017-10-25)

ThomasWaldmann.doc

Doing Python since 2001,
Linux since it was on Floppies,
loves FOSS.
Projects:
- MoinMoin Wiki
- nsupdate.info
- bepasty
- vpngw
- BorgBackup
Contact: tw @ waldmann-edv . de
Python Developer (freelance & remote)

It's a backup tool

one you maybe actually would enjoy using.

simple
efficient
secure
safe
free & open

simple

each backup is a full backup
FUSE = mount your backups
easy pruning of old backups
tooling: just borg, ssh, sh
good help, manpages, docs
single-file binary
good fs / OS / arch support

efficient

very fast for unchanged files
chunk deduplication
flexible compression
sparse file support
not flooding the fs cache
sped up by a bit of C and Cython
HW accelerated crypto

safe

borg uses:
- checksums
- transactions
- fs: syncing, atomic ops
backup repo = log-like KV store
checkpoints while backing up
off-site remote repositories

secure

authenticated encryption
- nothing to see in the repo
- detect tampering / corruption
ssh transport for remote repos
append-only mode repos
FOSS, you can see the code

Crypto

client-side, metadata and data
authenticated encryption (AE, EtM)
- aes256-ctr
- hmac-sha256 or blake2b
- counter management, never repeat
encrypted key material:
- on client or in backup repo
- passphrase protection: pbkdf2 + AES
OpenSSL 1.0/1.1, only libcrypto

Compression

chunk-based (not full file)
algorithms: none, lz4, zlib, lzma
with lz4 often faster than with none!
"auto" mode:
- first use lz4 predictor
- if compressible: expensive compression
"borg recreate" can recompress

Deduplication in SpaceTime

Deduplication "dimensions":
- inner deduplication of data set
  - copies of files, similar files
  - lots of zeros (sparse or not)
- historical deduplication in backup repo
  - many files don't change over time
  - they are all in each of your full backups
- deduplication between machines
  - just moved that big directory from m1 to m2?
  - same OS or data files everywhere?
  - will all dedup if machines share a backup repo.

Borg Deduplication

Content defined Chunk Deduplication
- cut a file into variable sized chunks,
  content defines where a cut happens
  (efficiently done using a rolling hash)
- MAC(chunk) is the key for the KV store
No problem with:
- inserted / deleted / shifted file contents
- renamed files / dirs
- VM disk images: only few chunks change

 $ borg info ssh://borg@myserver/repos/myrepo

               Original size   Compressed size   Dedup size
 All archives:      22.76 TB          18.22 TB    486.20 GB

                     Unique chunks         Total chunks
 Chunk index:              6305006            272643223

Borg assimilated Data

Real stats from a real backup repository (shortened).

2 machines, 147 backup archives, 2.5 years.

Borg 1.2: the future

multi-threading, actors, zeromq
fully use CPU and I/O
GIL is no big issue (I/O, C code)
crypto enhancements:
- AEAD API, faster AEAD ciphers
- key and cipher agility
- require OpenSSL 1.1?

Borg the Project ->

Borg Internals & Ideas v

Error Correction?

borg does error (and even tampering) detection
but not (yet?) error correction
kinds of errors / threat model:
- single/few bit errors
- defect / unreadable blocks
- media failure (defect disk, ssd)
see issue #225 for discussion
implement something in borg?
rely on other soft- or hardware solutions?
avoid futile attempts, borg is application level

Modernize Crypto

sha256, hmac-sha256 is slow
- solved: borg 1.1 added blake2b
zlib crc32 is slow
- solved: borg 1.1 added fast crc32 C code
AES-CTR + MAC 2-pass AE can be slow
- todo: borg 1.2 will use OpenSSL 1.1 for:
  - AES-OCB (very fast, if hw accelerated)
  - chacha2-poly1305 (quite fast w/o hw accel.)
key / cipher agility

Key Gen. / Management

currently:
- 1 AES key
- 1 MAC key
- 1 chunker seed
- stored highest IV value for AES CTR mode
- encrypted using key passphrase
ideas:
- session keys? always start from IV=0.
- per archive? per thread? per chunk?
- asymm. crypto: encrypt these keys for receiver

RAM consumption

borg >= 1.0 now has lower RAM consumption
uses bigger chunks (2MiB, was: 64kiB)
chunks, files and repo index kept in memory
less chunks to manage -> smaller chunks index.
be careful on small machines (NAS, raspi, ...)
or with huge amount of data / huge file count
in the docs, there is a formula to estimate RAM usage

Hash Tables

own hash table implementation in C
compact block of memory, no pyobj overhead
e.g. used for the chunks index, repo index
uses closed hashing (bucket array, no linked lists)
uses linear probing for collision handling
HT performance difficult to measure

Chunk Index Sync

problem: multiple clients updating same repo
then: chunk index needs to get re-synced
slow, esp. if remote, many and/or big archives
local collection of single-archive chunk indexes
needs lots of space, merging still expensive
idea: "borgception"
- backup chunks index into a secondary borg repo
- fetch it from there when out of sync
idea: "build chunks index from repo index" (in 1.1)
- repo index knows all chunk IDs
- but: no size/csize info in repo index

Python / Cython / C

Python (90%):
- easy, high level logic
Cython (5%):
- write pythonic code, get C-ish speed
- access C data types, functions, easy "nogil"
- simple interface code for C libs,
  we use that for OpenSSL, lz4 and own C code.
C (5%):
- used for the most resource-usage critical parts
  (CPU as well as RAM usage)
- own C code, bundled C code
- hard to maintain, debug

travis-ci.org

hosted service
free for FOSS
automatically runs our tests:
- for branches
- for pull requests
- on Linux and macOS
- misc. Python versions

pytest & tox

pytest:
- pretty and simple tests,
  less boilerplate than stdlib "unittest"
- powerful framework
- have fun writing tests
- optionally remote and parallel tests
tox:
- automates testing on all python versions
- each in a freshly built virtual env
- plus flake8 checker, for pep8 and more

pyenv

pull and build any python version you want
easily switch between versions
test on minimum requirement:
- older point release == more bugs
- 3.4.0, 3.5.0, 3.6.0 to find all the issues
build / bundle on latest / greatest release:
- newer point release == less bugs
- 3.5.4 (or 3.6.2) to get best build

vagrant, vbox, qemu

automate VMs:
create, start, provision, ..., shutdown, destroy
e.g. run tests / builds on:
- Linux (misc. dists, old / new, 32 / 64bit, ...)
- BSD (FreeBSD, OpenBSD, NetBSD)
- OS X
- OpenIndiana
- Windows (maybe)
PowerPC64 qemu VM with Debian to test on non-x86/x64 BE arch (most archs are LE).
less surprises "oh, it does not work on X?"

pyinstaller

creates a single-file binary, bundling together:
- your Python / Cython / C code
- (C)Python Interpreter of your choice
- all required Python stdlib libraries
- other required libraries
- but not the (g)libc
We use it to build Linux, FreeBSD, OS X borg binaries.
Intentionally build on "old" OS:
- all as-old or newer deployments will usually work
- preferably not too old: security updates wanted

Secure Releasing with GPG

creepy: users execute downloaded blobs, as root.
give them a chance to make sure it is authentic:
- release signing key fingerprint widely published
- public key uploaded to keyserver
- document how to use GPG to verify the signatures
git repo: sign the release tags (or every commit)
release files: sign them, detached sig

note: just publishing hashes of files is no protection against attacks (just against accidential corruption)

setuptools_scm

tired of bumping your version numbers?
setuptools_scm makes versions from git tags:
- considers latest tag
- distance to that tag (commits)
- workdir state (uncommitted changes?)
1.2.3 (tagged release code)
1.2.4.dev3+gdeadbee (3 commits later)
1.2.4.dev3+gdeadbee.d20170709 ("" + unclean)

Sphinx / Docs

sphinx - generates html/pdf from reST
reuse your ArgParser help:
- build_usage: html cli usage docs
- build_man: man pages
- see archiver.py and setup.py
reuse your github README:
- include it docs intro
- it's your "elevator speech"
reuse your docs:
- nice hosted docs can be your "home page"

ReadTheDocs.org

builds and hosts your docs:
- url like borgbackup.readthedocs.io
- automatically built from github tags
- version selector, stable, latest
- nice theme (also for mobile devices)
- offers PDF (and other) downloads
- uses sphinx

asciinema.org

create cool demo "videos" of console tools
embed them on:
- home page
- docs intro
- github README
selectable static preview screen
adjustable playback speed
copy & paste works from the "video"
recording is json:
- fix typos by editing it
- commit it to your repo

GitHub

"Organisation" == a common ground
- Main Repo "borg" with good README
- Repo "community" with links to related stuff
"Issues" (+ Labels)
- bugs / todo / planning
- ideas / feature requests / discussion
- questions -> docs enhancements
- bounties $$ via bountysource.com
"Pull Requests" + Code Review
"Milestones" for Release Planning
"Releases" to publish changelog link, src, bin

Communication Channels

Mailing List and Archive:
- borgbackup @ python.org
- slow, async, permanent
IRC (and also matrix):
- #borgbackup @ chat.freenode.net
- quicker, sync/async, transient
Twitter:
- @borgbackup on Twitter
Usages:
- support
- discussion
- (release) announcements

Borg - you can be assimilated!

test scalability / reliability / security
find, file and fix bugs
file and implement feature requests
improve docs
contribute or review code
spread the word
create dist packages
care for misc. platforms (windows)
donate funds via bountysource

For more information:

borgbackup.org

Questions / Feedback?

Just grab me at the conference or at the sprints!
Thomas J Waldmann @ twitter