Borg Backup
(a fork of Attic)
"The holy grail of backup software?"
Thomas Waldmann (07/2017)
Feature Set (1)
- simple & fast
- deduplication
- compression
-
authenticated encryption
-
easy pruning of old backups
-
simple backend (k/v, fs, via ssh)
Feature Set (2)
-
FOSS (BSD license)
-
good docs
-
good platform / arch support
Linux, *BSD, OS X, OpenIndiana, Cygwin, Win10 Linux Subsystem, HURD -
native Windows port still unfinished / not merged yet into master. - xattr / acl support
-
FUSE support ("mount a backup")
Code
- 95% Python 3.4+, Cython
(high-level code, glue code) - 5% C
(performance critical stuff) - 1.1: vendorized C: blake2, xxh64, crc32
- only ~20000 LOC total
- few dependencies
- unit tests, CI
Security
-
Signatures / Authentication
no undetected corruption/tampering
-
Encryption / Confidentiality
only you have access to your data
-
FOSS in Python
review possible, no buffer overflows
Safety
-
Robustness
(by append-only design, transactions)
-
Checkpoints
every 5 minutes (between files)
-
msgpack with "limited" Unpacker
(no memory DoS)
Crypto Keys
-
client-side meta+data encryption
-
separate keys for sep. concerns
-
passphrase pbkdf2 100k rounds
-
Keys:
- none
- repokey (replaces: passphrase-only)
- passphrase protected keyfile
Crypto Cipher/MAC
-
AEAD, Encrypt-then-MAC
- AES256-CTR + HMAC-SHA256
- Counter / IV deterministic, never repeats
-
we're working on adding AES256-GCM, maybe
also others (AES-OCB? chacha20-poly1305?)
-
uses OpenSSL (libcrypto)
- Intel/AMD: AES-NI, PCLMULQDQ
Compression
-
none
- no compression, 1:1 pass through, no cpu usage
-
lz4
- low compression, super fast (500MB/s)
- sometimes faster than w/o compression
-
zlib
- medium compression, medium fast, level 0..9
-
lzma
- high compression, slow, level 0..9
- beware of higher levels of lzma: super slow and they do not compress better due to chunk size
Deduplication (1)
-
No problem with:
- VM images (sparse file support)
- (physical) disk images
- renamed huge directories/trees
- inner deduplication of data set
- historical deduplication
- deduplication between different machines
Deduplication (2)
-
Content defined chunking:
- "buzhash" rolling hash
- cut data when hash has specific bit pattern,
yields chunks with 2^n bits target size - n + other chunker params configurable now
- seeded, to avoid fingerprinting chunk lengths
-
Store chunks under id into store:
- id = HASH(chunk) [without encryption]
- id = MAC(mac_key, chunk) [with encryption]
Fork from Attic (May 2015)
- attic has a good codebase
- attracted quite some devs
-
lots of pull requests and activity
- but:
- low / slow PR acceptance
- 1 main developer with little time
- rather wanted it as his little pet project
- rather coding on his own than review code
- "compatibility forever"
Borg - different goals
-
developed by "The Borg Collective"
-
more open development
-
new developers are welcome!
-
quicker development
-
redesign where needed
-
changes, new features
-
incompatible changes with good reason,
minor things at minor, major at major releases -
thus: less "sta(b)le"
Borg, 2 years after forking
-
attic repo: ~600 changesets
- borg repo: ~4400 changesets
- developers, developers, developers!
- active community:
on github, irc channel, mailing list - bug and scalability fixes, #5
- features! testing. platforms. docs.
Borg 1.0 (stable)
-
packaged for many Linux distributions
-
also in *BSD and Mac OS X dists
-
more or less works on Windows w/ Cygwin
-
Happy users on Twitter, Reddit and the Blogosphere.
Borg 1.1 (soon rc)
-
new features:
- diff, recreate, with-lock, export-tar
- borg mount: versions view
- "auto" compression (heuristic)
- blake2b id hash
- JSON API, JSON logging
- better speed: FUSE, traversal, HDDs
- checksums for indexes & caches
- some source reorg / cleanup
Borg 1.2: Multi-Threading
-
zeromq? actor model?
- traverse, read, chunk
- hash, dedup, compress, encrypt
- store, sync
- fully use CPU and IO capabilities
- but: avoid races, crypto issues
-
GIL is no big issue:
- heavy I/O, heavy C (library) code
- lightweight Python based stuff
Borg 1.2: Crypto
- AEAD-style internal crypto API
- key and cipher agility
-
add new, faster ciphers
- single-pass encrypt & authenticate
- aes-gcm, aes-ocb
- chacha20-poly1305
- keccak?
Borg - you can be assimilated!
-
test scalability / reliability / security
-
be careful!
-
find, file and fix bugs
-
file feature requests
-
improve docs
-
contribute code
-
spread the word
-
create dist packages
-
care for misc. platforms
Borg Backup - Links
borgbackup.org
#borgbackup on chat.freenode.net
Questions / Feedback?
-
Just grab me at the conference!
-
Thomas J Waldmann @ twitter
Borg the Project ->
Borg Internals & Ideas v
Error Correction?
-
borg does error (and even tampering) detection
-
but not (yet?) error correction
-
kinds of errors / threat model:
-
single/few bit errors
-
defect / unreadable blocks
-
media failure (defect disk, ssd)
-
-
see issue #225 for discussion
-
implement something in borg?
-
rely on other soft- or hardware solutions?
-
avoid futile attempts
Modernize Crypto
-
sha256, hmac-sha256, crc32 are slow
-
aes is also slow, if not hw accelerated
-
faster: poly1305, blake2, sha512-256, crc32c, chacha20
-
we will support OpenSSL 1.1 for better crypto:
-
aes-ocb / aes-gcm
-
chacha2-poly1305
-
-
also use blake2b (borg 1.1)
-
see PR #1034 crypto-aead branch
Key Gen. / Management
-
currently:
-
1 AES key
-
1 HMAC key
-
1 chunker seed
-
stored highest IV value for AES CTR mode
-
encrypted using key passphrase
-
- ideas:
- session keys? always start from IV=0.
- per archive? per thread? per chunk?
- asymm. crypto: encrypt these keys for receiver
RAM consumption
-
borg >= 1.0 now has lower RAM consumption
-
chunks, files and repo index kept in memory
-
uses bigger chunks (2MiB, was: 64kiB)
-
less chunks to manage -> smaller chunks index.
-
be careful on small machines (NAS, rpi, ...)
-
or with huge amount of data / huge file count
-
in the docs, there is a formula to estimate RAM usage
Hash Tables
-
own hash table implementation in C
-
compact block of memory, no pyobj overhead
-
e.g. used for the chunks index, repo index
-
uses closed hashing (bucket array, no linked lists)
-
uses linear probing for collision handling
-
sometimes slow (maybe when HT full of tombstones?)
-
use Robin Hood hashing?
-
HT performance difficult to measure
Chunk Index Sync
-
problem: multiple clients updating same repo
-
then: chunk index needs to get re-synced
-
slow, esp. if remote, many and/or big archives
-
local collection of single-archive chunk indexes
-
needs lots of space, merging still expensive
-
idea: "borgception"
-
backup chunks index into a secondary borg repo
-
fetch it from there when out of sync
-
-
idea: "build chunks index from repo index" (in 1.1)
-
repo index knows all chunk IDs
-
but: no size/csize info in repo index
-
GitHub & Git
- "Organisation" with some Repositories
- Main Repo "borg" with good README
- Community Repo with links to related stuff
- "Issues" (+ Tags)
- bugs
- todo / planning
- ideas / feature requests / discussion
- questions / docs enhancements
- bounties $$ via bountysource.com
- "Pull Requests" + Code Review
- "Milestones" for Release Planning
- "Releases" to publish sources and binaries
Sphinx for Docs & RTD
- sphinx-based docs (reST markup)
- README == elevator talk == docs intro
- documentation == home page
- borgbackup.readthedocs.io:
- docs automatically built from github repo
- docs for multiple releases available
- nice theme, works for mobile devices
- offers PDF (and other) docs downloads
- problematic: sphinx search could be better
- annoyance: rtd "edit on github" link is 404
asciinema demo "video"
- asciinema.org
- create cool demo "videos" of console tools
- embed them into home page
- static preview screen is selectable
- copy & paste works
- recording is readable/editable json (fix typos)
- speed factor to adjust playback speed
- small: commit it to your docs folder
Mailing List & IRC
- borgbackup@python.org mailing list + archive
- support
- discussion
- (release) announcements
- #borgbackup @ chat.freenode.net IRC channel
- support
- discussion
- faster-paced than ML
pytest & travis-ci.org
- travis-ci.org
- automatically runs our tests
- for commits
- for pull requests
- both on Linux and MacOS
- unit tests via tox + pytest (pytest.org):
- pretty unit tests,
- powerful framework,
- have fun writing unit tests.
- tox tests on all python versions.
- flake8 checks the style.
vagrant + virtualbox
- automate VMs (start / ssh into it / destroy)
- provision VMs with the stuff needed
- execute tests or build steps
- automate once, have it reproducible
- test on:
- misc. Linux dists (old / new, 32 / 64bit, ...)
- FreeBSD, OpenBSD, NetBSD
- OS X
- Windows (native, Cygwin, W10 Lx SubSys)
- have less surprises "oh, it does not work on X?"
- use a PowerPC64 qemu VM with debian to test on big-endian BE (x86/x64 and most other stuff is LE)
pyenv
- pull and build any python version you want
- easily switch between versions
- tests: you want the minimum requirement
- older point release == more bugs / weirdnesses
- e.g. test on 3.4.0, 3.5.0, 3.6.0 to find all the issues
- build / bundle: you want the latest / best release
- build on 3.5.3 (or 3.6.1)
- get unusual versions not provided by linux dist
pyinstaller
- creates a single-file (or single directory) "binary" from:
- your Python / Cython / C code
- (C)Python Interpreter of your choice (latest?)
- all required Python stdlib libraries
- other libraries
- but not the (g)libc
- We use PyInstaller 3.2.1 with Python 3.5.3 on:
- Debian 7 Wheezy 32/64bit, glibc 2.13
- FreeBSD 10.3 64bit
- OS X 10.10 "Yosemite"
- Intentionally "old" stuff, so all as-old or newer systems usually work.
setuptools_scm
- tired of editing your version numbers?
- use setuptools_scm
- use git tags for your releases
- setuptools-scm does the rest:
- considers latest tag
- distance to that tag
- workdir state (uncommitted changes?)
-
1.2.3 (tagged release code)
-
1.2.4.dev3+gdeadbee (3 commits later)
- 1.2.4.dev3+gdeadbee.d20170709 ("" + unclean)
GPG Basics
- GnuPG (GPG) is public key cryptography software.
- A GnuPG keypair has:
- a secret key (usually passphrase protected)
- a public key (often published via keyserver)
- a public key fingerprint used to identify a PK
- Signing requires access to the secret key.
- If data has a valid GPG signature, this proofs:
- signature was made by secret key holder
- data is authentic (unmodified, untampered) as it was at the time when the signature was made.
Secure Releasing
- users execute downloaded code (as root) - give them a chance to verify it first and make sure it is authentic:
- widely publish the release signing key fingerprint, upload public key to keyserver.
- git: tag and sign the release
- release files: sign them with GPG
- everybody can now use GPG to verify the signatures
- note: publishing hashes of files is often no protection against attacks (just against accidential corruption)
Python / Cython / C
- Python:
- most of the code, easy.
- Cython:
- write Python code, get C speed, easy "nogil"
- access C data types, functions via Python
- write high-level interface code for C libs
- we use it e.g. for OpenSSL, lz4 and own C code
- C:
- only for the most performance critical parts
- own C code, bundled C code
- aka "the danger zone"
BorgBackup Talk (updated 04/2017)
By Thomas Waldmann
BorgBackup Talk (updated 04/2017)
- 2,234