Borg Backup
(a fork of Attic)
"I found the Holy Grail of backups."
(Stavros K. about Attic-Backup, 8/2013)
Thomas Waldmann (@home, 2021-07)
ThomasWaldmann.__doc__
- Doing Python since 2001,
Linux since it was on Floppies,
loves FOSS. -
Projects:
- MoinMoin Wiki
- nsupdate.info
- bepasty
- vpngw
- BorgBackup
- Contact: tw @ waldmann-edv . de
- Python Developer (freelance & remote)
It's a backup tool
-
one you maybe actually would enjoy using.
- simple
- efficient
- secure
- safe
- free & open
simple
- each backup is a full backup
- FUSE = mount your backups
- easy pruning of old backups
- tooling: just borg, ssh, sh
- good help, manpages, docs
- single-file binary
- good fs / OS / arch support
efficient
- very fast for unchanged files
- chunk deduplication
- flexible compression
- sparse file support
- not flooding the fs cache
- sped up by a bit of C and Cython
- HW accelerated crypto
safe
-
borg uses:
- checksums
- transactions
- fs: syncing, atomic ops
- backup repo = log-like KV store
- checkpoints while backing up
- off-site remote repositories
secure
-
authenticated encryption
- nothing to see in the repo
- detect tampering / corruption
-
ssh transport for remote repos
-
append-only mode repos
-
size obfuscation (borg 1.2)
-
FOSS, you can see the code
Crypto
- client-side, metadata and data
-
authenticated encryption (AE, EtM)
- aes256-ctr
- hmac-sha256 or blake2b
- counter management, never repeat
-
encrypted key material:
- on client or in backup repo
- passphrase protection: pbkdf2 + AES
- OpenSSL 1.1, only libcrypto
Compression
- chunk-based (not full file)
- algorithms: lz4, zstd, zlib, lzma, none
- with lz4 often faster than with none!
-
"auto" mode:
- first use lz4 as predictor
- if compressible: expensive compression
- "borg recreate" can recompress
Deduplication in SpaceTime
-
Deduplication "dimensions":
-
inner deduplication of data set
- copies of files, similar files
- lots of zeros (sparse or not)
-
historical deduplication in backup repo
- many files don't change over time
- they are all in each of your full backups
-
deduplication between machines
- just moved that big directory from m1 to m2?
- same OS or data files everywhere?
- will all dedup if machines share a backup repo.
-
inner deduplication of data set
Borg Deduplication
-
Content defined Chunk Deduplication
- cut a file into variable sized chunks,
content defines where a cut happens
(efficiently done using a rolling hash) - MAC(chunk) is the key for the KV store
- cut a file into variable sized chunks,
-
No problem with:
- inserted / deleted / shifted file contents
- renamed files / dirs
- VM disk images: only few chunks change
$ borg info ssh://borg@myserver/repos/myrepo
Original size Compressed size Dedup size
All archives: 22.76 TB 18.22 TB 486.20 GB
Unique chunks Total chunks
Chunk index: 6305006 272643223
Borg assimilated Data
Real stats from a real backup repository (shortened).
2 machines, 147 backup archives, 2.5 years.
Borg now
-
1.1: current stable release
- getting fixes primarily
- getting few new features,
if uncritical to stability - branch 1.1-maint in the repo
-
1.0: oldstable
- not supported any more
-
1.2: in beta testing now
- master branch in repo
Borg 1.2
- lots of code cleanups / refactoring
- setup.py cleaned + pkgconfig now
- internal AEAD-style crypto API
- compat layer for msgpack
- FD (not: path) based operation,
less race conditions! - separated recursion and processing
- using pyfuse3 (new) and also
llfuse (deprecated, fuse2, widespread) - minimal native windows support
Borg 1.2
- optional chunk size obfuscation
-
compact_segments improved:
- separate cli command now
- more stable segments
- faster. separated manifest/commits.
- create --paths-from-stdin
- ctrl-c: checkpoint, then abort
- incremental, time-ltd. repo check
- prune: show rule
- fixed blocksize chunker (disks!)
Borg Future
- 1.1 + 1.2 maintenance
- crypto improvements
- multithreading
- details: see github milestones
Borg the Project ->
Borg Internals & Ideas v
Error Correction?
-
borg does error (and even tampering) detection
-
but not (yet?) error correction
-
kinds of errors / threat model:
-
single/few bit errors
-
defect / unreadable blocks
-
media failure (defect disk, ssd)
-
-
see issue #225 for discussion
-
implement something in borg?
-
rely on other soft- or hardware solutions?
-
avoid futile attempts, borg is application level
Modernize Crypto
-
sha256, hmac-sha256 is slow
-
solved: borg 1.1 added blake2b
-
-
zlib crc32 is slow
-
solved: borg 1.1 added fast crc32 C code
-
-
AES-CTR + MAC 2-pass AE can be slow
-
todo: borg helium will use OpenSSL 1.1 for:
-
AES-OCB (very fast, if hw accelerated)
-
chacha2-poly1305 (quite fast w/o hw accel.)
-
-
-
key / cipher agility (todo, borg helium)
Key Gen. / Management
-
currently:
-
1 AES key
-
1 MAC key
-
1 chunker seed
-
stored highest IV value for AES CTR mode
-
encrypted using key passphrase
-
- ideas:
- session keys? always start from IV=0.
- per archive? per thread? per chunk?
- asymm. crypto: encrypt these keys for receiver
RAM consumption
-
bigger chunks (e.g. 2MiB, default) == lower needs
-
smaller chunks (e.g. 64kiB) == higher RAM needs
-
chunks, files and repo index kept in memory
-
less chunks to manage -> smaller chunks index.
-
be careful on small machines (NAS, raspi, ...)
-
or with huge amount of data / huge file count
-
in the docs, there is a formula to estimate RAM usage
Hash Tables
-
own hash table implementation in C
-
compact block of memory, no pyobj overhead
-
e.g. used for the chunks index, repo index
-
uses closed hashing (bucket array, no linked lists)
-
uses linear probing for collision handling
-
HT performance difficult to measure
Chunk Index Sync
-
problem: multiple clients updating same repo
-
then: chunk index needs to get re-synced
-
slow, esp. if remote, many and/or big archives
-
local collection of single-archive chunk indexes
-
needs lots of space, merging still expensive
- idea: "build chunks index from repo index"
-
repo index knows all chunk IDs
-
but: no size/csize info in repo index
-
XXX TODO do we have this in 1.1?
-
Python / Cython / C
- Python (90%):
- easy, high level logic
- Cython (5%):
- write pythonic code, get C-ish speed
- access C data types, functions, easy "nogil"
- simple interface code for C libs,
we use that for OpenSSL, lz4 and own C code.
- C (5%):
- used for the most resource-usage critical parts
(CPU as well as RAM usage) - own C code, bundled C code
- hard to maintain, debug
- used for the most resource-usage critical parts
pytest & tox
-
pytest:
- pretty and simple tests,
less boilerplate than stdlib "unittest" - powerful framework
- have fun writing tests
- optionally remote and parallel tests
- pretty and simple tests,
-
tox:
- automates testing on all python versions
- each in a freshly built virtual env
- plus flake8 checker, for pep8 and more
pyenv
- pull and build any python version you want
- easily switch between versions
-
test on minimum requirement:
- older point release == more bugs
- py 3.[6789].0 to find all the issues
-
build / bundle on latest / greatest release:
- newer point release == less bugs
- 3.7.latest to get best build
vagrant, vbox, qemu
- automate VMs:
create, start, provision, ..., shutdown, destroy -
e.g. run tests / builds on:
- Linux (misc. dists, old / new, 32 / 64bit, ...)
- BSD (FreeBSD, OpenBSD, NetBSD)
- macOS
- OpenIndiana
- Windows (maybe)
- PowerPC64 qemu VM with Debian to test on non-x86/x64 BE arch (most archs are LE).
- less surprises "oh, it does not work on X?"
pyinstaller
- creates a single-file binary, bundling together:
- your Python / Cython / C code
- (C)Python Interpreter of your choice
- all required Python stdlib libraries
- other required libraries
- but not the (g)libc
- additionally, create single-directory binary dist
(faster startup, no temp-unpacking needed) - We use it to build Linux, FreeBSD, macOS borg binaries.
- Intentionally build on "old" OS:
- all as-old or newer deployments will usually work
- preferably not too old: security updates wanted
Secure Releasing with GPG
- creepy: users execute downloaded blobs, as root.
- give them a chance to make sure it is authentic:
- release signing key fingerprint widely published
- public key uploaded to keyserver
- document how to use GPG to verify the signatures
- git repo: sign the release tags (or every commit)
- release files: sign them, detached sig
- note: just publishing hashes of files is no protection against attacks (just against accidential corruption)
setuptools_scm
- tired of bumping your version numbers?
- setuptools_scm makes versions from git tags:
- considers latest tag
- distance to that tag (commits)
- workdir state (uncommitted changes?)
-
1.2.3 (tagged release code)
-
1.2.4.dev3+gdeadbee (3 commits later)
- 1.2.4.dev3+gdeadbee.d20170709 ("" + unclean)
Sphinx / Docs
- sphinx - generates html/pdf from reST
-
reuse your ArgParser help:
- build_usage: html cli usage docs
- build_man: man pages
- see archiver.py and setup.py
-
reuse your github README:
- include it as docs intro
- it's your "elevator speech"
-
reuse your docs:
- nice hosted docs can be your "home page"
ReadTheDocs.org
-
builds and hosts your docs:
- url like borgbackup.readthedocs.io
- automatically built from github tags
- version selector, stable, latest
- nice theme (also for mobile devices)
- offers PDF (and other) downloads
- uses sphinx
asciinema.org
- create cool demo "videos" of console tools
-
embed them on:
- home page
- docs intro
- github README
- selectable static preview screen
- adjustable playback speed
- copy & paste works from the "video"
-
recording is json:
- fix typos by editing it
- commit it to your repo
GitHub
- "Organisation" == a common ground
- Main Repo "borg" with good README
- Repo "community" with links to related stuff
- "Issues" (+ Labels)
- bugs / todo / planning
- ideas / feature requests / discussion
- questions -> docs enhancements
- bounties $$ via bountysource.com
- "Pull Requests" + Code Review
- "Milestones" for Release Planning
- "Releases" to publish changelog link, src, bin
- "Actions": CI (was on travis-ci before)
Communication Channels
- Mailing List and Archive:
- borgbackup @ python.org
- slow, async, permanent
- IRC (and also matrix):
- #borgbackup @ irc.libera.chat
- quicker, sync/async, transient
- Twitter:
- @borgbackup on Twitter
- Usages:
- support
- discussion
- (release) announcements
Borg - you can be assimilated!
-
test scalability / reliability / security
-
find, file and fix bugs
-
file and implement feature requests
-
improve docs
-
contribute or review code
-
spread the word
-
create dist packages
-
care for misc. platforms (windows)
-
donate funds via bountysource
For more information:
borgbackup.org
Questions / Feedback?
-
tw @ waldmann-edv . de
-
Thomas J Waldmann @ twitter
borgbackup
By Thomas Waldmann
borgbackup
borgbackup, the software and the project.
- 1,243