Borg Backup

(a fork of Attic)

 

 

"I found the Holy Grail of backups."

(Stavros K. about Attic-Backup, 8/2013)

 

 

 

Thomas Waldmann (@home, 2021-07)

ThomasWaldmann.__doc__

  • Doing Python since 2001,
    Linux since it was on Floppies,
    loves FOSS.
  • Projects:
    • MoinMoin Wiki
    • nsupdate.info
    • bepasty
    • vpngw
    • BorgBackup
  • Contact:  tw @ waldmann-edv . de
  • Python Developer  (freelance & remote)

It's a backup tool

-

one you maybe actually would enjoy using.

 

  • simple
  • efficient
  • secure
  • safe
  • free & open

 

simple

  • each backup is a full backup
  • FUSE = mount your backups
  • easy pruning of old backups
  • tooling:  just borg, ssh, sh
  • good help, manpages, docs
  • single-file binary
  • good fs / OS / arch support

efficient

  • very fast for unchanged files
  • chunk deduplication
  • flexible compression
  • sparse file support
  • not flooding the fs cache
  • sped up by a bit of C and Cython
  • HW accelerated crypto

safe

  • borg uses:
    • checksums
    • transactions
    • fs: syncing, atomic ops
  • backup repo = log-like KV store
  • checkpoints while backing up
  • off-site remote repositories

secure

  • authenticated encryption
    • nothing to see in the repo
    • detect tampering / corruption
  • ssh transport for remote repos

  • append-only mode repos

  • size obfuscation (borg 1.2)

  • FOSS, you can see the code

Crypto

  • client-side, metadata and data
  • authenticated encryption (AE, EtM)
    • aes256-ctr
    • hmac-sha256 or blake2b
    • counter management, never repeat
  • encrypted key material:
    • on client or in backup repo
    • passphrase protection: pbkdf2 + AES
  • OpenSSL 1.1, only libcrypto

Compression

  • chunk-based (not full file)
  • algorithms: lz4, zstd, zlib, lzma, none
  • with lz4 often faster than with none!
  • "auto" mode:
    • first use lz4 as predictor
    • if compressible: expensive compression
  • "borg recreate" can recompress

Deduplication in SpaceTime

  • Deduplication "dimensions":
    • inner deduplication of data set
      • copies of files, similar files
      • lots of zeros (sparse or not)
    • historical deduplication in backup repo
      • many files don't change over time
      • they are all in each of your full backups
    • deduplication between machines
      • just moved that big directory from m1 to m2?
      • same OS or data files everywhere?
      • will all dedup if machines share a backup repo.

Borg Deduplication

  • Content defined Chunk Deduplication
    • cut a file into variable sized chunks,
      content defines where a cut happens
      (efficiently done using a rolling hash)
    • MAC(chunk) is the key for the KV store
  • No problem with:
    • inserted / deleted / shifted file contents
    • renamed files / dirs
    • VM disk images: only few chunks change
 $ borg info ssh://borg@myserver/repos/myrepo

               Original size   Compressed size   Dedup size
 All archives:      22.76 TB          18.22 TB    486.20 GB

                     Unique chunks         Total chunks
 Chunk index:              6305006            272643223

 

Borg assimilated Data

Real stats from a real backup repository (shortened).


2 machines,  147 backup archives,  2.5 years.

Borg now

  • 1.1: current stable release
    • getting fixes primarily
    • getting few new features,
      if uncritical to stability
    • branch 1.1-maint in the repo
  • 1.0: oldstable
    • not supported any more
  • 1.2: in beta testing now
    • master branch in repo

Borg 1.2

  • lots of code cleanups / refactoring
  • setup.py cleaned + pkgconfig now
  • internal AEAD-style crypto API
  • compat layer for msgpack
  • FD (not: path) based operation,
    less race conditions!
  • separated recursion and processing
  • using pyfuse3 (new) and also
    llfuse (deprecated, fuse2, widespread)
  • minimal native windows support

Borg 1.2

  • optional chunk size obfuscation
  • compact_segments improved:
    • separate cli command now
    • more stable segments
    • faster. separated manifest/commits.
  • create --paths-from-stdin
  • ctrl-c: checkpoint, then abort
  • incremental, time-ltd. repo check
  • prune: show rule
  • fixed blocksize chunker (disks!)

Borg Future

  • 1.1 + 1.2 maintenance
  • crypto improvements
  • multithreading
  • details: see github milestones

Borg the Project ->

Borg Internals & Ideas  v

Error Correction?

  • borg does error (and even tampering) detection

  • but not (yet?) error correction

  • kinds of errors / threat model:

    • single/few bit errors

    • defect / unreadable blocks

    • media failure (defect disk, ssd)

  • see issue #225 for discussion

  • implement something in borg?

  • rely on other soft- or hardware solutions?

  • avoid futile attempts, borg is application level

Modernize Crypto

  • sha256, hmac-sha256 is slow

    • solved: borg 1.1 added blake2b

  • zlib crc32 is slow

    • solved: borg 1.1 added fast crc32 C code

  • AES-CTR + MAC 2-pass AE can be slow

    • todo: borg helium will use OpenSSL 1.1 for:

      • AES-OCB (very fast, if hw accelerated)

      • chacha2-poly1305 (quite fast w/o hw accel.)

  • key / cipher agility (todo, borg helium)

Key Gen. / Management

  • currently:

    • 1 AES key

    • 1 MAC key

    • 1 chunker seed

    • stored highest IV value for AES CTR mode

    • encrypted using key passphrase

  • ideas:
    • session keys? always start from IV=0.
    • per archive? per thread? per chunk?
    • asymm. crypto: encrypt these keys for receiver

RAM consumption

  • bigger chunks (e.g. 2MiB, default) == lower needs

  • smaller chunks (e.g. 64kiB) == higher RAM needs

  • chunks, files and repo index kept in memory

  • less chunks to manage -> smaller chunks index.

  • be careful on small machines (NAS, raspi, ...)

  • or with huge amount of data / huge file count

  • in the docs, there is a formula to estimate RAM usage

Hash Tables

  • own hash table implementation in C

  • compact block of memory, no pyobj overhead

  • e.g. used for the chunks index, repo index

  • uses closed hashing (bucket array, no linked lists)

  • uses linear probing for collision handling

  • HT performance difficult to measure

Chunk Index Sync

  • problem: multiple clients updating same repo

  • then: chunk index needs to get re-synced

  • slow, esp. if remote, many and/or big archives

  • local collection of single-archive chunk indexes

  • needs lots of space, merging still expensive

  • idea: "build chunks index from repo index"
    • repo index knows all chunk IDs

    • but: no size/csize info in repo index

    • XXX TODO do we have this in 1.1?

Python / Cython / C

  • Python (90%):
    • easy, high level logic
  • Cython (5%):
    • write pythonic code, get C-ish speed
    • access C data types, functions, easy "nogil"
    • simple interface code for C libs,
      we use that for OpenSSL, lz4 and own C code.
  • C (5%):
    • used for the most resource-usage critical parts
      (CPU as well as RAM usage)
    • own C code, bundled C code
    • hard to maintain, debug

pytest  &  tox

  • pytest:
    • pretty and simple tests,
      less boilerplate than stdlib "unittest"
    • powerful framework
    • have fun writing tests
    • optionally remote and parallel tests
  • tox:
    • automates testing on all python versions
    • each in a freshly built virtual env
    • plus flake8 checker, for pep8 and more

pyenv

 

  • pull and build any python version you want
  • easily switch between versions
  • test on minimum requirement:
    • older point release == more bugs
    • py 3.[6789].0 to find all the issues
  • build / bundle on latest / greatest release:
    • newer point release == less bugs
    • 3.7.latest to get best build

vagrant, vbox, qemu

  • automate VMs:
    create, start, provision, ..., shutdown, destroy
  • e.g. run tests / builds on:
    • Linux (misc. dists, old / new, 32 / 64bit, ...)
    • BSD (FreeBSD, OpenBSD, NetBSD)
    • macOS
    • OpenIndiana
    • Windows (maybe)
  • PowerPC64 qemu VM with Debian to test on non-x86/x64 BE arch (most archs are LE).
  • less surprises "oh, it does not work on X?"

pyinstaller

 

  • creates a single-file binary, bundling together:
    • your Python / Cython / C code
    • (C)Python Interpreter of your choice
    • all required Python stdlib libraries
    • other required libraries
    • but not the (g)libc
  • additionally, create single-directory binary dist
    (faster startup, no temp-unpacking needed)
  • We use it to build Linux, FreeBSD, macOS borg binaries.
  • Intentionally build on "old" OS:
    • all as-old or newer deployments will usually work
    • preferably not too old: security updates wanted

Secure Releasing with GPG

 

  • creepy:  users execute downloaded blobs, as root.
  • give them a chance to make sure it is authentic:
    • release signing key fingerprint widely published
    • public key uploaded to keyserver
    • document how to use GPG to verify the signatures
  • git repo:  sign the release tags  (or every commit)
  • release files:  sign them, detached sig
  • note: just publishing hashes of files is no protection against attacks (just against accidential corruption)

setuptools_scm

 

  • tired of bumping your version numbers?
  • setuptools_scm makes versions from git tags:
    • considers latest tag
    • distance to that tag (commits)
    • workdir state (uncommitted changes?)
  • 1.2.3 (tagged release code)

  • 1.2.4.dev3+gdeadbee (3 commits later)

  • 1.2.4.dev3+gdeadbee.d20170709 ("" + unclean)

Sphinx / Docs

  • sphinx - generates html/pdf from reST
  • reuse your ArgParser help:
    • build_usage:  html cli usage docs
    • build_man:  man pages
    • see archiver.py and setup.py
  • reuse your github README:
    • include it as docs intro
    • it's your "elevator speech"
  • reuse your docs:
    • nice hosted docs can be your "home page"

ReadTheDocs.org

  • builds and hosts your docs:
    • url like borgbackup.readthedocs.io
    • automatically built from github tags
    • version selector, stable, latest
    • nice theme (also for mobile devices)
    • offers PDF (and other) downloads
    • uses sphinx

asciinema.org

  • create cool demo "videos" of console tools
  • embed them on:
    • home page
    • docs intro
    • github README
  • selectable static preview screen
  • adjustable playback speed
  • copy & paste works from the "video"
  • recording is json:
    • fix typos by editing it
    • commit it to your repo

GitHub

  • "Organisation" == a common ground
    • Main Repo "borg" with good README
    • Repo "community" with links to related stuff
  • "Issues" (+ Labels)
    • bugs / todo / planning
    • ideas / feature requests / discussion
    • questions -> docs enhancements
    • bounties $$ via bountysource.com
  • "Pull Requests" + Code Review
  • "Milestones" for Release Planning
  • "Releases" to publish changelog link, src, bin
  • "Actions": CI (was on travis-ci before)

Communication Channels

  • Mailing List and Archive:
    • borgbackup @ python.org
    • slow, async, permanent
  • IRC (and also matrix):
    • #borgbackup @ irc.libera.chat
    • quicker, sync/async, transient
  • Twitter:
    • @borgbackup on Twitter
  • Usages:
    • support
    • discussion
    • (release) announcements

Borg - you can be assimilated!

  • test scalability / reliability / security

  • find, file and fix bugs

  • file and implement feature requests

  • improve docs

  • contribute or review code

  • spread the word

  • create dist packages

  • care for misc. platforms (windows)

  • donate funds via bountysource

For more information:

borgbackup.org

Questions / Feedback?

  • tw @ waldmann-edv . de

  • Thomas J Waldmann @ twitter

borgbackup

By Thomas Waldmann

borgbackup

borgbackup, the software and the project.

  • 1,174