rdiff-backup

By | 2013-08-01

rdiff-backup is about the only backup tool I’ve found acceptable. The key things for me:

  • The (most recent) backup is readable on disk, as is. That is to say that there is no binary format. If the worst came to the worst, I could restore with cp.

  • It keeps a rolling backup automatically. Each new backup automatically pushes all the previous backups down one. They are all available.

  • It’s metadata is not scattered about throughout the backup; it’s kept in an obvious and easily isolated directory, rdiff-backup-data.

  • It supports ssh as the transport mechanism.

  • It supports pull-backups. This is particularly important because I want to be able to backup a system that’s live on the internet. To maintain security, you should never keep any secure keys on an internet-facing host. That includes ssh keys. Therefore you should always do your backups with a secure (non-internet facing) host logging into the internet-facing host using an ssh public key. That means the only key stored on the internet-facing system is a public key – which gives access to nothing. Even if that system is compromised, it doesn’t give the attacker access to any more of your systems.

  • The backup is incremental, with increments stored as “reverse” diffs. That means that the latest backup is an exact mirror, and the increments are stored as incremental diffs backwards from that latest snapshot. It makes your increments mostly free in terms of disk space. You can afford then to keep many more of them.

Enough advertising. I’d like to talk about using rdiff-backup to backup your, say, small server system. I’m assuming that a webserver that serves multiple sites, has databases, and maybe even some source code repositories.

Install with

apt-get install rdiff-backup python-pylibacl python-pyxattr

I suggest you put your rdiff-backup command in a script file. Even though it’s likely going to be only running one command, that command can be a long and complicated one, and it’s much easier to debug and understand if it’s in a file.

Let’s begin by making a null-backup.

#!/bin/sh
rdiff-backup \
    --exclude / \
    --print-statistics \
    root@remote::/ \
    /local/backup/directory

Note the “host::path” notation makes rdiff-backup use ssh to connect; you’ll need rdiff-backup installed on the remote machine as well, for rdiff-backup to run and communicate with over that ssh connection.

--------------[ Session statistics ]--------------
StartTime 1375345801.00 (Thu Aug  1 09:30:01 2013)
EndTime 1375345804.04 (Thu Aug  1 09:30:04 2013)
ElapsedTime 3.04 (3.04 seconds)
SourceFiles 1
SourceFileSize 0 (0 bytes)
MirrorFiles 1
MirrorFileSize 0 (0 bytes)
NewFiles 0
NewFileSize 0 (0 bytes)
DeletedFiles 0
DeletedFileSize 0 (0 bytes)
ChangedFiles 1
ChangedSourceSize 0 (0 bytes)
ChangedMirrorSize 0 (0 bytes)
IncrementFiles 0
IncrementFileSize 0 (0 bytes)
TotalDestinationSizeChange 0 (0 bytes)
Errors 0
--------------------------------------------------

I think you should always be explicit about what you’re backing up when dealing with root-based backups. I find it better to do it this way around, rather than fill your command line with exclusions for all the non-filesystem directories (/sys, /dev, /tmp, /proc, etc). Running the above command gets us a directory tree like this in our backup directory:

$ tree -F
.
`-- rdiff-backup-data/
    |-- access_control_lists.2013-08-01T09:30:01+01:00.snapshot
    |-- backup.log
    |-- chars_to_quote
    |-- current_mirror.2013-08-01T09:30:01+01:00.data
    |-- error_log.2013-08-01T09:30:01+01:00.data
    |-- extended_attributes.2013-08-01T09:30:01+01:00.snapshot
    |-- file_statistics.2013-08-01T09:30:01+01:00.data.gz
    |-- increments/
    |-- mirror_metadata.2013-08-01T09:30:01+01:00.snapshot.gz
    `-- session_statistics.2013-08-01T09:30:01+01:00.data

2 directories, 9 files

There is nothing but rdiff-backup-data; pretty obviously because we’ve excluded the entire remote directory structure, so we’re backing up nothing. However that lets us see what rdiff-backup is keeping aside from our data. I won’t describe each of them; they’re fairly self explanatory, and it’s unlikely you’ll ever need to look at anything other than the logs (and even them, rarely).

We can get rdiff-backup to tell us about the backups so far.

$ rdiff-backup --list-increment-sizes /local/backup/directory
        Time                       Size        Cumulative size
-----------------------------------------------------------------------------
Thu Aug  1 09:30:01 2013         4.00 KB           4.00 KB   (current mirror)

Run the same null-backup line again, then look at the increment list again.

$ rdiff-backup --list-increment-sizes /local/backup/directory
        Time                       Size        Cumulative size
-----------------------------------------------------------------------------
Thu Aug  1 09:35:04 2013         4.00 KB           4.00 KB   (current mirror)

Note that rdiff-backup --list-increment-sizes is clever enough to note that nothing has changed, so it simply shows once increment. In truth, it has recorded both backups:

$ tree -F
.
`-- rdiff-backup-data/
    |-- access_control_lists.2013-08-01T09:30:01+01:00.snapshot
    |-- access_control_lists.2013-08-01T09:35:04+01:00.snapshot
    |-- backup.log
    |-- chars_to_quote
    |-- current_mirror.2013-08-01T09:35:04+01:00.data
    |-- error_log.2013-08-01T09:30:01+01:00.data
    |-- error_log.2013-08-01T09:35:04+01:00.data
    |-- extended_attributes.2013-08-01T09:30:01+01:00.snapshot
    |-- extended_attributes.2013-08-01T09:35:04+01:00.snapshot
    |-- file_statistics.2013-08-01T09:30:01+01:00.data.gz
    |-- file_statistics.2013-08-01T09:35:04+01:00.data.gz
    |-- increments/
    |-- mirror_metadata.2013-08-01T09:30:01+01:00.diff
    |-- mirror_metadata.2013-08-01T09:35:04+01:00.snapshot.gz
    |-- session_statistics.2013-08-01T09:30:01+01:00.data
    `-- session_statistics.2013-08-01T09:35:04+01:00.data

2 directories, 15 files

Note that there are two sets of rdiff-backup files; we well get a new set every time we run rdiff-backup. Run it a few times to observe if you wish.

Let’s now create something to backup. Change your backup script to this:

#!/bin/sh
rdiff-backup \
    --include /test-file \
    --exclude / \
    --print-statistics \
    root@remote::/ \
    /local/backup/directory

You’ll need to then put some content in remote::/test-file.

$ ssh root@remote "date > /test-file"

Rerun your backup script. This time you’ll see this output:

--------------[ Session statistics ]--------------
StartTime 1375346602.00 (Thu Aug  1 09:43:22 2013)
EndTime 1375346605.25 (Thu Aug  1 09:43:25 2013)
ElapsedTime 3.25 (3.25 seconds)
SourceFiles 2
SourceFileSize 29 (29 bytes)
MirrorFiles 1
MirrorFileSize 0 (0 bytes)
NewFiles 1
NewFileSize 29 (29 bytes)
DeletedFiles 0
DeletedFileSize 0 (0 bytes)
ChangedFiles 1
ChangedSourceSize 0 (0 bytes)
ChangedMirrorSize 0 (0 bytes)
IncrementFiles 2
IncrementFileSize 0 (0 bytes)
TotalDestinationSizeChange 29 (29 bytes)
Errors 0
--------------------------------------------------

More than zero bytes has been transferred. Our backup now looks something like this (yours

.
|-- rdiff-backup-data/
|   |-- access_control_lists.2013-08-01T09:30:01+01:00.snapshot
|   |-- access_control_lists.2013-08-01T09:35:04+01:00.snapshot
|   |-- access_control_lists.2013-08-01T09:41:54+01:00.snapshot
|   |-- access_control_lists.2013-08-01T09:43:22+01:00.snapshot
|   |-- backup.log
|   |-- chars_to_quote
|   |-- current_mirror.2013-08-01T09:43:22+01:00.data
|   |-- error_log.2013-08-01T09:30:01+01:00.data
|   |-- error_log.2013-08-01T09:35:04+01:00.data
|   |-- error_log.2013-08-01T09:41:54+01:00.data
|   |-- error_log.2013-08-01T09:43:22+01:00.data
|   |-- extended_attributes.2013-08-01T09:30:01+01:00.snapshot
|   |-- extended_attributes.2013-08-01T09:35:04+01:00.snapshot
|   |-- extended_attributes.2013-08-01T09:41:54+01:00.snapshot
|   |-- extended_attributes.2013-08-01T09:43:22+01:00.snapshot
|   |-- file_statistics.2013-08-01T09:30:01+01:00.data.gz
|   |-- file_statistics.2013-08-01T09:35:04+01:00.data.gz
|   |-- file_statistics.2013-08-01T09:41:54+01:00.data.gz
|   |-- file_statistics.2013-08-01T09:43:22+01:00.data.gz
|   |-- increments/
|   |   `-- test-file.2013-08-01T09:41:54+01:00.missing
|   |-- increments.2013-08-01T09:35:04+01:00.dir*
|   |-- increments.2013-08-01T09:41:54+01:00.dir*
|   |-- mirror_metadata.2013-08-01T09:30:01+01:00.diff
|   |-- mirror_metadata.2013-08-01T09:35:04+01:00.diff.gz
|   |-- mirror_metadata.2013-08-01T09:41:54+01:00.diff.gz
|   |-- mirror_metadata.2013-08-01T09:43:22+01:00.snapshot.gz
|   |-- session_statistics.2013-08-01T09:30:01+01:00.data
|   |-- session_statistics.2013-08-01T09:35:04+01:00.data
|   |-- session_statistics.2013-08-01T09:41:54+01:00.data
|   `-- session_statistics.2013-08-01T09:43:22+01:00.data
`-- test-file

More metadata files as we’ve come to expect; but more importantly now:

.
|-- rdiff-backup-data/
|   |-- increments/
|   |   `-- test-file.2013-08-01T09:41:54+01:00.missing
`-- test-file

We have our test-file backed as a standard copy; and an increment describing the change (we needn’t worry about how rdiff-backup keeps its increments).

$ rdiff-backup --list-increment-sizes /local/backup/directory
        Time                       Size        Cumulative size
-----------------------------------------------------------------------------
Thu Aug  1 09:43:22 2013         4.03 KB           4.03 KB   (current mirror)
Thu Aug  1 09:41:54 2013         0 bytes           4.03 KB
Thu Aug  1 09:35:04 2013         0 bytes           4.03 KB

Excellent. rdiff-backup is telling us all the snapshots it can take us back to, and we can see that the latest is the only one with any data in it. Obviously in a real backup, you’ll have much bigger numbers.

I’d suggest deleting this backup directory now; and we’ll alter our script to be more realistic. It doesn’t matter if you don’t, but your next backup will make it look like your system went from black to full, which isn’t really accurate.

I prefer not to backup anything that the distribution installs; I’m only interested in unrecreatable or unobtainable files. Here then is a reasonable backup script:

#!/bin/sh
rdiff-backup \
    --exclude **/.git \
    --include /etc \
    --include /home \
    --include /srv \
    --include /root \
    --include /usr/local \
    --include /var/backups \
    --include /var/lib/ldap \
    --include /var/lib/mysql \
    --include /var/lib/postgresql \
    --include /var/log \
    --include /var/mail \
    --include /var/www \
    --exclude / \
    --print-statistics \
    root@remote::/ \
    /local/backup/directory

Note particularly the inclusion of /var/log – one of the first things an attacker will do is delete all your logs. If you’ve got them backed up, at least you’ll have some indication of when the attack happened.

I’m slightly ambivalent about including /var/lib/postgresql and other database directories. The problem is that the databases are live when the backup happens, and it’s unlikely that these directories are in a self-consistent state. However, better to have them than want them – one day it might save your bacon. I’ll discuss real database backups another day.

I’ve also excluded .git/ directories. .git is the working directory storage for a git repository, it’s so fluid that you’ll get a lot of noise in your backups if you include it. Instead, your working directories are being backed up, and you should be pushing your repositories to a central location (as with databases, I’ll come to an automated method for this another day), which you should include in a backup. This is entirely a judgement call, if you aren’t limited by disk space, include them.

Once you have your backups, rdiff-backup can help you interrogate them. Rather than repeat information, have a look at rdiff-backup’s own examples page.

Leave a Reply