Edit: this script is deprecated in favour of a rewritten version 2.
I use Amazon S3 to host large media files which I want cheap scalable bandwidth on, and for expandable offsite storage of important backups. I used to have some simple incremental tar scripts to do my offsite backups, but since I moved to Bacula, I’ve just established an alternative schedule and file set definition for my offsite backups, the critical subset of data I couldn’t possibly stand to lose (like company documents). Since I was refreshing all my procedures and tarring the Bacula volumes no longer made any sense, I rewrote my script for putting the resulting backup data on S3.
The prerequisite in all cases is s3cmd, which is pretty mature now and available on most distros (“apt-get install s3cmd” and you’re done on Ubuntu). s3cmd actually has a ‘sync’ command, but firstly that tries to sync in both directions, which I don’t want (I know in theory it should never overwrite any local version so long as I don’t update the remote copies from somewhere else, but I’m paranoid when it comes to my backups and prefer to be explicit), and secondly it obviously has to connect to S3 to determine the sync status, wheras I always know whether I need to upload new files just from my local environment (and S3 charges per request - not much, but it’s not zero and it’s the principle of the thing). So, I decided not to use the ‘sync’ command, and just determine locally what new files I needed to ‘put’ on the server.
Secondly, encryption is a must, since some of the data is sensitive and I don’t want to trust anyone else with it. I used to manually GPG my tarballs before uploading them, but I noticed that s3cmd supports an encryption option too. It just uses GPG anyway, just in symmetric form rather than asymmetric like my version did (translation: you use the same passphrase for encryption and decryption; a little less secure than using generated public/private keys but still ok so long as you pick a good passphrase and look after it). The default symmetric algorithm in gpg is CAST5 which seems pretty good, although you can change it if you want by editing your s3cmd config file. So, I decided to give it a try - after you configure s3cmd to use encryption, it actually automatically decrypts too when you pull the data back (symmetric key, remember) - being distrustful, I pulled the data back from S3 in a different environment and examined it, and it was indeed complete gibberish, but decipherable with the passphrase. Good stuff.
So, here’s my little script which will upload the encrypted contents of a folder to S3 - just the contents that have been added or updated since the last sync of that folder, and will encrypt them by default. I just run this on a cron schedule now and it seems to work fine. License is MIT, use at your own risk, no warranty is given that it won’t destroy every file on your machine or eat your children. Usage is like this:
s3putsecurefolder /my/source/folder my.s3.bucket
Edit: it was brought to my attention that Amazon have made it easier to create pseudo-folder structures in S3 buckets since I last tried to do it (I swear it used to throw out keys with forward slashes in them, I had to mangle my names last time I did this), so I’ve updated the script to allow nested folders too.