Amazon S3 as an online backup service

· by Steve · Read in about 6 min · (1223 Words)

I’ve been thinking about subscribing to an online backup service for a little while; after all while a scheduled backup system to an external drive / NAS is all very well, if something should go seriously wrong (heaven forbid), you really need an offsite backup of your most critical data. There’s only so long you can go burning DVDs or removing hard drives and persuading friends / family to keep them at their houses before it starts to get unwieldy. As a small business, you often don’t have the option of fireproof safes and staff to do a proper offsite rotation system all the time, so with increasing broadband speeds online backup is starting to look more attractive.

There’s plenty of dedicated options there, like Mozy and Carbonite, and they’re fairly inexpensive, but most of them are aimed at home desktop PC users, and as such typically only have Windows and (sometimes) Mac GUI interfaces. Automated Linux server backups tend to be limited to the business versions, if they’re available at all, which come at a higher price. So I kept looking.

I happened across a number of services that piggy-back on Amazon’s Simple Storage Service (S3) - one of several web services that Amazon provide to developers who want to create online solutions without maintaining a scalable data centre of their own. It has the significant advantage of being rather cheap for small volumes, given that it uses a very fine-grained pricing structure (US hosting, EU hosting is 20% higher):

Storage $0.15 per GB, per month
Transfers In $0.10 per GB
Transfers Out $0.18 per GB up to 10TB, savings after that
PUT Requests $0.01 per 1,000
GET Requests $0.01 per 10,000

This means you’re never paying for more than you need - if you store 10GB and transfer the same amount in every month (unlikely), say with less than a thousand PUT requests, you’re going to be paying $2.51 per month, which isn’t half bad - some services like Carbonite have unlimited storage (per PC) for a fixed price though so if you go beyond a certain level you might be better with something like that, but again they’re aimed at desktop users. Personally, because I’m only using it to store my most critical documents, and I’ve already compressed them, I don’t expect to be paying more than about 50 cents per month. Services like JungleDisk, ElephantDrive and ElasticDrive all use S3 as a back-end storage, and provide the friendly front-ends which AWS/S3 doesn’t (it’s basically just an API). Of all of these, I liked JungleDisk the most, because it’s inexpensive ($20 for any number of PCs, over and above the S3 charges) and comes with a Linux version, which just requires you to install the FUSE module, whereby you can mount an S3-backed, locally cached drive onto your server an rsync directly to it. It also supports encryption so the files that are uploaded to S3 aren’t readable by anyone else.

In the end though, one thing struck me about all these services - useful though they were, you were dependent on a piece of software from a relatively new startup, and the file structures they create on S3 are typically encoded in some way - this is out of necessity, since S3 doesn’t support any notion of directories, it deals purely in terms of ‘buckets’ (which have to be globally unique) and ‘keys’ which actually store a block of data, ie a maximum of a 2-level system. JungleDisk have released some source code to show you how to get data back out of S3, should you want to do it independently of their software, which is a nice touch, but still, if the JungleDrive service was discontinued it would be a pain. In my travels, I’d come across a few open source solutions, so while they required a bit more effort, I thought it was worth investigating.

In the open source arena there’s a number of options, but out of those the ones I took most seriously were:

  • Amanda now supports S3 as a target, 
  • duplicity looks like a nice automated backup system with GPG encryption which has added support for S3 in recent versions
  • s3sync is a Ruby script package which provides you with an rsync-style S3 implementation as well as a command-line tool for manipulating your S3 store (very useful in its own right since S3 don’t provide any tools)

Amanda would have been great, but it appears the S3 back-end is only available in the enterprise edition sold by Zmanda, not the open source version yet, which is disappointing. Duplicity also looked nice, but S3 support was only added in the recent versions, and it had a lot of dependencies and would have been a lot of trouble to set up manually on my server box, which is still chugging away on Debian Sarge and thus didn’t have recent enough versions in its stable list. An upgrade to Etch is definitely on the cards sometime soon, and I of course could have just grabbed the source packages, but I wanted to try out a simpler solution for now.

In the end, I went with s3sync. It has a few disadvantages over the other two, in that it doesn’t archive files up or compress them (leading to higher PUT/GET requests and transfers) and most significantly it doesn’t encrypt files - it can operate over SSL, but the files themselves are unencrypted, leaving them open to any Amazon employee or a server compromise once uploaded, which for my business documents was not acceptable. So I just scripted around that - I just use a standard incremental TAR backup of my most important directories, encrypt the results through GPG, and then use s3sync to upload the resulting data to S3. Works like a a charm - I get a small number of encrypted archives getting uploaded each time, and they’re small because they’re just diffs. The only problem I did have is that I didn’t check my Ruby version against the s3sync README - again Sarge was too old here, its Ruby 1.8.2 meant that all commands would work except uploading, which puzzled me for a while as I was still learning the nuances of the S3 system 😕 Luckily sarge-backports made it easy to grab a conforming version (and of libopenssl-ruby) which solved the problems.

Overall, I like S3 as an online backup back end - you can’t complain at the prices, and it’s nice to know that your data can grow as large as you like without crossing any major billing boundaries, and is taking advantage of Amazon’s infrastructure. I can see why startups are using these services a lot - being able to start small and grow without the infrastructural pain is a serious bonus (although its not perfect). For end-users I’d probably still recommend JungleDisk - it benefits from S3’s cheap prices while having a nice user interface - but if you’re a bit more picky about feeding your data through closed tools or having more control over the process, then you’ll want something else. If you have complex requirements duplicity looks worth your time, or if your requirements are fairly simple like mine, s3sync is a good low-level solution with very few prerequisites, so long as you remember to encrypt your own sensitive data.