Tag Archives: amazon

Internet Linux Tech

S3 encrypted upload script, v2 (Python)

pythonOk, so I discovered a number of shortcomings in my recent attempt to sync a folder in one direction to Amazon S3 using encryption, the most important of which was that it wouldn’t resume a failed transfer efficiently, which in the case of large transfers wasn’t at all ideal (as I learned to be own cost – damn my 256k upload speed).

So, this is attempt number 2. I decided to completely rewrite the script in Python instead to give me some more flexibility, coupled with the availability of Boto, a nice Python library for accessing all the Amazon Web Services. Rather than rely on just local information, or even date/time stamps, I decided to use hashes to track whether files were different. Amazon already stores the MD5 of the file you upload to them and makes that available without downloading the file, but that’s no use when you encrypt your files before uploading them, because the MD5 is of the encrypted contents rather then the original; so unless you keep the encrypted copies around too, or encrypt the local files again every time just to check the match (expensive if you’re dealing with large files) you won’t be able to compare them – I think this is the reason why ‘s3cmd sync’ currently doesn’t support encryption.

So, I decided to use S3′s ability to store custom metadata in keys, and stored the MD5 hash of the original file against the encrypted contents that I uploaded. That way, I can check the hashes against each other pretty quickly without having to re-encrypt the local files. If the hashes are different, I encrypt and upload. This approach trades a bit of preprocessing against avoiding uploads, so it’s likely to be more efficient on small groups of very large files rather than lots of small files – that’s how I use S3 for my backups of course. It also means I don’t have to worry about timestamp variations, it’s the content of the file that is the driver of whether it’s uploaded or not.

So, here’s the new version. It’s a bit more powerful than the last one – I’m calling gpg myself now so you have the choice between encrypting using public keys (more secure, and the default), or using symmetric encryption with a passphrase. You need to install Boto before you can run it, and it depends on Python 2.5 with hashlib installed. I’ve run it on both Linux and the Mac, it should work on Windows too provided you take the trouble to set up Python and GnuPG, but I haven’t tried; my Linux (apt) and OS X (macports) setups make these things quicker so being short of time I just went with that. Here’s the usage from –help:

Usage: s3putsecurefolder.py [options] source_folder target_bucket gpg_recipient_or_phrase

Options:
  -h, --help            show this help message and exit
  -n, --dry-run         Do not upload any files, just list actions
  -a ACCESS_KEY, --accesskey=ACCESS_KEY
                        AWS access key to use instead of relying on
                        environment variable AWS_ACCESS_KEY
  -s SECRET_KEY, --secretkey=SECRET_KEY
                        AWS secret key to use instead of relying on
                        environment variable AWS_ACCESS_KEY
  -c, --create          Create bucket if it does not already exist
  -v, --verbose         Verbose output
  -S, --symmetric       Instead of encrypting with a public key, encrypts
                        files using a symmetric cypher and the passphrase
                        given on the command-line.

Once again, no warranty is given, MIT license. If you see that I’ve done anything dumb, let me know :)

Linux Open Source Tech Web

s3putsecurefolder

Edit: this script is deprecated in favour of a rewritten version 2.

I use Amazon S3 to host large media files which I want cheap scalable bandwidth on, and for expandable offsite storage of important backups. I used to have some simple incremental tar scripts to do my offsite backups, but since I moved to Bacula, I’ve just established an alternative schedule and file set definition for my offsite backups, the critical subset of data I couldn’t possibly stand to lose (like company documents). Since I was refreshing all my procedures and tarring the Bacula volumes no longer made any sense, I rewrote my script for putting the resulting backup data on S3.

The prerequisite in all cases is s3cmd, which is pretty mature now and available on most distros (“apt-get install s3cmd” and you’re done on Ubuntu). s3cmd actually has a ‘sync’ command, but firstly that tries to sync in both directions, which I don’t want (I know in theory it should never overwrite any local version so long as I don’t update the remote copies from somewhere else, but I’m paranoid when it comes to my backups and prefer to be explicit), and secondly it obviously has to connect to S3 to determine the sync status, wheras I always know whether I need to upload new files just from my local environment (and S3 charges per request – not much, but it’s not zero and it’s the principle of the thing). So, I decided not to use the ‘sync’ command, and just determine locally what new files I needed to ‘put’ on the server.

Secondly, encryption is a must, since some of the data is sensitive and I don’t want to trust anyone else with it. I used to manually GPG my tarballs before uploading them, but I noticed that s3cmd supports an encryption option too. It just uses GPG anyway, just in symmetric form rather than asymmetric like my version did (translation: you use the same passphrase for encryption and decryption; a little less secure than using generated public/private keys but still ok so long as you pick a good passphrase and look after it). The default symmetric algorithm in gpg is CAST5 which seems pretty good, although you can change it if you want by editing your s3cmd config file. So, I decided to give it a try – after you configure s3cmd to use encryption, it actually automatically decrypts too when you pull the data back (symmetric key, remember) – being distrustful, I pulled the data back from S3 in a different environment and examined it, and it was indeed complete gibberish, but decipherable with the passphrase. Good stuff.

So, here’s my little script which will upload the encrypted contents of a folder to S3 – just the contents that have been added or updated since the last sync of that folder, and will encrypt them by default. I just run this on a cron schedule now and it seems to work fine. License is MIT, use at your own risk, no warranty is given that it won’t destroy every file on your machine or eat your children. Usage is like this:

s3putsecurefolder /my/source/folder my.s3.bucket

Edit: it was brought to my attention that Amazon have made it easier to create pseudo-folder structures in S3 buckets since I last tried to do it (I swear it used to throw out keys with forward slashes in them, I had to mangle my names last time I did this), so I’ve updated the script to allow nested folders too.

Business Political Tech Travel Web

Streaming media from Amazon S3

Thanks John for the reminder to investigate S3 as a business media hosting service, it works like a charm!

Now that I have far fewer bandwidth worries (max $0.17 per GB), the Torus Knot site includes a nifty dynamic selector so you can pick low, medium or high quality – the latter is at a higher resolution too, clocking in at about 100Mb. I may well use S3 for future public commercial downloads in the future too. It’s altogether more convenient than the block bandwidth allocations you get with regular hosting packages, since it scales dynamically at a very fine level of detail depending on demand. And don’t be fooled by ‘unlimited’ bandwidth offers, all hosting companies have to pay for bandwidth and there’s no such thing as ‘unlimited’ resources; you’ll actually find your bandwidth being throttled or cut off via a ‘reasonable use’ clause in the small-print; ‘unlimited’ is simply a marketing lure. If you want truly scalable guaranteed bandwidth, you have to pay for it.

Getting S3 media hosting working wasn’t that hard, but did require a few discrete steps. Firstly, you need to create a bucket in your S3 account which is all in lower case, is globally unique and is DNS-compatible; so for example I created a bucket called ‘media.torusknot.com’.

Then to make it all look nice you need to create a DNS CNAME entry to map a sub-domain of your site to that S3 bucket; in my case I mapped ‘media.torusknot.com’ to ‘media.torusknot.com.s3.amazonaws.com’. That allows me to access any files I upload to that S3 bucket via ‘http://media.torusknot.com/somefile.jpg’. You do just need to set the ACLs on the files & the bucket to make sure public access is allowed.

Finally, if you want to stream video files via a Flash player from S3 to another domain, you also have to tell Flash that it’s ok for the content to be pulled in from a different domain. Create a file called ‘crossdomain.xml’ in the bucket, with these contents:

<cross-domain-policy>
<site-control permitted-cross-domain-policies="all"/>
</cross>

That allows the media to be accessed from anywhere – you can be more specific if you want but this is the simplest approach.

Once again I’m using the excellent FlowPlayer; my only issue with it is that the ‘buffering’ animation seems to not work all the time (so be patient if you’re viewing the high quality version).

Gotta love this cloud computing business :)