Ok, so I discovered a number of shortcomings in my recent attempt to sync a folder in one direction to Amazon S3 using encryption, the most important of which was that it wouldn’t resume a failed transfer efficiently, which in the case of large transfers wasn’t at all ideal (as I learned to be own cost – damn my 256k upload speed).
So, this is attempt number 2. I decided to completely rewrite the script in Python instead to give me some more flexibility, coupled with the availability of Boto, a nice Python library for accessing all the Amazon Web Services. Rather than rely on just local information, or even date/time stamps, I decided to use hashes to track whether files were different. Amazon already stores the MD5 of the file you upload to them and makes that available without downloading the file, but that’s no use when you encrypt your files before uploading them, because the MD5 is of the encrypted contents rather then the original; so unless you keep the encrypted copies around too, or encrypt the local files again every time just to check the match (expensive if you’re dealing with large files) you won’t be able to compare them – I think this is the reason why ‘s3cmd sync’ currently doesn’t support encryption.
So, I decided to use S3′s ability to store custom metadata in keys, and stored the MD5 hash of the original file against the encrypted contents that I uploaded. That way, I can check the hashes against each other pretty quickly without having to re-encrypt the local files. If the hashes are different, I encrypt and upload. This approach trades a bit of preprocessing against avoiding uploads, so it’s likely to be more efficient on small groups of very large files rather than lots of small files – that’s how I use S3 for my backups of course. It also means I don’t have to worry about timestamp variations, it’s the content of the file that is the driver of whether it’s uploaded or not.
So, here’s the new version. It’s a bit more powerful than the last one – I’m calling gpg myself now so you have the choice between encrypting using public keys (more secure, and the default), or using symmetric encryption with a passphrase. You need to install Boto before you can run it, and it depends on Python 2.5 with hashlib installed. I’ve run it on both Linux and the Mac, it should work on Windows too provided you take the trouble to set up Python and GnuPG, but I haven’t tried; my Linux (apt) and OS X (macports) setups make these things quicker so being short of time I just went with that. Here’s the usage from –help:
Usage: s3putsecurefolder.py [options] source_folder target_bucket gpg_recipient_or_phrase Options: -h, --help show this help message and exit -n, --dry-run Do not upload any files, just list actions -a ACCESS_KEY, --accesskey=ACCESS_KEY AWS access key to use instead of relying on environment variable AWS_ACCESS_KEY -s SECRET_KEY, --secretkey=SECRET_KEY AWS secret key to use instead of relying on environment variable AWS_ACCESS_KEY -c, --create Create bucket if it does not already exist -v, --verbose Verbose output -S, --symmetric Instead of encrypting with a public key, encrypts files using a symmetric cypher and the passphrase given on the command-line.
Once again, no warranty is given, MIT license. If you see that I’ve done anything dumb, let me know