S3 encrypted upload script, v2 (Python)

pythonOk, so I discovered a number of shortcomings in my recent attempt to sync a folder in one direction to Amazon S3 using encryption, the most important of which was that it wouldn’t resume a failed transfer efficiently, which in the case of large transfers wasn’t at all ideal (as I learned to be own cost – damn my 256k upload speed).

So, this is attempt number 2. I decided to completely rewrite the script in Python instead to give me some more flexibility, coupled with the availability of Boto, a nice Python library for accessing all the Amazon Web Services. Rather than rely on just local information, or even date/time stamps, I decided to use hashes to track whether files were different. Amazon already stores the MD5 of the file you upload to them and makes that available without downloading the file, but that’s no use when you encrypt your files before uploading them, because the MD5 is of the encrypted contents rather then the original; so unless you keep the encrypted copies around too, or encrypt the local files again every time just to check the match (expensive if you’re dealing with large files) you won’t be able to compare them – I think this is the reason why ‘s3cmd sync’ currently doesn’t support encryption.

So, I decided to use S3′s ability to store custom metadata in keys, and stored the MD5 hash of the original file against the encrypted contents that I uploaded. That way, I can check the hashes against each other pretty quickly without having to re-encrypt the local files. If the hashes are different, I encrypt and upload. This approach trades a bit of preprocessing against avoiding uploads, so it’s likely to be more efficient on small groups of very large files rather than lots of small files – that’s how I use S3 for my backups of course. It also means I don’t have to worry about timestamp variations, it’s the content of the file that is the driver of whether it’s uploaded or not.

So, here’s the new version. It’s a bit more powerful than the last one – I’m calling gpg myself now so you have the choice between encrypting using public keys (more secure, and the default), or using symmetric encryption with a passphrase. You need to install Boto before you can run it, and it depends on Python 2.5 with hashlib installed. I’ve run it on both Linux and the Mac, it should work on Windows too provided you take the trouble to set up Python and GnuPG, but I haven’t tried; my Linux (apt) and OS X (macports) setups make these things quicker so being short of time I just went with that. Here’s the usage from –help:

Usage: s3putsecurefolder.py [options] source_folder target_bucket gpg_recipient_or_phrase

Options:
  -h, --help            show this help message and exit
  -n, --dry-run         Do not upload any files, just list actions
  -a ACCESS_KEY, --accesskey=ACCESS_KEY
                        AWS access key to use instead of relying on
                        environment variable AWS_ACCESS_KEY
  -s SECRET_KEY, --secretkey=SECRET_KEY
                        AWS secret key to use instead of relying on
                        environment variable AWS_ACCESS_KEY
  -c, --create          Create bucket if it does not already exist
  -v, --verbose         Verbose output
  -S, --symmetric       Instead of encrypting with a public key, encrypts
                        files using a symmetric cypher and the passphrase
                        given on the command-line.

Once again, no warranty is given, MIT license. If you see that I’ve done anything dumb, let me know :)

  • Pingback: SteveStreeting.com » Blog Archive » s3putsecurefolder

  • Dark Sylinc

    I don’t use S3, nor am I a top security expert, BUT…

    I’m concerned in case you store the MD5 of the original file in a public server (that’s what you’re doing, right?) and then compare it with the MD5 in your local copy.

    By exposing the MD5 of the original file you lower the security of the stored files. Specially since MD5 has a known theoretical vulnerability which could make it easier to reconstruct the file from the MD5 hash sum rather trying to crack the encrypted file itself (or pherhaps it could help cracking the encryption depending on the algorithm used).

    A better, much more secure, option is to store a hash map pairs of “MD5 Original” -> “MD5 Encrypted” in your LOCAL pc in a separate file and compare.

    1) Calculate MD5 from your local file (from now on MD5Local).
    2) Retrieve the encrypted MD5 from the server (MD5Server).
    3) Look up MD5Local in the table (MD5Encrypt) and compare it with MD5Server. If they match, stop. If they don’t match or the entry wasn’t found…
    4) Encrypt again
    5) Update the map’s MD5Encrypt and upload the file.

    Also, ideally, your script should handle files that were already encrypted. You should never encrypt a file twice. It makes it awfull less safe.
    It may not be your case, but since your making it under MIT license, it should be worth noting it in the python script as a comment at least.

    My suggestion works very same as yours. Both trade precomputing for some additional disk space.
    The difference is that your’s stores the original MD5 in the public server and mine’s stores the original MD5 and the encrypted MD5 in YOUR machine.
    If you move to another PC, you should copy the file where you store the map table pairs (if you don’t, your script will start uploading again until the map starts getting filled with pairs again).
    Depending on your files, that table file might get very big so you may want to remove the oldest entries when it gets too big.

    Cheers
    Dark Sylinc

  • Frenetic

    Dark Synlinc: You can’t feasably reconstruct a file larger than a few kilobytes from a 128-bit hash. Any non-security-related information revealed by an insecure hash/checksum could just as easily be found by other means when attempting to crack the encryption, which at any rate would remain non-trivial (assuming the encryption algorithm used hasn’t been broken, you’d still need to use a brute force attack to get anything useful).

    Plus, I’m sure I’m not the only one thinking about this xkcd strip;)

  • http://www.stevestreeting.com Steve

    Thanks for the feedback!

    However, I definitely don’t want to keep a redundant database of MD5 mappings, and I’m not really convinced by the suggested vulnerability – as far as I’m aware, the use of an MD5 hash of the original data to aid in the cracking of encrypted data is entirely theoretical.

    “You can’t feasably reconstruct a file larger than a few kilobytes from a 128-bit hash.”

    You can’t even do that. You can’t regenerate even a 1k file from a 16-byte hash value, it’s not physically possible to pull new data out of the air like that.

    What you *can* do is generate some new data that will result in the same hash value, that’s where MD5 is now considered ‘vulnerable’ (for breaking password systems which match MD5s, or forging signatures) – but that’s not a vector I’m bothered by, since I’m not trying to verify authenticity. I haven’t seen any information that suggests you can get anything even remotely useful about the original data back out of an MD5.

    Maybe I’ve totally missed some technical article on how you can reconstruct data from an MD5 hash? If so I’d love to see it, please link…

  • Dark Sylinc

    Oh yeah, I love that xkcd strip!

    I won’t deny your thoughts, and I probably overreacted with what can be done with an MD5

    But when I wrote that I had the following thoughts in my mind:
    1) Security is about looking in the future, and is meant to protect data as long enough so that when it is broken, the data itself is useless. But on the other hand we don’t know what new methods may arise in the future. Probably tomorrow or in 100 years (or never) a method is discovered to speed up the cracking of encrypted data by knowing the original’s MD5 hash. A counter-argument would be that it is as likely possible that the encryption algorithm becomes completely broken (i.e. we find how to find out prime numbers instantly).
    But not exposing the MD5 hash decreases such CHANCES.

    2) A thief specifically targetting Steve’s information could see the server’s files (encrypted) and by knowing which file he needs (i.e. by looking at filenames and timestamps) he can get the MD5, get access to Steve’s computer (even if that means physically stealing it, or breaking in while he’s home) and quickly find it because he just needs to look for a file in the HDD with the matching MD5.
    A counter argument is that if an IP thief is so determined, he will get it anyway. But again, you shouldn’t make his life easier.

    These arguments sound a bit crazy, probably they are, may be not; but the point is about reducing probabilities of stealing, and more importantly storing the MD5 hashes in your local pc isn’t that very big deal.

    Cheers
    Dark Sylinc

  • Jeff

    Dark Sylinc: If a thief knows the path and filename of what he wants, I’m not sure doing a mapping through the MD5 hash of the contents of the file makes it any easier to find that file on Steve’s computer. And if the thief is searching for the filename of a file he has the MD5 hash of, that means he already has the file. Also I’m not sure why the thief wouldn’t just steal/image Steve’s HDD and then browse it at his leisure.

    That said, I’m not so sure I’d want to give someone the MD5 hash of a file named my-credit-card-number.txt.

  • http://www.stevestreeting.com Steve

    @Dark: I definitely appreciate the feedback. However, I do still disagree :)

    1) I certainly agree that keeping an eye on the future is a good idea. However, hashes like MD5 have repeatedly been proven uninvertable even for small input data, never mind reasonable file sizes. I can’t even imagine a situation where knowing the (tiny) hash of some original data would assist you in breaking the (completely separate) encryption algorithm, to the extent that if it did, I suspect the encryption algorithm would be proven flawed anyway – as you say. I don’t think that exposing the MD5 decreases the chances at all, I think they’re orthogonal issues. It’s just the same to say that knowing the name of an encrypted file, or the name of the person who encrypted it, would make the security weaker, because chances are that data might occur somewhere in one of the files – but that information is regularly known. A 16-byte hash value is so abstract compared to the original data that it, IMO, represents a *far* lower risk than far more commonly known information.

    2) As Jeff points out, if someone stole my physical machine, they wouldn’t give a rats arse about the MD5s! They’d have everything anyway. Why on earth would you break into my S3 account (files are private anyway), download the MD5s, then trawl through my stolen HDD calculating MD5s to match up with that, when all you really need to do is search the disk contents for keywords like ‘bank’? :D Quite the opposite of making his life easier, I think if he was crazy enough to try the MD5 matching route, he would be slowed down a fair bit! Maybe you’re overthinking this… ;)

    Again though, I appreciate the thoughts – such challenges are good to make me think about these cases, even if I don’t reach the same conclusions you do.

    “That said, I’m not so sure I’d want to give someone the MD5 hash of a file named my-credit-card-number.txt.”

    If you have an unencrypted file on your HDD called my-credit-card-number.txt then you deserve everything you get ;)

    For what it’s worth, I don’t even expose file names or individual files in the way I use this – my uploads are Bacula volumes which are chunks of compressed and aggregated files, and then obviously encrypted – as such it represents an even lower risk. But still, I think the approach I’m taking is secure enough even for regular files.

  • Jeff

    “I think the approach I’m taking is secure enough even for regular files.”

    Well, quite a few config/dot-files would probably be both simple and structured enough that all the contents except a secret password could be guessed. And a few might have the possibility of a predictability in update frequency, allowing someone who has the ability to view your backups over time to identify the individual file based on filesize and the frequency of changes. In that case, I think an offline dictionary attack against the password would be possible, albeit with a pretty large salt (the other contents of the file) preventing a tabled attack. Unlikely, sure, and it would require pretty sizable resources to pull off such an attack (though if we’re talking about not trusting S3 themselves, they’d presumably use the EC cloud to do the dictionary attack :) ), but I still wouldn’t want to move away from the batching of files pre-hashing. I’d guess this sort of thing has probably been examined in more detail regarding the permissions on checksum databases for HIDSes.

  • http://www.stevestreeting.com Steve

    Fair point. That would be the case with any encryption on separate, identifiable files where 90+% of the file content is guessable I assume. Presumably encrypted filesystems like ZFS have that same possible vector since the metadata is still unencrypted – I wonder if anyone has done analysis on that?