In this lesson, we look at cryptographic hashes, which are tools for guaranteeing information integrity.
Hashes are used in a variety of ways, including password authentication, which we'll talk a bit about below.
Hashes allow us to check the integrity of a message, i.e. check that the message has not been tampered with, even without a copy of the "original" message. In the setting of passwords, this means that we can check a password without ever storing a copy of the password, which is important for security.
Thus, even though what hashing provides directly is a way to check integrity, it can actually be used to help support authentication, as well. In fact, hashing is also used to help provide non-repudiation.
Example: a Prediction. Suppose you wanted to prove to someone — but someone you dislike — that you are psychic. You might, for example, predict the winner of next season's The Voice. If you give him your prediction now, he might place a bet and make a lot of money, which is bad because you dislike him. On the other hand, if you give him your prediction after the season, he won't believe you're psychic.
What you need is to give him something beforehand, something that:
gives no information before or during the season about your prediction
after the season provides indisputable evidence that you predicted the real winner (or, if you failed, that you didn't).
It's worth thinking a bit about this example. Hashing will
provide a guarantee before hand of the integrity of
the message you'll send afterwards. Integrity,
one of the Pillars of Cyber Security!
Essentially, what we need is something that will guarantee that you haven't modified the message since before the The Voice season started. There is a cryptographic technique called hashing that provides this "something", this guarantee of integrity.
How It Works. Here's a simple scheme would allow us to prove our psychic-ness. At the right you see a table with some letters and punctuation marks. Next to each of them is a sequence of two colored squares. We write down the name of the predicted winner, and we replace each character with the squares from its table entry. Then, we take an solved Rubik's Cube and go through the sequence of squares in order, interpreting each square as a quarter-turn clockwise twist of the cube-face with that color as its center square. When we're done, we give the messed up Rubik's cube to "the guy". The messed up Rubik's Cube is the hash of the predicted winner's name.
Proving it Later. Once the season is over, we give "the guy" the name of the winner we predicted, and he can take a second solved Rubik's cube, go through the moves for that input, and verify that what he gets is the same as the Rubik's cube you gave him.
Barring some incredibly unlikely coincidence, no two people would have names that hash to the same value, so "the guy" will be convinced that the message (just a name) you gave him after the fact is the same as the message you had hashed beforehand, and thus he will be convinced that you are really psychic. Hashing like this guarantees the integrity of your prediction. If you had predicted incorrectly, the hash value of your prediction would be a complete mismatch from the confirmation hash value later.
You might object that for some people (and certainly for computers) solving Rubik's Cube is easy. We can address this in two ways: we could use a Rubik's 4-cube, 5-cube or 6-cube to up the difficulty, or we could (complicate the algorithm for the) hash by running through the input forward then backward. Our Rubik's Cube Hash example is just an analogy, though. The key is defining a mathematical function that (among other things), is easy to compute in the forward direction (given input, produce output), but far more difficult to compute in reverse (given output, produce input). Real-world hash functions are far more difficult to compute in reverse (i.e., "solve") than our Rubik's Cube, but the analogy holds: it's easy to compute the cube configuration given an input (sequence of moves), but more difficult, given a cube configuration, to deduce exactly which moves produced that configuration.
How many moves to solve a Rubik's cube? A group of researchers, using computer time donated to them by Google, proved that the least number of moves that suffices to solve any scrambled cube is 20. Read the details.
Can you solve a Rubik's cube while juggling?
What about our goal of not giving away the name of our predicted winner prior to the end of the season? Well, finding a input that hashes to the mixed up cube you gave "the guy" means solving a Rubik's Cube, and that's hard! This is the fundamental insight behind hashing: it's easy to produce a hash from an input (a sequence of bytes), but very hard, given a hash, to find the input that produced it. Not impossible, just hard to do fast — like solving a Rubik's Cube, which is not impossible, just hard to do fast. In this case, finding "a" input that hashes to the same cube isn't enough, because you might find a different input (probably not a person's name) that just happens to hash to the same value.
As the The Voice season progresses, the field of contestants gets small enough that "the guy", if he has lots of time or has lots of friends (and Rubik's cubes) helping him, could figure out the name of the predicted winner from the hash without having to solve a Rubik's Cube. He could simply get the Rubik's Hash of every remaining contestant's name and find the one that matches the cube you gave him. Because there are too few possible messages (contestant names), you risk giving away your prediction to "the guy". This is called a dictionary attack, by the way. When the pool of possible inputs that could have produced the hash value is small enough, one can try them all and test for matches. This pool of candidate inputs is the "dictionary".
In situations like this, we can add complexity by salting the input before hashing. You could, for instance, choose several random letters and add them to the front and the back of the name you've predicted before you actually take the hash. These random letters are the "salt". Now, "the guy" can't try hashing each contestant's name, even when it's down to just a few contestants, because he'd have to try each name with each possible salt value, and there way too many possibilities. After the competition, you present him with an input like
and he can verify that you did indeed predict "John_Smith" to win. Adding "salt" multiplies the effective size of the "dictionary" of possible input values, making an exhaustive ("brute-force") trial of all possible input values less practical.
Properties of a Good Cryptographic Hash Function
There are several properties that a good cryptographic hash function should have.
It should be easy to compute the hash of an input.
If all you have is a hash value, it should be very hard to find an input that hashes to that value. Our Rubik's Cube Hash has this property (and the first property), at least if you accept Rubik's cube solving as being difficult.
It should be difficult to find two inputs with the same hash. Related to that, it should be hard to take a hash value and a input that hashes to it, and engineer from the first input another input that hashes to that same value. It might not be obvious that someone could cause trouble if they could do this, but what should be obvious is that if we can do this, our hashing scheme is not capable of guaranteeing the integrity of messages, and that's bad.
A very common application of hashing is password checking. Suppose there's a server you login to from time to time. It's undesirable to store your password on the server's hard drive. There are tricks that might allow people to see that file and, once they do, they'd have your password and be able to log in as you. So instead, we'd like the server to store something that it can use to verify your password, but which nobody can use to figure out your password. We can use a hash function!
Servers have a special file (the password file) that stores usernames and the hashes of passwords. When you log in, you give your password, they hash it and check the hash against the entry for your username in the password file stored on their hard drive.
Because people aren't very creative about choosing passwords, there is a relatively small pool of likely passwords, which means dictionary attacks are a problem; choosing unlikely passwords reduces this risk. Unfortunately, people have collected huge precomputed lists of passwords and their hash values. These precomputed lists, often stored in a special format called a called Rainbow Table, can be searched very quickly.
The effectiveness of Rainbow Table attacks can be reduced by using salt. Typically, a server will store a user's login name, salt input (different for each user, and randomly chosen when the account is created), and the hash of "password+salt". For a rainbow table to be effective, it has to store the hash value for every possible password+salt combination. So, using a 10-byte salt, for example, even if there are a million passwords in the pool, there would be 1,208,925,819,614,629,174,706,176,000,000 potential entries in the rainbow table. That's 1.2×1018 terabytes. That size of table is impractical.
Note that salt doesn't help against a brute-force attack of all possible passwords, assuming attacker has the password file, because the attacker therefore also has each user's individual salt value. But salt does slow down a precomputed, dictionary-based attack (such as those employing Rainbow Tables), especially in the common situation where an attacker isn't trying to break into a specific account, but is looking for any account to break into.
If you contact a company/site to tell them you've forgotten your password, and they can tell you what your password was ... they're storing your actual password and not the hash (how do we know that?). You should be indignant! Do you think it never happens? Check out this New York Times article about the RockYou.com break-in. Hackers broke into rockyou.com and stole files with the passwords of over 32 million RockYou.com users. This was only possible because RockYou.com stored actual passwords, not hashes-of-passwords. [Here's a brief follow-up article.]
450k Yahoo Passwords Leaked (7/2012)
It's been a few years since the RockYou.com incident. Surely no company would store passwords "in the clear", i.e. no cryptographic protections ... right? Wrong! In July 2012, attackers used an SQL injection attack to steal user credentials of about 450,000 yahoo.com accounts. Apparently, the database was storing the actual passwords for those accounts, not hashes! [CNET article] [net-security.org article] This blog post has a breakdown of the most common passwords in the bunch (notice that '123456' is on top). Similar occurrence at Billabong too!
Strength of passwords
There are two scenarios for cracking passwords, online and offline attacks.
Online Attacks. In this setting, the attacker does not possess a password-hash file, but actually tries to login with many different passwords. Most systems take a long time (in computer terms) to respond to a login — especially an incorrect login. This limits the effectiveness of this kind of attack. For instance, if it takes 2 seconds to get a "login failed" message, you can only try 30 passwords a minute. If you had an 8-character password, even if all characters were lowercase letters, if would take on average over 6,000 years to log in successfully in this scenario. Also, unsuccessful login attempts are usually logged, so that even a mildly vigilant system administrator would be aware that someone was attempting to hack into the account after a few minutes of trying.
Many systems employ throttling, in which the server starts introducing longer and longer delays as the same user tries more and more logins within a brief period. In fact, often systems lock an account if there are too many unsuccessful attempts in a short period of time. With this policy, an attacker couldn't possibly be successful unless he had narrowed down the password to one of a very small number of possibilities. So, in this scenario even a relatively short password will be relatively secure provided it is not easily guessable.
A Really Bad Password The worst thing you can do is choose a password that's really common. The RockYou.com breach allows us all to see what some of these are. The five most common passwords of the 32 million that were stolen were:
So if you use one of those, you're making it really easy for people who want to break into your account. Think it doesn't happen? Check out this Feb 2012 Huffington Post story about how a Syrian official's emails were stolen because his password was 12345. Check out this article, which has a fun discussion of popular PINs.
LinkedIn Hacked ... and they didn't use salt! In the summer of 2012, the popular social media site linkedin.com had account information stolen, including password hashes. Shockingly, this very high-profile site was not salting passwords! They changed that immediately, of course, but it was a bit late at that point. The ACM Queue article LinkedIn Password Leak: Salt Their Hide has a good discussion of why this is so disastrous!
Number of passwords of length n is kn, where k is the number of characters the password came from. So if you restrict yourself to lowercase letters, there are 26n passwords of length n. If you draw from both uppercase and lower case, there are 52n. If you draw from both uppercase, lower case and digits 0-9, there are 62n. And so on. However, length is far more important than the size of the character set you draw from. There are 138,000 times more passwords of length 12 using only lower case, than passwords of length 6 using all possible printable characters.
Offline Attacks. In this scenario, things are a lot more dire, because someone has stolen the password file. Because the attacker has his own copy of the password file, he doesn't have to try logging into a real system to check whether he's found the password. He simply hashes his guessed password and checks if it matches a hash in the file. This means he can check millions — potentially billions — of passwords per second. There are three basic attacks here: brute force (try every input of length 1, then 2, then 3 ...), a dictionary attack, and precomputed (rainbow table) attacks. Long passwords that include lower and upper case letters, numbers and punctuation characters are most secure. What's also important is that your password is not a real word or names and, most of all, don't choose things like iloveyou1, that about a million other people also use as a password.
As we've seen, salting passwords provides some security in the face of an offline attack, since precomputed tables can't be used. It's also possible to add a site-wide salt value, so that what's actually stored in the password file is the hash of password + user-salt + site-salt. This way an attacker has to steal the site-salt value as well as the password file. Finally, a site can employ password stretching. The idea behind password stretching is to make the computation of a single hash very time-consuming. That way, brute force attacks take a whole lot longer. One way to do this is to store not the hash of the password+salt, but the hash of the hash of the hash of the hash of ... of the hash of the password+salt. If you use the 10,000-times iterated hash, and attacker has to wait 10,000 times longer for a password to be cracked.
If you like the xkcd-style passwords (as described in the comic to
the right), you can use the sy110 xkcd-style password generator. You can check out this graphic for some tips — you'll have to click to expand it in order to make it readable.
Client and Server Responsibilities
In class, you went through an activity in which you steal the password file for the class message board, and read off usernames
and passwords. This prompts the instructor to turn on hashing of passwords for the message board. When you go back and steal the password file again, all you found are usernames and hashes-of-passwords. This prompts us to try to crack the hashes to recover the passwords. We have the RockYou.com dictionary of passwords, with precomputed MD5 hashes for each of them. This way, we crack many passwords quite quickly. This prompts the instructor to turn on salting of passwords, so the precomputed hashes can't be used. This slows down our attack quite a bit, but still some passwords can be cracked. Finally, we discuss what makes for secure (hard to crack) passwords. We end up with the following observation: both the server (website) and the client (user) have some responsibilities in protecting identities with passwords. Here is our summary:
Use hashing (Increases exploit effort in event of password file compromise)
Use salt (Protect against Rainbow Table Attack)
Control/restrict access to the password file (Reduce likelihood of password file compromise)
Uncommon password (Protect against Dictionary Attack)
Uses multiple character sets (A-Z, a-z, 0-9, special characters) (Protect against Dictionary Attack. Protect against (increase effort of) Brute Force Attack)
Long password (Protect against (increase effort of) Brute Force Attack)
Do not reuse passwords from other accounts (Decreases impact in event of password file compromise)
Two factor authentication
Dropbox hacked ... will now offer 2-factor authentication
In July 2012, hackers managed to break into the account of an
employee at dropbox.com, a well-known "cloud storage" site,
using a stolen password.
This gave them access they should not have had — access
to dropbox.com user information, for example.
Dropbox.com has decided to offer a two-factor authentication
option (using your cell-phone as the second factor) to add
another layer of defense against this kind of attack. Check
Why It's Needed. It's hard to remember a long password, especially if it's a good one with lowercase, uppercase, etc. Thus passwords may be broken, and of course they may be stolen in various ways. One approach to stronger authentication is two-factor authentication.
How It Works. The idea is that to authenticate (for instance to log into a system), you have to provide two things:
something you know (a password, usually)
something you have or something you are.
Examples. A smart card would be "something you have", while a fingerprint or iris scan would be an example of "something you are" (something inherent to you). An attacker would have to both have your password and the other factor. An interesting idea that Google uses for gmail (optionally) is that when you initiate login, a text-message is sent to your cell phone with a secret 1-time number. You enter that number along with your password to authenticate. Essentially, the cell phone (something you have) is acting as the second factor in this two-factor authentication system.
Password Cracking in the Navy
Navy commands routinely run password cracking software on their
own password files to ensure users are using strong passwords.
Specifically, the Navy uses L0phtCrack,
a program that can crack Windows and UNIX passwords. L0phtCrack
can be configured to use dictionary attacks (like the attacks we
performed during the class activities). Results from the routine
password cracks are reported to and monitored by upper levels of
the chain of command.
Security means never standing still
To keep their information systems secure, the Navy has to
constantly update and upgrade not only their hardware and
software, but even the cryptographic algorithms the hardware and
MD5 was the standard for a long time, but as researchers started
to discover flaws (or even potential flaws), the Navy
had to migrate to a newer, better hashing algorithm, which was
SHA1. Even that is now less secure than it was a few years ago.
covering the Department of the Navy's announcement that systems
will be migrating (gradually changing) to a still stronger
cryptographic hashing algorithm: SHA-256.