The previous lesson introduced you to symmetric (secret key) encryption. Symmetric encryption is, quite clearly, a tool for providing confidentiality (as in the IA pillar). In this lesson, we look at cryptographic hashes, which are tools for guaranteeing integrity. Hashes are used in a variety of ways, including password authentication, which we'll talk a bit about below. Hashes allow us to check the integrity of a message, i.e. check that the message is the same as an earlier message, without remembering the message itself — indeed without ever having seen the message before. In the setting of passwords, this means that we can check a password without ever storing a copy of the password, which is important for security. Thus, even though what hashing provides directly is a way to check integrity, it is used to help provide authentication. In fact, hashing is also used to help provide non-repudiation.

Rubik's Hash

## Hashes

Suppose you wanted to prove to someone — but someone you hate — that you are psychic. You might, predict the winner of next season's American Idol. But if you give him your prediction now, he might place a bet and make a lot of money, which is bad because you hate him. On the other hand, if you give him your prediction after the season, he won't believe you're psychic. (Nor would I!) What you need is give him him something before-hand, something a) that gives no information before or during the season about your prediction, and b) that after the season provides indisputable evidence that you predicted the real winner (or, if you failed, that you didn't).

It's worth thinking a bit about this example. Hashing will provide a guarantee before hand of the integrity of the message you'll send afterwards. Integrity, that's an IA pillar!
Essentially, what we need is for you to be able to send a message with your prediction after the season, but you must have sent something before the season that will provide a guarantee of the integrity of the message you send afterwards — i.e. something that will guarantee that you haven't modified the message since before the American Idol season started. There is a cryptographic technique called hashing that provides this "something", this guarantee of integrity.

## Rubik's Hash

Here's a simple scheme would allow us to prove our psychic-ness. At the right you see a table with some letters and punctuation marks. Next to each of them is a sequence of two colored squares. We write down the name of the predicted winner, and we replace each character with the squares from its table entry. Then, we take an solved Rubik's Cube and go through the sequence of squares in order, interpreting each square as a quarter-turn clockwise twist of the cube-face with that color as its center square. When we're done, we give the messed up Rubik's cube to "the guy". The messed up Rubik's Cube is the hash of the predicted winner's name.

Once the season is over, we give "the guy" the name of the winner we predicted, and he can take a second solved Rubik's cube, go through the moves for that string, and verify that what he gets is the same as the Rubik's cube you gave him.

Barring some incredibly unlikely coincidence, no two people would have names that hash to the same value, so "the guy" will be convinced that the message (just a name) you gave him after the fact is the same as the message you had hashed beforehand, and thus he will be convinced that you are really psychic. Hashing like this guarantees integrity.

You might object that for some people (and certainly for computers) solving Rubik's Cube is easy. We can address this in two ways: we could use a Rubik's 4 or 5 or 6-cube to up the difficulty or, more profoundly, we could hash by running through the string forward then backward. This second approach makes more work for us when we do the hashing, but it means that solving a Rubik's cube is not enough. You need to find a solution that looks like doing one sequence of moves, then doing the same moves in reverse (in fact it's a bit different than that) and ending up with a solved cube. As far as I know, nobody — not even computers — can do that fast! So for this discussion we treat solving a Rubik's cube as if it is a really hard problem, realizing that to make it truly hard we just need to make this small change to the rules for how the hash is computed. Let's call it Rubik's Cube Hash version 2.0.
How many moves to solve a Rubik's cube?
A group of researchers, using computer time donated to them by Google, proved that "God's Number", the least number of moves that suffices to solve any scrambled cube, is 20. Read the details.
Can you solve a Rubik's cube while juggling?
What about our goal of not giving away the name of our predicted winner prior to the end of the season? Well, finding a string that hashes to the mixed up cube you gave "the guy" means solving a Rubik's Cube, and that's hard! This is the fundamental insight behind hashing: it's easy to produce a hash from a string, but very hard to find a string that produces the given hash. Not impossible, just hard to do fast — like solving a Rubik's Cube, which is not impossible, just hard to do fast. In this case, finding "a" string that hashes to the same cube isn't enough, because you might find a different string (probably not a person's name) that just happens to hash to the same value.

## Salt

As the American Idol season progresses, the field of contestants gets small enough that "the guy", if he has lots of time or has lots of friends (and Rubik's cubes) helping him, could figure out the name of the predicted winner from the hash without having to solve a Rubik's Cube. He could simply get the Rubik's Hash of every remaining contestant's name and find the one that matches the cube you gave him. Because there are too few possible messages (contestant names), you risk giving away your prediction to "the guy". This is called a dictionary attack, by the way. When the pool of strings from which came the one that produced the hash value is small enough, one can try them all and test for matches. This pool of candidate strings is the "dictionary".

In situations like this, we can salt the string before hashing. You could, for instance, choose several random letters and add them to the front and the back of the name you've predicted before you actually take the hash. These random letters are the "salt". Now, "the guy" can't try hashing each contestant's name, even when it's down to just a few contestants, because he'd have to try each name with each possible salt value, and there way too many possibilities. After the race, you present him with a string like

`XOKZPJOHNSMITHXOKZP`
and he can verify that you did indeed predict "John_Smith" to win. The only fly in the ointment is this: "the guy" might object that you could have different salt combinations for every potential contestant's name that all hash to the same cube you gave him before the season.

Is that feasible? Actually I don't know. It's certainly much harder than solving a Rubik's Cube though! So as long as we all accept that solving a Rubik's cube is hard, "the guy" cannot reasonably imagine that you had messages on hand that hash to the cube you gave him for more than one person's name.

## Properties of a good cryptographic hash function

There are several properties that a good cryptographic hash function should have.
1. It should be easy to compute the hash of a string.
2. If all you have is a hash value, it should be very hard to find a string that hashes to that value. Our Rubik's Cube Hash has this property (and the first property), at least if you accept Rubik's cube solving as being difficult. The Rubik's Cube Hash version 2.0 certainly has this property.
3. It should be difficult to find two strings with the same hash. Related to that, it should be hard to take a hash value and a string that hashes to it, and engineer from the first string another string that hashes to that same value. It might not be obvious that someone could cause trouble if they could do this, but what should be obvious is that if we can do this, our hashing scheme is not capable of guaranteeing the integrity of messages, and that's bad.
Our Rubik's hash, even version 2.0, doesn't have the third property, because I could insert double commas anywhere I wanted to and, since two commas means four quarter-turns of the same face, it doesn't change the hash value. However, it would be very hard to engineer a meaningful messages — with different meanings — with the same hash. As discussed earlier, finding a message that means you predicted "John_Smith" and another message with the same hash value that means you predicted "Tina_Jones" is really hard. We'll look at a commonly used cryptographic hash function next lesson, but for now it's worth knowing that "real" cryptographic hashes are numbers, not scrambled Rubik's cubes.

If you contact a company/site to tell them you've forgotten your password, and they can tell you what your password was ... they're storing your actual password and not the hash (how do we know that?). You should be indignant! Do you think it never happens? Check out this New York Times article about the RockYou.com break-in. Hackers broke into rockyou.com and stole files with the passwords of over 32 million RockYou.com users. This was only possible because RockYou.com stored actual passwords, not hashes-of-passwords. [Here's a brief follow-up article.]

# 450k Yahoo Passwords Leaked (7/2012)

It's been a few years since the RockYou.com incident. Surely no company would store passwords "in the clear", i.e. unencrypted ... right? Wrong! In July 2012, attackers used an SQL injection attack to steal user credentials from about 450,000 yahoo.com accounts. Apparently, the database was storing the actual passwords for those accounts, not hashes! [CNET article] [net-security.org article]

This blog post has a breakdown of the most common passwords in the bunch. Notice that 123456 is still on top! Yeah!

Billabong too!

Because people aren't very creative about choosing passwords, there is a relatively small pool of likely passwords, which means dictionary attacks are a problem. Choosing unlikely passwords ameliorates this. Worse yet, people have collected huge precomputed lists of passwords and their hash values. These lists, called Rainbow Tables can be searched very quickly.

Rainbow tables can be defeated using salt. Typically, a server will store a user's login name, salt string (different for each user, and randomly chosen when the account is created), and the hash of salt+password. For a rainbow table to be effective, it has to store the hash value for every possible salt+password combination. So, using a 10-byte salt for example, even if there are a million passwords in the pool, there would be 1,208,925,819,614,629,174,706,176,000,000 entries in the rainbow table. That's 1.2×1018 terabytes, for those of you keeping track at home. That size of table is impractical. Note that this doesn't help against a dictionary attack, since someone who had the password file would also have each user's salt. But it does slow down a dictionary attack, especially in the common situation in which an attacker isn't trying to break into a specific account, but is looking for any account to break into.

There are two scenarios for cracking passwords: 1) someone tries to break into your account by attempting actual logins with many different passwords [online attack], and 2) someone steals the password file and tries to find a string with the same hash as is stored in your entry in the password file (if salt is used it's there in the password file) [offline attack]. These scenarios are different and require different strengths to be secure.
2. A Really Bad Password The worst thing you can do is choose a password that's really common. The the RockYou.com break-in allows us all to see what some of these are. The five most common passwords of the 32 million that were stolen were:
1. 123456
2. 12345
3. 123456789
5. iloveyou
So if you use one of those, you're making it really easy for people who want to break into your account. Think it doesn't happen? Check out this Feb. 2012 msnbc story about how a Syrian official's emails were stolen because his password was 12345.
Check out this article, which has a fun discussion of popular PINs.
LinkedIn Hacked ... and they didn't use salt!
In the summer of 2012, the popular social media site linkedin.com had account information stolen, including password hashes. Shockingly, this very high-profile site was not salting passwords! They changed that immediately, of course, but it was a bit late at that point. The ACM Queue article LinkedIn Password Leak: Salt Their Hide has a good discussion of why this is so disastrous!

Number of passwords of length n is kn, where k is the number of characters the password came from. So if you restrict yourself to lowercase letters, there are 26n passwords of length n. If you draw from both uppercase and lower case, there are 52n. If you draw from both uppercase, lower case and digits 0-9, there are 62n. And so on. However, length is far more important than the size of the character set you draw from. There are 138,000 times more passwords of length 12 using only lower case, than passwords of length 6 using all possible printable characters.
3. Offline Attacks In the second scenario, things are a lot more dire. It's good to remember, however, that we're only in the second scenario if something else has already gone wrong — i.e. if someone has stolen the password file. Because the attacker has his own copy of the password file, he doesn't have to try logging into a real system to check whether he's found the password. He simply hashes his guessed password and checks if it matches a hash in the file. This means he can check millions — potentially billions — of passwords per second. There are three basic attacks here: brute force (try every string of length 1, then 2, then 3 ...), a dictionary attack, and rainbow table attacks. Long passwords that include lower and upper case letters, numbers and punctuation characters are most secure. What's also important is that your password is not a real word or names and, most of all, don't choose things like `iloveyou1`, that about a million other people also use as a password.

As we've seen, salting passwords provides some security in the face of an offline attack, since precomputed tables can't be used. It's also possible to use a site-wide salt value, so that what's actually stored in the password file is the hash of password + user-salt + site-salt. This way an attacker has to steal the site-salt value as well as the password file. Finally, a site can employ password stretching. The idea behind password stretching is to make the computation of a single hash very time-consuming. That way, brute force attacks take a whole lot longer. One way to do this is to store not the hash of the password+salt, but the hash of the hash of the hash of the hash of ... of the hash of the password+salt. If you use the 10,000-times iterated hash, and attacker has to wait 10,000 times longer for a password to be cracked.

If you like the xkcd-style passwords (as described in the comic to the right), you can use the si110 xkcd-style password generator. You can check out this graphic for some tips — you'll have to click to expand it in order to make it readable. I'm not completely sold on everything it says, but it certainly gives you plenty of ideas of what not to do!

## Two factor authentication

Dropbox hacked ... 2-factor authentication coming
In July 2012, hackers managed to break into the account of an employee at dropbox.com, a well-known "cloud storage" site, using a stolen password. This gave them access they should not have had — access to dropbox.com user information, for example. Dropbox.com has decided to offer a two-factor authentication option (using your cell-phone as the second factor) to add another layer of defense against this kind of attack. Check out this arstechnica article.