The previous lesson introduced you to symmetric (secret key) encryption. Symmetric encryption is, quite clearly, a tool for providing confidentiality (as in the IA pillar). In this lesson, we look at cryptographic hashes, which are tools for guaranteeing integrity. Hashes are used in a variety of ways, including password authentication, which we'll talk a bit about below. Hashes allow us to check the integrity of a message, i.e. check that the message is the same as an earlier message, without remembering the message itself — indeed without ever having seen the message before. In the setting of passwords, this means that we can check a password without ever storing a copy of the password, which is important for security. Thus, even though what hashing provides directly is a way to check integrity, it is used to help provide authentication. In fact, hashing is also used to help provide non-repudiation.

Rubik's Hash

Hashes

Suppose you wanted to prove to someone — but someone you hate — that you are psychic. You might, predict the winner of next season's American Idol. But if you give him your prediction now, he might place a bet and make a lot of money, which is bad because you hate him. On the other hand, if you give him your prediction after the season, he won't believe you're psychic. (Nor would I!) What you need is give him him something before-hand, something a) that gives no information before or during the season about your prediction, and b) that after the season provides indisputable evidence that you predicted the real winner (or, if you failed, that you didn't).

It's worth thinking a bit about this example. Hashing will provide a guarantee before hand of the integrity of the message you'll send afterwards. Integrity, that's an IA pillar!

Essentially, what we need is for you to be able to send a message with your prediction after the season, but you must have sent something before the season that will provide a guarantee of the integrity of the message you send afterwards — i.e. something that will guarantee that you haven't modified the message since before the American Idol season started. There is a cryptographic technique called hashing that provides this "something", this guarantee of integrity.

Rubik's Hash

Here's a simple scheme would allow us to prove our psychic-ness. At the right you see a table with some letters and punctuation marks. Next to each of them is a sequence of two colored squares. We write down the name of the predicted winner, and we replace each character with the squares from its table entry. Then, we take an solved Rubik's Cube and go through the sequence of squares in order, interpreting each square as a quarter-turn clockwise twist of the cube-face with that color as its center square. When we're done, we give the messed up Rubik's cube to "the guy". The messed up Rubik's Cube is the hash of the predicted winner's name.

Once the season is over, we give "the guy" the name of the winner we predicted, and he can take a second solved Rubik's cube, go through the moves for that string, and verify that what he gets is the same as the Rubik's cube you gave him.

Barring some incredibly unlikely coincidence, no two people would have names that hash to the same value, so "the guy" will be convinced that the message (just a name) you gave him after the fact is the same as the message you had hashed beforehand, and thus he will be convinced that you are really psychic. Hashing like this guarantees integrity.

You might object that for some people (and certainly for computers) solving Rubik's Cube is easy. We can address this in two ways: we could use a Rubik's 4 or 5 or 6-cube to up the difficulty or, more profoundly, we could hash by running through the string forward then backward. This second approach makes more work for us when we do the hashing, but it means that solving a Rubik's cube is not enough. You need to find a solution that looks like doing one sequence of moves, then doing the same moves in reverse (in fact it's a bit different than that) and ending up with a solved cube. As far as I know, nobody — not even computers — can do that fast! So for this discussion we treat solving a Rubik's cube as if it is a really hard problem, realizing that to make it truly hard we just need to make this small change to the rules for how the hash is computed. Let's call it Rubik's Cube Hash version 2.0.

How many moves to solve a Rubik's cube?
A group of researchers, using computer time donated to them by Google, proved that "God's Number", the least number of moves that suffices to solve any scrambled cube, is 20. Read the details.

Can you solve a Rubik's cube while juggling?

What about our goal of not giving away the name of our predicted winner prior to the end of the season? Well, finding a string that hashes to the mixed up cube you gave "the guy" means solving a Rubik's Cube, and that's hard! This is the fundamental insight behind hashing: it's easy to produce a hash from a string, but very hard to find a string that produces the given hash. Not impossible, just hard to do fast — like solving a Rubik's Cube, which is not impossible, just hard to do fast. In this case, finding "a" string that hashes to the same cube isn't enough, because you might find a different string (probably not a person's name) that just happens to hash to the same value.

Salt

As the American Idol season progresses, the field of contestants gets small enough that "the guy", if he has lots of time or has lots of friends (and Rubik's cubes) helping him, could figure out the name of the predicted winner from the hash without having to solve a Rubik's Cube. He could simply get the Rubik's Hash of every remaining contestant's name and find the one that matches the cube you gave him. Because there are too few possible messages (contestant names), you risk giving away your prediction to "the guy". This is called a dictionary attack, by the way. When the pool of strings from which came the one that produced the hash value is small enough, one can try them all and test for matches. This pool of candidate strings is the "dictionary".

In situations like this, we can salt the string before hashing. You could, for instance, choose several random letters and add them to the front and the back of the name you've predicted before you actually take the hash. These random letters are the "salt". Now, "the guy" can't try hashing each contestant's name, even when it's down to just a few contestants, because he'd have to try each name with each possible salt value, and there way too many possibilities. After the race, you present him with a string like

XOKZPJOHNSMITHXOKZP

and he can verify that you did indeed predict "John_Smith" to win. The only fly in the ointment is this: "the guy" might object that you could have different salt combinations for every potential contestant's name that all hash to the same cube you gave him before the season.

Is that feasible? Actually I don't know. It's certainly much harder than solving a Rubik's Cube though! So as long as we all accept that solving a Rubik's cube is hard, "the guy" cannot reasonably imagine that you had messages on hand that hash to the cube you gave him for more than one person's name.

Properties of a good cryptographic hash function

There are several properties that a good cryptographic hash function should have.

It should be easy to compute the hash of a string.
If all you have is a hash value, it should be very hard to find a string that hashes to that value. Our Rubik's Cube Hash has this property (and the first property), at least if you accept Rubik's cube solving as being difficult. The Rubik's Cube Hash version 2.0 certainly has this property.
It should be difficult to find two strings with the same hash. Related to that, it should be hard to take a hash value and a string that hashes to it, and engineer from the first string another string that hashes to that same value. It might not be obvious that someone could cause trouble if they could do this, but what should be obvious is that if we can do this, our hashing scheme is not capable of guaranteeing the integrity of messages, and that's bad.

Our Rubik's hash, even version 2.0, doesn't have the third property, because I could insert double commas anywhere I wanted to and, since two commas means four quarter-turns of the same face, it doesn't change the hash value. However, it would be very hard to engineer a meaningful messages — with different meanings — with the same hash. As discussed earlier, finding a message that means you predicted "John_Smith" and another message with the same hash value that means you predicted "Tina_Jones" is really hard. We'll look at a commonly used cryptographic hash function next lesson, but for now it's worth knowing that "real" cryptographic hashes are numbers, not scrambled Rubik's cubes.

Password Authentication

If you contact a company/site to tell them you've forgotten your password, and they can tell you what your password was ... they're storing your actual password and not the hash (how do we know that?). You should be indignant! Do you think it never happens? Check out this New York Times article about the RockYou.com break-in. Hackers broke into rockyou.com and stole files with the passwords of over 32 million RockYou.com users. This was only possible because RockYou.com stored actual passwords, not hashes-of-passwords. [Here's a brief follow-up article.]

450k Yahoo Passwords Leaked (7/2012)

It's been a few years since the RockYou.com incident. Surely no company would store passwords "in the clear", i.e. unencrypted ... right? Wrong! In July 2012, attackers used an SQL injection attack to steal user credentials from about 450,000 yahoo.com accounts. Apparently, the database was storing the actual passwords for those accounts, not hashes! [CNET article] [net-security.org article]

This blog post has a breakdown of the most common passwords in the bunch. Notice that 123456 is still on top! Yeah!

Billabong too!

A very common application of hashing is password checking. Suppose there's a server you login to from time to time. It's undesirable to store your password on the server's harddrive. There are tricks that might allow people to see that file and, once they do, they'd have your password and be able to login as you. So instead, we'd like the server to store something that it can use to verify your password, but which nobody can use to figure out your password. Of course we use a hash. Servers have a special file (the password file) that stores usernames and the hashes of passwords When you login, you give your password, they hash it and check against the entry for your username in the password file stored on their harddrive. An important point to note with this system: to login as user foo, I don't need to have foo's password — I just need to know a string with the same hash as foo's password.

Because people aren't very creative about choosing passwords, there is a relatively small pool of likely passwords, which means dictionary attacks are a problem. Choosing unlikely passwords ameliorates this. Worse yet, people have collected huge precomputed lists of passwords and their hash values. These lists, called Rainbow Tables can be searched very quickly.

Rainbow tables can be defeated using salt. Typically, a server will store a user's login name, salt string (different for each user, and randomly chosen when the account is created), and the hash of salt+password. For a rainbow table to be effective, it has to store the hash value for every possible salt+password combination. So, using a 10-byte salt for example, even if there are a million passwords in the pool, there would be 1,208,925,819,614,629,174,706,176,000,000 entries in the rainbow table. That's 1.2×10¹⁸ terabytes, for those of you keeping track at home. That size of table is impractical. Note that this doesn't help against a dictionary attack, since someone who had the password file would also have each user's salt. But it does slow down a dictionary attack, especially in the common situation in which an attacker isn't trying to break into a specific account, but is looking for any account to break into.

Strength of passwords

There are two scenarios for cracking passwords: 1) someone tries to break into your account by attempting actual logins with many different passwords [online attack], and 2) someone steals the password file and tries to find a string with the same hash as is stored in your entry in the password file (if salt is used it's there in the password file) [offline attack]. These scenarios are different and require different strengths to be secure.

Online Attacks In the first setting, someone actually tries to login with many different passwords. Most systems take a long time (in computer terms) to respond to a login — especially an incorrect login. This limits the effectiveness of this kind of attack. For instance, if it takes 2 seconds to get an "login failed" message, you can only try 30 passwords a minute. If you had an 8-character password, even if all characters were lowercase letters, if would take on average over 6,000 years to login successfully in this scenario. Moreover, unsuccessful login attempts are usually logged, so that even a mildly vigilant system administrator would be aware that someone was attempting to hack into the account after a few minutes of trying. Finally, many systems employ throttling, in which the server starts introducing longer and longer delays as the same user tries more and more logins within a brief period. In fact, often systems lock an account if there are too many unsuccessful attempts in a short period of time. With this policy, an attacker couldn't possibly be successful unless he had narrowed down the password to one of a very small number of possibilities. (USNA WiFi simply locks you out after just three unsuccessful attempts!) So, in this scenario even a relatively short password will be secure provided it not guessable, which usually means not a word or number or name, and certainly not anything connected with you, like your old high school's name.

A Really Bad Password The worst thing you can do is choose a password that's really common. The the RockYou.com break-in allows us all to see what some of these are. The five most common passwords of the 32 million that were stolen were:

123456
12345
123456789
password
iloveyou

So if you use one of those, you're making it really easy for people who want to break into your account. Think it doesn't happen? Check out this Feb. 2012 msnbc story about how a Syrian official's emails were stolen because his password was 12345.
Check out this article, which has a fun discussion of popular PINs.

LinkedIn Hacked ... and they didn't use salt!
In the summer of 2012, the popular social media site linkedin.com had account information stolen, including password hashes. Shockingly, this very high-profile site was not salting passwords! They changed that immediately, of course, but it was a bit late at that point. The ACM Queue article LinkedIn Password Leak: Salt Their Hide has a good discussion of why this is so disastrous!

Number of passwords of length n is kⁿ, where k is the number of characters the password came from. So if you restrict yourself to lowercase letters, there are 26ⁿ passwords of length n. If you draw from both uppercase and lower case, there are 52ⁿ. If you draw from both uppercase, lower case and digits 0-9, there are 62ⁿ. And so on. However, length is far more important than the size of the character set you draw from. There are 138,000 times more passwords of length 12 using only lower case, than passwords of length 6 using all possible printable characters.

Offline Attacks In the second scenario, things are a lot more dire. It's good to remember, however, that we're only in the second scenario if something else has already gone wrong — i.e. if someone has stolen the password file. Because the attacker has his own copy of the password file, he doesn't have to try logging into a real system to check whether he's found the password. He simply hashes his guessed password and checks if it matches a hash in the file. This means he can check millions — potentially billions — of passwords per second. There are three basic attacks here: brute force (try every string of length 1, then 2, then 3 ...), a dictionary attack, and rainbow table attacks. Long passwords that include lower and upper case letters, numbers and punctuation characters are most secure. What's also important is that your password is not a real word or names and, most of all, don't choose things like iloveyou1, that about a million other people also use as a password.
As we've seen, salting passwords provides some security in the face of an offline attack, since precomputed tables can't be used. It's also possible to use a site-wide salt value, so that what's actually stored in the password file is the hash of password + user-salt + site-salt. This way an attacker has to steal the site-salt value as well as the password file. Finally, a site can employ password stretching. The idea behind password stretching is to make the computation of a single hash very time-consuming. That way, brute force attacks take a whole lot longer. One way to do this is to store not the hash of the password+salt, but the hash of the hash of the hash of the hash of ... of the hash of the password+salt. If you use the 10,000-times iterated hash, and attacker has to wait 10,000 times longer for a password to be cracked.

If you like the xkcd-style passwords (as described in the comic to the right), you can use the si110 xkcd-style password generator. You can check out this graphic for some tips — you'll have to click to expand it in order to make it readable. I'm not completely sold on everything it says, but it certainly gives you plenty of ideas of what not to do!

The SI110 Password Activity

In class, you went through an activity in which you stole the password file for the class message board, and read off usernames and passwords. This prompted the instructor to turn on hashing of passwords for the message board. When you went back and stole the password file again, all you found were usernames and hashes-of-passwords. This prompted us to try to crack the hashes to recover the passswords. We had the RockYou.com dictionary of passwords, with precomputed MD5 hashes for each of them. This way, we could crack many passwords quite quickly. This prompted the instructor to turn on salting of passwords. That way the precomputed hashes couldn't be used. This slowed down our attack quite a bit, but still some passwords could be cracked. Finally, we talked about what made for secure (hard to crack passwords). We ended up with the following observation: both the website and the user have responsibilities in protecting identities with passwords. Here is our summary:

Two factor authentication

Dropbox hacked ... 2-factor authentication coming
In July 2012, hackers managed to break into the account of an employee at dropbox.com, a well-known "cloud storage" site, using a stolen password. This gave them access they should not have had — access to dropbox.com user information, for example. Dropbox.com has decided to offer a two-factor authentication option (using your cell-phone as the second factor) to add another layer of defense against this kind of attack. Check out this arstechnica article.

It's hard to remember a long password, especially if it's a good one with lowercase, uppercase, etc. Thus passwords may be broken, and of course they may be stolen in various ways. One approach to stronger authentication is two-factor authentication. The idea is that to authenticate (for instance to login to a system) you have to provide two pieces of information: something you know (a password, usually), and something you have/are. A smart card would be a "something you have", a fingerprint would be an example of a "something you are". The security here is that an attacker would have to both have your password and the other factor — your smart card, or finger print or whatever it might be. An interesting idea that google uses for gmail (optionally) is that when you initiate login, a text-message is sent to your cell phone with a secret 1-time number. You enter that number along with your password to authenticate. Essentially the cell phone (something you have) is acting as the second factor in this two-factor authentication system.

How Apple and Amazon Security Flaws Led to My Epic Hacking
Mat Honan wrote an interesting article in Wired about his digital life getting hacked. This story highlights a number of really important lessons:

security is holistic — whether Mr. Honan chose a strong password, or whether apple used salt & hashing didn't matter ... because the protocol apple uses for authenticating a person asking for a password to be reset is incredibly weak.
security is ... well, it's holistic — Mr. Honan's Twitter account (the goal of the attack) was secure enough in isolation, but nothing's isolated. There were connections to other accounts of his, that were connected to other accounts, and in the context of all these connections, his Twitter account was not secure.
you don't have to outrun the bear — Mr. Honan states that, if he'd used two-factor authentication for his gmail account, this probably wouldn't have happened. Why? Because the attacker would have simply turned to an easier target.
back stuff up

Update: Both Apple and Amazon have changed policies in response to this article/incident.

Password Cracking in the Navy
Navy commands routinely run password cracking software on their own password files to ensure users are using strong passwords. Specifically, the Navy uses L0phtCrack, a program that can crack Windows and UNIX passwords. L0phtCrack can be configured to use dictionary attacks (like the attacks we performed during the class activities). Results from the routine password cracks are reported to and monitored by upper levels of the chain of command.

Security means never standing still
To keep their information systems secure, the Navy has to constantly update and upgrade not only their hardware and software, but even the cryptographic algorithms the hardware and software use. MD5 was the standard for a long time, but as researchers started to discover flaws (or even potential flaws), the Navy had to migrate to a newer, better hashing algorithm, which was SHA1. Even that is now less secure than it was a few years ago. Consider this brief article covering the Department of the Navy's announcement that systems will be migrating (gradually changing) to a still stronger cryptographic hashing algorithm: SHA-256.