The previous lesson introduced you to symmetric (secret key)
encryption. Symmetric encryption is, quite clearly, a tool for
providing confidentiality (as in the IA pillar).
In this lesson, we look at cryptographic hashes, which
are tools for guaranteeing integrity.
Hashes are used in a variety of ways, including password
authentication, which we'll talk a bit about below.
Hashes allow us to check the integrity of a message, i.e. check
that the message is the same as an earlier message, without
remembering the message itself — indeed without ever
having seen the message before. In the setting of passwords,
this means that we can check a password without ever storing a
copy of the password, which is important for security.
Thus, even though what hashing provides directly is a way to
check integrity, it is used to help provide authentication.
In fact, hashing is also used to help provide non-repudiation.
Rubik's Hash
Hashes
Suppose you wanted to prove to someone — but someone you
hate —
that you are psychic.
You might, predict the winner of next season's American Idol.
But if you
give him
your prediction now, he might
place a bet and make a lot of money, which is bad
because you hate him. On the other hand, if you
give him
your prediction after the season, he won't believe you're psychic. (Nor would I!)
What you need is
give him
him something before-hand, something
a) that gives no information before or during the season about your
prediction, and b) that after the season provides indisputable evidence
that you predicted the real winner (or, if you failed, that you didn't).
It's worth thinking a bit about this example. Hashing will
provide a guarantee before hand of the integrity of
the message you'll send afterwards. Integrity,
that's an IA pillar!
Essentially, what we need is for you to be able to
send a message with
your prediction after the season, but you
must have sent something before the season that will
provide a guarantee of the integrity of the message
you send afterwards — i.e. something that will guarantee
that you haven't modified the message since before the
American Idol season started.
There is a cryptographic technique called hashing
that provides this "something", this guarantee of integrity.
Rubik's Hash
Here's a simple scheme would allow us to prove our psychic-ness.
At the right you see a table with some letters and punctuation
marks. Next to each of them is a sequence of two colored
squares.
We write down the name of the predicted winner, and we
replace each character with the squares from its table
entry. The, we take an solved Rubik's Cube and go through the
sequence of squares in order, interpreting each square as a
quarter-turn clockwise twist of the cube-face with that color as
its center square. When we're done, we give the messed up
Rubik's cube to "the guy". The messed up Rubik's Cube is
the hash of the predicted winner's name.
Once the season is over, we give "the guy" the name of the winner we
predicted, and he can take a second solved Rubik's cube, go
through the moves for that string, and verify that what he gets
is the same as the Rubik's cube you gave him.
Barring some incredibly unlikely coincidence, no two people
would have names that hash to the same value, so "the guy" will
be convinced that the message (just a name) you gave him after
the fact is the same as the message you had hashed beforehand,
and thus he will be convinced that you
are really psychic. Hashing like this guarantees integrity.
You might object that for some people (and
certainly for computers) solving Rubik's Cube is easy.
We can address this in two ways: we could use a Rubik's 4
or 5 or 6-cube to up the difficulty or, more profoundly, we
could hash by running through the string forward then backward.
This second approach makes more work for us when we do the
hashing, but it means that solving a Rubik's cube is not
enough. You need to find a solution that looks like doing
one sequence of moves, then doing the same moves in reverse
(in fact it's a bit different than that)
and ending up with a solved cube. As far as I know, nobody
— not even computers — can do that fast!
So for this discussion we treat solving a Rubik's cube as if
it is a really hard problem, realizing that to make it truly
hard we just need to make this small change to the rules for
how the hash is computed. Let's call it Rubik's Cube Hash
version 2.0.
What about our goal of not giving away the name of our
predicted winner prior to the end of the season?
Well, finding a string that hashes to the mixed up cube you
gave "the guy" means solving a Rubik's Cube, and that's hard!
This is the fundamental insight behind hashing: it's easy to
produce a hash from a string, but very hard to find a string that
produces the given hash. Not impossible, just hard to do fast
— like solving a Rubik's Cube, which is not impossible,
just hard to do fast. In this case, finding "a" string that
hashes to the same cube isn't enough, because you might find a
different string (probably not a person's name) that just
happens to hash to the same value.
Salt
As the American Idol season progresses,
the field of contestants gets small enough that "the guy", if he
has lots of time or has lots of friends (and Rubik's cubes)
helping him, could
figure out the name of the predicted winner from the hash without having to
solve a Rubik's Cube. He could simply get the Rubik's Hash of
every remaining contestant's name and find the one that matches the
cube you gave him. Because there are too few
possible messages (contestant names), you risk giving away your
prediction to "the guy".
This is called a dictionary attack, by the way. When
the pool of strings from which came the one that produced the hash
value is small enough, one can try them all and test for
matches. This pool of candidate strings is the "dictionary".
In situations like this, we can salt the string before
hashing. You could, for instance, choose several random letters
and add them to the front and the back of the name you've
predicted before you actually take the hash.
These random letters are the "salt".
Now, "the guy" can't try hashing each contestant's name, even
when it's down to just a few contestants,
because he'd have to try each name with each possible salt
value, and there way too many possibilities. After the race,
you present him
with a string like
XOKZPJOHNSMITHXOKZP
and he can verify that you did indeed predict "John_Smith" to
win. The only fly in the ointment is this: "the guy" might
object that you could have diffent salt combinations for every
potential contestant's
name that all hash to the same cube you gave him before
the season.
Is that feasible? Actually I don't know. It's
certainly much harder than solving a Rubik's Cube though!
So as long as we all accept that solving a Rubik's cube is
hard, "the guy" cannot reasonably imagine that you had messages
on hand that hash to the cube you gave him for more than one
person's name.
Properties of a good cryptographic hash function
There are several properties that a good cryptographic hash
function should have.
- It should be easy to compute the hash of a string.
- If all you have is a hash value, it should be very hard to
find a string that hashes to that value. Our Rubik's Cube Hash
has this property (and the first property), at least if you accept Rubik's
cube solving as being difficult. The Rubik's Cube Hash version
2.0 is certainly difficult certainly has this property.
-
It should be difficult to find two strings
with the same hash. Related to that, it should be hard to take
a hash value and a string that hashes to it, and engineer from
the first string another string that hashes to that same
value. It might not be obvious that someone could cause trouble if they
could do this, but what should be obvious is that if we can do
this, our hashing scheme is not capable of guaranteeing the
integrity of messages, and that's bad.
Our Rubik's hash, even version 2.0, doesn't have the third
property, because I could insert double commas anywhere I wanted
to and, since two commas means four quarter-turns of the same
face, it doesn't change the hash value. However, it would be
very hard to engineer a meaningful messages — with
different meanings — with the same hash. As discussed
earlier, finding a message that means you predicted "John_Smith"
and another message with the same hash value that means you
predicted "Tina_Jones" is really hard.
We'll look at a commonly used cryptographic hash function next
lesson, but for now it's worth knowing that "real" cryptographic
hashes are numbers, not scrambled Rubik's cubes.
Password Authentication
RockYou.com story.
If you contact a company/site to tell them you've forgotten
your password, and they can tell you what your password was
... they're storing your actual password and not the hash
(how do we know that?).
You should be indignant!
A very common application of hashing is password checking.
Suppose there's a server you login to from time to time.
It's undesirable to store your password on the server's
harddrive. There are tricks that might allow people to see that
file and, once they do, they'd have your password and be able to
login as you. So instead, we'd like the server to store
something that it can use to verify your password,
but which nobody can use to figure out your password.
Of course we use a hash. Servers have a special file (the
password file) that stores usernames and the hashes of passwords
When you login, you give your password, they hash it and check
against the entry for your username in the password file stored
on their harddrive. An important point to note with this system:
to login as user foo, I don't need to
have foo's password — I just need to know a string
with the same hash as foo's password.
Because people aren't very creative about choosing passwords,
there is a relatively small pool of likely passwords, which
means dictionary attacks are a problem. Choosing unlikely
passwords ameliorates this. Worse yet, people have collected
huge precomputed lists of passwords and their hash values.
These lists, called Rainbow Tables can be searched
very quickly.
Rainbow tables can be defeated using salt. Typically, a
server will store a user's login name, salt string (different
for each user, and randomly chosen when the account is
created), and the hash of salt+password. For a rainbow table
to be effective, it has to store the hash value for every
possible salt+password combination. So, using a 10-byte salt
for example, even if there are
a million passwords in the pool, there would be
1,208,925,819,614,629,174,706,176,000,000
entries in the rainbow table.
That's 1.2×1018 terabytes, for those of you
keeping track at home. That size of table is impractical.
Note that this doesn't help against a dictionary attack, since
someone who had the password file would also have each user's
salt. But it does slow down a dictionary attack, especially
in the common situation in which an attacker isn't trying to
break into a specific account, but is looking for any account
to break into.
Strength of passwords
There are two scenarios for breaking passwords:
1) someone tries to break in to your account by
attempting actual logins with many different passwords, and
2) someone steals the password file and tries to find a string
with the same hash as your password (if salt is used it's there
in the password file).
These scenarios are different and require different strengths to
be secure.
In the first setting, someone actually tries to login with many
different passwords. Most systems take a long time (in
computer terms) to respond to a login — especially an
incorrect login. This limits the effectiveness of this kind of
attack. For instance, if it takes 2 seconds to get an
"login failed" message, you can only try 30 passwords a minute.
If you had an 8-character password, even if all characters were
lowercase letters, if would take on average over 6,000 years to
login successfully in this scenario.
Moreover, unsuccessful login attempts are usually logged, so
that even a mildly vigilant system administrator would be aware
that someone was attempting to hack into the account after a
few minutes of trying.
Finally, many systems lock an account if there are too many
unsuccessful attempts in a short period of time. With this
policy, an attacker couldn't possibly be successful unless he
had narrowed down the password to one of a very short number of
possibilities. So, in this scenario a relatively short
password will be secure, provided it not guessable,
which usually means not a word or number or name, and certainly
not anything connected with you, like your old high school's name.
Number of passwords of length n is
kn, where k is the number of characters
the password came from. So if you restrict yourself to
lowercase letters, there are 26n passwords of length n.
If you draw from both uppercase and lower case, there are
52n.
If you draw from both uppercase, lower case and digits 0-9, there are
62n. And so on.
However, length is far more important than the size of the
character set you draw from. There are 138,000 times more
passwords of length 12 using only lower case, than passwords
of length 6 using all possible printable characters.
In the second scenario, things are a lot more dire. It's good
to remember, however, that we're only in the second scenario if
something else has already gone wrong — i.e. if someone
has stolen the password file.
Because the attacker has his own copy of the password file, he
doesn't have to try logging into a real system to check whether
he's found the password. He simply hashes his guessed password
and see if it matches a hash in the file. This means he can
check millions — potentially billions — of
passwords per second.
There are three basic attacks here: brute force (try every
string of length 1, then 2, then 3 ...), a dictionary attack,
and rainbow table attacks. Long passwords that include lower and upper
case letters, numbers and punctuation characters are most
secure. What's also important is don't choose real words or
names, and don't choose things like iloveyou1,
that about a million other people also use as a password.
Two factor authentication
It's hard to remember a long password, especially if it's a good
one with lowercase, uppercase, etc. Thus passwords may be
broken, and of course they may be stolen in various ways.
One approach to stronger authentication is two-factor
authentication. The idea is that to authenticate (for
instance to login to a system) you have to provide two pieces of
information: something you know (a password, usually), and
something you have/are. A smart card would be a
"something you have", a fingerprint would be an example of a
"something you are".
The security here is that an attacker would have to both
have your password and the other factor — your smart card,
or finger print or whatever it might be.
An interesting idea that google uses for
gmail (optionally) is that when you initiate login, a
text-message is sent to your cell phone with a secret 1-time
number. You enter that number along with your password to
authenticate. Essentially the cell phone (something you have)
is acting as the second factor in this two-factor authentication
system.