Cryptographic Hash Functions
Compression
A (cryptographic) hash function works as follows:
- Input: any arbitrary length data
- Output: short, fixed-length digest.
Most outstanding examples include MD5, SHA-1, SHA-2, SHA-3:
| the output length in bits
| | MD5 | 128 (16 bytes)
| | SHA-1 | 160 (20 bytes)
| | SHA-2 (SHA-256, SHA-384, SHA-512) | 256 (32 bytes), 384, 512
| | SHA-3 | 224, 256, 384, 512
|
Nowadays, SHA-256 is the most popular.
|
|
|
Properties of Cryptographic Hash functions
- Collision resistance:
It is computationally infeasible to find a collision in \(H\). In other words,
it is computationally infesiable to find any two different \(x_1\) and \(x_2\)
such that \( H(x_1) = H(x_2) \).
- Preimage resistance:
Suppose \(y\) is the hash of some random data \(x\) chosen from
exponetially large possibilities. Then, it is computationally
infeasible to find \( x \) such that \( H(x) = y \).
Why is it important to choose random data from expoenitally large
possibilities?
It is because otherwise, a simple brute-force attack can easily find the
preimage. For example, suppose \(x\) is chosen from 1000 possibilities. Then,
computing 1000 hashes for all possible candidates will definitely give you
\(x\)!
- Randomly looking: The output of hash looks random. In particular,
even a slight change in the input should result in a totally different hash
output.
For example, as you see below, H("hello") and H("zello") look totally
different.
>>> import hashlib
>>> H1 = hashlib.md5()
>>> H1.update(b"hello")
>>> H1.digest()
b']A@*\xbcK*v\xb9q\x9d\x91\x10\x17\xc5\x92'
>>> H1.hexdigest()
'5d41402abc4b2a76b9719d911017c592'
>> H2 = hashlib.md5()
>>> H2.update(b"zello")
>>> H2.digest()
b'F\t\xa5\x95\xb4\xaf\x97\xeaX\xa3>\x15\xe8O\x82\xe6'
>>> H2.hexdigest()
'4609a595b4af97ea58a33e15e84f82e6'
Refer to the
documentation for how to use the cryptographic hash functions
in Python.
Collision Attacks on (Broken) Cryptographic Hash Functions
Birthday attack
Recall the birthday
paradox:
If there are \( 20 \approx \sqrt{365} \) people,
then with probability about half, there are two people with the same birthday.
Due to this paradox, the best guarantee against collision attacks that any
cryptographic hash function with \( n \) bit output is \( \sqrt {2^n} = 2^{n/2}
\), which means the attacker has to compute \( 2^{n/2} \) hashes.
In practice, this is still a huge number for a large \(n\). For example, with
SHA-256, we have \(n=256\), and one needs to compute \(2^{128}\) hashes, which
will take at least \(10^{20}\) trillion years.
Security of MD5
The security of the MD5 hash function is severely compromised. Don't use
this. A collision attack exists that can find collisions within one minute. In
2012, according to Microsoft, the authors of the Flame malware used an MD5
collision to forge a Windows code-signing certificate. See
this for more detail.
Security of SHA-1
Until recently, SHA-1 was the most commonly used hash algorithm all over the
world. However, researchers from Google published a practical attack on SHA-1
called Shattered.
Consequently, web browsers and other software that relies on hashing for
security moved to SHA-2, which is still believed to be secure. This is
the most commonly used hash algorithm all over the world now, and there is no
known practical attack so far, thereby with the practical security.
Password-based Authentication
Authentication is the act of proving an assertion, such as the identity of a
computer system user.
- In contrast with identification, the act of indicating a
person or thing's identity, authentication is the process of verifying that
identity.
- Authentication might involve validating personal identity documents,
verifying the authenticity of a website with a digital certificate,
determining the age of an artifact by carbon dating, or ensuring that a
product or document is not counterfeit.
- The most popular user-authentication mechanism is password-based
authentication. Assuming only the account owner knows the password, if
a user gives the correct password, then the user must be the account owner.
Password-based authentication: storing hash of a password
Instead of storing the actual passwords, the server can store hash of them.
-
Due to preimage resistance, even if this password file is stolen, it will be
infeasible to find out the actual password from the file, as long as the
passwords are randomly chosen among exponentially large possibilities.
Unfortuately, sometimes the system may have a user who has chosen a easily
predictable password. In this case, this protection mechanism using
a cryptographic hash does give much security.

-
Due to collision resistance, it is safe to assume that only the actual
password will match the given hash. So, the password authentication will
still work correctly.
Other Applications
Benefiting from collision resistance and compression
If data \(x\) is really large, one can use its hash \(y = H(x)\) instead of
using the actual x as its representation. Due to collision resistance, it is
safe to assume that there is no other data \(z\) whose hash is also \(y\). So,
it is safe to regard \(y\) as the short representation of \(x\). In this sense,
we often call this \(y\) a fingerprint of \(x\).
This concept of using the hash as a short representation is quite useful:
Shell Commands
There are shell commands that compute the hash checksum of a file.
~$ md5sum two.pcap
6e76ea38b1791107ebef45ec4b99ec50 two.pcap
~$ sha1sum two.pcap
b96f81be3ecff92338bf4305f4edb157e680f162 two.pcap
~$ sha256sum two.pcap
d99ee086e193fa67f09b0e21836447f9d0741995f86608260460feffc7babf84 two.pcap