Cryptographic Hash Functions

Compression

A (cryptographic) hash function works as follows:

Input: any arbitrary length data
Output: short, fixed-length digest.

Most outstanding examples include MD5, SHA-1, SHA-2, SHA-3:

	the output length in bits
MD5	128 (16 bytes)
SHA-1	160 (20 bytes)
SHA-2 (SHA-256, SHA-384, SHA-512)	256 (32 bytes), 384, 512
SHA-3	224, 256, 384, 512

Nowadays, SHA-256 is the most popular.

Properties of Cryptographic Hash functions

Collision resistance: It is computationally infeasible to find a collision in \(H\). In other words, it is computationally infesiable to find any two different \(x_1\) and \(x_2\) such that \( H(x_1) = H(x_2) \).
Preimage resistance: Suppose \(y\) is the hash of some random data \(x\) chosen from exponetially large possibilities. Then, it is computationally infeasible to find \( x \) such that \( H(x) = y \).
Why is it important to choose random data from expoenitally large possibilities?
It is because otherwise, a simple brute-force attack can easily find the preimage. For example, suppose \(x\) is chosen from 1000 possibilities. Then, computing 1000 hashes for all possible candidates will definitely give you \(x\)!

Randomly looking: The output of hash looks random. In particular, even a slight change in the input should result in a totally different hash output.

For example, as you see below, H("hello") and H("zello") look totally different.


>>> import hashlib

>>> H1 = hashlib.md5()
>>> H1.update(b"hello")
>>> H1.digest()
b']A@*\xbcK*v\xb9q\x9d\x91\x10\x17\xc5\x92'
>>> H1.hexdigest()
'5d41402abc4b2a76b9719d911017c592'

>> H2 = hashlib.md5()
>>> H2.update(b"zello")
>>> H2.digest()
b'F\t\xa5\x95\xb4\xaf\x97\xeaX\xa3>\x15\xe8O\x82\xe6'
>>> H2.hexdigest()
'4609a595b4af97ea58a33e15e84f82e6'

Refer to the documentation for how to use the cryptographic hash functions in Python.

Collision Attacks on (Broken) Cryptographic Hash Functions

Birthday attack

Recall the birthday paradox:

If there are \( 20 \approx \sqrt{365} \) people, then with probability about half, there are two people with the same birthday.

Due to this paradox, the best guarantee against collision attacks that any cryptographic hash function with \( n \) bit output is \( \sqrt {2^n} = 2^{n/2} \), which means the attacker has to compute \( 2^{n/2} \) hashes.

In practice, this is still a huge number for a large \(n\). For example, with SHA-256, we have \(n=256\), and one needs to compute \(2^{128}\) hashes, which will take at least \(10^{20}\) trillion years.

Security of MD5

The security of the MD5 hash function is severely compromised. Don't use this. A collision attack exists that can find collisions within one minute. In 2012, according to Microsoft, the authors of the Flame malware used an MD5 collision to forge a Windows code-signing certificate. See this for more detail.

Security of SHA-1

Until recently, SHA-1 was the most commonly used hash algorithm all over the world. However, researchers from Google published a practical attack on SHA-1 called Shattered.

Consequently, web browsers and other software that relies on hashing for security moved to SHA-2, which is still believed to be secure. This is the most commonly used hash algorithm all over the world now, and there is no known practical attack so far, thereby with the practical security.

Password-based Authentication

Authentication is the act of proving an assertion, such as the identity of a computer system user.

In contrast with identification, the act of indicating a person or thing's identity, authentication is the process of verifying that identity.
Authentication might involve validating personal identity documents, verifying the authenticity of a website with a digital certificate, determining the age of an artifact by carbon dating, or ensuring that a product or document is not counterfeit.
The most popular user-authentication mechanism is password-based authentication. Assuming only the account owner knows the password, if a user gives the correct password, then the user must be the account owner.

Password-based authentication: storing hash of a password

Instead of storing the actual passwords, the server can store hash of them.

Due to preimage resistance, even if this password file is stolen, it will be infeasible to find out the actual password from the file, as long as the passwords are randomly chosen among exponentially large possibilities. Unfortuately, sometimes the system may have a user who has chosen a easily predictable password. In this case, this protection mechanism using a cryptographic hash does give much security.
Due to collision resistance, it is safe to assume that only the actual password will match the given hash. So, the password authentication will still work correctly.

Other Applications

Benefiting from collision resistance and compression

If data \(x\) is really large, one can use its hash \(y = H(x)\) instead of using the actual x as its representation. Due to collision resistance, it is safe to assume that there is no other data \(z\) whose hash is also \(y\). So, it is safe to regard \(y\) as the short representation of \(x\). In this sense, we often call this \(y\) a fingerprint of \(x\).

This concept of using the hash as a short representation is quite useful:

Integrity check: Check this site.

...(omitted)...
5433b9fae57b6dc2a07cd0b8c38a1713bd1fb6f4b05633b7ac963e7bda994898 *virtualbox-6.1_6.1.32-149290~Ubuntu~bionic_amd64.deb
74f1bc3196b8415aaac3b7a76df0d123145d4b0e242ee58e09424d0a7cd430e3 *virtualbox-6.1_6.1.32-149290~Ubuntu~eoan_amd64.deb
da26093b1e9faf7a3b0c4557026342be22400ebaa0d4a619b74291628659cf27 *virtualbox-6.1_6.1.32-149290~Ubuntu~xenial_amd64.deb

Using these hash values, you can verify the integrity of the file that you downloaded.

Searching
Searching can be much more efficient by using the fingerprints instead of the actual data items.

Shell Commands

There are shell commands that compute the hash checksum of a file.

~$ md5sum two.pcap
6e76ea38b1791107ebef45ec4b99ec50  two.pcap
~$ sha1sum two.pcap
b96f81be3ecff92338bf4305f4edb157e680f162  two.pcap
~$ sha256sum two.pcap
d99ee086e193fa67f09b0e21836447f9d0741995f86608260460feffc7babf84  two.pcap