SY110: Hashing, Passwords, and Authentication

Hashing, Passwords, & Authentication

Learning Outcomes

After completing these activities you should be able to:

Introduction

In this lesson and the accompanying lab, we will explore the topics of password security, cryptographic hash functions, and general considerations regarding authentication.

Password Storage and Security

When you create an account, the system you're logging into needs a way to verify your identity. For password-based authentication, this means your password must be stored somewhere-typically a password file. On a Linux system, this is often the /etc/shadow file, while a website would store this on its web server. This password file is a highly valuable target for attackers. For this reason, system administrators use strict file permissions to ensure it's only accessible to those who absolutely need it. However, proper permissions can sometimes fail, and password files are leaked.

A classic example of this is the 2009 RockYou data breach, where a developer of social media widgets suffered a leak of over 32 million user accounts. One of the main security failures was that the passwords were stored " in the clear"-meaning they were completely readable once the file was compromised. From the data leaked in the RockYou breach, we can make a few important observations:

Number of passwords of length n is kn, where k is the number of characters the password came from. So if you restrict yourself to lowercase letters, there are 26n passwords of length n. If you draw from both uppercase and lowercase, there are 52n. If you draw from both uppercase, lowercase and digits 0-9, there are 62n. And so on. However, length is far more important than the size of the character set you draw from. There are 138,000 times more passwords of length 12 using only lower case, than passwords of length 6 using all possible printable characters.

This demonstrates that even if a password file is not stolen, attackers can easily guess common passwords or those from known breaches. This is why strong passwords-using a mix of upper and lowercase letters, numbers, and special characters-are often required.

Ultimately, on password-based systems, both the client and server have important responsibilities:

Cryptographic Hash Functions

Let's return to the problem of password storage. If storing passwords in the clear is too dangerous, what's a better alternative? The answer is a (cryptographic) hash function.

A hash function takes an input (like a password or a file) and performs a one-way mathematical operation on it. This creates a unique, fixed-length output called a hash (also known as a digest or message digest. This is different from encryption, which is a two-way process designed to be reversed (e.g., decryption).

There are many different hash algorithms, each producing a different "flavor" of hash; some common examples include MD5, SHA-1, SHA-256, and SHA-512. Below are some example hashes (you can use various online tools, like this site, to calculate these as well):

What makes a good hash function?

For a hash function to be useful in cybersecurity, it must have following properties:

  1. Simplicity: It should be fast and simple to compute the hash of an input.
  2. Pre-image resistance: If all you have is a hash value, it should be very hard to find an input that hashes to that value. This is why a small change in the input, such as changing a single character, results in a massive, unpredictable change in the output, known as the avalanche effect. (For example, the SHA-256 hash of midshipman is c6af71bcd8ce2ad7f345b9249fbf15ec29d0b98b1d4df29bf1038bb208c172b0, but the hash of Midshipman is b1e13d852e1525f3522a194d061375c61abaaee30bb56f28b38bd5d27c17a45c - they aren't close at all!) This is also why hashes typically produce fixed length values - if a long input produced a longer hash value, it would leak information about the input.
  3. Deterministic: A given input will always generate the same output.
  4. Collision resistance: It should be difficult to find two different inputs that produce the same hash, something known as a collision. We will discuss collisions later in the lecture.

How is a hash function one-way?

The idea of a function that's easy to calculate but impossible to reverse can be confusing. Hash functions achieve this "one-way" property by chaining together a series of mathematical operations that are designed to destroy information about the original input. We won't fully explore this concept, but we will demonstrate the idea with two simple operations that are part of many modern hash functions: the exclusive OR (XOR) operation and modular arithmetic.

XOR (Exclusive OR) Operation

The XOR operation takes two binary inputs and outputs a 1 if the inputs are different, and a 0 if they are the same. An XOR truth table is shown below to illustrate this concept.



When this operation is used in a hash function, we are left with only the output, and we can't be certain of the original input value. If the output was 1, we don't know if A was 1 and B was 0, or A was 0 and B was 1. A cryptographic hash uses many of these non-reversible operations in sequence, making it computationally infeasible to work backward to find the original input.

Modular Arithmetic

Another concept that destroys information is modular arithmetic (sometimes called "clock arithmetic"), where we only keep the remainder after division. For example, consider the expression:
(C + D) mod 7

If C is 5 D is 4, then:

(5 + 4) mod 7 = 9 mod 7 = 2

But if all you saw was the output (2), you wouldn't know the original inputs. The inputs could have been C = 5, D = 4 - but also C = 1, D = 8, or C = 30, D = 35, and many others.

In other words, different inputs can produce the same result, and the original values are not recoverable just from the output. Chaining together operations like modular arithmetic and XOR is computationally simple to do in one direction - but effectively impossible to reverse. The information is, by design, lost in the process of hashing - and that's a good thing!

Hashing, Password Storage, and Password Transmission

So how does hashing help with passwords? Well, an improvement over storing plaintext passwords is storing their hashes. Then, when a user enters a password, the authentication system hashes the password, and compares the hashes. When the user later tries to log in, the system takes their submitted password, calculates a new hash, and compares it to the stored hash. If the hashes match, the user is authenticated. This process ensures that even if an attacker steals the password file, they only have the hash values - not the original passwords.

You saw this in the lab, similar to the screenshots below; your instructor enabled hashing on the message board, and then the password file contained the MD5 hash of your password, instead of the plaintext value.





You might wonder why you've never had to enter a hash when you log in. This is because, typically, the password is sent from your computer (the client) to the server, and the hashing is done on the server after it receives the password. That may sound insecure at first, as the password is submitted unhashed, but this is where encryption comes in - protecting the password in transit. (Remember, hashing is NOT the same as encryption! Hashing is a one-way function, whereas encryption/decryption goes both ways.)

Rainbow Table Attacks

Unfortunately, hashing passwords isn't enough on its own. Again, suppose an attacker steals the (hashed) password file. If multiple users chose the same password, their hashes will also be identical - because a good hash function is deterministic: the same input always produces the same output. From this, an attacker would likely conclude those users have the same password.

Attackers can take this even further by leveraging data from past password leaks - like the well-known RockYou breach. They can take millions of leaked passwords, hash them, and then compare those hashes to the ones in a stolen password file. If they find a match, they've likely discovered the password. To be clear, attackers aren't reversing the math of a hash function. Instead, they're making educated guesses about which input produced the given hash output.

This process becomes even more efficient with the use of rainbow tables: precomputed lists of common passwords and their corresponding hash values. (A sample rainbow table for top 10 passwords from the RockYou breach is shown below.) So while hashing the password file adds a layer of protection, it can be defeated by rainbow table attacks - especially if users choose weak or common passwords.



Salting Passwords

The solution to rainbow table attacks is a concept known as salting. Salting involves adding a bit of random input (the salt) to your password before it's hashed. The exact implementation can vary (in the lab, the salt was appended to the end of your password), but the key is that the salt is unique for each user.

When the system hashes your password, it's actually hashing the combination of your password and your unique salt (e.g., hashing passwordQeW98%ny, where QeW98%ny is the salt). This is a critical distinction, because even if two users choose the exact same password, the final hash stored in the password file will be different for each of them.

This process renders a rainbow table useless. An attacker's pre-computed table may have an entry for the hash of password, but it is unlikely to have an entry for the hash of passwordQeW98%ny, since the salt is a randomly generated value. Some websites go even further by adding an additional site-wide salt that an attacker would also need to acquire.



Dictionary Attacks

Sadly, salting won’t completely stop an attacker. Since the salt is typically stored alongside the hash in the password file, an attacker who steals the file gets both the hash and the salt for each user. This means they can still try hashing common password guesses combined with the known salt - a method called a dictionary attack. (The name comes from attackers historically trying words from the dictionary as passwords.)

This is similar a rainbow table attack but differs in one key way: speed. Rainbow tables are precomputed, so attacking a password is as fast as looking up a hash in a giant list, an almost instant operation for a computer. A dictionary attack, however, must recalculate the hash for every guess using the correct salt, which takes much more time. The extra computation slows the attacker down significantly - and that’s exactly what we want.

You should have seen this in the lab - running dcrack on a password and salt took much longer than running dcrackT on a password hash.

Brute Force Attacks

By now, it should be clear that choosing common, easily guessed passwords is a bad idea. But does picking a unique, never-before-used password guarantee safety? Unfortunately, no. An attacker with enough time and computational power could, in theory, try every possible string - "A", "AA", "AAA", and so on - until they find a match. This exhaustive process is called a brute force attack.

Even if passwords are salted and hashed, trying every possible combination will eventually uncover the correct password - given enough time. This is why strong, long passwords are so important: they exponentially increase the time it takes to brute force a password. For example, a 14-character password using a mix of upper and lower case letters, numbers, and special characters could take millions of years to crack with today's technology.

To make dictionary and brute force attacks even more difficult, websites can also use a technique called password stretching. The idea is to intentionally make the hashing process slower: instead of hashing your password once, the system might hash it thousands or even millions of times. This doesn't affect the user much, as they’re entering a single password, but it dramatically slows down an attacker who is trying to test billions of passwords per second. Additionally, some hashing algorithms - like PBKDF2, bcrypt, and scrypt - are even designed to run slowly to slow down password attacks.

Another layer of defense is password expiration. Many systems require users to change their passwords periodically. This means that even if an attacker eventually cracks your password, the stolen credentials may no longer be valid by the time they try to use them.

Password Security - A Recap

Authenticator Threats: NIST SP 800-63-3, Digital Identity Guidelines, provides a comprehensive list of authenticator threats, which include:

Attacks and Authentication Threats

Much of what we’ve discussed so far involves offline attacks - so named because attackers work with stolen password hashes without interacting with the live system. For example, once a password file is compromised, attackers can try millions of password guesses on their own hardware until they find a match. If they succeed, the login attempt appears completely normal to the system-it looks like the legitimate user simply entered the correct password.

But not all threats are offline. Online attacks involve the attacker actively attempting to log in to a system, often by trying common or brute-forced passwords. These attacks are more detectable because they interact directly with the server, generating suspicious login traffic. System administrators can defend against them using mechanisms like login throttling (time delays and/or limits on the number of login attempts), account lockouts after too many failed attempts, or CAPTCHAs to slow automated guesses. For a more comprehensive list of authenticator threats (courtesy of NIST), see the call-out box.

Client and Server Responsibilities

At the end of the day, password security is a shared responsibility between users and systems. Each side has a role to play in minimizing risk.

The server’s job is to:

The user’s job is to:

If both parties do their part, the system is far more resilient to both offline and online attacks.

Multi-Factor Authentication

Even with strong passwords and proper hashing, there are limits to how much security passwords alone can provide. Fatigue from complex password requirements can lead users to cut corners-and even strong passwords can be phished, guessed, or stolen.

The next step is to utilize multi-factor authentication (MFA), also referred to as two-factor authentication (2FA). With MFA schemes, authentication requires two (or more) factors of authentication, which are broadly categorized as something you know, something you have, and something you are (or inherence). Something you know might be a password or the pin on your Common Access Card (CAC); something you have might be the CAC itself; while something you are includes biometrics, such as your photo, fingerprint, or voice signature. The idea is simple: while one of these factors - particularly your password - might be stolen, guessed, or cracked, it should be harder for an attacker to obtain multiple factors. Take a look at the call-out box for a more in-depth run-down of various authenticator types, as determined by NIST.

Authenticator Types: These are the available authenticator types required for Authenticator Assurance Level 1 controls by the NIST SP 800-63-3.

Passwordless Authentication

If you’re still concerned about password security...good! You’re not alone. Some organizations are moving away from passwords altogether. A common implementation of this passwordless authentication is something known as passkeys.

Passkeys rely upon public-key infrastructure (PKI) - the same general concept underpinning asymmetric encryption. When you register on a site using passkeys, your device creates a key pair: a public key, stored on the server, and a private key, securely stored on your device. When you log in, you authenticate locally on a device - such as using a fingerprint or PIN to unlock your phone - and the device performs a cryptographic operation that proves it holds the private key, without ever sending it over the internet.

Passkeys offer several improvements over traditional passwords. They’re more resilient to phishing (you're unlikely to enter your private key into a phishing site), they don’t rely on human memory, and they can’t be easily extracted through malware like keyloggers - malicious software that records what you type. Even if a website follows all best practices, a keylogger on your device could still steal your password. With passkeys, there's nothing typed to steal. In October 2023, Google made passkeys the default login option for personal Google Accounts. For a visual representation of of passkeys, take a look at the video below.



Hashing Collisions Explored

Let’s return to the concept of a collision, where two different inputs (passwords, files, etc.) produce the exact same hash output. Why does this occur? Consider that the set of all possible inputs (passwords, files, etc.) is effectively infinite. However, the set of all possible hash outputs is finite. For example, the SHA-256 algorithm produces a 256-bit hash, which means there are 2256 possible outputs. While this is an astronomically large number, it is not infinite. Thus, collisions are unavoidable: an infinite set of inputs cannot be mapped to a finite set of outputs without repetitions.

Newer, more secure hashing algorithms make it harder to generate a collision, and all things equal, an algorithm with a longer hash value reduces the likelihood that two inputs result in a collision. But make no mistake - modern hashing algorithms do not completely solve the collision issue.

To see a collision in action, download the two pictures below, then use this website to take their MD5 hashes. Despite clearly being two different pictures, the hashes matched - a collision!





So why should you care? Generally speaking, collisions are less of a concern for password security because the input space for passwords is relatively small (most passwords are short, even with systems requiring longer ones). Where collisions are more of a concern is file integrity, which we’ll discuss below.

File Integrity and Hashing

That was a lot about hashing, password security, and authentication - and believe it or not, that’s not where the magic of hashing ends! Because a hash function is deterministic - the same input always produces the same output - we can use it as a "digital fingerprint" for a file. This is the concept of file integrity.

If you have a file and its known, trusted hash, you can re-calculate the hash at any point in the future. If the new hash matches the original, you can be confident that the file has not been tampered with. If the hashes do not match, you know the file's integrity has been compromised.

This is where collisions become a major concern. An attacker could try to create a malicious file that has the same hash as a legitimate one. If they succeeded, you would not be able to tell the difference just by checking the hash values. This is incredibly difficult, but not impossible, and is a concept that has been exploited in the real world. For example, the Flame malware, discovered on Iranian networks in 2012, used this same idea as part of its sophisticated attack. We will explore this concept in more detail in our next lecture on Digital Signatures and Certificates.

+ Analogy - A Rubik's Cube Hashing Algorithm