Normalization

Database Normalization is the processes of minimizing data redundancy within the columns and tables of our relational database. We will work to solve this problem by decomposing our tables into smaller tables, removing redundant information, while ensuring that we do not experience any information loss. Additionally we will work to ensure that all dependencies make logical sense. These efforts result in more consistent data, relationships between data become more clear, and data can be added in an easier fashion.

Modification Anomalies - Our Motivation

Why do we want to normalize our data? Decisions made when designing the tables of a database, either from scratch by starting with the entity-relational model, or from existing tables in an older/different storage system, may introduce inefficiencies. When working with small data sets, it may not be obvious that redundant data is costly. Imagine tables with a million rows, a ten character column of redundant information suddenly becomes noticeably wasteful. Now imagine, like good database maintainers, we also create a daily backup of our database, along with weekly backups. Thus, we keep multiple copies of our database to allow us to revert our database to a specific day within the past year if necessary. What a waste! Redundant information in our databases is typically unnecessary and ALWAYS costly! As we begin pursuing efficient databases, let's start with a few assumptions:

The question that we have to pose at each step, should we keep this structure or can it be further optimized?

Consider the following table: is there redundant data?

AlphaAcYrSemCourseTitleCourseGrade
2510012022SPRINGNL110The New Leadership CourseF
2510022022SPRINGNL110The New Leadership CourseA
2512342022SPRINGHE111Rhetoric and Intro to LiteratureB
2512342022SPRINGHH104American Naval HistoryA
2512342022SPRINGNL110The New Leadership CourseA
2512342022SPRINGNS101Fundamentals of SeamanshipB
2512342022SPRINGSC111Foundations of Chemistry IC

Do we really need to have columns repeating 2022 and SPRING for every course? Do we need both Course (NL110) and Title (The New Leadership Course) repeated for each grade entry? Isn't this repeated information better listed once in a separate smaller table describing only the courses? Are there other unintended consequences, or anomalies, that get introduced by trying to store too much information in a single table? Yes, yes... YES!!! Let's explore some of these anomalies and then tackle how to correct them.

Deletion Anomaly

A deletion anomaly occurs if you lose more information then intended when you delete some row(s) from a table. In our Grades table above, if we deleted all of the rows for NL110 we would also lose the title of the course as this information is not stored anywhere else in our database. What if a professor just decided to curve his course and re-enter grades? We would still want the course and title information to exist somewhere.

Insertion Anomaly

An insertion anomaly occurs if you would have to add more information than desired to add a specific piece of new data. If, after deleting the instances of NL110 from our table, we then renamed NL110 "Introduction to Leadership" we would have to add a grade to the database to simply record this course change somewhere.

Update Anomaly

Update anomalies result from an update to a row (or rows) that cause the information to be inconsistent across the database. Let's say we changed the name of NL110 and changed this for student 251001 only... we would have introduced an inconsistency in our table as depicted below. It is frustrating and inefficient to have to update this type of information multiple times over a table.

AlphaAcYrSemCourseTitleCourseGrade
2510012022SPRINGNL110Introduction to LeadershipF
2510022022SPRINGNL110The New Leadership CourseA
2512342022SPRINGHE111Rhetoric and Intro to LiteratureB
2512342022SPRINGHH104American Naval HistoryA
2512342022SPRINGNL110The New Leadership CourseA
2512342022SPRINGNS101Fundamentals of SeamanshipB
2512342022SPRINGSC111Foundations of Chemistry IC
After this update, what is the name of the course with course number NL110? Is it "Introduction to Leadership" or "The New Leadership Course"?

Initial Table Decomposition

If we were to pull the course titles out of the Grades table and place the information in its own table (like below) would that correct the anomalies mentioned above?

Grades
AlphaAcYrSemCourseCourseGrade
2510012022SPRINGNL110F
2510022022SPRINGNL110A
2512342022SPRINGHE111B
2512342022SPRINGHH104A
2512342022SPRINGNL110A
2512342022SPRINGNS101B
2512342022SPRINGSC111C
Courses
CourseTitle
NL110The New Leadership Course
HE111Rhetoric and Intro to Literature
HH104American Naval History
NS101Fundamentals of Seamanship
SC111Foundations of Chemistry I

This seems like an obvious improvement. But as our databases get larger and more complex, we need a repeatable and proven way to determine whether or not we should change how we are storing information in our database. We have two options for improvements: decompose (break apart into multiple tables) or merge (combine into a single table) data.

Functional Dependencies

A functional dependency (FD) is a type of integrity constraint. If X and Y are sets of attributes (columns) of a relation (table), we say that: $$X \rightarrow Y \text{(X determines Y), if and only if, for each value of the column(s) in X there exists only one possible value for column(s) in Y}$$ X will determine Y if a specific value for X always results in the same value for Y. This means that X determines Y, or Y functionally depends on X. Examples: $$\text{Any primary key determines all other columns} $$ $$Alpha \rightarrow (Name, Class, DateOfBirth, \text{etc})$$ $$(Rank, YearsOfService) \rightarrow Wage $$ $$(NbHours, HourlyPrice) \rightarrow Charge $$ As seen the the examples above, functional dependencies may be based upon equations, but they are not equations. A composite determinant consists of more than one column. For example, NbHours and HourlyPrice together determine the charge.

Functional Dependency Rules

The following three properties of functional dependencies are called Armstrong's Axioms: $$\text{Let A, B, C be sets of columns} $$ $$A \rightarrow A, \ \text{always, (reflexivity)}$$ $$If \ A \rightarrow B, \ \text{and} \ B \rightarrow C, \ \text{then} \ A \rightarrow C \ (\text{transitivity})$$ $$If \ A \rightarrow B, \text{and} \ C \ \text{is a set of columns}, \ \text{then} \ (A,C) \rightarrow (B,C) \ (\text{augmentation}) $$ The following two rules are also true: $$If \ A \rightarrow B \ \text{and} \ A \rightarrow C, \ \text{then} \ A \rightarrow (B,C) \ (\text{union})$$ $$If A \rightarrow (B,C) \ \text{then} A \rightarrow B \ \text{and}\ A \rightarrow C \ (\text{decomposition})$$ Functional dependency is a statement about all allowable instances of a table. You cannot find the functional dependencies simply by looking at a portion of the data due to:

Given some data in a table R, we can check to see if it violates some functional dependency, but we cannot tell if the functional dependency holds over any possible instance of R.

AlphaLastNameFirstNameMajorAdvisor
241342ThomasSarahSCSLewis
242368SmithJohnSITJones
246644MikalsonMichaelECESkapanski
247862DoeJaneECESkapanski
253116DoeBobSDSLefferton
250908JohnsonJohnSDSBrown
251198ThomasThomasSCSLewis
259722JeffersonJanetSDSLefferton
259832ThomasSarahSDSLefferton
As an example, determine the possible functional dependencies (FDs) in the table to the left. Click the solution icon at the top to see the recommended answers. Note that we CANNOT determine FDs from data alone, as existing data might not be representative for all possible data.

A determinate in a table is any column that you can use to determine the values assigned to the other columns in the same row. A determinate is unique in a table, if and only if, it determines every other column in the table. Unique determinates are superkeys. A superkey exists if there are no two rows that have the same values for the columns in the superkey. A key is a superkey such that no proper subset of it is a superkey (so a key is a minimal set of columns that are unique).

A set of columns is a superkey for a table if:

  1. No two distinct rows can have the same values in all key columns, or equivalently
  2. It determines all of the other columns in a table

Normal Forms

Relations (tables) are categorized as belonging to a normal form based on what modification anomalies they are subject to.

1NF - First Normal Form

A relation (table) is in First Normal Form (1NF) if every column contains only single values (so no lists). This requirement was already included in our definition of a relation (in the relational model sense), so all relations are in 1NF. Below are all the rules a table in 1NF needs to satisfy:

2NF - Second Normal Form

A table is in Second Normal Form (2NF) if:

3NF - Third Normal Form

A table is in Third Normal Form (3NF) if:

$$A \rightarrow B \rightarrow C$$

Here is another way to define if a table R is in 3NF: A table R is in 3NF if, for every functional dependency X → A that holds true for R, then one of the following statements is true:

If a table is in 2NF but not 3NF, meaning it has some transitive dependencies, then split them out into a new table, where the determinant becomes a primary key in the new table and a foreign key in the old table. For example: $$ R(\underline{A},\underline{B},C,D) $$ $$ (A,B) \rightarrow (C,D) $$ $$ C \rightarrow D $$ Becomes: $$R(\underline{A},\underline{B},C)$$ $$R2(\underline{C},D)$$

BCNF - Boyce-Codd Normal Form

A table is in Boyce-Codd Normal Form (BCNF) if: