Database Normalization is the processes of minimizing data redundancy within the columns and tables of our relational database. We will work to solve this problem by decomposing our tables into smaller tables, removing redundant information, while ensuring that we do not experience any information loss. Additionally we will work to ensure that all dependencies make logical sense. These efforts result in more consistent data, relationships between data become more clear, and data can be added in an easier fashion.
Why do we want to normalize our data? Decisions made when designing the tables of a database, either from scratch by starting with the entity-relational model, or from existing tables in an older/different storage system, may introduce inefficiencies. When working with small data sets, it may not be obvious that redundant data is costly. Imagine tables with a million rows, a ten character column of redundant information suddenly becomes noticeably wasteful. Now imagine, like good database maintainers, we also create a daily backup of our database, along with weekly backups. Thus, we keep multiple copies of our database to allow us to revert our database to a specific day within the past year if necessary. What a waste! Redundant information in our databases is typically unnecessary and ALWAYS costly! As we begin pursuing efficient databases, let's start with a few assumptions:
Consider the following table: is there redundant data?
| Alpha | AcYr | Sem | Course | Title | CourseGrade |
|---|---|---|---|---|---|
| 251001 | 2022 | SPRING | NL110 | The New Leadership Course | F |
| 251002 | 2022 | SPRING | NL110 | The New Leadership Course | A |
| 251234 | 2022 | SPRING | HE111 | Rhetoric and Intro to Literature | B |
| 251234 | 2022 | SPRING | HH104 | American Naval History | A |
| 251234 | 2022 | SPRING | NL110 | The New Leadership Course | A |
| 251234 | 2022 | SPRING | NS101 | Fundamentals of Seamanship | B |
| 251234 | 2022 | SPRING | SC111 | Foundations of Chemistry I | C |
Do we really need to have columns repeating 2022 and SPRING for every course? Do we need both Course (NL110) and Title (The New Leadership Course) repeated for each grade entry? Isn't this repeated information better listed once in a separate smaller table describing only the courses? Are there other unintended consequences, or anomalies, that get introduced by trying to store too much information in a single table? Yes, yes... YES!!! Let's explore some of these anomalies and then tackle how to correct them.
A deletion anomaly occurs if you lose more information then intended when you delete some row(s) from a table. In our Grades table above, if we deleted all of the rows for NL110 we would also lose the title of the course as this information is not stored anywhere else in our database. What if a professor just decided to curve his course and re-enter grades? We would still want the course and title information to exist somewhere.
An insertion anomaly occurs if you would have to add more information than desired to add a specific piece of new data. If, after deleting the instances of NL110 from our table, we then renamed NL110 "Introduction to Leadership" we would have to add a grade to the database to simply record this course change somewhere.
Update anomalies result from an update to a row (or rows) that cause the information to be inconsistent across the database. Let's say we changed the name of NL110 and changed this for student 251001 only... we would have introduced an inconsistency in our table as depicted below. It is frustrating and inefficient to have to update this type of information multiple times over a table.
| Alpha | AcYr | Sem | Course | Title | CourseGrade |
|---|---|---|---|---|---|
| 251001 | 2022 | SPRING | NL110 | Introduction to Leadership | F |
| 251002 | 2022 | SPRING | NL110 | The New Leadership Course | A |
| 251234 | 2022 | SPRING | HE111 | Rhetoric and Intro to Literature | B |
| 251234 | 2022 | SPRING | HH104 | American Naval History | A |
| 251234 | 2022 | SPRING | NL110 | The New Leadership Course | A |
| 251234 | 2022 | SPRING | NS101 | Fundamentals of Seamanship | B |
| 251234 | 2022 | SPRING | SC111 | Foundations of Chemistry I | C |
If we were to pull the course titles out of the Grades table and place the information in its own table (like below) would that correct the anomalies mentioned above?
| Alpha | AcYr | Sem | Course | CourseGrade |
|---|---|---|---|---|
| 251001 | 2022 | SPRING | NL110 | F |
| 251002 | 2022 | SPRING | NL110 | A |
| 251234 | 2022 | SPRING | HE111 | B |
| 251234 | 2022 | SPRING | HH104 | A |
| 251234 | 2022 | SPRING | NL110 | A |
| 251234 | 2022 | SPRING | NS101 | B |
| 251234 | 2022 | SPRING | SC111 | C |
| Course | Title |
|---|---|
| NL110 | The New Leadership Course |
| HE111 | Rhetoric and Intro to Literature |
| HH104 | American Naval History |
| NS101 | Fundamentals of Seamanship |
| SC111 | Foundations of Chemistry I |
A functional dependency (FD) is a type of integrity constraint. If X and Y are sets of attributes (columns) of a relation (table), we say that: $$X \rightarrow Y \text{(X determines Y), if and only if, for each value of the column(s) in X there exists only one possible value for column(s) in Y}$$ X will determine Y if a specific value for X always results in the same value for Y. This means that X determines Y, or Y functionally depends on X. Examples: $$\text{Any primary key determines all other columns} $$ $$Alpha \rightarrow (Name, Class, DateOfBirth, \text{etc})$$ $$(Rank, YearsOfService) \rightarrow Wage $$ $$(NbHours, HourlyPrice) \rightarrow Charge $$ As seen the the examples above, functional dependencies may be based upon equations, but they are not equations. A composite determinant consists of more than one column. For example, NbHours and HourlyPrice together determine the charge.
The following three properties of functional dependencies are called Armstrong's Axioms: $$\text{Let A, B, C be sets of columns} $$ $$A \rightarrow A, \ \text{always, (reflexivity)}$$ $$If \ A \rightarrow B, \ \text{and} \ B \rightarrow C, \ \text{then} \ A \rightarrow C \ (\text{transitivity})$$ $$If \ A \rightarrow B, \text{and} \ C \ \text{is a set of columns}, \ \text{then} \ (A,C) \rightarrow (B,C) \ (\text{augmentation}) $$ The following two rules are also true: $$If \ A \rightarrow B \ \text{and} \ A \rightarrow C, \ \text{then} \ A \rightarrow (B,C) \ (\text{union})$$ $$If A \rightarrow (B,C) \ \text{then} A \rightarrow B \ \text{and}\ A \rightarrow C \ (\text{decomposition})$$ Functional dependency is a statement about all allowable instances of a table. You cannot find the functional dependencies simply by looking at a portion of the data due to:
| Alpha | LastName | FirstName | Major | Advisor |
|---|---|---|---|---|
| 241342 | Thomas | Sarah | SCS | Lewis |
| 242368 | Smith | John | SIT | Jones |
| 246644 | Mikalson | Michael | ECE | Skapanski |
| 247862 | Doe | Jane | ECE | Skapanski |
| 253116 | Doe | Bob | SDS | Lefferton |
| 250908 | Johnson | John | SDS | Brown |
| 251198 | Thomas | Thomas | SCS | Lewis |
| 259722 | Jefferson | Janet | SDS | Lefferton |
| 259832 | Thomas | Sarah | SDS | Lefferton |
A determinate in a table is any column that you can use to determine the values assigned to the other columns in the same row. A determinate is unique in a table, if and only if, it determines every other column in the table. Unique determinates are superkeys. A superkey exists if there are no two rows that have the same values for the columns in the superkey. A key is a superkey such that no proper subset of it is a superkey (so a key is a minimal set of columns that are unique).
A set of columns is a superkey for a table if:
Relations (tables) are categorized as belonging to a normal form based on what modification anomalies they are subject to.
A relation (table) is in First Normal Form (1NF) if every column contains only single values (so no lists). This requirement was already included in our definition of a relation (in the relational model sense), so all relations are in 1NF. Below are all the rules a table in 1NF needs to satisfy:
A table is in Second Normal Form (2NF) if:
A table is in Third Normal Form (3NF) if:
Here is another way to define if a table R is in 3NF: A table R is in 3NF if, for every functional dependency X → A that holds true for R, then one of the following statements is true:
If a table is in 2NF but not 3NF, meaning it has some transitive dependencies, then split them out into a new table, where the determinant becomes a primary key in the new table and a foreign key in the old table. For example: $$ R(\underline{A},\underline{B},C,D) $$ $$ (A,B) \rightarrow (C,D) $$ $$ C \rightarrow D $$ Becomes: $$R(\underline{A},\underline{B},C)$$ $$R2(\underline{C},D)$$
A table is in Boyce-Codd Normal Form (BCNF) if:
| Alpha | Assignment | Points | MaxPoints |
|---|---|---|---|
| 259722 | Quiz1 | 10 | 10 |
| 259722 | Quiz2 | 2.5 | 10 |
| 259722 | Quiz3 | 2 | 20 |
| 259422 | Quiz1 | 6 | 10 |
| 259422 | Quiz2 | 7 | 10 |
| 259422 | Quiz3 | 18 | 20 |
| 259422 | Quiz3 | 18 | 20 |
| 259936 | Quiz1 | 6 | 10 |
| 259936 | Quiz2 | 8 | 10 |
| 259936 | Quiz3 | 20 | 20 |
| Alpha | Assignment | Points |
|---|---|---|
| 259722 | Quiz1 | 10 |
| 259722 | Quiz2 | 2.5 |
| 259722 | Quiz3 | 2 |
| 259422 | Quiz1 | 6 |
| 259422 | Quiz2 | 7 |
| 259422 | Quiz3 | 18 |
| 259422 | Quiz3 | 18 |
| 259936 | Quiz1 | 6 |
| 259936 | Quiz2 | 8 |
| 259936 | Quiz3 | 20 |
| Assignment | MAxPoints |
|---|---|
| Quiz1 | 10 |
| Quiz2 | 10 |
| Quiz3 | 20 |
The following steps are used to normalize tables:
Assume the data currently in the table is representative for all
possible data.
Assume the data currently in the table is representative for all possible data. Use the process above to place EQUIPMENT_REPAIR (ItemNumber, EquipmentType, AcquisitionCost, RepairNumber, RepairDate, RepairCost) into BCNF.