SD321 (Fall 2024)

Normalization

Database Normalization is the processes of minimizing data redundancy within the columns and tables of our relational database. We will work to solve this problem by decomposing our tables into smaller tables, removing redundant information, while ensuring that we do not experience any information loss. Additionally we will work to ensure that all dependencies make logical sense. These efforts result in more consistent data, relationships between data become more clear, and data can be added in an easier fashion.

Modification Anomalies - Our Motivation

Why do we want to normalize our data? Decisions made when designing the tables of a database, either from scratch by starting with the entity-relational model, or from existing tables in an older/different storage system, may introduce inefficiencies. When working with small data sets, it may not be obvious that redundant data is costly. Imagine tables with a million rows, a ten character column of redundant information suddenly becomes noticeably wasteful. Now imagine, like good database maintainers, we also create a daily backup of our database, along with weekly backups. Thus, we keep multiple copies of our database to allow us to revert our database to a specific day within the past year if necessary. What a waste! Redundant information in our databases is typically unnecessary and ALWAYS costly! As we begin pursuing efficient databases, let's start with a few assumptions:

We have one or more tables with data
The data are stored within a database

The question that we have to pose at each step, should we keep this structure or can it be further optimized?

Consider the following table: is there redundant data?

Alpha	AcYr	Sem	Course	Title	CourseGrade
251001	2022	SPRING	NL110	The New Leadership Course	F
251002	2022	SPRING	NL110	The New Leadership Course	A
251234	2022	SPRING	HE111	Rhetoric and Intro to Literature	B
251234	2022	SPRING	HH104	American Naval History	A
251234	2022	SPRING	NL110	The New Leadership Course	A
251234	2022	SPRING	NS101	Fundamentals of Seamanship	B
251234	2022	SPRING	SC111	Foundations of Chemistry I	C

Do we really need to have columns repeating 2022 and SPRING for every course? Do we need both Course (NL110) and Title (The New Leadership Course) repeated for each grade entry? Isn't this repeated information better listed once in a separate smaller table describing only the courses? Are there other unintended consequences, or anomalies, that get introduced by trying to store too much information in a single table? Yes, yes... YES!!! Let's explore some of these anomalies and then tackle how to correct them.

Deletion Anomaly

A deletion anomaly occurs if you lose more information then intended when you delete some row(s) from a table. In our Grades table above, if we deleted all of the rows for NL110 we would also lose the title of the course as this information is not stored anywhere else in our database. What if a professor just decided to curve his course and re-enter grades? We would still want the course and title information to exist somewhere.

Insertion Anomaly

An insertion anomaly occurs if you would have to add more information than desired to add a specific piece of new data. If, after deleting the instances of NL110 from our table, we then renamed NL110 "Introduction to Leadership" we would have to add a grade to the database to simply record this course change somewhere.

Update Anomaly

Update anomalies result from an update to a row (or rows) that cause the information to be inconsistent across the database. Let's say we changed the name of NL110 and changed this for student 251001 only... we would have introduced an inconsistency in our table as depicted below. It is frustrating and inefficient to have to update this type of information multiple times over a table.

Alpha	AcYr	Sem	Course	Title	CourseGrade
251001	2022	SPRING	NL110	Introduction to Leadership	F
251002	2022	SPRING	NL110	The New Leadership Course	A
251234	2022	SPRING	HE111	Rhetoric and Intro to Literature	B
251234	2022	SPRING	HH104	American Naval History	A
251234	2022	SPRING	NL110	The New Leadership Course	A
251234	2022	SPRING	NS101	Fundamentals of Seamanship	B
251234	2022	SPRING	SC111	Foundations of Chemistry I	C

After this update, what is the name of the course with course number NL110? Is it "Introduction to Leadership" or "The New Leadership Course"?

Initial Table Decomposition

If we were to pull the course titles out of the Grades table and place the information in its own table (like below) would that correct the anomalies mentioned above?

Grades

Alpha	AcYr	Sem	Course	CourseGrade
251001	2022	SPRING	NL110	F
251002	2022	SPRING	NL110	A
251234	2022	SPRING	HE111	B
251234	2022	SPRING	HH104	A
251234	2022	SPRING	NL110	A
251234	2022	SPRING	NS101	B
251234	2022	SPRING	SC111	C

Courses

Course	Title
NL110	The New Leadership Course
HE111	Rhetoric and Intro to Literature
HH104	American Naval History
NS101	Fundamentals of Seamanship
SC111	Foundations of Chemistry I

This seems like an obvious improvement. But as our databases get larger and more complex, we need a repeatable and proven way to determine whether or not we should change how we are storing information in our database. We have two options for improvements: decompose (break apart into multiple tables) or merge (combine into a single table) data.

Functional Dependencies

A functional dependency (FD) is a type of integrity constraint. If X and Y are sets of attributes (columns) of a relation (table), we say that: $$X \rightarrow Y \text{(X determines Y), if and only if, for each value of the column(s) in X there exists only one possible value for column(s) in Y}$$ X will determine Y if a specific value for X always results in the same value for Y. This means that X determines Y, or Y functionally depends on X. Examples: $$\text{Any primary key determines all other columns} $$ $$Alpha \rightarrow (Name, Class, DateOfBirth, \text{etc})$$ $$(Rank, YearsOfService) \rightarrow Wage $$ $$(NbHours, HourlyPrice) \rightarrow Charge $$ As seen the the examples above, functional dependencies may be based upon equations, but they are not equations. A composite determinant consists of more than one column. For example, NbHours and HourlyPrice together determine the charge.

Functional Dependency Rules

The following three properties of functional dependencies are called Armstrong's Axioms: $$\text{Let A, B, C be sets of columns} $$ $$A \rightarrow A, \ \text{always, (reflexivity)}$$ $$If \ A \rightarrow B, \ \text{and} \ B \rightarrow C, \ \text{then} \ A \rightarrow C \ (\text{transitivity})$$ $$If \ A \rightarrow B, \text{and} \ C \ \text{is a set of columns}, \ \text{then} \ (A,C) \rightarrow (B,C) \ (\text{augmentation}) $$ The following two rules are also true: $$If \ A \rightarrow B \ \text{and} \ A \rightarrow C, \ \text{then} \ A \rightarrow (B,C) \ (\text{union})$$ $$If A \rightarrow (B,C) \ \text{then} A \rightarrow B \ \text{and}\ A \rightarrow C \ (\text{decomposition})$$ Functional dependency is a statement about all allowable instances of a table. You cannot find the functional dependencies simply by looking at a portion of the data due to:

Data set limitations
Must logically be a determinant

Given some data in a table R, we can check to see if it violates some functional dependency, but we cannot tell if the functional dependency holds over any possible instance of R.

Alpha	LastName	FirstName	Major	Advisor
241342	Thomas	Sarah	SCS	Lewis
242368	Smith	John	SIT	Jones
246644	Mikalson	Michael	ECE	Skapanski
247862	Doe	Jane	ECE	Skapanski
253116	Doe	Bob	SDS	Lefferton
250908	Johnson	John	SDS	Brown
251198	Thomas	Thomas	SCS	Lewis
259722	Jefferson	Janet	SDS	Lefferton
259832	Thomas	Sarah	SDS	Lefferton

As an example, determine the possible functional dependencies (FDs) in the table to the left. Click the solution icon at the top to see the recommended answers. Note that we CANNOT determine FDs from data alone, as existing data might not be representative for all possible data.

A determinate in a table is any column that you can use to determine the values assigned to the other columns in the same row. A determinate is unique in a table, if and only if, it determines every other column in the table. Unique determinates are superkeys. A superkey exists if there are no two rows that have the same values for the columns in the superkey. A key is a superkey such that no proper subset of it is a superkey (so a key is a minimal set of columns that are unique).

A set of columns is a superkey for a table if:

No two distinct rows can have the same values in all key columns, or equivalently
It determines all of the other columns in a table

Normal Forms

Relations (tables) are categorized as belonging to a normal form based on what modification anomalies they are subject to.

1NF - First Normal Form

A relation (table) is in First Normal Form (1NF) if every column contains only single values (so no lists). This requirement was already included in our definition of a relation (in the relational model sense), so all relations are in 1NF. Below are all the rules a table in 1NF needs to satisfy:

Each column contains data about one attribute
All entries in a column are of the same kind (ex. Alpha (INT), Name (VARCHAR), Date (DATE))
Each column has a unique name
Cells contain only one value
The order of the columns is unimportant
The order of the rows in unimportant
Each row is unique

2NF - Second Normal Form

A table is in Second Normal Form (2NF) if:

The table is in 1NF, and
All non-key columns are determined only by the ENTIRE key
There are no partial (functional) dependencies

3NF - Third Normal Form

A table is in Third Normal Form (3NF) if:

The table is in 2NF, and
There are no transitive dependencies

$$A \rightarrow B \rightarrow C$$

Here is another way to define if a table R is in 3NF: A table R is in 3NF if, for every functional dependency X → A that holds true for R, then one of the following statements is true:

A ∈ X (it is a trivial FD)
X is a superkey or
A is part of some key for R

If a table is in 2NF but not 3NF, meaning it has some transitive dependencies, then split them out into a new table, where the determinant becomes a primary key in the new table and a foreign key in the old table. For example: $$ R(\underline{A},\underline{B},C,D) $$ $$ (A,B) \rightarrow (C,D) $$ $$ C \rightarrow D $$ Becomes: $$R(\underline{A},\underline{B},C)$$ $$R2(\underline{C},D)$$

BCNF - Boyce-Codd Normal Form

A table is in Boyce-Codd Normal Form (BCNF) if:

The table is in 3NF, and
Every determinant is a superkey

Example

Assume we have the table AssignmentGrades as show to the right. This does not meet the 2NF criterial where every column depends on the entire key, MaxPoints depends only on the assignment.

Alpha	Assignment	Points	MaxPoints
259722	Quiz1	10	10
259722	Quiz2	2.5	10
259722	Quiz3	2	20
259422	Quiz1	6	10
259422	Quiz2	7	10
259422	Quiz3	18	20
259422	Quiz3	18	20
259936	Quiz1	6	10
259936	Quiz2	8	10
259936	Quiz3	20	20

Alpha	Assignment	Points
259722	Quiz1	10
259722	Quiz2	2.5
259722	Quiz3	2
259422	Quiz1	6
259422	Quiz2	7
259422	Quiz3	18
259422	Quiz3	18
259936	Quiz1	6
259936	Quiz2	8
259936	Quiz3	20

Assignment	MAxPoints
Quiz1	10
Quiz2	10
Quiz3	20

Process for Putting a Table into BCNF

The following steps are used to normalize tables:

Identify every functional dependency
Identify every candidate key
If there is a functional dependency that has a determinant that is not a candidate key:

Move the columns of that functional dependency into a new table
Make the determinant of that functional dependency the primary key of the new table
Leave a copy of the determinant as a foreign key in the original table
Create a referential integrity constraint between the original table and the new table

Repeat step 3 until every determinant of every table is a candidate key

In-Class Problems

Problem 1

Assume the data currently in the table is representative for all possible data.

What are the functional dependencies in the table?
Give an example of update anomaly in the table.
Give an example of insert anomaly in the table.
Give an example of delete anomaly in the table.
Use the process above to place SKU_DATA(SKU, SKU_Description, Department, Buyer) into BCNF.

Problem 2

Assume the data currently in the table is representative for all possible data. Use the process above to place EQUIPMENT_REPAIR (ItemNumber, EquipmentType, AcquisitionCost, RepairNumber, RepairDate, RepairCost) into BCNF.