Data and Data Storage

Data scientists need to deal with a lot of data. This course is an introduction to data astorage methods and systems. In particular, we will discuss the use of relational databases, including data modeling, database interaction, and interfacing with a database from a Python or R script. You will quickly realize that a good design will help your projects succeed, and this course aims to help you determine what good database design means.

The course will introduce working with databases via MySQL and quickly move into data modeling so we can build our own database designs and schemas.

Why are databases important?

If you consider the amount of information available, from everything that is available on the Internet down to what is involved just to understand your grades here at the Academy, you can see that we will need effective ways of storing and retrieving information. We must figure out how to overcome many problems, examples include:

We will attempt to solve these problems with the use of Relational Databases.

What is a database?

We will consider a database to be a possibly very large, integrated, collection of data. The data tends to be maintained by a Database Management System (DBMS) which is a software package designed to store and manage databases. The DBMS handles many problems for you, such as Data independence and efficient access, reduced application development time, Data integrity and security, performance and scalability, and many others.

Databases are used everywhere, by corporations, universities, and the military. The size of the datasets are continuing to increase in both diversity and volume.

The Relational Model

The Relational Model is the most widely used model for storing data, with vendors including IBM, Microsoft, Oracle, Sybase, and others. While there are a few competing models such as the Object-Oriented Model, Object-Relational Model, XML, and Key-value stores, we will focus this course on the traditional database model with discussions of other options in other classes in this majot.

What is a Relation?

A relation is a two-dimensional table that can be described by a schema, where a schema describes the structure (a relation name, column names, and column types). With this in mind, you can consider a relational database to be a set of relations (or tables).

EmployeeNumberFirstNameLastNameDepartmentEmailPhone
125JaniceJohansonAccountingjj@happy.com310-1234
155MarySmithFinancems@happy.com444-1234
178ShannonMcHansonFinancesm@happy.com444-5678
189BobJordanAccountingjb@happy.com310-5555
193BobHoyleSalesbh@happy.com310-6666
197JanetteYoungLegaljy@happy.com410-1234
204ToddMcquireLegaltm@happy.com410-4444
205ToddSmithLegalts@happy.com410-4444

Knowledge Check: Let's try to describe this table as simply as possible, say the table name is EMPLOYEE, what other information would we want to include in our schema to allow someone else to understand the structure of the table?

Requirements for a Relation

A table has a few requirements for it to be considered a relation.

Invalid Relations

Knowledge Check: With these requirements in mind, why are the following tables Not relations?

EmployeeNumberFirstNameLastNameDepartmentEmailPhone
125JaniceJohansonAccountingjj@happy.com310-1234
189BobJordanAccountingjb@happy.com310-5555,
310-0000,
310-0001
204ToddMcquireLegaltm@happy.com410-4444,
410-0010
EmployeeNumberFirstNameLastNameDepartmentEmailPhone
125JaniceJohansonAccountingjj@happy.com310-1234
189BobJordanAccountingjb@happy.com310-5555
Fax:310-0000
Home:310-0001
204ToddMcquireLegaltm@happy.com410-4444

Basic Terminology

Although not all tables are relations, the terms table and relation are normally used interchangeably:

TableColumnRow
RelationAttributeTuple
FileFieldRecord

MySQL

In this course we will be using the community edition of the popular open source relational database system MySQL. MySQL is the most commonly used open source database system, and the second most used of all database systems. It scales well and is suitable for nearly all database projects.

Connecting to MySQL

For the purposes of this class you will be using the database server db.cs.usna.edu for all of your homework, assignments, projects, and demos. To connect to the database we can use the mysql client in the terminal or the MySQL Shell extension in VSCode. Below are the details:

Connecting to MySQL from the command line

Bring up a terminal on your linux workstation and type:

mysql -h db.cs.usna.edu -u m26xxxx -p m26xxxx -vvv
where m26xxxx is your standard logon. The server will ask for a password, your password is just your logon, m26xxxx and not your academy password. If you change your password, Store/remember your PW somewhere so you do not forget it!

The generic version of the command is

mysql -h <server_name> -u <user_name> -p <database_schema> -vvv
where the values between < > are replaced by the actual values. The vvv option is to show more debug information.

You should see something similar to the following when you successfully connect:

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 1001
Server version: 8.0.34-0ubuntu0.22.04.1 (Ubuntu)

Copyright (c) 2000, 2023, Oracle and/or its affiliates.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql>

Changing your password

You can change your database password from the default via the command:

SET PASSWORD = 'cleartext password';
Change cleartext password to a new password, please do not use your normal academy credentials. Whatever you type between the two apostrophe marks will be your new password. I recommend you stick to letters and numbers, and don't use special characters in for this password. While it ends up being less secure, it might help you out in the future if you're including your password in scripts (Also not best practice, but let's keep it simple while we're starting out!)

Exiting the MySQL client

Now to test your password, exit the client with either the exit or quit commands. This should return to the Linux command prompt. You can now try to connect back with your new password.

Connecting to MySQL using MySQL Shell extension in VS Code

Install MySQL Shell for VS Code

Create a connection

Open connection

Structured Query Language - SQL

The Structured Query Language (SQL) is the standard language used to interact with Relational Databases. The versions of the language used by various RDBMS vendors are nearly identical, with only a few modifications to support advanced features that may not appear in every DBMS. SQL was developed by IBM in the 1970's and later endorsed as a national standard by the American National Standards Institute (ANSI) in 1992 under SQL-92. This is the language that we will focus on in this course as we work with MySQL.

The Relational Model supports simple, yet powerful, data queries. SQL is a high level declarative language, where the queries can be written rather intuitively and the DBMS is responsible for the efficient evaluation of the language. There are precise semantics for relational queries that allow the DBMS optimizer to extensively re-order operations, and still ensure that the answer does not change.

Categories of SQL Statements

Data Definition Language

Data definition language (DDL) statements are used to create and modify tables, views, and other structures. We will focus on the DDL statements CREATE, DROP, and ALTER.

Data Manipulation Language

Data manipulation language (DML) statements are used for queries and data modification. Examples include SELECT, INSERT, UPDATE, and DELETE.

SQL - DDL

Before we can work with any data, we must have a place to put it. A database (schema) has been created for each of you on the department's database management server, but you do not have any tables (relations) yet in which to place the information.

Creating a New Table - CREATE TABLE

The first SQL command we will focus on is the CREATE TABLE statement. When creating a table, we need to specify a name for the table and then specify the columns in that table. When we define a column in a table we set the column name, data type, and any optional constraints.

Data Types

When we decide on a column data type, it is best to consider the most specific type possible. These are a few data types that we will start out with, but there are many more.

Let's consider the example table below:

EmployeeNumberFirstNameLastNameDepartmentEmailPhoneReview
125JaniceJohansonAccountingjj@happy.com310-123401 Jan 2016
155MarySmithFinancems@happy.com444-123401 Jan 2016

We have 7 columns that we need to work with and decide on a name and data type. Let's assume we decide on the following data types:

EmployeeNumberFirstNameLastNameDepartmentEmailPhoneReview
INTCHAR(30)CHAR(30)CHAR(20)CHAR(50)CHAR(15)TIMESTAMP

Assuming the above type decisions, then the syntax for the CREATE TABLE statement would be:

CREATE TABLE Employee (EmployeeNumber INT, FirstName CHAR(30), LastName CHAR(30), Department CHAR(20), Email CHAR(50), Phone CHAR(15), Review TIMESTAMP);

This is hard to read when shown this way, the format that we will use for nearly all examples is show below and is considerably easier to read.

CREATE TABLE Employee (EmployeeNumber INT,
                       FirstName CHAR(30),
                       LastName CHAR(30),
                       Department CHAR(20),
                       Email CHAR(50),
                       Phone CHAR(15),
                       Review TIMESTAMP);

Selecting Data Types - Case Study: Alpha Numbers

You will be faced with an important decision many times in your life: should a certain piece of data be stored as a numeric value (such as an INTEGER or FLOAT) or a string (CHAR or VARCHAR)? For example, Alpha numbers look numeric. You think you want to store them as an INTEGER!

As a rule of thumb, though, you should only use numeric types for data you intend to perform mathematical operations on. It probably makes sense to store age as an INTEGER in order to compare if someone is older or younger. Alpha numbers, on the other hand, are really never compared in the sense of "greater than" or "less than". And there's really no reason to add or subtract from them.

So what's the harm of storing Alpha numbers as INTEGERs? First, some Alpha numbers have leading zeros (for example 091890). Integers are not stored with leading zeros. Second, Alpha numbers have been a string of numbers for probably a hundred years, but what if they are changed to include letters? If that seems contrived, think about company numbers. It was a poor soul that stored them as INTEGERs and had to migrate their database once plebe summer arrived and companies were identified by letter.

Long story short: if it doesn't make sense to do math on a value, don't store it as a numeric type.

Should specific columns be required?

We can specify whether values for a column are optional or required by using NULL or NOT NULL keywords. In our current example, let us assume that an employee receives a Department, Email, or Phone Number only after initial training and assignment. The table should then be created as follows:

CREATE TABLE Employee (EmployeeNumber INT NOT NULL,
                       FirstName CHAR(30) NOT NULL,
                       LastName CHAR(30) NOT NULL,
                       Department CHAR(20) NULL,
                       Email CHAR(50) NULL,
                       Phone CHAR(15) NULL,
                       Review TIMESTAMP NULL);
As a note, the default setting in MySQL is NULL.

Default values in columns

MySQL will automatically set certain values, if not explicitly set when inserted into the database, if those options are set during table creation. Two very common examples are AUTO_INCREMENT, which we will discuss later along with keys, and setting a value like the date based on the current date. Continuing our example:

CREATE TABLE Employee (EmployeeNumber INT NOT NULL,
                       FirstName CHAR(30) NOT NULL,
                       LastName CHAR(30) NOT NULL,
                       Department CHAR(20) NULL,
                       Email CHAR(50) NULL,
                       Phone CHAR(15) NULL,
                       Review TIMESTAMP NULL DEFAULT CURRENT_TIMESTAMP);

Data Constraints

Data constraints are limitations on data values that must be true for any possible instance of the databases. There are a few different types of data constraints, also referred to as Integrity Constraints:

  1. Domain Constraints - An item must be from a specific domain or list (ex. {Mon, Tue, Wed})
  2. Range Constraints - An item must exist in some range (ex. 4-50)
  3. Intra-relation Constraints - Constraints placed on data within the same table
  4. Inter-relation Constraints - Constraints where data relies on data in a different table
  5. Key Constraints - A specific column, or set of columns, uniquely identifies a row
  6. Foreign Key Constraints - A specific column, or set of columns, exists in another table
Constraints are specified at table creation, after the last column is defined, as follows: CONSTRAINT <Name> <constraint-type> <specific-constraint> , and are checked when the associated tables are modified. Example:

CREATE TABLE Employee (EmployeeNumber INT NOT NULL,
                       FirstName CHAR(30) NOT NULL,
                       LastName CHAR(30) NOT NULL,
                       Department CHAR(20) NULL,
                       Email CHAR(50) NULL,
                       Phone CHAR(15) NULL,
                       Review TIMESTAMP NULL DEFAULT CURRENT_TIMESTAMP,
                       CONSTRAINT Check_Email CHECK (Email LIKE '%@happy.com'));
                       
A legal instance of a table is one that satisfies all specified constraints. A DBMS should not allow illegal instances!

Keys

Keys are combinations of one or more columns that are used to identify rows in a table. When the keys consists of two or more columns it is referred to as a composite key. A set of columns is a key for a table when:

  1. No two distinct rows have the same values in key columns
  2. No subset of the key satisfies the first condition

A candidate key is a key that satisfies the conditions above and the primary key is a candidate key that is selected as the primary means of identification for rows in the table.

A surrogate key is an artificial column which is added to a table to serve as the primary key. This is DBMS supplied, short, numeric and never changes.

Specify the Primary Key

Primary keys are created using the same constraint method specified earlier.

CREATE TABLE Employee (EmployeeNumber INT NOT NULL AUTO_INCREMENT,
                       FirstName CHAR(30) NOT NULL,
                       LastName CHAR(30) NOT NULL,
                       Department CHAR(20) NULL,
                       Email CHAR(50) NULL,
                       Phone CHAR(15) NULL,
                       Review TIMESTAMP NULL DEFAULT CURRENT_TIMESTAMP,
                       CONSTRAINT Check_Email CHECK (Email LIKE '%@happy.com'),
                       CONSTRAINT PK_Employee PRIMARY KEY (EmployeeNumber));

In the example above EnployeeNumber is a primary key that is a surrogate key, as it is defined using the AUTO_INCREMENT keywork, which makes the DBMS provide a unique value for that column if one is not specified.

Alternate keys are alternate identifiers of unique rows in a table.

CREATE TABLE Employee (EmployeeNumber INT NOT NULL AUTO_INCREMENT,
                       FirstName CHAR(30) NOT NULL,
                       LastName CHAR(30) NOT NULL,
                       Department CHAR(20) NULL,
                       Email CHAR(50) NULL,
                       Phone CHAR(15) NULL,
                       Review TIMESTAMP NULL DEFAULT CURRENT_TIMESTAMP,
                       CONSTRAINT Check_Email CHECK (Email LIKE '%@happy.com'),
                       CONSTRAINT PK_Employee PRIMARY KEY (EmployeeNumber),
                       CONSTRAINT AK_Employee_Email UNIQUE (Email));

Foreign Keys

Foreign keys are used as links between two tables. A foreign key is the primary key of one table that is placed in another table to form a link. This is a referential integrity constraint: the values of the foreign key must exist as primary key values in the original, "foreign", table. There can be no "dangling references!"

CREATE TABLE Department (Name CHAR(20) NOT NULL,
                         Location CHAR(30) NOT NULL,
                         CONSTRAINT PK_Department PRIMARY KEY (Name));
CREATE TABLE Employee (EmployeeNumber INT NOT NULL AUTO_INCREMENT,
                       FirstName CHAR(30) NOT NULL,
                       LastName CHAR(30) NOT NULL,
                       Department CHAR(20) NULL,
                       Email CHAR(50) NULL,
                       Phone CHAR(15) NULL,
                       Review TIMESTAMP NULL DEFAULT CURRENT_TIMESTAMP,
                       CONSTRAINT Check_Email CHECK (Email LIKE '%@happy.com'),
                       CONSTRAINT PK_Employee PRIMARY KEY (EmployeeNumber),
                       CONSTRAINT FK_Employee_Department FOREIGN KEY (Department)
                                  REFERENCES Department (Name),
                       CONSTRAINT AK_Employee_Email UNIQUE (Email));

Referential Integrity

What should happen if you now attempt to add an employee with a Department name that does exist? Reject the add! If there is a foreign key constraint that relies on another table's information, and that information does not exist, the database will take no action when you try to add a new row.

But what would happen if we deleted the Finance department from the Department table? In the example above we have two rows in our Employee table that would no longer be able to properly satisfy the foreign key constraint. We have a few options:

  1. Delete the referring member rows in Employee - Although we would lose all record of those employees!
  2. Take no action and prevent the row in Departments from being deleted
  3. Set Department in Employee to be NULL
  4. Set Department in Employee to a default value

The default is NO ACTION where the delete/update is rejected. CASCADE will delete/update all rows that refer to the deleted/updated row, and SET NULL / SET DEFAULT will have those respective actions. As an example:

CREATE TABLE Employee (EmployeeNumber INT NOT NULL AUTO_INCREMENT,
                       FirstName CHAR(30) NOT NULL,
                       LastName CHAR(30) NOT NULL,
                       Department CHAR(20) NULL,
                       Email CHAR(50) NULL,
                       Phone CHAR(15) NULL,
                       Review TIMESTAMP NULL DEFAULT CURRENT_TIMESTAMP,
                       CONSTRAINT Check_Email CHECK (Email LIKE '%@happy.com'),
                       CONSTRAINT PK_Employee PRIMARY KEY (EmployeeNumber),
                       CONSTRAINT FK_Employee_Department FOREIGN KEY (Department)
                                  REFERENCES Department (Name)
                                  ON DELETE SET NULL ON UPDATE CASCADE,
                       CONSTRAINT AK_Employee_Email UNIQUE (Email));

Removing a Table - DROP TABLE

There are a few additional commands and statements that you may find useful when using MySQL. In this lesson we have created a few tables, but if you are attempting to create a table that already exists, you will encounter an error. The command to remove an existing table is DROP TABLE, example:

DROP TABLE Employee;
If you run that command twice in a row, you will get an error the second time, as the table you are trying to drop no longer exists.

mysql> DROP TABLE Employee;
ERROR 1051 (42S02): Unknown table 'mXXXXXX.Employee'
Another way to remove a table is to check if it exists prior to removing it:

DROP TABLE IF EXISTS Employee;
In fact, you will often find this statement before a CREATE TABLE statement in most code.

DROP TABLE IF EXISTS Employee;
CREATE TABLE Employee (EmployeeNumber INT NOT NULL AUTO_INCREMENT,
                       FirstName CHAR(30) NOT NULL,
                       LastName CHAR(30) NOT NULL,
                       Department CHAR(20) NULL,
                       Email CHAR(50) NULL,
                       Phone CHAR(15) NULL,
                       Review TIMESTAMP NULL DEFAULT CURRENT_TIMESTAMP,
                       CONSTRAINT Check_Email CHECK (Email LIKE '%@happy.com'),
                       CONSTRAINT PK_Employee PRIMARY KEY (EmployeeNumber),
                       CONSTRAINT FK_Employee_Department FOREIGN KEY (Department)
                                  REFERENCES Department (Name)
                                  ON DELETE SET NULL ON UPDATE CASCADE,
                       CONSTRAINT AK_Employee_Email UNIQUE (Email));
This has tradeoffs! It's less likely to throw an error, but you're more likely to accidentally drop a table that you didn't mean to. But you'll see this a lot in this course since we're not dealing with production deployments.

If you would like to see what tables you have created, see the next section!

What tables do I have?

SHOW TABLES;
To find the column names within a specific table:

SHOW FIELDS IN Employee;

Practice Problems

Complete these practice problems in a terminal using the mysql commands you learned or using the MySQL Shell for VS Code extension and opening a script. Create a single file or a series of well-labeled files with your commands written out. For example, class2_answers.txt or q1.txt, q2.txt, etc.

  1. Connect to your database on the MySQL server on db and remove any tables that already exist.
  2. Create a new table called Midshipman with at least the following columns:
    • LastName
    • FirstName
    • Alpha
    • Company
    • Room
    • Phone
    • Email
    • Major
  3. What should the primary key be?
  4. Create a table called Brigade, with at least a battalion and battalion officer attribute. Make the battalion number the primary key for the table.
  5. Create a table called Company, with at least a company (number), battalion (number), company officer and SEL attribute. Pick and assign the best PRIMARY KEY.
  6. Now set it so that the Midshipman table has a FOREIGN KEY reference to Company
  7. Now set it so that the Company table has a FOREIGN KEY reference to Brigade
  8. Hint: This is trickier than it seems at first! TABLE Midshipman is now dependent on TABLE Company due to the FOREIGN KEY we added in the last problem. So we have two options, drop Midshipman first so we can rebuild Company with our new FOREIGN KEY constraint, then rebuild Midshipman. Or you can figure out how to use the ALTER command (covered later) instead of dropping and recreating Company.
  9. Would this have been easier in the other direction? Creating Brigade then Company then Midshipman
  10. What are some common data types?
  11. What are the types of Data Constraints
  12. What happens when I delete data from a table which is the parent (based on Foreign keys) to another table?