Data scientists need to deal with a lot of data. This course is an introduction to data astorage methods and systems. In
particular, we will discuss the use of relational databases, including data modeling,
database interaction, and interfacing with a database from a Python
or R script.
You will quickly realize that a good design will help your projects succeed, and this
course aims to help you determine what good database design means.
The course will introduce working with databases via MySQL and quickly move into data
modeling so we can build our own database designs and schemas.
If you consider the amount of information available, from everything that is available on the Internet down to what is involved just to understand your grades here at the Academy, you can see that we will need effective ways of storing and retrieving information. We must figure out how to overcome many problems, examples include:
We will consider a database to be a possibly very large, integrated, collection of data. The data tends to
be maintained by a Database Management System (DBMS) which is a software package designed to store and manage
databases. The DBMS handles many problems for you, such as Data independence and efficient access,
reduced application development time, Data integrity and security, performance and scalability,
and many others.
Databases are used everywhere, by corporations, universities, and the military. The size of the datasets
are continuing to increase in both diversity and volume.
The Relational Model is the most widely used model for storing data, with vendors including IBM, Microsoft, Oracle, Sybase, and others. While there are a few competing models such as the Object-Oriented Model, Object-Relational Model, XML, and Key-value stores, we will focus this course on the traditional database model with discussions of other options in other classes in this majot.
A relation is a two-dimensional table that can be described by a schema, where a schema describes the structure (a relation name, column names, and column types). With this in mind, you can consider a relational database to be a set of relations (or tables).
| EmployeeNumber | FirstName | LastName | Department | Phone | |
|---|---|---|---|---|---|
| 125 | Janice | Johanson | Accounting | jj@happy.com | 310-1234 |
| 155 | Mary | Smith | Finance | ms@happy.com | 444-1234 |
| 178 | Shannon | McHanson | Finance | sm@happy.com | 444-5678 |
| 189 | Bob | Jordan | Accounting | jb@happy.com | 310-5555 |
| 193 | Bob | Hoyle | Sales | bh@happy.com | 310-6666 |
| 197 | Janette | Young | Legal | jy@happy.com | 410-1234 |
| 204 | Todd | Mcquire | Legal | tm@happy.com | 410-4444 |
| 205 | Todd | Smith | Legal | ts@happy.com | 410-4444 |
Knowledge Check: Let's try to describe this table as simply as possible, say the table name is EMPLOYEE, what other information would we want to include in our schema to allow someone else to understand the structure of the table?
A table has a few requirements for it to be considered a relation.
Knowledge Check: With these requirements in mind, why are the following tables Not relations?
| EmployeeNumber | FirstName | LastName | Department | Phone | |
|---|---|---|---|---|---|
| 125 | Janice | Johanson | Accounting | jj@happy.com | 310-1234 |
| 189 | Bob | Jordan | Accounting | jb@happy.com | 310-5555, 310-0000, 310-0001 |
| 204 | Todd | Mcquire | Legal | tm@happy.com | 410-4444, 410-0010 |
| EmployeeNumber | FirstName | LastName | Department | Phone | |
|---|---|---|---|---|---|
| 125 | Janice | Johanson | Accounting | jj@happy.com | 310-1234 |
| 189 | Bob | Jordan | Accounting | jb@happy.com | 310-5555 |
| Fax: | 310-0000 | ||||
| Home: | 310-0001 | ||||
| 204 | Todd | Mcquire | Legal | tm@happy.com | 410-4444 |
Although not all tables are relations, the terms table and relation are normally used interchangeably:
| Table | Column | Row |
| Relation | Attribute | Tuple |
| File | Field | Record |
In this course we will be using the community edition of the popular open source relational database system MySQL. MySQL is the most commonly used open source database system, and the second most used of all database systems. It scales well and is suitable for nearly all database projects.
For the purposes of this class you will be using the database server db.cs.usna.edu for all of your homework, assignments, projects, and demos. To connect to the database we can use the mysql client in the terminal or the MySQL Shell extension in VSCode. Below are the details:
Bring up a terminal on your linux workstation and type:
mysql -h db.cs.usna.edu -u m26xxxx -p m26xxxx -vvv
where m26xxxx is your standard logon.
The server will ask for a password, your password is just your logon, m26xxxx and not your academy password.
If you change your password, Store/remember your PW somewhere so you do not forget it!
The generic version of the command is
mysql -h <server_name> -u <user_name> -p <database_schema> -vvv
where the values between < > are replaced by the actual values.
The vvv option is to show more debug information.
You should see something similar to the following when you successfully connect:
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 1001
Server version: 8.0.34-0ubuntu0.22.04.1 (Ubuntu)
Copyright (c) 2000, 2023, Oracle and/or its affiliates.
Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
mysql>
You can change your database password from the default via the command:
SET PASSWORD = 'cleartext password';
Change cleartext password to a new
password,
please do not use your normal academy credentials.
Whatever you type between the two apostrophe marks
will be your new password. I recommend you stick to letters and numbers, and don't use special characters in for this password. While it ends up being less secure, it might help you out in the future if you're including your password in scripts (Also not best practice, but let's keep it simple while we're starting out!)
Now to test your password, exit the client with either the exit or quit commands. This should return to the Linux command prompt. You can now try to connect back with your new password.
Install MySQL Shell for VS Code
Create a connection
Open connection
Welcome to the MySQL Schell - DB Notebook.
Press Ctrl+Enter to execute the current statement.
Execute \sql to switch to SQL, \js to JavaScript and \ts to TypeScript mode.
Execute \help or \? for help;
The Structured Query Language (SQL) is the standard language used to interact with Relational Databases. The versions of the language used by various RDBMS vendors are nearly identical, with only a few modifications to support advanced features that may not appear in every DBMS. SQL was developed by IBM in the 1970's and later endorsed as a national standard by the American National Standards Institute (ANSI) in 1992 under SQL-92. This is the language that we will focus on in this course as we work with MySQL.
The Relational Model supports simple, yet powerful, data queries. SQL is a high level declarative language, where the queries can be written rather intuitively and the DBMS is responsible for the efficient evaluation of the language. There are precise semantics for relational queries that allow the DBMS optimizer to extensively re-order operations, and still ensure that the answer does not change.
Data definition language (DDL) statements are used to create and modify tables, views, and other structures. We will focus on the DDL statements CREATE, DROP, and ALTER.
Data manipulation language (DML) statements are used for queries and data modification. Examples include SELECT, INSERT, UPDATE, and DELETE.
Before we can work with any data, we must have a place to put it. A database (schema) has been created for each of you on the department's database management server, but you do not have any tables (relations) yet in which to place the information.
The first SQL command we will focus on is the CREATE TABLE
statement. When creating a table, we need to specify a name for the table and
then specify the columns in that table. When we define a column in a table we set the column name, data type, and any optional constraints.
When we decide on a column data type, it is best to consider the most specific type possible. These are a few data types that we will start out with, but there are many more.
Let's consider the example table below:
| EmployeeNumber | FirstName | LastName | Department | Phone | Review | |
|---|---|---|---|---|---|---|
| 125 | Janice | Johanson | Accounting | jj@happy.com | 310-1234 | 01 Jan 2016 |
| 155 | Mary | Smith | Finance | ms@happy.com | 444-1234 | 01 Jan 2016 |
We have 7 columns that we need to work with and decide on a name and data type. Let's assume we decide on the following data types:
| EmployeeNumber | FirstName | LastName | Department | Phone | Review | |
|---|---|---|---|---|---|---|
| INT | CHAR(30) | CHAR(30) | CHAR(20) | CHAR(50) | CHAR(15) | TIMESTAMP |
Assuming the above type decisions, then the syntax for the CREATE TABLE statement would be:
CREATE TABLE Employee (EmployeeNumber INT, FirstName CHAR(30), LastName CHAR(30), Department CHAR(20), Email CHAR(50), Phone CHAR(15), Review TIMESTAMP);
This is hard to read when shown this way, the format that we will use for nearly all examples is show below and is considerably easier to read.
CREATE TABLE Employee (EmployeeNumber INT,
FirstName CHAR(30),
LastName CHAR(30),
Department CHAR(20),
Email CHAR(50),
Phone CHAR(15),
Review TIMESTAMP);
You will be faced with an important decision many times in your life: should a certain piece of data be stored as a numeric value (such as an INTEGER or FLOAT) or a string (CHAR or VARCHAR)? For example, Alpha numbers look numeric. You think you want to store them as an INTEGER!
As a rule of thumb, though, you should only use numeric types for data you intend to perform mathematical operations on. It probably makes sense to store age as an INTEGER in order to compare if someone is older or younger. Alpha numbers, on the other hand, are really never compared in the sense of "greater than" or "less than". And there's really no reason to add or subtract from them.
So what's the harm of storing Alpha numbers as INTEGERs? First, some Alpha numbers have leading zeros (for example 091890). Integers are not stored with leading zeros. Second, Alpha numbers have been a string of numbers for probably a hundred years, but what if they are changed to include letters? If that seems contrived, think about company numbers. It was a poor soul that stored them as INTEGERs and had to migrate their database once plebe summer arrived and companies were identified by letter.
Long story short: if it doesn't make sense to do math on a value, don't store it as a numeric type.
We can specify whether values for a column are optional or required by using NULL or NOT NULL keywords. In our current example, let us assume that an employee receives a Department, Email, or Phone Number only after initial training and assignment. The table should then be created as follows:
CREATE TABLE Employee (EmployeeNumber INT NOT NULL,
FirstName CHAR(30) NOT NULL,
LastName CHAR(30) NOT NULL,
Department CHAR(20) NULL,
Email CHAR(50) NULL,
Phone CHAR(15) NULL,
Review TIMESTAMP NULL);
As a note, the default setting in MySQL is NULL.
MySQL will automatically set certain values, if not explicitly set when inserted into the database, if those options are set during table creation. Two very common examples are AUTO_INCREMENT, which we will discuss later along with keys, and setting a value like the date based on the current date. Continuing our example:
CREATE TABLE Employee (EmployeeNumber INT NOT NULL,
FirstName CHAR(30) NOT NULL,
LastName CHAR(30) NOT NULL,
Department CHAR(20) NULL,
Email CHAR(50) NULL,
Phone CHAR(15) NULL,
Review TIMESTAMP NULL DEFAULT CURRENT_TIMESTAMP);
Data constraints are limitations on data values that must be true for any possible instance of the databases. There are a few different types of data constraints, also referred to as Integrity Constraints:
CREATE TABLE Employee (EmployeeNumber INT NOT NULL,
FirstName CHAR(30) NOT NULL,
LastName CHAR(30) NOT NULL,
Department CHAR(20) NULL,
Email CHAR(50) NULL,
Phone CHAR(15) NULL,
Review TIMESTAMP NULL DEFAULT CURRENT_TIMESTAMP,
CONSTRAINT Check_Email CHECK (Email LIKE '%@happy.com'));
A legal instance of a table is one that satisfies all specified constraints.
A DBMS should not allow illegal instances!
Keys are combinations of one or more columns that are used to identify rows in a table. When the keys consists of two or more columns it is referred to as a composite key. A set of columns is a key for a table when:
A candidate key is a key that satisfies the conditions above and the primary key is a candidate key that is selected as the primary means of identification for rows in the table.
Primary keys are created using the same constraint method specified earlier.
CREATE TABLE Employee (EmployeeNumber INT NOT NULL AUTO_INCREMENT,
FirstName CHAR(30) NOT NULL,
LastName CHAR(30) NOT NULL,
Department CHAR(20) NULL,
Email CHAR(50) NULL,
Phone CHAR(15) NULL,
Review TIMESTAMP NULL DEFAULT CURRENT_TIMESTAMP,
CONSTRAINT Check_Email CHECK (Email LIKE '%@happy.com'),
CONSTRAINT PK_Employee PRIMARY KEY (EmployeeNumber));
In the example above EnployeeNumber is a primary key that is a surrogate key, as it is defined using the AUTO_INCREMENT keywork, which makes the DBMS provide a unique value for that column if one is not specified.
Alternate keys are alternate identifiers of unique rows in a table.
CREATE TABLE Employee (EmployeeNumber INT NOT NULL AUTO_INCREMENT,
FirstName CHAR(30) NOT NULL,
LastName CHAR(30) NOT NULL,
Department CHAR(20) NULL,
Email CHAR(50) NULL,
Phone CHAR(15) NULL,
Review TIMESTAMP NULL DEFAULT CURRENT_TIMESTAMP,
CONSTRAINT Check_Email CHECK (Email LIKE '%@happy.com'),
CONSTRAINT PK_Employee PRIMARY KEY (EmployeeNumber),
CONSTRAINT AK_Employee_Email UNIQUE (Email));
Foreign keys are used as links between two tables. A foreign key is the primary key of one table that is placed in another table to form a link. This is a referential integrity constraint: the values of the foreign key must exist as primary key values in the original, "foreign", table. There can be no "dangling references!"
CREATE TABLE Department (Name CHAR(20) NOT NULL,
Location CHAR(30) NOT NULL,
CONSTRAINT PK_Department PRIMARY KEY (Name));
CREATE TABLE Employee (EmployeeNumber INT NOT NULL AUTO_INCREMENT,
FirstName CHAR(30) NOT NULL,
LastName CHAR(30) NOT NULL,
Department CHAR(20) NULL,
Email CHAR(50) NULL,
Phone CHAR(15) NULL,
Review TIMESTAMP NULL DEFAULT CURRENT_TIMESTAMP,
CONSTRAINT Check_Email CHECK (Email LIKE '%@happy.com'),
CONSTRAINT PK_Employee PRIMARY KEY (EmployeeNumber),
CONSTRAINT FK_Employee_Department FOREIGN KEY (Department)
REFERENCES Department (Name),
CONSTRAINT AK_Employee_Email UNIQUE (Email));
What should happen if you now attempt to add an employee with a Department name that does exist? Reject the add! If there is a foreign key constraint that relies on another table's information, and that information does not exist, the database will take no action when you try to add a new row.
But what would happen if we deleted the Finance department from the Department table? In the example above we have two rows in our Employee table that would no longer be able to properly satisfy the foreign key constraint. We have a few options:
CREATE TABLE Employee (EmployeeNumber INT NOT NULL AUTO_INCREMENT,
FirstName CHAR(30) NOT NULL,
LastName CHAR(30) NOT NULL,
Department CHAR(20) NULL,
Email CHAR(50) NULL,
Phone CHAR(15) NULL,
Review TIMESTAMP NULL DEFAULT CURRENT_TIMESTAMP,
CONSTRAINT Check_Email CHECK (Email LIKE '%@happy.com'),
CONSTRAINT PK_Employee PRIMARY KEY (EmployeeNumber),
CONSTRAINT FK_Employee_Department FOREIGN KEY (Department)
REFERENCES Department (Name)
ON DELETE SET NULL ON UPDATE CASCADE,
CONSTRAINT AK_Employee_Email UNIQUE (Email));
There are a few additional commands and statements that you may find useful when using MySQL. In this lesson we have created a few tables, but if you are attempting to create a table that already exists, you will encounter an error. The command to remove an existing table is DROP TABLE, example:
DROP TABLE Employee;
If you run that command twice in a row, you will get an error the second time, as the table you are trying
to drop no longer exists.
mysql> DROP TABLE Employee;
ERROR 1051 (42S02): Unknown table 'mXXXXXX.Employee'
Another way to remove a table is to check if it exists prior to removing it:
DROP TABLE IF EXISTS Employee;
In fact, you will often find this statement before a CREATE TABLE statement in most code.
DROP TABLE IF EXISTS Employee;
CREATE TABLE Employee (EmployeeNumber INT NOT NULL AUTO_INCREMENT,
FirstName CHAR(30) NOT NULL,
LastName CHAR(30) NOT NULL,
Department CHAR(20) NULL,
Email CHAR(50) NULL,
Phone CHAR(15) NULL,
Review TIMESTAMP NULL DEFAULT CURRENT_TIMESTAMP,
CONSTRAINT Check_Email CHECK (Email LIKE '%@happy.com'),
CONSTRAINT PK_Employee PRIMARY KEY (EmployeeNumber),
CONSTRAINT FK_Employee_Department FOREIGN KEY (Department)
REFERENCES Department (Name)
ON DELETE SET NULL ON UPDATE CASCADE,
CONSTRAINT AK_Employee_Email UNIQUE (Email));
This has tradeoffs! It's less likely to throw an error, but you're more likely to accidentally drop a table that you didn't mean to. But you'll see this a lot in this course since we're not dealing with production deployments.
If you would like to see what tables you have created, see the next section!
SHOW TABLES;
To find the column names within a specific table:
SHOW FIELDS IN Employee;
Complete these practice problems in a terminal using the mysql
commands you learned or using the MySQL Shell for VS Code extension and opening
a script. Create a single file or a series of well-labeled files with your
commands written out. For example, class2_answers.txt or q1.txt, q2.txt, etc.