SD321 (Fall 2024)

SD321 Class Overview

Course Logistics

Student and instructor introductions, the role of the website, course policies and expectations, and a brief intro into data storage and databases.

Readings

Suggested readings will be posted to the calendar. Readings are primarily from the 13th edition of Database Processing by Kroenke & Auer, Prentice Hall 2014, as well as SQL for Data Science by Antonio Badia Springer 2020. While these books are marked as recommended you are encouraged to follow the readings throughout this course. The online notes serve as a distilled source of information, but the texts will dive much deeper into a subject and may help if you are to get lost or find yourself behind.

What are the goals of the class?

Data scientists need to deal with a lot of data. This course is an introduction to data astorage methods and systems. In particular, we will discuss the use of relational databases, including data modeling, database interaction, and interfacing with a database from a Python or R script. You will quickly realize that a good design will help your projects succeed, and this course aims to help you determine what good database design means.

The course will introduce working with databases via MySQL and quickly move into data modeling so we can build our own database designs and schemas.

Problems

And the end of the notes for almost every lecture will be a couple of example problems (like what you have seen in previous courses). If time permits, we will work on these as in-class exercises.

Course Assignments

The calendar is the primary source of information and guidance that we will use throughout the semester. You are expected to check here often to ensure that no additional readings, notes, or assignments are posted.

Homeworks: There will very few collected homework assignments. You are encouraged to complete the practice problems at the end of each day's lecture notes. Completing them will prepare you to complete the graded lab assignments!

Quizzes: Be prepared for surprise/random quizzes to ensure that you are following along with the class.

Labs: The weekly labs and a final project will count for most of the graded work in this class.

Submissions: All submissions will be done via the online submission site (submit.usna.edu), which will track time of submission. Review the course policy regarding late assignments.

Course Project

There will be a single group project, with teams of 2-3 people, that will build a data science application with a database backend of your choosing (with our approval). At the end of the semester you will present your work to the class including a full demo. The first milestone will be towards the end of October, at which point you will have:

Created a team of 2-3 students with a designated team leader.
Created a task list broken out by team member where:
1. Each member will have some responsibility for DB design and Python/R interface portions.
2. Each member should be able to understand entire project and others’ effort.

More details will be provided as the semester progresses. Start thinking about this now!

Python Review and an Introduction to R

Python is a dynamic, interpreted (bytecode-compiled) language. There are no type declarations of variables, parameters, functions, or methods in source code. This makes the code short and flexible, and you lose the compile-time type checking of the source code. Python tracks the types of all values at runtime and flags code that does not make sense as it runs.

R is a language and an environment for statistical computing and graphics and is considered to be one of the most comprehensive statistical programming languages available. R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, …) and graphical techniques, and is highly extensible.

The best way to learn is to just start working with the interpreter! Bring up a terminal on your machine (ctrl-alt-T on Ubuntu) and type in python3 to start Python or R to start R.

$ python3

Python 3.10.12 (main, Jul 29 2024, 16:56:48) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>

$ R

R version 4.4.1 (2024-06-14) -- "Race for Your Life"
Copyright (C) 2024 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu

Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

>

You will be typing in commands at the >>> or the > prompt.

Language Overview and Comparison

Python and R are excellent languages for working with different hardware and operating systems, and can be worked with via the interpreter directly or when called from a script. For our purposes we will be showing Python code on the Left and the equivalent R code on the right. Output will be shown as comments.

print("This is my first Python code!")

This is my first Python code!

print("This is my first R code!")

[1] "This is my first R code!"

Python and R statements are traditionally written one statement per line, while it is possible to have more than one statement per line by separating them with a semicolon ; this is not recommended. Strings can be handled in two different ways, 'single quotes' and "double quotes". Note that we could omit the print() statements below and the output would be provided on the interactive command lines of the respective interpreters, this wouldn't work if running in a program.

a = 'cool eh?'
b = "oohrah!"
print(a+" "+b)

cool eh? oohrah!

a = 'cool eh?'
b = "oohrah!"
print(paste(a,b))

[1] "cool eh? oohrah!"

Variable type is determined by content, variables are not declared prior to use! and variable names are case sensitive. In R you can use either = or <- to assign a variable. The type of a variable is given by the type of the data assigned to it, and can change during the execution of the script. With R you can use either typeof() or class() to get information on the type of the variable.

Variables scope: R uses lexicographic scoping, so a variable is "visible" in the block, as delimited by {}, it was defined. Variables defined in a script outside of any block are visible everywhere in that script.

new_boolean = True
new_int = 42
new_force_float = float(42)
new_float = 3.14159
new_string = "Automatically typed languages"
my_type = type(new_float)

print(my_type)

<type 'float'>

new_boolean = T
new_int = 42

new_float = 3.14159
new_string = "Automatically typed languages"
my_type = typeof(new_float)

print(my_type)

[1] "double"

We also have a nice way of combining strings via concatenation, or via paste in R.

long_string = "line 1" + "line 2" + 'line 3'
long_string = long_string + new_string

long_string = paste("line 1", "line 2", 'line 3')
long_string = paste(long_string, "line 4")

Arrays should be declared before use,

new_array_1 = []
new_array_1.append('item1')
new_array_1.append('item2')
print(new_array_1)
new_array_2 = [1,2,4,8,16,32]
print(new_array_2)
print(new_array_2[1:-1])

new_array_1.reverse()
for item in new_array_1:
  print(str(item))

['item1', 'item2']
[1, 2, 4, 8, 16, 32]
[2, 4, 8, 16]
item2
item1

new_array_1 = c()
new_array_1 = append(new_array_1, 'item1')
new_array_1 = append(new_array_1, 'item2')
print(new_array_1)
new_array_2 = c(1,2,4,8,16,32)
print(new_array_2)
print(new_array_2[(2:(length(new_array_2)-1))])

new_array_1 = rev(new_array_1)
for (item in new_array_1) {
  print(item)
}

[1] "item1" "item2"
[1]  1  2  4  8 16 32
[1]  2  4  8 16
[1] "item2"
[1] "item1"

Python also has index based dictionaries, which are really nice to work with, in R we can do this with a list since its structure is a hash table. Note that in R we need to have these idicies be a string, so I quoted the zero in the example below.

new_dict_1 = {}
new_dict_1[0] = 'test'
new_dict_1['bob'] = 'cat'

print(new_dict_1)
print(new_dict_1['bob'])


if 'bob' in new_dict_1:
  print('Bob was there')


for item in new_dict_1:
  print(item, new_dict_1[item])

{0: 'test', 'bob': 'cat'}
cat
Bob was there
0 test
bob cat

new_dict_1 <-list(
'0' = 'test',
bob = 'cat'
)
print(new_dict_1)
print(new_dict_1$bob)
print(new_dict_1[['bob']])

if ('bob' %in% names(new_dict_1)) {
  print('Bob was there')
}

for (item in names(new_dict_1)) {
  print(paste(item, new_dict_1[item]))
}

$`0`
[1] "test"
$bob
[1] "cat"
[1] "cat"
[1] "cat"
[1] "Bob was there"
[1] "0 test"
[1] "bob cat"

Basic Programming Constructs

As experienced programmers, it is often effective to present the varying constructs, if-statements, while-loops, etc., for review and learn new languages via examples and practice.

if (new_int == 42):
  print("new_int = "+ str(new_int) + " and it is 42!")
elif (new_int == 43):
  print("Why does new_int = 43?")
else:
  print("why does new_int not equal 42!")

new_int = 42 and it is 42!

if (new_int == 42) {
  print(paste("new_int =", new_int, "it is 42!"))
} else if (new_int == 43) {
  print("Why does new_int = 43?")
} else {
  print("why does new_int not equal 42!")
}

[1] "new_int = 42 it is 42!"

Note: Seperate multiple conditions with and or or in python and with && or || in R, example:

if (new_int == 43 or (new_int == 42 and new_float == 3.14159)):

if (new_int == 43 || (new_int == 42 && new_float == 3.14159)) {

Lets take a few minutes to explore loops in more depth and creating ranges of numbers

print(range(5))
for i in range(5):
  print(i),

print()

for j in range(len(new_array_2)):
  print(new_array_2[j]),

print()

print((0:4))
for (i in (0:4)) {
  cat(i, ' ')
}
cat('\n')

for (j in 0:length(new_array_2)) {
  cat(new_array_2[j], ' ')
}
cat('\n')

It's worth taking some time to understand at how range works in Python as this creates the series of numbers that you iterate through. You can get help at any time by using pythons built-in help() function, try help(range). In R, a similar capability can be utilized by specifying the range like (start:stop).

No lets consider a simple while loop and compare between the two languages.

while (new_int < 45):
  new_int += 1
  print(new_int)

while (new_int < 45) {
  new_int = new_int + 1
  print(new_int)
}

Python and R do not have a do-while loop, but as you remember it is easy to convert between the two.

Writing functions

Functions in python can return multiple results (which varies greatly from most languages you may have seen before). You may pass in objects and other functions as well.

def my_power(base, exponent):
  results = 1.0
  for i in range(exponent):
    results = results * base
  return results



new_results = my_power(2, 8)
print("2^8 =", new_results)

my_power = function(base, exponent) {
  results = 1.0
  for (i in 1:exponent) {
    results = results * base
  }
  return(results)
}

new_results = my_power(2, 8)
print(paste("2^8 =", new_results))

We can also have default values for arguments to functions, just remember that those with default values must come after required arguments.

def chart(data, label="demo"):
  ...
  print(label)

chart = function(data, label="demo") {
  ...
  print(label)
}

def multireturn():
    return 1, 2, 3

test1, test2, test3 = multireturn()

Multiple return in R is not permitted.

Take some time to try the above examples in both the Python3 and R interpreters on the lab machines or on your laptops. Its worth understanding the basics of the languages you will be using this semester. We will practice the basics of Python and R for the next week to set up a foundation for the rest of the course.

Practice Problems

At the bottom of most classes are practice problems that you should try, see if you can complete the following:

In both Python and R, create a function that will convert Celsius into Fahrenheit. Remember the formula is \( Fahrenheit = \frac{(Celsius \times 9)}{5} + 32 \)
In both Python and R, using a loop, calculate \( 10! \), which would be \( 10 \times 9 \times 8 \times 7 \times 6 \times 5 \times 4 \times 3 \times 2 \times 1 \)