Student and instructor introductions, the role of the website, course policies and expectations, and a brief intro into data storage and databases.
Suggested readings will be posted to the calendar. Readings are primarily from the 13th edition of Database Processing by Kroenke & Auer, Prentice Hall 2014, as well as SQL for Data Science by Antonio Badia Springer 2020. While these books are marked as recommended you are encouraged to follow the readings throughout this course. The online notes serve as a distilled source of information, but the texts will dive much deeper into a subject and may help if you are to get lost or find yourself behind.
Data scientists need to deal with a lot of data. This course is an introduction to data astorage methods and systems. In
particular, we will discuss the use of relational databases, including data modeling,
database interaction, and interfacing with a database from a Python
or R script.
You will quickly realize that a good design will help your projects succeed, and this
course aims to help you determine what good database design means.
The course will introduce working with databases via MySQL and quickly move into data
modeling so we can build our own database designs and schemas.
And the end of the notes for almost every lecture will be a couple of example problems (like what you have seen in previous courses). If time permits, we will work on these as in-class exercises.
The calendar is the primary source of information and guidance that we will use
throughout the semester. You are expected to check here often to ensure that no additional readings, notes,
or assignments are posted.
Homeworks: There will very few collected homework assignments.
You are encouraged to complete the practice problems at the end of each day's lecture notes. Completing them will prepare you to complete the graded lab assignments!
Quizzes: Be prepared for surprise/random quizzes to ensure that you
are following along with the class.
Labs: The weekly labs and a final project will count for most
of the
graded work in this class.
Submissions: All submissions will be done via the online submission site (submit.usna.edu),
which will track time of submission.
Review the course policy regarding late assignments.
There will be a single group project, with teams of 2-3 people, that will build a data science application with a database backend of your choosing (with our approval). At the end of the semester you will present your work to the class including a full demo. The first milestone will be towards the end of October, at which point you will have:
Python is a dynamic, interpreted (bytecode-compiled) language. There are no type declarations of variables, parameters, functions, or methods in source code. This makes the code short and flexible, and you lose the compile-time type checking of the source code. Python tracks the types of all values at runtime and flags code that does not make sense as it runs.
R is a language and an environment for statistical computing and graphics and is considered to be one of the most comprehensive statistical programming languages available. R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, …) and graphical techniques, and is highly extensible.
The best way to learn is to just start working with the interpreter! Bring up a terminal on your machine (ctrl-alt-T on Ubuntu) and type in python3 to start Python or R to start R.
$ python3 Python 3.10.12 (main, Jul 29 2024, 16:56:48) [GCC 11.4.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>>
$ R R version 4.4.1 (2024-06-14) -- "Race for Your Life" Copyright (C) 2024 The R Foundation for Statistical Computing Platform: x86_64-pc-linux-gnu Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R. >
You will be typing in commands at the >>> or the > prompt.
Python and R are excellent languages for working with different hardware and operating systems, and can be worked with via the interpreter directly or when called from a script. For our purposes we will be showing Python code on the Left and the equivalent R code on the right. Output will be shown as comments.
print("This is my first Python code!")
This is my first Python code!
print("This is my first R code!")
[1] "This is my first R code!"
print() statements below and the output would be provided on the interactive
command lines of the respective interpreters, this wouldn't work if running in a program.
a = 'cool eh?'
b = "oohrah!"
print(a+" "+b)
cool eh? oohrah!
a = 'cool eh?'
b = "oohrah!"
print(paste(a,b))
[1] "cool eh? oohrah!"
typeof() or class() to get information on the type of the variable.
Variables scope: R uses lexicographic scoping, so a variable is "visible" in the block, as delimited by {}, it was defined. Variables defined in a script outside of any block are visible everywhere in that script.
new_boolean = True
new_int = 42
new_force_float = float(42)
new_float = 3.14159
new_string = "Automatically typed languages"
my_type = type(new_float)
print(my_type)
<type 'float'>
new_boolean = T
new_int = 42
new_float = 3.14159
new_string = "Automatically typed languages"
my_type = typeof(new_float)
print(my_type)
[1] "double"
We also have a nice way of combining strings via concatenation, or via paste in R.
long_string = "line 1" + "line 2" + 'line 3'
long_string = long_string + new_string
long_string = paste("line 1", "line 2", 'line 3')
long_string = paste(long_string, "line 4")
Arrays should be declared before use,
new_array_1 = []
new_array_1.append('item1')
new_array_1.append('item2')
print(new_array_1)
new_array_2 = [1,2,4,8,16,32]
print(new_array_2)
print(new_array_2[1:-1])
new_array_1.reverse()
for item in new_array_1:
print(str(item))
['item1', 'item2']
[1, 2, 4, 8, 16, 32]
[2, 4, 8, 16]
item2
item1
new_array_1 = c()
new_array_1 = append(new_array_1, 'item1')
new_array_1 = append(new_array_1, 'item2')
print(new_array_1)
new_array_2 = c(1,2,4,8,16,32)
print(new_array_2)
print(new_array_2[(2:(length(new_array_2)-1))])
new_array_1 = rev(new_array_1)
for (item in new_array_1) {
print(item)
}
[1] "item1" "item2"
[1] 1 2 4 8 16 32
[1] 2 4 8 16
[1] "item2"
[1] "item1"
new_dict_1 = {}
new_dict_1[0] = 'test'
new_dict_1['bob'] = 'cat'
print(new_dict_1)
print(new_dict_1['bob'])
if 'bob' in new_dict_1:
print('Bob was there')
for item in new_dict_1:
print(item, new_dict_1[item])
{0: 'test', 'bob': 'cat'}
cat
Bob was there
0 test
bob cat
new_dict_1 <-list(
'0' = 'test',
bob = 'cat'
)
print(new_dict_1)
print(new_dict_1$bob)
print(new_dict_1[['bob']])
if ('bob' %in% names(new_dict_1)) {
print('Bob was there')
}
for (item in names(new_dict_1)) {
print(paste(item, new_dict_1[item]))
}
$`0`
[1] "test"
$bob
[1] "cat"
[1] "cat"
[1] "cat"
[1] "Bob was there"
[1] "0 test"
[1] "bob cat"
As experienced programmers, it is often effective to present the varying constructs, if-statements, while-loops, etc., for review and learn new languages via examples and practice.
if (new_int == 42):
print("new_int = "+ str(new_int) + " and it is 42!")
elif (new_int == 43):
print("Why does new_int = 43?")
else:
print("why does new_int not equal 42!")
new_int = 42 and it is 42!
if (new_int == 42) {
print(paste("new_int =", new_int, "it is 42!"))
} else if (new_int == 43) {
print("Why does new_int = 43?")
} else {
print("why does new_int not equal 42!")
}
[1] "new_int = 42 it is 42!"
if (new_int == 43 or (new_int == 42 and new_float == 3.14159)):
if (new_int == 43 || (new_int == 42 && new_float == 3.14159)) {
print(range(5))
for i in range(5):
print(i),
print()
for j in range(len(new_array_2)):
print(new_array_2[j]),
print()
print((0:4))
for (i in (0:4)) {
cat(i, ' ')
}
cat('\n')
for (j in 0:length(new_array_2)) {
cat(new_array_2[j], ' ')
}
cat('\n')
help(range). In R, a similar
capability can be utilized by specifying the range like (start:stop).
No lets consider a simple while loop and compare between the two languages.
while (new_int < 45):
new_int += 1
print(new_int)
while (new_int < 45) {
new_int = new_int + 1
print(new_int)
}
Functions in python can return multiple results (which varies greatly from most languages you may have seen before). You may pass in objects and other functions as well.
def my_power(base, exponent):
results = 1.0
for i in range(exponent):
results = results * base
return results
new_results = my_power(2, 8)
print("2^8 =", new_results)
my_power = function(base, exponent) {
results = 1.0
for (i in 1:exponent) {
results = results * base
}
return(results)
}
new_results = my_power(2, 8)
print(paste("2^8 =", new_results))
We can also have default values for arguments to functions, just remember that those with default values must come after required arguments.
def chart(data, label="demo"):
...
print(label)
chart = function(data, label="demo") {
...
print(label)
}
def multireturn():
return 1, 2, 3
test1, test2, test3 = multireturn()
Multiple return in R is not permitted.
Take some time to try the above examples in both the Python3 and R interpreters on the lab machines or on your laptops. Its worth understanding the basics of the languages you will be using this semester. We will practice the basics of Python and R for the next week to set up a foundation for the rest of the course.
At the bottom of most classes are practice problems that you should try, see if you can complete the following: