What is Data Science?

You can find lots of definitions of data science out there. I tend to like this one:

Data science combines the scientific method, math and statistics, specialized programming, advanced analytics, machine learning, and even storytelling to uncover and explain insights buried in data ... in order to make decisions based on the data.

The orange text before the ellipsis is adapted from a decent data science definition from IBM.

The blue text after the ellipsis comes from your USNA CS Department. We created this major so that you will graduate as a data-driven decision-making midshipman. That's one of the foundational goals of DS at USNA, not just to find insights, but to make decisions too.

If it sounds like DS has a lot of different components, that's because it does! You can study math and statistics in isolation, and you can study computer programming in isolation, but that doesn't mean you'll be ready to "do data science". There are many aspects of programming that are irrelevant to DS, just as there are aspects of mathematics that we don't need in DS. To complicate matters further, DS requires specific things in stats and programming that you won't naturally learn in normal classes. The best mathematical programmer will have no clue how to actually STORE the data he/she needs to use for DS. We quickly find ourselves needing specialized technology like cloud computing and databases, things that a programmer might not normally learn.

Data Science is thus an interdisciplinary field of study with a technical bent. The fact of the matter is that you won't get too far without computer programming. Your DS major and this class begins with that programming foundation.

The Data Science Pipeline

A helpful way to think about DS is to characterize it with what people call the DS life cycle or the DS pipeline. This refers to the major steps required to move from "a bunch of messy data" to "insights and prediction". One way of defining the pipeline is with these major steps:

  1. Data Acquisition
  2. Data Storage
  3. Data Processing (or Cleaning)
  4. Data Analysis
  5. Data Visualization, Interface, and Communication

We'll refer back to these steps in the pipeline as the class progresses. Some labs will have a direct mapping to one step, while others might be a combination of several. You'll of course need the entire DS major to become adept at all of them!

What is Python?

READING: Chapter 1 of your textbook down to and including "Writing a Program".

We will discuss from the reading:

Programming with Python

If we have time, today also introduces you to your first Python program! What is Python? What is a program? We'll introduce you to the "programming environment", and show you a basic input/output program. We'll illustrate how to print to the screen, read from the user, and store the user's input in a variable. These are basic input/output techniques that you'll have in most of your future programs.

We will specifically discuss and learn about:

You should be able to understand this program.

A simple output and input program.

# Output and input from the user.
print("Hello friend!")
name = input("What is your name? ")
print("Hello " + name)
print("My name is Data Devourerer.")