Introductions

Course Logistics

Who is your instructor? Pick a section leader. Role of the website. Course Policy and CS Department Honor Policy. Computing resources and Python tools.

What is this course about?

This course is an introduction to data science (DS) with most of its emphasis on an introduction to computer programming for the DS ecosystem. It is the first course in the Data Science major. It's very difficult to actually "do DS" without computer programming skills, so we begin by teaching that foundation here. Later courses will focus more on analysis, statistics, and modeling.

The course assumes students have no knowledge of computer programming, and teaches the Python language. Students who want to learn programming but are not DS majors should probably seek other avenues. SI286 is great for general interest in Python, and IC210/SI204 are great for those wishing to learn C/C++.

It's about building the computing programming foundation for Data Science!

Word of Advice

Programming is a creative process, in which you construct a model world inside the computer that interacts with the real world via keyboards, mice, monitors, speakers, network connections, and so forth. It's very rewarding and exciting; you will never be bored!

However, it can also be frustrating. Your code won't work on your first attempt most of the time; you need to debug your code, figuring out where you were wrong. Again unfortunately, with high probability, the debugging won't be finished on your first try either. Perhaps only after many attempts, your program will finally work.

It's just hard to write a correct program. Some words of advice:

Of course, we will try our best so that the level of frustration stays manageable. We are going to start slow.

IMPORTANT: If you think you're too frustrated, you must seek help. It's your responsibility to seek help. It's our responsibility to help you with all our might when you seek help. Your instructor won't usually know whether you're desperate or not.

Why Python?


from xkcd

This is the million dollar question. Why Python, or perhaps asked another way, why not Java or why not C++? You will run into many people that will try to convince you why one language is better than the other. If someone tries to make a general claim like "The X language is better than Y at all things", you should walk away. No language reigns supreme at all things.

Programming languages are sometimes like tools ... different languages are more applicable to different tasks. Python is great for short-ish programs for data processing. Java is great for large collaborative projects. C++ is great for speed.

But languages are sometimes like culture. People speak Portuguese in Brazil because that's what they grew up with. It doesn't mean Portuguese is better than Japanese for communication, it's just what they prefer culturally. Don't get caught up in language wars. This class will focus on how to program with Python as the language because it's just plain popular in the data science community, as well as the world in general.

Python does have nice properties, and we did of course intentionally pick it for Data Science.

What is Data Science?

We probably won't have much time today. Go see the next lecture!