Installing Mamba

The most common languages for processing data are Python and R. We’re going to be using Python. Python has a lot of problems. It’s slow and it’s undisciplined, making it poorly suited for large software projects. However, it’s fast for the programmer, and we can sidestep the “undisciplined” problem by noting that most data science projects don’t require that much programming - it’s just a little bit of programming applied to large data sets.

The fact that it’s slow is still a problem when processing those large data sets. The scientific computing community has gotten around this by writing a large number of very impressive libraries for data science that are written in C, but which provide an API for Python. This lets the programmer direct what will happen in an easy-to-code language, while having all the computationally expensive parts happen in C.

This means that the “right” way to program for data science in Python is by using the libraries as much as possible, and doing as little as possible in pure Python - so we have to know how these libraries work (or at least, know how to look up how these libraries work - Google and Stack Overflow will be key throughout this course!). The libraries we’ll be using include:

pandas: useful for loading and manipulating datasets
numpy: used for linear algebra
scipy: data structures and algorithms for scientific computing
matplotlib: matlab-like plotting
sklearn: machine learning algorithms
jupyter: useful for making reports that include text, images, code, and graphs.
plotly: useful for making plots (and interactive plots!) in a web environment (like a jupyter notebook)

Probably, we’ll come up with some others, too.

The first step…

…is to install all this business. Because all the libraries need to be compatible, it’s common to use a package manager, which keeps track of all those compatibilities for you and installs the right versions of the right libraries. The most common are Anaconda or Mamba. Anaconda is more common, Mamba is a more recent, faster version which is meant to behave exactly like Anaconda. I recommend Mamba.

Go here and install Mambaforge. This will involve downloading and running an install script, appropriate for the architecture of the machine you are installing on. Choose “yes” for all the questions.

After you’ve done that, (on Linux), run source ~/.bashrc in order to run the initialization code that is now in your .bashrc to use Mamba python. You should see (base) appear on the left side of your Linux prompt.

Now, install libraries: mamba install numpy scipy scikit-learn plotly matplotlib jupyter

You’ll want to do this on all the machines you intend to work on. At minimum, this will include:

A laptop
ssh.cs.usna.edu

Install on the ssh box by first ssh-ing in. You then need to download the install script, without a browser! How to do it?! For this, you can use wget, or curl, both tools that a CS major should be (or get) comfortable with. wget makes a get request and saves the result to disk. curl (with no extra flags) gets the contents, and spits them out onto stdout. So, after clicking through the browser on our local machine to find the web address of the installer, we can do one of these two:

wget --no-check-certificate https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-Linux-x86_64.sh, then make it executable, then run it. The no-check-certificate is necessary if running on the mission network because of the Man-in-the-middle attack ITSD runs.
curl https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-Linux-x86_64.sh | bash

To start Jupyter…

…run jupyter notebook. You’ll notice your browser gets all excited and pops up a new window. A Jupyter notebook allows for text in Markdown format, python code, images, etc. It’s a great format for telling a story that involves code/graphs, etc. There are a whole bunch of Jupyter tutorials out there that are worth doing. I like this one, which also introduces pandas and some matplotlib.