Installing Mamba
The most common languages for processing data are Python and R. We’re going to be using Python. Python has a lot of problems. It’s slow and it’s undisciplined, making it poorly suited for large software projects. However, it’s fast for the programmer, and we can sidestep the “undisciplined” problem by noting that most data science projects don’t require that much programming - it’s just a little bit of programming applied to large data sets.
The fact that it’s slow is still a problem when processing those large data sets. The scientific computing community has gotten around this by writing a large number of very impressive libraries for data science that are written in C, but which provide an API for Python. This lets the programmer direct what will happen in an easy-to-code language, while having all the computationally expensive parts happen in C.
This means that the “right” way to program for data science in Python is by using the libraries as much as possible, and doing as little as possible in pure Python - so we have to know how these libraries work (or at least, know how to look up how these libraries work - Google and Stack Overflow will be key throughout this course!). The libraries we’ll be using include:
pandas: useful for loading and manipulating datasetsnumpy: used for linear algebrascipy: data structures and algorithms for scientific computingmatplotlib: matlab-like plottingsklearn: machine learning algorithmsjupyter: useful for making reports that include text, images, code, and graphs.plotly: useful for making plots (and interactive plots!) in a web environment (like a jupyter notebook)
Probably, we’ll come up with some others, too.
The first step…
…is to install all this business. Because all the libraries need to be compatible, it’s common to use a package manager, which keeps track of all those compatibilities for you and installs the right versions of the right libraries. The most common are Anaconda or Mamba. Anaconda is more common, Mamba is a more recent, faster version which is meant to behave exactly like Anaconda. I recommend Mamba.
Go here and install Mambaforge. This will involve downloading and running an install script, appropriate for the architecture of the machine you are installing on. Choose “yes” for all the questions.
After you’ve done that, (on Linux), run source ~/.bashrc
in order to run the initialization code that is now in your .bashrc to
use Mamba python. You should see (base) appear on the left
side of your Linux prompt.
Now, install libraries:
mamba install numpy scipy scikit-learn plotly matplotlib jupyter
You’ll want to do this on all the machines you intend to work on. At minimum, this will include:
- A laptop
- ssh.cs.usna.edu
Install on the ssh box by first ssh-ing in. You then need to download
the install script, without a browser! How to do it?! For this, you can
use wget, or curl, both tools that a CS major
should be (or get) comfortable with. wget makes a get
request and saves the result to disk. curl (with no extra
flags) gets the contents, and spits them out onto stdout. So, after
clicking through the browser on our local machine to find the web
address of the installer, we can do one of these two:
wget --no-check-certificate https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-Linux-x86_64.sh, then make it executable, then run it. Theno-check-certificateis necessary if running on the mission network because of the Man-in-the-middle attack ITSD runs.curl https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-Linux-x86_64.sh | bash
To start Jupyter…
…run jupyter notebook. You’ll notice your browser gets
all excited and pops up a new window. A Jupyter notebook allows for text
in Markdown format, python code, images, etc. It’s a great format for
telling a story that involves code/graphs, etc. There are a whole bunch
of Jupyter tutorials out there that are worth doing. I like
this one, which also introduces pandas and some matplotlib.