Pandas Library

Pandas is a very popular library that manages column-indexed data, like what you might think of as spreadhseets. You will often come across data in comma-separated value (CSV) or tab-separated value (TSV) formats. When Prof Chambers spent a summer at Google, many of the applications he worked with were ingesting TSV-formatted data. We've already had a couple labs in this class using this format. The Pandas library make reading and processing CSV/TSV data very very easy. In fact, you'll wish I allowed you to use this earlier in the semester :)

Pandas converts your data into a DataFrame object. The DataFrame class contains many useful functions that make common operations simple. Our goal is to show you how to do a few popular operations ... but this is by no means a full overview.

Reading

Not required, but this is a gentle introduction to Pandas, and this is a decent overview of column selection. And of course, the official Pandas homepage.

Overview Notes

Reading a CSV file from disk into a DataFrame object:

import pandas as pd
csv = pd.read_csv('mydata.csv')      

Print a friendly summary of the DataFrame data:

print(csv)

Select a subset of rows:

subset = csv[5:10]
print(subset)

Select just one column from many:

column = csv['COUNT']
print(column)
# Convert this column of counts to a normal List
L = column.tolist()
print(L)

One common operation is to read in a big CSV file, and then select just the rows you care about. Pandas makes this VERY easy. Let's say you only want the rows where the count is greater than 10 in a column called 'COUNT'. Here it is:

bigs = csv[ csv['COUNT'] > 10 ]
print(bigs)

Easy right? Why does this work? The operation csv['COUNT'] returns a single column as a Series object. It's a one-dimensional array of all values in that column. This Series object then allows you to make an arbitrary comparison to all of its values, and it will return a new Series object of True/False values for whether each value satisfied the comparison or not. In this example...

bools = csv['COUNT'] > 10
print(bools)
...we check if each row's "COUNT" column is greater than 10. This prints out:
True
True
False
True
False
...
Thus when you combine this Series of bools with a DataFrame, it returns to you all the rows that are True.

Let's put it all together now. One of the most common operations you will do is to read in a CSV file, and filter it down to the rows and columns that you care about. Let's do it all at once with an example:

# Read a CSV file from disk
csv = pd.read_csv('mydata.csv')      

# I only want the first 200 rows.
csv = csv[0:200]

# I also only want the rows where the CATEGORY column is 'retail'	
retail = csv[ csv['CATEGORY'] == 'retail' ]

# Now I just want 2 columns for these retail businesses
retail = retail[ ['NAME','NET INCOME'] ]	

# Print the results!	
print(retail)
	
# Finally, convert those two columns to two lists if you need to use them elsewhere.
names = retail['NAME'].tolist()
incomes = retail['INCOME'].tolist()	

Below is part of the Lab 5 Lab with Pandas

You should be impressed with how easy the lab is now. If you want to run this yourself, save the life expectancy data file into your directory. Then run this to see some of it:

import plotly.graph_objects as go
import pandas as pd

# Make one plot for these 3 countries.
countries = [ 'India', 'China', 'Philippines' ]

# PANDAS
csv = pd.read_csv('life-expectancies.csv')
YEARS = csv.columns.tolist()[3:]
# The countries we want.
result = csv[ csv['Country Name'].isin(countries) ]
# The columns we want
result = result[YEARS]

# Key variables for computing the average line later.
sums = [0]*len(YEARS)
N = 0

# Loop over our country rows!
fig = go.Figure()
for index,row in result.iterrows():
  sums += row   # row is a Series -- allows arithmetic with sequences of numbers
        
  fig.add_traces(go.Scatter(x=YEARS,y=row,mode='lines+markers',name=countries[N]))
  N += 1

# Compute average line.
average = sums / N
fig.add_traces(go.Scatter(x=YEARS,y=average,mode='lines+markers',name='AVERAGE'))

# Draw it!
fig.update_layout(title='Life Expectancies at Birth',
                  xaxis_title="Year of Birth",
                  yaxis_title="Life Expectancy",
                  showlegend=True,
                  plot_bgcolor='white')
fig.show()