Pandas Library

Today we continue with an overview of Pandas, the popular library that manages column-indexed data, like what you might think of as spreadhseets.

Reading

As posted last time, these are not required, but this is a gentle introduction to Pandas, and this is a decent overview of column selection. And of course, the official Pandas homepage.

Review Notes

Reading a CSV file from disk into a DataFrame object:

import pandas as pd
csv = pd.read_csv('mydata.csv')

Select one column:

column = csv['COUNT']

Select multiple columns:

columns = csv[ ['COUNT','CATEGORY'] ]

Select a subset of rows:

subset = csv[5:10]

Select just one row's values:

subset = csv.iloc[5].values

Select rows that match a condition based on a column's value:

bigs   = csv[ csv['COUNT'] > 10 ]
retail = csv[ csv['CATEGORY'] == 'retail' ]
retail = csv[ csv['CATEGORY'].isin(['retail','service','auto']) ]

Iterate over rows:

for i,row in csv.iterrows():
  print(row) # row is a Series

A Simple Task -- Makes Your Life Easy!

Here is a really basic example. Pandas isn't just useful for complicated things. Imagine you have a simple CSV file of character names and their races (Lord of the Rings):

"name","race"
"Aragorn II","Man"
"Arwen","Elf"
"Elrond","Elf"
"Celebrían","Elf"
...

You want to pull out all the elf names for later use. How would you do that? Just like this:

import pandas as pd

csv = pd.read_csv('lotr.csv')
elves = csv[ csv['race'] == 'Elf' ]
names = elves['name']

for n in names:
  print(n)

The above even works when the CSV format changes, like swapping column order or adding more columns! The old way looked like the following, inflexible with CSV changes:

fh = open('lotr.csv')
names = []
for line in fh:
    parts = line.split(',')
    if parts[1] == 'Elf':
        names.append(parts[0])

for n in names:
  print(n)

A Slightly More Complicated Task

Let's use the Marvel character data from last HW. We can use Pandas to filter by multiple conditions. For instance, let's retrieve all characters that were introduced in the 1960's and who have blue eyes:

page_id,name,urlslug,ID,ALIGN,EYE,HAIR,SEX,GSM,ALIVE,APPEARANCES,FIRST APPEARANCE,Year
1678,Spider-Man (Peter Parker),\/Spider-Man_(Peter_Parker),Secret Identity,Good Characters,Hazel Eyes,Brown Hair,Male Characters,,Living Characters,4043,Aug-62,1962
7139,Captain America (Steven Rogers),\/Captain_America_(Steven_Rogers),Public Identity,Good Characters,Blue Eyes,White Hair,Male Characters,,Living Characters,3360,Mar-41,1941
64786,"Wolverine (James \""Logan\"" Howlett)",\/Wolverine_(James_%22Logan%22_Howlett),Public Identity,Neutral Characters,Blue Eyes,Black Hair,Male Characters,,Living Characters,3061,Oct-74,1974

The Pandas program isn't too complicated:

import pandas as pd

csv = pd.read_csv('marvel-wikia-data.csv')
csv = csv[ (1970<=csv['Year']) & (csv['Year']<=1979) ]   # two conditions at once
blues = csv[ csv['EYE'] == 'Blue Eyes' ]
names = blues['name']

for n in names:
  print(n)

Select and Sum -- USA Population by State

Quick selection and math with Pandas. We have a CSV of our 50 USA states and their populations:

STATE,POPESTIMATE2019,lat,long
Alabama,4903185,32.377716,-86.300568
Alaska,731545,58.301598,-134.420212
Arizona,7278717,33.448143,-112.096962
Arkansas,3017804,34.746613,-92.288986

Let's find the population of all southern states, defined by anything below 40 latitude.

import pandas as pd

csv = pd.read_csv('statepops.csv')
southern = csv[ csv['lat'] < 35 ]
pops = southern['POPESTIMATE2019']
total = sum(pops)   # sum works with Pandas Series objects!
			     
print('Total southern population:', total)

Check out this shortened version!

csv = pd.read_csv('statepops.csv')
pops = csv[ csv['lat'] < 35 ]['POPESTIMATE2019']
print('Total southern population:', sum(pops))