Today we continue with an overview of Pandas, the popular library that manages column-indexed data, like what you might think of as spreadhseets.
As posted last time, these are not required, but this is a gentle introduction to Pandas, and this is a decent overview of column selection. And of course, the official Pandas homepage.
Reading a CSV file from disk into a DataFrame object:
import pandas as pd
csv = pd.read_csv('mydata.csv')
Select one column:
column = csv['COUNT']
Select multiple columns:
columns = csv[ ['COUNT','CATEGORY'] ]
Select a subset of rows:
subset = csv[5:10]
Select just one row's values:
subset = csv.iloc[5].values
Select rows that match a condition based on a column's value:
bigs = csv[ csv['COUNT'] > 10 ]
retail = csv[ csv['CATEGORY'] == 'retail' ]
retail = csv[ csv['CATEGORY'].isin(['retail','service','auto']) ]
Iterate over rows:
for i,row in csv.iterrows():
print(row) # row is a Series
Here is a really basic example. Pandas isn't just useful for complicated things. Imagine you have a simple CSV file of character names and their races (Lord of the Rings):
"name","race" "Aragorn II","Man" "Arwen","Elf" "Elrond","Elf" "CelebrÃan","Elf" ...
You want to pull out all the elf names for later use. How would you do that? Just like this:
import pandas as pd
csv = pd.read_csv('lotr.csv')
elves = csv[ csv['race'] == 'Elf' ]
names = elves['name']
for n in names:
print(n)
The above even works when the CSV format changes, like swapping column order or adding more columns! The old way looked like the following, inflexible with CSV changes:
fh = open('lotr.csv')
names = []
for line in fh:
parts = line.split(',')
if parts[1] == 'Elf':
names.append(parts[0])
for n in names:
print(n)
Let's use the Marvel character data from last HW. We can use Pandas to filter by multiple conditions. For instance, let's retrieve all characters that were introduced in the 1960's and who have blue eyes:
page_id,name,urlslug,ID,ALIGN,EYE,HAIR,SEX,GSM,ALIVE,APPEARANCES,FIRST APPEARANCE,Year 1678,Spider-Man (Peter Parker),\/Spider-Man_(Peter_Parker),Secret Identity,Good Characters,Hazel Eyes,Brown Hair,Male Characters,,Living Characters,4043,Aug-62,1962 7139,Captain America (Steven Rogers),\/Captain_America_(Steven_Rogers),Public Identity,Good Characters,Blue Eyes,White Hair,Male Characters,,Living Characters,3360,Mar-41,1941 64786,"Wolverine (James \""Logan\"" Howlett)",\/Wolverine_(James_%22Logan%22_Howlett),Public Identity,Neutral Characters,Blue Eyes,Black Hair,Male Characters,,Living Characters,3061,Oct-74,1974
The Pandas program isn't too complicated:
import pandas as pd
csv = pd.read_csv('marvel-wikia-data.csv')
csv = csv[ (1970<=csv['Year']) & (csv['Year']<=1979) ] # two conditions at once
blues = csv[ csv['EYE'] == 'Blue Eyes' ]
names = blues['name']
for n in names:
print(n)
Quick selection and math with Pandas. We have a CSV of our 50 USA states and their populations:
STATE,POPESTIMATE2019,lat,long Alabama,4903185,32.377716,-86.300568 Alaska,731545,58.301598,-134.420212 Arizona,7278717,33.448143,-112.096962 Arkansas,3017804,34.746613,-92.288986
Let's find the population of all southern states, defined by anything below 40 latitude.
import pandas as pd
csv = pd.read_csv('statepops.csv')
southern = csv[ csv['lat'] < 35 ]
pops = southern['POPESTIMATE2019']
total = sum(pops) # sum works with Pandas Series objects!
print('Total southern population:', total)
Check out this shortened version!
csv = pd.read_csv('statepops.csv')
pops = csv[ csv['lat'] < 35 ]['POPESTIMATE2019']
print('Total southern population:', sum(pops))