We've shown the basics of using Pandas. Today we show more practical hands-on examples and give you a chance to experiment during class. Go back and see the prior lectures for helpful lists of its core functionality.
Sometimes you need to calculate things for your rows, for instance, combining two rows into one. You can create a new DataFrame column very easily.
Here is a scenario. Take your McDonalds HW. Each item on their menu has Calories and Calories.from.Fat. What if we want to find the menu item with the highest fat percentage? We want to divide those two numbers and save the result in each row. This is a breeze:
csv = pd.read_csv('mcdonalds.csv')
csv['fat%'] = csv['Calories.from.Fat']/csv['Calories']
Once you have your new column, you can just lookup the max and print it:
# Find the max
i = csv['fat%'].idxmax()
maxrow = csv.iloc[i]
# Print the max!
print(maxrow['Item'], 'has the most fat calories with', maxrow['Calories'], 'total calories and', maxrow['Calories.from.Fat'], 'calories from fat.')
We can extend the above example to sorting by fat % instead of just finding the one max item. This is also very easy with the Pandas sort_values() function:
result = csv.sort_values(by='fat%', ascending=False)
# Print Top 5
for i in range(5):
row = result.iloc[i]
print(row['Item'], 'at', row['fat%'])
Here is a really basic example. Pandas isn't just useful for complicated things. Imagine you have a simple CSV file of character names and their races (Lord of the Rings):
"name","race" "Aragorn II","Man" "Arwen","Elf" "Elrond","Elf" "CelebrÃan","Elf" ...
You want to pull out all the elf names for later use. How would you do that? Just like this:
import pandas as pd
csv = pd.read_csv('lotr.csv')
elves = csv[ csv['race'] == 'Elf' ]
names = elves['name']
for n in names:
print(n)
The above even works when the CSV format changes, like swapping column order or adding more columns! The old way looked like the following, inflexible with CSV changes:
fh = open('lotr.csv')
names = []
for line in fh:
parts = line.split(',')
if parts[1] == 'Elf':
names.append(parts[0])
for n in names:
print(n)
Let's use the Marvel character data from last HW. We can use Pandas to filter by multiple conditions. For instance, let's retrieve all characters that were introduced in the 1960's and who have blue eyes:
page_id,name,urlslug,ID,ALIGN,EYE,HAIR,SEX,GSM,ALIVE,APPEARANCES,FIRST APPEARANCE,Year 1678,Spider-Man (Peter Parker),\/Spider-Man_(Peter_Parker),Secret Identity,Good Characters,Hazel Eyes,Brown Hair,Male Characters,,Living Characters,4043,Aug-62,1962 7139,Captain America (Steven Rogers),\/Captain_America_(Steven_Rogers),Public Identity,Good Characters,Blue Eyes,White Hair,Male Characters,,Living Characters,3360,Mar-41,1941 64786,"Wolverine (James \""Logan\"" Howlett)",\/Wolverine_(James_%22Logan%22_Howlett),Public Identity,Neutral Characters,Blue Eyes,Black Hair,Male Characters,,Living Characters,3061,Oct-74,1974
The Pandas program isn't too complicated:
import pandas as pd
csv = pd.read_csv('marvel-wikia-data.csv')
csv = csv[ (1970<=csv['Year']) & (csv['Year']<=1979) ] # two conditions at once
blues = csv[ csv['EYE'] == 'Blue Eyes' ]
names = blues['name']
for n in names:
print(n)
Quick selection and math with Pandas. We have a CSV of our 50 USA states and their populations:
STATE,POPESTIMATE2019,lat,long Alabama,4903185,32.377716,-86.300568 Alaska,731545,58.301598,-134.420212 Arizona,7278717,33.448143,-112.096962 Arkansas,3017804,34.746613,-92.288986
Now find the population of all southern states, defined by anything below 40 latitude.
import pandas as pd
csv = pd.read_csv('statepops.csv')
southern = csv[ csv['lat'] < 35 ]
pops = southern['POPESTIMATE2019']
total = sum(pops) # sum works with Pandas Series objects!
print('Total southern population:', total)
Check out this shortened version!
csv = pd.read_csv('statepops.csv')
pops = csv[ csv['lat'] < 35 ]['POPESTIMATE2019']
print('Total southern population:', sum(pops))
Building on the above solution. Can we find the state with the maximum population?
csv = pd.read_csv('statepops.csv')
As = csv[ csv['STATE'].str.startswith('A') ]
pops = As['POPESTIMATE2019']
# Two options
themax = max(pops)
themax = pops.max() # uses Panda's own function
print('Max As population:', themax)
This is fine, but we don't know which state had the max population. You can just go back and find the row with that max:
# Option 1 (DataFrame with one row)
maxrow = csv[ csv['POPESTIMATE2019']==themax ]
state = maxrow['STATE'].iloc[0]
print(state, 'has the biggest population at', themax)
# Option 2 (Series of the entire column)
i = As['POPESTIMATE2019'].idxmax()
maxrow = csv.iloc[i]
state = maxrow['STATE']
print(state, 'has the biggest population at', themax)