Pandas Exercises

We've shown the basics of using Pandas. Today we show more practical hands-on examples and give you a chance to experiment during class. Go back and see the prior lectures for helpful lists of its core functionality.

Today in Class

  1. Creating a new column in a DataFrame
  2. Sorting rows by column
  3. & and | operators
  4. Several exercises

Creating a new column

Sometimes you need to calculate things for your rows, for instance, combining two rows into one. You can create a new DataFrame column very easily.

Here is a scenario. Take your McDonalds HW. Each item on their menu has Calories and Calories.from.Fat. What if we want to find the menu item with the highest fat percentage? We want to divide those two numbers and save the result in each row. This is a breeze:

csv = pd.read_csv('mcdonalds.csv')
csv['fat%'] = csv['Calories.from.Fat']/csv['Calories']

Once you have your new column, you can just lookup the max and print it:

# Find the max
i = csv['fat%'].idxmax()
maxrow = csv.iloc[i]

# Print the max!
print(maxrow['Item'], 'has the most fat calories with', maxrow['Calories'], 'total calories and', maxrow['Calories.from.Fat'], 'calories from fat.')

Sorting the Rows

We can extend the above example to sorting by fat % instead of just finding the one max item. This is also very easy with the Pandas sort_values() function:

result = csv.sort_values(by='fat%', ascending=False)

# Print Top 5
for i in range(5):
    row = result.iloc[i]
    print(row['Item'], 'at', row['fat%'])

A Simple Task -- Makes Your Life Easy!

Here is a really basic example. Pandas isn't just useful for complicated things. Imagine you have a simple CSV file of character names and their races (Lord of the Rings):

"name","race"
"Aragorn II","Man"
"Arwen","Elf"
"Elrond","Elf"
"Celebrían","Elf"
...

You want to pull out all the elf names for later use. How would you do that? Just like this:

import pandas as pd

csv = pd.read_csv('lotr.csv')
elves = csv[ csv['race'] == 'Elf' ]
names = elves['name']

for n in names:
  print(n)

The above even works when the CSV format changes, like swapping column order or adding more columns! The old way looked like the following, inflexible with CSV changes:

fh = open('lotr.csv')
names = []
for line in fh:
    parts = line.split(',')
    if parts[1] == 'Elf':
        names.append(parts[0])

for n in names:
  print(n)

A Slightly More Complicated Task

Let's use the Marvel character data from last HW. We can use Pandas to filter by multiple conditions. For instance, let's retrieve all characters that were introduced in the 1960's and who have blue eyes:

page_id,name,urlslug,ID,ALIGN,EYE,HAIR,SEX,GSM,ALIVE,APPEARANCES,FIRST APPEARANCE,Year
1678,Spider-Man (Peter Parker),\/Spider-Man_(Peter_Parker),Secret Identity,Good Characters,Hazel Eyes,Brown Hair,Male Characters,,Living Characters,4043,Aug-62,1962
7139,Captain America (Steven Rogers),\/Captain_America_(Steven_Rogers),Public Identity,Good Characters,Blue Eyes,White Hair,Male Characters,,Living Characters,3360,Mar-41,1941
64786,"Wolverine (James \""Logan\"" Howlett)",\/Wolverine_(James_%22Logan%22_Howlett),Public Identity,Neutral Characters,Blue Eyes,Black Hair,Male Characters,,Living Characters,3061,Oct-74,1974

The Pandas program isn't too complicated:

import pandas as pd

csv = pd.read_csv('marvel-wikia-data.csv')
csv = csv[ (1970<=csv['Year']) & (csv['Year']<=1979) ]   # two conditions at once
blues = csv[ csv['EYE'] == 'Blue Eyes' ]
names = blues['name']

for n in names:
  print(n)

Select and Sum -- USA Population by State

Quick selection and math with Pandas. We have a CSV of our 50 USA states and their populations:

STATE,POPESTIMATE2019,lat,long
Alabama,4903185,32.377716,-86.300568
Alaska,731545,58.301598,-134.420212
Arizona,7278717,33.448143,-112.096962
Arkansas,3017804,34.746613,-92.288986

Now find the population of all southern states, defined by anything below 40 latitude.

import pandas as pd

csv = pd.read_csv('statepops.csv')
southern = csv[ csv['lat'] < 35 ]
pops = southern['POPESTIMATE2019']
total = sum(pops)   # sum works with Pandas Series objects!
			     
print('Total southern population:', total)

Check out this shortened version!

csv = pd.read_csv('statepops.csv')
pops = csv[ csv['lat'] < 35 ]['POPESTIMATE2019']
print('Total southern population:', sum(pops))

Find the Max Population of States starting with an A

Building on the above solution. Can we find the state with the maximum population?

csv = pd.read_csv('statepops.csv')
As = csv[ csv['STATE'].str.startswith('A') ]
pops = As['POPESTIMATE2019']

# Two options
themax = max(pops)
themax = pops.max()  # uses Panda's own function

print('Max As population:', themax)

This is fine, but we don't know which state had the max population. You can just go back and find the row with that max:

# Option 1 (DataFrame with one row)
maxrow = csv[ csv['POPESTIMATE2019']==themax ]
state = maxrow['STATE'].iloc[0]
print(state, 'has the biggest population at', themax)

# Option 2 (Series of the entire column)
i = As['POPESTIMATE2019'].idxmax()
maxrow = csv.iloc[i]
state = maxrow['STATE']
print(state, 'has the biggest population at', themax)