Preparing and Cleaning our Data

Imagine that you have a large amount of data, it could have come from a sensor that has been in the field for years, a piece of equipment in the engine room, or from a site like Kaggle. Our goal has always been to be able to understand that data and convert it to some type of knowledge.

By this point you should be starting to realize that the data you receive isn't always in the format you expect, or has some obvious erroneous entries, or other problems that you will have to deal with.

The purpose of this discussion is to consider possible issues you may encounter and ways to overcome them. You, as the Data Scientist, need to make some decisions, do you throw out a row of data that didn't meet your expectations, do you change a value, or do you set some default values. Each of these actions will impact the final results generated from any of your follow on work and you should be able to justify and document any of these transformations that you decide to perform.

A Simple Walkthrough with UFO data

We have been working with the UFO data for the past few weeks, its a fun and interesting topic. You were able to import the data into your MySQL database using the files provided. But what happens when you work with the raw data directly. lets look at the first 10 rows and the header, in the file 10_raw.csv and some quick outputs from loading the data in R.

datetime,city,state,country,shape,duration (seconds),duration (hours/min),comments,date posted,latitude,longitude
10/10/1949 20:30,san marcos,tx,us,cylinder,2700,45 minutes,"This event took place in early fall around 1949-50. It occurred after a Boy Scout meeting in the Baptist Church. The Baptist Church sit",4/27/2004,29.8830556,-97.9411111
10/10/1949 21:00,lackland afb,tx,,light,7200,1-2 hrs,"1949 Lackland AFB&#44 TX.  Lights racing across the sky & making 90 degree turns on a dime.",12/16/2005,29.38421,-98.581082
10/10/1955 17:00,chester (uk/england),,gb,circle,20,20 seconds,"Green/Orange circular disc over Chester&#44 England",1/21/2008,53.2,-2.916667
10/10/1956 21:00,edna,tx,us,circle,20,1/2 hour,"My older brother and twin sister were leaving the only Edna theater at about 9 PM&#44...we had our bikes and I took a different route home",1/17/2004,28.9783333,-96.6458333
10/10/1960 20:00,kaneohe,hi,us,light,900,15 minutes,"AS a Marine 1st Lt. flying an FJ4B fighter/attack aircraft on a solo night exercise&#44 I was at 50&#44000&#39 in a "clean" aircraft (no ordinan",1/22/2004,21.4180556,-157.8036111
10/10/1961 19:00,bristol,tn,us,sphere,300,5 minutes,"My father is now 89 my brother 52 the girl with us now 51 myself 49 and the other fellow which worked with my father if he&#39s still livi",4/27/2007,36.5950000,-82.1888889
10/10/1965 21:00,penarth (uk/wales),,gb,circle,180,about 3 mins,"penarth uk  circle  3mins  stayed 30ft above me for 3 mins slowly moved of and then with the blink of the eye the speed was unreal",2/14/2006,51.434722,-3.18
10/10/1965 23:45,norwalk,ct,us,disk,1200,20 minutes,"A bright orange color changing to reddish color disk/saucer was observed hovering above power transmission lines.",10/2/1999,41.1175000,-73.4083333
10/10/1966 20:00,pell city,al,us,disk,180,3  minutes,"Strobe Lighted disk shape object observed close&#44 at low speeds&#44 and low altitude in Oct 1966 in Pell City Alabama",3/19/2009,33.5861111,-86.2861111
10/10/1966 21:00,live oak,fl,us,disk,120,several minutes,"Saucer zaps energy from powerline as my pregnant mother receives mental signals not to pass info",5/11/2005,30.2947222,-82.9841667
ufo10 = read.table("10_raw.csv", header=TRUE, sep=",")
options(width=1000)
summary(ufo10)
   datetime             city              state             country             shape           duration..seconds. duration..hours.min.   comments         date.posted           latitude       longitude       
 Length:10          Length:10          Length:10          Length:10          Length:10          Min.   :  20       Length:10            Length:10          Length:10          Min.   :21.42   Min.   :-157.804  
 Class :character   Class :character   Class :character   Class :character   Class :character   1st Qu.: 135       Class :character     Class :character   Class :character   1st Qu.:29.51   1st Qu.: -97.617  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character   Mode  :character   Median : 240       Mode  :character     Mode  :character   Mode  :character   Median :31.94   Median : -84.635  
                                                                                                Mean   :1282                                                                  Mean   :35.59   Mean   : -78.194  
                                                                                                3rd Qu.:1125                                                                  3rd Qu.:39.99   3rd Qu.: -75.603  
                                                                                                Max.   :7200                                                                  Max.   :53.20   Max.   :  -2.917  

Looking at the raw file above, do you see anything of concern? Anything that may cause you troubles later? Anything that you may want to change the formating or transform?

Lets expand our discussion to the first 50 rows of the raw UFO data. Take a quick look at 50_raw.csv and some quick outputs from loading the data in R.

datetime,city,state,country,shape,duration (seconds),duration (hours/min),comments,date posted,latitude,longitude
10/10/1949 20:30,san marcos,tx,us,cylinder,2700,45 minutes,"This event took place in early fall around 1949-50. It occurred after a Boy Scout meeting in the Baptist Church. The Baptist Church sit",4/27/2004,29.8830556,-97.9411111
10/10/1949 21:00,lackland afb,tx,,light,7200,1-2 hrs,"1949 Lackland AFB&#44 TX.  Lights racing across the sky & making 90 degree turns on a dime.",12/16/2005,29.38421,-98.581082
10/10/1955 17:00,chester (uk/england),,gb,circle,20,20 seconds,"Green/Orange circular disc over Chester&#44 England",1/21/2008,53.2,-2.916667
10/10/1956 21:00,edna,tx,us,circle,20,1/2 hour,"My older brother and twin sister were leaving the only Edna theater at about 9 PM&#44...we had our bikes and I took a different route home",1/17/2004,28.9783333,-96.6458333
10/10/1960 20:00,kaneohe,hi,us,light,900,15 minutes,"AS a Marine 1st Lt. flying an FJ4B fighter/attack aircraft on a solo night exercise&#44 I was at 50&#44000&#39 in a "clean" aircraft (no ordinan",1/22/2004,21.4180556,-157.8036111
10/10/1961 19:00,bristol,tn,us,sphere,300,5 minutes,"My father is now 89 my brother 52 the girl with us now 51 myself 49 and the other fellow which worked with my father if he&#39s still livi",4/27/2007,36.5950000,-82.1888889
10/10/1965 21:00,penarth (uk/wales),,gb,circle,180,about 3 mins,"penarth uk  circle  3mins  stayed 30ft above me for 3 mins slowly moved of and then with the blink of the eye the speed was unreal",2/14/2006,51.434722,-3.18
10/10/1965 23:45,norwalk,ct,us,disk,1200,20 minutes,"A bright orange color changing to reddish color disk/saucer was observed hovering above power transmission lines.",10/2/1999,41.1175000,-73.4083333
10/10/1966 20:00,pell city,al,us,disk,180,3  minutes,"Strobe Lighted disk shape object observed close&#44 at low speeds&#44 and low altitude in Oct 1966 in Pell City Alabama",3/19/2009,33.5861111,-86.2861111
10/10/1966 21:00,live oak,fl,us,disk,120,several minutes,"Saucer zaps energy from powerline as my pregnant mother receives mental signals not to pass info",5/11/2005,30.2947222,-82.9841667
10/10/1968 13:00,hawthorne,ca,us,circle,300,5 min.,"ROUND &#44 ORANGE &#44 WITH WHAT I WOULD SAY WAS POLISHED METAL OF SOME KIND AROUND THE EDGES .",10/31/2003,33.9163889,-118.3516667
10/10/1968 19:00,brevard,nc,us,fireball,180,3 minutes,"silent red /orange mass of energy floated by three of us in western North Carolina in the 60s",6/12/2008,35.2333333,-82.7344444
10/10/1970 16:00,bellmore,ny,us,disk,1800,30 min.,"silver disc seen by family and neighbors",5/11/2000,40.6686111,-73.5275000
10/10/1970 19:00,manchester,ky,us,unknown,180,3 minutes,"Slow moving&#44 silent craft accelerated at an unbelievable angle and speed.",2/14/2008,37.1536111,-83.7619444
10/10/1971 21:00,lexington,nc,us,oval,30,30 seconds,"green oval shaped light over my local church&#44power lines down..",2/14/2010,35.8238889,-80.2536111
10/10/1972 19:00,harlan county,ky,us,circle,1200,20minutes,"On october 10&#44 1972 myself&#44my 5yrs.daughter&#442 neices and 2 nephews were playing tag in the back yard .When we looked over on the ridge",9/15/2005,36.8430556,-83.3219444
10/10/1972 22:30,west bloomfield,mi,us,disk,120,2 minutes,"The UFO was so close&#44 my battery in the car went to zero amps&#44 stalling the engine&#44 turning off my lights and radio.",8/14/2007,42.5377778,-83.2330556
10/10/1973 19:00,niantic,ct,us,disk,1800,20-30 min,"Oh&#44 what a night &#33  Two (2) saucer-shaped&#44 glowing green objects and one (1) brilliantly glowing sphere gliding over the lake.",9/24/2003,41.3252778,-72.1936111
10/10/1973 23:00,bermuda nas,,,light,20,20 sec.,"saw fast moving blip on the radar scope thin went outside and saw it again.",1/11/2002,32.364167,-64.678611
10/10/1974 17:00,willow beach,az,,light,120,2 min,"The object was a brillant white light standing out against the bright arizona sky",2/18/2001,0,0
10/10/1974 19:30,hudson,ma,us,other,2700,45 minutes,"Not sure of the eact month or year of this sighting but it was in the fall of 74 or 75. Was walking home around dusk and saw a bright l",8/10/1999,42.3916667,-71.5666667
10/10/1974 21:30,cardiff (uk/wales),,gb,disk,1200,20 minutes,"back in 1974 I was 19 at the time and  lived in a suburb of Cardiff Wales UK called Ely&#44 and in the distance there was a wood called Ca",2/1/2007,51.5,-3.2
10/10/1974 23:00,hudson,ks,us,light,1200,one hour?,"The light chased us.",7/25/2004,38.1055556,-98.6597222
10/10/1975 17:00,north charleston,sc,us,light,360,5-6 minutes,"Several Flashing UFO lights over Charleston Naval Base in S.C.",2/14/2008,32.8544444,-79.9750000
10/10/1976 20:30,washougal,wa,us,oval,60,1 minute,"Three extremely large lights hanging above nearby trees.",2/7/2014,45.5827778,-122.3522222
10/10/1976 22:00,stoke mandeville (uk/england),,gb,cigar,3,3 seconds,"White object over Buckinghamshire UK.",12/12/2009,51.783333,-.783333
10/10/1977 12:00,san antonio,tx,us,other,30,30 seconds,"i was about six or seven and my family and me were sitting next to a window at home eating when a type of dark ball hit the screen wind",2/24/2005,29.4238889,-98.4933333
10/10/1977 22:00,louisville,ky,us,light,30,approx: 30 seconds,"HBCCUFO CANADIAN REPORT:  Pilot Sighting Of Unusual Light.",3/17/2004,38.2541667,-85.7594444
10/10/1978 02:00,elmont,ny,us,rectangle,300,5min,"A memory I will never forget that happened meny years ago.",2/1/2007,40.7008333,-73.7133333
10/10/1979 00:00,poughkeepsie,ny,us,chevron,900,15 minutes,"1/4 moon-like&#44  its &#39chord&#39 or flat side parallel to horizon&#44  bright orange-red glow&#44  completely silent&#44 no features.",4/16/2005,41.7002778,-73.9213889
10/10/1979 22:00,saddle lake (canada),ab,,triangle,270,4.5 or more min.,"Lights far above&#44  that glance; then flee from the celestrialhavens&#44 only to appear again.",1/19/2005,53.970571,-111.689885
10/10/1979 22:00,san diego,ca,us,oval,180,3 minutes,"My 2nd UFO sighting&#44 October 1979",8/5/2001,32.7152778,-117.1563889
10/10/1979 22:00,security,co,us,unknown,1800,30mins.,"very low clouds all different colors&#44search lights were seen shining down out of the clouds on the houses and streets. this event was s",1/28/1999,38.7583333,-104.7425000
10/10/1980 19:00,houston,tx,us,sphere,180,3 min,"Sphere&#44 No lights&#44  moving through neighborhoods above tree tops&#44 over busy streets in Houston in 1980.",4/16/2005,29.7630556,-95.3630556
10/10/1980 22:00,dallas,tx,us,unknown,300,5 minutes,"Strange shape shifting craft of pure light energy.",10/28/2002,32.7833333,-96.8000000
10/10/1980 23:30,manchester,nh,us,light,300,3 to 5 min,"A red glowing sphere stopped and watched me.",11/21/2010,42.9955556,-71.4552778
10/10/1982 07:00,gisborne (new zealand),,,disk,120,2min,"gisborne nz 1982 wainui beach to sponge bay",1/11/2002,-38.662334,178.017649
10/10/1984 05:00,houston,tx,us,circle,60,1 minute,"2 experience with unkown",4/18/2012,29.7630556,-95.3630556
10/10/1984 12:00,traverse city,mi,us,other,120,couple minutes,"translucent football seen over city airport",10/7/2003,44.7630556,-85.6205556
10/10/1984 22:00,white plains,ny,us,formation,20,15-20 seconds,"Saw a hugh object in sky with lights intermittently placed not making any noise traveling north in the night sky.It had no real identif",8/10/1999,41.0338889,-73.7633333
10/10/1985 20:25,leeds (uk/england),,gb,triangle,600,10min,"three light in the sky that led to a big black silent triangle shaped craft.",3/4/2003,53.8,-1.583333
10/10/1986 20:00,holmes/pawling,ny,,chevron,180,3 minutes,"Football Field Sized Chevron with bright white lights on front&#44 moving slowly with absolutely no sound",10/8/2007,41.523427,-73.646795
10/10/1987 20:00,manilla (rr#1) (canada),on,,disk,600,10/mins,"We could clearly see the craft and observed in detail it&#39s structure&#44no sound was made by it and the lights that were flashing were not",8/5/2001,0,0
10/10/1988 18:00,milwaukee,wi,us,triangle,600,10 minutes,"A silent black triangular object with no lights moved over us as we walked our dog across a school playground.",8/7/2007,43.0388889,-87.9063889
10/10/1988 21:00,new gloucester,me,us,unknown,120,few minutes,"I&#39m still afraid to look at the sky at night.",1/10/2009,43.9627778,-70.2830556
10/10/1988 22:00,boulder,co,us,light,60,1 minute,"Three Stars Begin to Move Randomly&#44 Stop&#44 Change Direction&#44 Move Again&#44 And Then Vanish",7/16/2006,40.0150000,-105.2700000
10/10/1988 22:45,seattle (ballard area),wa,us,unknown,2,2 sec.,"Two adult witnesses are stunned by the sound of an object that streaks above the home they are in.",4/27/2004,47.6063889,-122.3308333
10/10/1989 00:00,calabasas,ca,us,disk,300,approx 5 min,"Unidentified object on Mulholland Highway.",12/14/2004,34.1577778,-118.6375000
10/10/1989 21:00,centralia,wa,us,triangle,60,1 minute,"A huge triangularly shaped silent object that blotted out 25 degrees of the sky&#44 lighted by four glowing points.",4/27/2004,46.7163889,-122.9530556
10/10/1989 21:00,kranklin,ky,,triangle,180,3min,"Triangle seen in franklin Ky - october 1989",5/11/2005,36.722263,-86.577218
ufo50 = read.table("50_raw.csv", header=TRUE, sep=",")
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  : 
  line 43 did not have 11 elements
Calls: read.table -> scan
Execution halted

Lets pay attention to that specific line

datetime,city,state,country,shape,duration (seconds),duration (hours/min),comments,date posted,latitude,longitude
10/10/1987 20:00,manilla (rr#1) (canada),on,,disk,600,10/mins,"We could clearly see the craft and observed in detail it&#39s structure&#44no sound was made by it and the lights that were flashing were not",8/5/2001,0,0

Since it is complaining about not having enough elements in the line, lets re-read that line telling R to read the file and file in missing fields.

ufo42 = read.table("50_line_42_raw.csv", header=TRUE, sep=",", fill=TRUE)
options(width=1000)
print(ufo42)
          datetime        city state country shape duration..seconds. duration..hours.min. comments date.posted latitude longitude
1 10/10/1987 20:00 manilla (rr    NA      NA    NA                 NA                   NA       NA          NA       NA        NA

Well, that result was probably unexpected... First of all using fill is rather dangerous because it will just fill in the missing values at the end with NAs, but what if an earlier column was missing. In our case we were actually being impacted by the way read.table() handles comments #, we could get around that by telling R what the comment character is.

ufo42 = read.table("50_line_42_raw.csv", header=TRUE, sep=",", comment.char="")
options(width=1000)
print(ufo42)
          datetime                    city state country shape duration..seconds. duration..hours.min.                                                                                                                                      comments date.posted latitude longitude
1 10/10/1987 20:00 manilla (rr#1) (canada)    on      NA  disk                600              10/mins We could clearly see the craft and observed in detail it&#39s structure&#44no sound was made by it and the lights that were flashing were not    8/5/2001        0         0

Ok, we can load the data now, but...

Lets look at the data again and consider what parts of the data we don't like. Do you see anything in particular that is concerning?

You, as a Data Scientist, have to make the decisions on how you are going to load your raw data. Your steps should always be reproducable and you should be able to provide these transforms to others.

Understanding our data

To use our data, you must be able to comprehend the data itself. While not comprehensive, the list below will help guide your thoughts in processing the RAW data that you have.

Practical Steps to load the data into MySQL

In general you will find that it is more common for Python to be used for loading the data into a database. Each row of the data can be processed and then inserted into the database. Before we do this we should think of a general process.

After determining the schema for our tables, we need to begin inserting the rows into the database. We will do that with Python.

Load and review the data - Looking at the CSV

Lets use the Pandas library to make our lives a bit easier.

import pandas as pd             # Load in the Pandas Library

df = pd.read_csv('100.csv')     # Load in our 100 rows of UFO data
print(df.isnull().sum())        # Lets see what nulls we have
datetime       0
city           0
state         10
country       17
shape          2
seconds        0
hours          1
comments       0
dateposted     0
latitude       0
longitude      0
Just by taking this step, we can see that there is missing data in our table, for each of these columns we need to make decisions.

If we decide delete rows or use default values for missing fields we can use some of Pandas built in methods.

df_clean = df.dropna()           # Lets drop all rows with NAs
df_filled = df.fillna(0)         # Lets replace all NA values with another value
df_unique = df.drop_duplicates() # Lets drop rows that are duplicative

Convert each field (as needed)

Currently every column is being treated as either a string or an object by Pandas, if we wanted to perform work before the data was inserted into the database now would be a great time. As an example, the seconds column is a perfect candidate to change to an int

df['seconds'] = df['seconds'].astype(int)

Assuming that you have created your MySQL tables with the correct types, in this example as an int, performing this step in Python is probably not necessary.

Walkthrough each line one at a time

Now that we have loaded in all of the data and made some global changes to the data. Lets walkthrough each row in the dataframe.

for i in range(len(df)):
  row = df.iloc[i,:]
  print(row)
We will see something like the below for each iteration of the loop.

datetime                                       10/10/2001 22:00
city                                                    phoenix
state                                                        az
country                                                      us
shape                                                  triangle
seconds                                                      60
hours                                        less then a minute
comments      Triangle shaped craft spotted flying west to e...
dateposted                                           11/20/2001
latitude                                              33.448333
longitude                                           -112.073333

And now we can make changes to the individual parts.

city  = row['city'].capitalize()
state = row['state'].upper()
Hopefully you can see that this is a great time to make changes as needed to each field to meet what is needed to insert the data into MySQL. If our table looked like:

+------------------+--------------+------+-----+---------+-------+
| Field            | Type         | Null | Key | Default | Extra |
+------------------+--------------+------+-----+---------+-------+
| dtg              | DATETIME     | YES  |     | NULL    |       |
| city             | varchar(128) | YES  |     | NULL    |       |
| state            | varchar(32)  | YES  |     | NULL    |       |
| country          | varchar(32)  | YES  |     | NULL    |       |
| shape            | varchar(64)  | YES  |     | NULL    |       |
| duration_seconds | INT          | YES  |     | NULL    |       |
| duration_minutes | INT          | YES  |     | NULL    |       |
| comments         | varchar(256) | YES  |     | NULL    |       |
| dateposted       | DATETIME     | YES  |     | NULL    |       |
| latitude         | double(9,6)  | YES  |     | NULL    |       |
| longitude        | double(9,6)  | YES  |     | NULL    |       |
+------------------+--------------+------+-----+---------+-------+
Then we would need to write a prepared INSERT statement and after cleaning each row and validating it, we would insert it into our database

query = "INSERT INTO ufo VALUES (?,?,?,?,?,?,?,?,?,?,?)"

Thoughts

It is extremely important that you become very comfortable with string manipulation in Python. Methods like .strip(), .lower(), .split(), and slicing will need to become second nature.

Commands like the following are also very important in that they can help you understand the data that you are working with.

print(df.head())
print(df.info())
print(df.describe())
print(df.isnull().sum())
In fact you should work to obtain this type of information each time you begin to work with a dataset.

Problems

By this point in the class, your team should have already found the sources of the data you want to work with and have the CSVs (or other raw data), ready for processing. With that in mind, for today's problems:

  1. Load each of your CSV's into a Pandas dataframe using the methods above.
  2. Run the commands above from the Thoughts section for each of the dataframes, consider what is important, and what MySQL types your data should be.
  3. Step through each row in your dataframe, and extract each column into its own variable.
  4. For each variable, manipulate it as necessary (conversions, formats, etc).
  5. Decide if that row is valid.