Imagine that you have a large amount of data, it could have come from a sensor that has been in the field for years, a piece of equipment in the engine room, or from a site like Kaggle. Our goal has always been to be able to understand that data and convert it to some type of knowledge.
By this point you should be starting to realize that the data you receive isn't always in the format you expect, or has some obvious erroneous entries, or other problems that you will have to deal with.
The purpose of this discussion is to consider possible issues you may encounter and ways to overcome them. You, as the Data Scientist, need to make some decisions, do you throw out a row of data that didn't meet your expectations, do you change a value, or do you set some default values. Each of these actions will impact the final results generated from any of your follow on work and you should be able to justify and document any of these transformations that you decide to perform.
We have been working with the UFO data for the past few weeks, its a fun and interesting topic. You were able to import the data into your MySQL database using the files provided. But what happens when you work with the raw data directly. lets look at the first 10 rows and the header, in the file 10_raw.csv and some quick outputs from loading the data in R.
datetime,city,state,country,shape,duration (seconds),duration (hours/min),comments,date posted,latitude,longitude
10/10/1949 20:30,san marcos,tx,us,cylinder,2700,45 minutes,"This event took place in early fall around 1949-50. It occurred after a Boy Scout meeting in the Baptist Church. The Baptist Church sit",4/27/2004,29.8830556,-97.9411111
10/10/1949 21:00,lackland afb,tx,,light,7200,1-2 hrs,"1949 Lackland AFB, TX. Lights racing across the sky & making 90 degree turns on a dime.",12/16/2005,29.38421,-98.581082
10/10/1955 17:00,chester (uk/england),,gb,circle,20,20 seconds,"Green/Orange circular disc over Chester, England",1/21/2008,53.2,-2.916667
10/10/1956 21:00,edna,tx,us,circle,20,1/2 hour,"My older brother and twin sister were leaving the only Edna theater at about 9 PM,...we had our bikes and I took a different route home",1/17/2004,28.9783333,-96.6458333
10/10/1960 20:00,kaneohe,hi,us,light,900,15 minutes,"AS a Marine 1st Lt. flying an FJ4B fighter/attack aircraft on a solo night exercise, I was at 50ꯠ' in a "clean" aircraft (no ordinan",1/22/2004,21.4180556,-157.8036111
10/10/1961 19:00,bristol,tn,us,sphere,300,5 minutes,"My father is now 89 my brother 52 the girl with us now 51 myself 49 and the other fellow which worked with my father if he's still livi",4/27/2007,36.5950000,-82.1888889
10/10/1965 21:00,penarth (uk/wales),,gb,circle,180,about 3 mins,"penarth uk circle 3mins stayed 30ft above me for 3 mins slowly moved of and then with the blink of the eye the speed was unreal",2/14/2006,51.434722,-3.18
10/10/1965 23:45,norwalk,ct,us,disk,1200,20 minutes,"A bright orange color changing to reddish color disk/saucer was observed hovering above power transmission lines.",10/2/1999,41.1175000,-73.4083333
10/10/1966 20:00,pell city,al,us,disk,180,3 minutes,"Strobe Lighted disk shape object observed close, at low speeds, and low altitude in Oct 1966 in Pell City Alabama",3/19/2009,33.5861111,-86.2861111
10/10/1966 21:00,live oak,fl,us,disk,120,several minutes,"Saucer zaps energy from powerline as my pregnant mother receives mental signals not to pass info",5/11/2005,30.2947222,-82.9841667
ufo10 = read.table("10_raw.csv", header=TRUE, sep=",")
options(width=1000)
summary(ufo10)
datetime city state country shape duration..seconds. duration..hours.min. comments date.posted latitude longitude
Length:10 Length:10 Length:10 Length:10 Length:10 Min. : 20 Length:10 Length:10 Length:10 Min. :21.42 Min. :-157.804
Class :character Class :character Class :character Class :character Class :character 1st Qu.: 135 Class :character Class :character Class :character 1st Qu.:29.51 1st Qu.: -97.617
Mode :character Mode :character Mode :character Mode :character Mode :character Median : 240 Mode :character Mode :character Mode :character Median :31.94 Median : -84.635
Mean :1282 Mean :35.59 Mean : -78.194
3rd Qu.:1125 3rd Qu.:39.99 3rd Qu.: -75.603
Max. :7200 Max. :53.20 Max. : -2.917
Looking at the raw file above, do you see anything of concern? Anything that may cause you troubles later? Anything that you may want to change the formating or transform?
Lets expand our discussion to the first 50 rows of the raw UFO data. Take a quick look at 50_raw.csv and some quick outputs from loading the data in R.
datetime,city,state,country,shape,duration (seconds),duration (hours/min),comments,date posted,latitude,longitude
10/10/1949 20:30,san marcos,tx,us,cylinder,2700,45 minutes,"This event took place in early fall around 1949-50. It occurred after a Boy Scout meeting in the Baptist Church. The Baptist Church sit",4/27/2004,29.8830556,-97.9411111
10/10/1949 21:00,lackland afb,tx,,light,7200,1-2 hrs,"1949 Lackland AFB, TX. Lights racing across the sky & making 90 degree turns on a dime.",12/16/2005,29.38421,-98.581082
10/10/1955 17:00,chester (uk/england),,gb,circle,20,20 seconds,"Green/Orange circular disc over Chester, England",1/21/2008,53.2,-2.916667
10/10/1956 21:00,edna,tx,us,circle,20,1/2 hour,"My older brother and twin sister were leaving the only Edna theater at about 9 PM,...we had our bikes and I took a different route home",1/17/2004,28.9783333,-96.6458333
10/10/1960 20:00,kaneohe,hi,us,light,900,15 minutes,"AS a Marine 1st Lt. flying an FJ4B fighter/attack aircraft on a solo night exercise, I was at 50ꯠ' in a "clean" aircraft (no ordinan",1/22/2004,21.4180556,-157.8036111
10/10/1961 19:00,bristol,tn,us,sphere,300,5 minutes,"My father is now 89 my brother 52 the girl with us now 51 myself 49 and the other fellow which worked with my father if he's still livi",4/27/2007,36.5950000,-82.1888889
10/10/1965 21:00,penarth (uk/wales),,gb,circle,180,about 3 mins,"penarth uk circle 3mins stayed 30ft above me for 3 mins slowly moved of and then with the blink of the eye the speed was unreal",2/14/2006,51.434722,-3.18
10/10/1965 23:45,norwalk,ct,us,disk,1200,20 minutes,"A bright orange color changing to reddish color disk/saucer was observed hovering above power transmission lines.",10/2/1999,41.1175000,-73.4083333
10/10/1966 20:00,pell city,al,us,disk,180,3 minutes,"Strobe Lighted disk shape object observed close, at low speeds, and low altitude in Oct 1966 in Pell City Alabama",3/19/2009,33.5861111,-86.2861111
10/10/1966 21:00,live oak,fl,us,disk,120,several minutes,"Saucer zaps energy from powerline as my pregnant mother receives mental signals not to pass info",5/11/2005,30.2947222,-82.9841667
10/10/1968 13:00,hawthorne,ca,us,circle,300,5 min.,"ROUND , ORANGE , WITH WHAT I WOULD SAY WAS POLISHED METAL OF SOME KIND AROUND THE EDGES .",10/31/2003,33.9163889,-118.3516667
10/10/1968 19:00,brevard,nc,us,fireball,180,3 minutes,"silent red /orange mass of energy floated by three of us in western North Carolina in the 60s",6/12/2008,35.2333333,-82.7344444
10/10/1970 16:00,bellmore,ny,us,disk,1800,30 min.,"silver disc seen by family and neighbors",5/11/2000,40.6686111,-73.5275000
10/10/1970 19:00,manchester,ky,us,unknown,180,3 minutes,"Slow moving, silent craft accelerated at an unbelievable angle and speed.",2/14/2008,37.1536111,-83.7619444
10/10/1971 21:00,lexington,nc,us,oval,30,30 seconds,"green oval shaped light over my local church,power lines down..",2/14/2010,35.8238889,-80.2536111
10/10/1972 19:00,harlan county,ky,us,circle,1200,20minutes,"On october 10, 1972 myself,my 5yrs.daughterƺ neices and 2 nephews were playing tag in the back yard .When we looked over on the ridge",9/15/2005,36.8430556,-83.3219444
10/10/1972 22:30,west bloomfield,mi,us,disk,120,2 minutes,"The UFO was so close, my battery in the car went to zero amps, stalling the engine, turning off my lights and radio.",8/14/2007,42.5377778,-83.2330556
10/10/1973 19:00,niantic,ct,us,disk,1800,20-30 min,"Oh, what a night ! Two (2) saucer-shaped, glowing green objects and one (1) brilliantly glowing sphere gliding over the lake.",9/24/2003,41.3252778,-72.1936111
10/10/1973 23:00,bermuda nas,,,light,20,20 sec.,"saw fast moving blip on the radar scope thin went outside and saw it again.",1/11/2002,32.364167,-64.678611
10/10/1974 17:00,willow beach,az,,light,120,2 min,"The object was a brillant white light standing out against the bright arizona sky",2/18/2001,0,0
10/10/1974 19:30,hudson,ma,us,other,2700,45 minutes,"Not sure of the eact month or year of this sighting but it was in the fall of 74 or 75. Was walking home around dusk and saw a bright l",8/10/1999,42.3916667,-71.5666667
10/10/1974 21:30,cardiff (uk/wales),,gb,disk,1200,20 minutes,"back in 1974 I was 19 at the time and lived in a suburb of Cardiff Wales UK called Ely, and in the distance there was a wood called Ca",2/1/2007,51.5,-3.2
10/10/1974 23:00,hudson,ks,us,light,1200,one hour?,"The light chased us.",7/25/2004,38.1055556,-98.6597222
10/10/1975 17:00,north charleston,sc,us,light,360,5-6 minutes,"Several Flashing UFO lights over Charleston Naval Base in S.C.",2/14/2008,32.8544444,-79.9750000
10/10/1976 20:30,washougal,wa,us,oval,60,1 minute,"Three extremely large lights hanging above nearby trees.",2/7/2014,45.5827778,-122.3522222
10/10/1976 22:00,stoke mandeville (uk/england),,gb,cigar,3,3 seconds,"White object over Buckinghamshire UK.",12/12/2009,51.783333,-.783333
10/10/1977 12:00,san antonio,tx,us,other,30,30 seconds,"i was about six or seven and my family and me were sitting next to a window at home eating when a type of dark ball hit the screen wind",2/24/2005,29.4238889,-98.4933333
10/10/1977 22:00,louisville,ky,us,light,30,approx: 30 seconds,"HBCCUFO CANADIAN REPORT: Pilot Sighting Of Unusual Light.",3/17/2004,38.2541667,-85.7594444
10/10/1978 02:00,elmont,ny,us,rectangle,300,5min,"A memory I will never forget that happened meny years ago.",2/1/2007,40.7008333,-73.7133333
10/10/1979 00:00,poughkeepsie,ny,us,chevron,900,15 minutes,"1/4 moon-like, its 'chord' or flat side parallel to horizon, bright orange-red glow, completely silent, no features.",4/16/2005,41.7002778,-73.9213889
10/10/1979 22:00,saddle lake (canada),ab,,triangle,270,4.5 or more min.,"Lights far above, that glance; then flee from the celestrialhavens, only to appear again.",1/19/2005,53.970571,-111.689885
10/10/1979 22:00,san diego,ca,us,oval,180,3 minutes,"My 2nd UFO sighting, October 1979",8/5/2001,32.7152778,-117.1563889
10/10/1979 22:00,security,co,us,unknown,1800,30mins.,"very low clouds all different colors,search lights were seen shining down out of the clouds on the houses and streets. this event was s",1/28/1999,38.7583333,-104.7425000
10/10/1980 19:00,houston,tx,us,sphere,180,3 min,"Sphere, No lights, moving through neighborhoods above tree tops, over busy streets in Houston in 1980.",4/16/2005,29.7630556,-95.3630556
10/10/1980 22:00,dallas,tx,us,unknown,300,5 minutes,"Strange shape shifting craft of pure light energy.",10/28/2002,32.7833333,-96.8000000
10/10/1980 23:30,manchester,nh,us,light,300,3 to 5 min,"A red glowing sphere stopped and watched me.",11/21/2010,42.9955556,-71.4552778
10/10/1982 07:00,gisborne (new zealand),,,disk,120,2min,"gisborne nz 1982 wainui beach to sponge bay",1/11/2002,-38.662334,178.017649
10/10/1984 05:00,houston,tx,us,circle,60,1 minute,"2 experience with unkown",4/18/2012,29.7630556,-95.3630556
10/10/1984 12:00,traverse city,mi,us,other,120,couple minutes,"translucent football seen over city airport",10/7/2003,44.7630556,-85.6205556
10/10/1984 22:00,white plains,ny,us,formation,20,15-20 seconds,"Saw a hugh object in sky with lights intermittently placed not making any noise traveling north in the night sky.It had no real identif",8/10/1999,41.0338889,-73.7633333
10/10/1985 20:25,leeds (uk/england),,gb,triangle,600,10min,"three light in the sky that led to a big black silent triangle shaped craft.",3/4/2003,53.8,-1.583333
10/10/1986 20:00,holmes/pawling,ny,,chevron,180,3 minutes,"Football Field Sized Chevron with bright white lights on front, moving slowly with absolutely no sound",10/8/2007,41.523427,-73.646795
10/10/1987 20:00,manilla (rr#1) (canada),on,,disk,600,10/mins,"We could clearly see the craft and observed in detail it's structure,no sound was made by it and the lights that were flashing were not",8/5/2001,0,0
10/10/1988 18:00,milwaukee,wi,us,triangle,600,10 minutes,"A silent black triangular object with no lights moved over us as we walked our dog across a school playground.",8/7/2007,43.0388889,-87.9063889
10/10/1988 21:00,new gloucester,me,us,unknown,120,few minutes,"I'm still afraid to look at the sky at night.",1/10/2009,43.9627778,-70.2830556
10/10/1988 22:00,boulder,co,us,light,60,1 minute,"Three Stars Begin to Move Randomly, Stop, Change Direction, Move Again, And Then Vanish",7/16/2006,40.0150000,-105.2700000
10/10/1988 22:45,seattle (ballard area),wa,us,unknown,2,2 sec.,"Two adult witnesses are stunned by the sound of an object that streaks above the home they are in.",4/27/2004,47.6063889,-122.3308333
10/10/1989 00:00,calabasas,ca,us,disk,300,approx 5 min,"Unidentified object on Mulholland Highway.",12/14/2004,34.1577778,-118.6375000
10/10/1989 21:00,centralia,wa,us,triangle,60,1 minute,"A huge triangularly shaped silent object that blotted out 25 degrees of the sky, lighted by four glowing points.",4/27/2004,46.7163889,-122.9530556
10/10/1989 21:00,kranklin,ky,,triangle,180,3min,"Triangle seen in franklin Ky - october 1989",5/11/2005,36.722263,-86.577218
ufo50 = read.table("50_raw.csv", header=TRUE, sep=",")
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
line 43 did not have 11 elements
Calls: read.table -> scan
Execution halted
Lets pay attention to that specific line
datetime,city,state,country,shape,duration (seconds),duration (hours/min),comments,date posted,latitude,longitude
10/10/1987 20:00,manilla (rr#1) (canada),on,,disk,600,10/mins,"We could clearly see the craft and observed in detail it's structure,no sound was made by it and the lights that were flashing were not",8/5/2001,0,0
Since it is complaining about not having enough elements in the line, lets re-read that line telling R to read the file and file in missing fields.
ufo42 = read.table("50_line_42_raw.csv", header=TRUE, sep=",", fill=TRUE)
options(width=1000)
print(ufo42)
datetime city state country shape duration..seconds. duration..hours.min. comments date.posted latitude longitude
1 10/10/1987 20:00 manilla (rr NA NA NA NA NA NA NA NA NA
Well, that result was probably unexpected...
First of all using fill is rather dangerous because it will just fill in the missing
values at the end with NAs, but what if an earlier column was missing. In our case
we were actually being impacted by the way read.table() handles comments #,
we could get around that by telling R what the comment character is.
ufo42 = read.table("50_line_42_raw.csv", header=TRUE, sep=",", comment.char="")
options(width=1000)
print(ufo42)
datetime city state country shape duration..seconds. duration..hours.min. comments date.posted latitude longitude
1 10/10/1987 20:00 manilla (rr#1) (canada) on NA disk 600 10/mins We could clearly see the craft and observed in detail it's structure,no sound was made by it and the lights that were flashing were not 8/5/2001 0 0
Lets look at the data again and consider what parts of the data we don't like. Do you see anything in particular that is concerning?
duration..seconds., duration..hours.min, etc.duration..hours.min column you get results like "10 minutes", "< 1 min", and "two seperate times, 10 se". Also consider extra spaces, capitolization, etc.You, as a Data Scientist, have to make the decisions on how you are going to load your raw data. Your steps should always be reproducable and you should be able to provide these transforms to others.
To use our data, you must be able to comprehend the data itself. While not comprehensive, the list below will help guide your thoughts in processing the RAW data that you have.
In general you will find that it is more common for Python to be used for loading the data into a database. Each row of the data can be processed and then inserted into the database. Before we do this we should think of a general process.
After determining the schema for our tables, we need to begin inserting the rows into the database. We will do that with Python.
Lets use the Pandas library to make our lives a bit easier.
import pandas as pd # Load in the Pandas Library
df = pd.read_csv('100.csv') # Load in our 100 rows of UFO data
print(df.isnull().sum()) # Lets see what nulls we have
datetime 0
city 0
state 10
country 17
shape 2
seconds 0
hours 1
comments 0
dateposted 0
latitude 0
longitude 0
Just by taking this step, we can see that there is missing data in our table, for
each of these columns we need to make decisions.
If we decide delete rows or use default values for missing fields we can use some of Pandas built in methods.
df_clean = df.dropna() # Lets drop all rows with NAs
df_filled = df.fillna(0) # Lets replace all NA values with another value
df_unique = df.drop_duplicates() # Lets drop rows that are duplicative
Currently every column is being treated as either a string or an object by
Pandas, if we wanted to perform work before the data was inserted into the
database now would be a great time. As an example, the seconds column
is a perfect candidate to change to an int
df['seconds'] = df['seconds'].astype(int)
Assuming that you have created your MySQL tables with the correct types, in this example as an int, performing this step in Python is probably not necessary.
Now that we have loaded in all of the data and made some global changes to the data. Lets walkthrough each row in the dataframe.
for i in range(len(df)):
row = df.iloc[i,:]
print(row)
We will see something like the below for each iteration of the loop.
datetime 10/10/2001 22:00
city phoenix
state az
country us
shape triangle
seconds 60
hours less then a minute
comments Triangle shaped craft spotted flying west to e...
dateposted 11/20/2001
latitude 33.448333
longitude -112.073333
And now we can make changes to the individual parts.
city = row['city'].capitalize()
state = row['state'].upper()
Hopefully you can see that this is a great time to make changes
as needed to each field to meet what is needed to insert the data into MySQL.
If our table looked like:
+------------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+------------------+--------------+------+-----+---------+-------+
| dtg | DATETIME | YES | | NULL | |
| city | varchar(128) | YES | | NULL | |
| state | varchar(32) | YES | | NULL | |
| country | varchar(32) | YES | | NULL | |
| shape | varchar(64) | YES | | NULL | |
| duration_seconds | INT | YES | | NULL | |
| duration_minutes | INT | YES | | NULL | |
| comments | varchar(256) | YES | | NULL | |
| dateposted | DATETIME | YES | | NULL | |
| latitude | double(9,6) | YES | | NULL | |
| longitude | double(9,6) | YES | | NULL | |
+------------------+--------------+------+-----+---------+-------+
Then we would need to write a prepared INSERT statement
and after cleaning each row and validating it, we would insert it into our database
query = "INSERT INTO ufo VALUES (?,?,?,?,?,?,?,?,?,?,?)"
It is extremely important that you become very comfortable with string
manipulation in Python. Methods like .strip(), .lower(), .split(),
and slicing will need to become second nature.
Commands like the following are also very important in that they can help you understand the data that you are working with.
print(df.head())
print(df.info())
print(df.describe())
print(df.isnull().sum())
In fact you should work to obtain this type of information each time you
begin to work with a dataset.
By this point in the class, your team should have already found the sources of the data you want to work with and have the CSVs (or other raw data), ready for processing. With that in mind, for today's problems: