Web Scraping

We've covered reading data from text files line-by-line, CSV/TSV formats, and JSON. Today we shift to reading HTML from web pages that don't offer their data in a formal format outside of the visual human-centered website itself. This will enable you to write programs that can query any arbitrary page on the Web, and hopefully scrape off whatever information you need.

Making Web Queries

We introduced you to basic web queries yesterday and in the previous lecture on JSON. Python has a library called requests which lets you make GET and POST requests to web servers. You should remember these from plebe SY110, but if not, just remember this is the mechanism a web browser uses to request a web page from a remote server. Here is the basic mechanism:

import requests

page = requests.get('https://www.usna.edu/Users/cs/nchamber/')
print(page.text)

The one parameter is the URL you'd normally type into a web browser. This pulls Dr. Chambers's faculty webpage. Try it out on any page you want! The returned object is a Response object which has several data fields including .text that you'll see printed out if you run this:

<?xml version = "1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
  "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns = "http://www.w3.org/1999/xhtml">
<head>
  <title>Nate Chambers : US Naval Academy</title>
  ...
  ...

The text is the HTML response. If you queried a different URL for a different type of file, you'll get that text back instead. During the JSON lab, we used this approach to query a .json file, so it returned JSON text. This let us use the json() function which every Response object has. When you're web scraping, usually you will just be looking at HTML files like this.

Parsing the HTML-formatted Text

It's simple enough to retrieve a webpage, but the challenge is pulling out the pieces of information you want. An HTML file is sometimes huge in size, so the one piece of data you want is a needle in the HTML haystack. How do we find it? As with most things, Python has a library for this. It's called BeautifulSoup.

Click that link and spend a couple minutes looking at the Quickstart program example they have.

The beautiful soup library is bs4, but we mainly just want the BeautifulSoup class inside of it, so let's import just that:

from bs4 import BeautifulSoup

Call your desired website (we'll do Dr. Chambers's faculty page again) and give it to a new BeautifulSoup object:

page = requests.get('https://www.usna.edu/Users/cs/nchamber/')
soup = BeautifulSoup(page.text)    

Now you have a soup object with all the searching and parsing functions you could hope for.

Accessing HTML Elements Quickly

Want to find a DIV, or a TABLE, or a link A? You can access HTML tags by type, quickly getting the first such element on the page by just using variables in the object:

print(soup.a)
print(soup.div)
print(soup.h1)

These print the first of each type that is on the page. Typically, though, these are just quick access things, and chances are you want ALL of the anchor links on the page, right?

alist = soup.find_all('a')
print('Number of links =', len(alist))
print('This is the 5th link:', alist[4])
# Loop and print ALL the links	
for a in alist:
   print(a.get('href'))

Easy right? The find_all(str) method returns a ResultSet object, which is basically a list of all the HTML tags you searched for. Each object in a ResultSet is a Tag object. This gives you access to all the HTML attributes and internal text you might want. The code above finds all links on the page, loops over them, and then prints out their addresses using the 'href' attribute from HTML anchor tags.

Let's drive this home a little more. Perhaps you want all the <p> tags? It's a similar loop, but the text is inside of the P tag, not as an attribute, so you instead use the Tag object's text variable:

ps = soup.find_all('p')
for p in ps:
   print(p.text)

Tags

The basic building block in BeautifulSoup when parsing HTML pages is the Tag class. When you search and filter, you are getting back Tag objects. What can you do with a Tag object?

tag.text get a string representation of the text inside of the tag
tag.get(str) get the value of a Tag's attribute
tag.parent get the parent of a Tag in the HTML tree, returned as another Tag
tag.nextSibling get the next Tag adjacent to this one in the HTML tree
tag.children get a list-type object holding all Tags that are inside of this one
tag.find_all(str) get a List of all Tags under this current tag that match a search criteria

This is why our short code examples in the previous section work. We used p.text to print the paragraph's text, and we grabbed a link's address using the 'href' attribute: a.get('href'). These basic attributes of the Tag class will let you perform most operations that you need. Let's see how we might use them to find information on the page.

Finding What You Need

OK you can now retrieve any webpage you want and find lists of particular HTML elements. This is very powerful, but there are a couple more tricks to help you navigate the HTML.

Parents, Children, Siblings: You'll sometimes be able to grab an element close to what you want, but it's difficult to get to the exact element with one search. Let's imagine that the piece of data you want is just below a particular link on the webpage. In that case, you can first get to the link, and then you will navigate up and down the HTML tree:

These do what they sound like ... if the HTML has a series of tags and you have one of them in a variable, you can ask for its nextSibling to get the next one. Or if it's inside a bigger element like a DIV, you can ask for its parent to move up one level.

Text Search: You'll often look at a webpage and see exactly the content you want in text form. Just like before when we searched for tags of one type, BeautifulSoup lets you search for tags with a text span using the text attribute in the normal find_all() function:

matches = soup.body.find_all(text='Research Interests')	  

When you search for text, you don't get a Tag back but instead a string-type object. This search finds the "Research Interests" text on my webpage, click and see it.

Ok so we found the "Research Interests" text, so let's access its parent to get the actual H2 tag:

h2 = matches[0].parent

Great, and so you'll see the paragraph we want comes after the H2 on the webpage. We can get that by grabbing the nextSibling (actually, two of them, because there is a whitespace string between the two on the page. Here is the entire code block:

page = requests.get('https://www.usna.edu/Users/cs/nchamber/')
soup = BeautifulSoup(page.text)

matches = soup.body.find_all(text='Research Interests')
h2 = matches[0].parent
p = h2.nextSibling.nextSibling
print(p.text)

The Conclusion of the Matter

You may be asking how you know to get a parent, or two siblings, or maybe children. A lot of this is trial and error. Write a short bit of code and print out the tag you are on. Print out its siblings and parents. Open the actual HTML of the webpage and align what you're seeing printed to the page. This will tell you where and how to navigate the HTML tree. It might seem daunting at first, but once you write your program to grab the info you need, it will always work until the webpage changes its internal structure ... this happens less often than you might imagine.

The key to a good web scraping program is to write it based on the HTML's unique characteristics. If you write it based on finding, say, the 3rd child of the 19th DIV after the 4th table ... it's probably going to break pretty easily. However, if you write it based on finding the table that has an H2 before it called 'data table', well now it's more robust if the page changes.

Ultimately, this just brushes the surface of what you can do. Look at the example below for another full program, but you now have the basics to be off and running if you want to scrape the Web!

Real Example

Imagine we have a website that contains a list of organizations who have been hit with DDoS attacks -- the website is here. There is a list of domains and dates on this webpage, and we want a program that will retrieve this and write all the organization names out to a text file.

How do we approach this? The first step is to look at the HTML code! Right-click on the page in your browser and choose "View Page Source". Now look for the names and dates that you saw, and you should find they are in an HTML TABLE. What makes this a little easier is that the organizations are in TD tags and none others exist on the page, so maybe we can use those. Let's retrieve the page and grab the TDs:

import requests
from bs4 import BeautifulSoup

page = requests.get('https://www.usna.edu/Users/cs/nchamber/data/ddos/')
soup = BeautifulSoup(page.text, 'html.parser')
tdlist = soup.find_all('td')

out = open('scraped.txt', 'w')
for td in tdlist:
    sib = td.nextSibling    # Get the next HTML element following this TD

    # TD of a domain is followed by another TD with a date, so sibling will not be None	    
    # Some TDs have no text, so skip those
    if sib != None and td.text != ' ':
        company = td.text
        datestr = sib.text

        # Write to file
        out.write(company + '\t' + datestr + '\n')

out.close()