JSON and Scraping

We covered two types of Web queries in our prior two lectures: using an API to retrieve JSON, and scraping HTML web pages. Today puts these two side-by-side for a hands-on comparison with in-class exercises.

Days Until Christmas

Let's start with a JSON review to illustrate what is available. Do a Web search for "current time json" and you'll see a few hits! Right away we can see that APIs are available for retrieving accurate times. Let's write a program that will always tell you how many days and hours we have until Christmas.

The first thing to do is to understand the JSON format. This WorkdTimeAPI interface has a very nice webpage that shows examples! Sometimes the webpage's API will have sufficient information, but sometimes it's hard to find what you need. The direct approach is to simply type into your Web browser an example of a JSON call, in this case: http://worldtimeapi.org/api/timezone/America/New_York

{"abbreviation":"EST","client_ip":"136.160.90.6","datetime":"2022-11-22T11:15:51.994797-05:00","day_of_week":2,"day_of_year":326,"dst":false,"dst_from":null,"dst_offset":0,"dst_until":null,"raw_offset":-18000,"timezone":"America/New_York","unixtime":1669133751,"utc_datetime":"2022-11-22T16:15:51.994797+00:00","utc_offset":"-05:00","week_number":47}

Looks good! I see a nice 'datetime' field which sure sounds familiar (Python datetime module?). We can use that to do our datetime math. Here is a starter program:

import requests
from datetime import datetime

tz = input('What timezone ("local" also an option)? ')

if tz.lower() == 'local':
    tz = "America/New_York"

# Make the JSON query
data = requests.get("http://worldtimeapi.org/api/timezone/" + tz)
json = data.json()

# Create the NOW datetime object.
...

# Create a Christmas datetime, subtract, and print!
...

We'll show this in class. A full solution is here.

Live Stock Ticker: HTML

Let's compare both approaches (JSON and HTML) on the same task: stock prices.

There are many websites that show stock prices. We found MarketWatch to be pretty reliable, and it shows a static HTML webpage for any ticker that you desire. Click here to see the GMS webpage. Notice that the URL (marketwatch.com/investing/stock/gms) has the ticker at the very end of the address, so we can easily build that string in our program to ask for different stocks. Starter code is below:

import requests
from bs4 import BeautifulSoup

stock = input('Stock Ticker? ').lower()

while stock != 'quit' and stock != 'exit':
    # Query for the HTML
    data = requests.get('http://www.marketwatch.com/investing/stock/' + stock)
    soup = BeautifulSoup(data.text,'html.parser')

    # Find the stock price in the HTML! Print it out!



      
      

    stock = input('Stock Ticker? ').lower()

Your task is to read the HTML and find the price, then fill in this program to get it! A full solution is shown here.

Live Stock Ticker: JSON

The above works, but what if there is an API available for the same thing? Does that make it easier? You be the judge. Yahoo Finance has a JSON API that lets you query for various stock ticker information. Unfortunately it is in "unofficial" mode so documentation on the JSON format is not readily available. But that's ok! If we have the URL to query, we can inspect the returned JSON ourselves and find what we need.

Below is starter code to connect to Yahoo! Finance -- it's a little more complicated to connect than before because the interface requires a session connection, not just a one-time GET query. However, you can see in this code that the end result is the same: we have a call to json().

import requests

stock = input('Stock Ticker? ').lower()

while stock != 'quit' and stock != 'exit':

    # Build the JSON query
    query = 'http://query1.finance.yahoo.com/v11/finance/quoteSummary/' + stock + '?modules=financialData'

    with requests.session():
        header = {'Connection': 'keep-alive',
                   'Expires': '-1',
                   'Upgrade-Insecure-Requests': '1',
                   'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) \
                   AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'
                   }
        website = requests.get(query, headers=header)
        json = website.json()

    # Your code here! Use the json variable! Extract and print the price.

As with the HTML scraper, your task is to read the JSON and find the price, then fill in this program to print it! A full solution is shown here.

Discussion: JSON vs HTML

Which do you find easier to program? Which is more efficient to execute? Which is more reliable? These are different questions, and their answers can vary depending on the task. More often than not, JSON is more reliable and efficient to use in your programs because the authors of the JSON will not change its format on a whim. However, HTML can change a lot to satisfy human viewing demands, or advertisement insertion, or whatever. The location of a price on a webpage can drastically change and will break your HTML scraper.

That said, an HTML scraper can be very quick and efficient if your webpage source is straightforward and simple. JSON is often not an option, so you must use the HTML or nothing at all. In the end, it might come down to what is available to you.

Bonus Task: HTML scraping S&P stock tickers!

The above retrieves prices, but what if you don't even know what tickers are available? You can find all stocks on this wikipedia page, and you might want to write a scraper to grab them! Whenever they change, your program can automatically update itself.

Take a look at the webpage and write a program to print all stock symbols. When finished, you can come back and see our full solution here.