Data scraping¶

Data scraping, or webscraping, is the process of importing information froma website of multiple websites into a spreadsheet or a file. It's a common practice for people doing research, finding the best deals (think of travel websites like Kayak), etc.

Concrete example: Say you want to know how certain politician have voted throughout the years without having to click throught a website. https://clerk.house.gov/Votes

doing it by hand: Click "view details" for each bill -> copy and paste

But we can do much better through coding and web scraping

Steps:

Manually inspect the data source (the website). The better you know the website, the easier it will be to scrape
Scrape HTML content from the page. HTML is a coding language to create websites.
Parse HTML code form the page.

Let's try this on a simple website first: https://realpython.github.io/fake-jobs/

import sys

!{sys.executable} -m pip install requests             # requests: python library that allows you to access websites + resources
!{sys.executable} -m pip install beautifulsoup4       # bs4: python library for parsing structured data

# Store websites content so we can use the content in Python
import requests
from bs4 import BeautifulSoup

URL = "https://realpython.github.io/fake-jobs/"
page = requests.get(URL)                           # get's website page, returns requests.Response object

#print(page.status_code)                           # 200 means okay, 404 means page not found
#print(page.text)                                  # could also print page.url, page.links, ...

soup = BeautifulSoup(page.content, "html.parser")           # creates beautiful soup object which stores the page's
                                                            #      information in a format that's easily usable
                                                            # pass page.content helps with character encoding
                                                            # html.parser makes sure you use the appropriate parser

results = soup.find(id="ResultsContainer")

#print all job titles on the page
job_elements = results.find_all("div", class_="card-content")
for job_element in job_elements:
    title_element = job_element.find("h2", class_="title")
    print(title_element.text.strip())

# But let's say we only care about the jobs that explicitly mention python
# We can use a lambda function

python_jobs = results.find_all("h2", string=lambda text: "python" in text.lower())

for jobs in python_jobs:
    print(jobs.text.strip())

For more information on any of the above code, visit: https://realpython.com/beautiful-soup-web-scraper-python/

If you're interested in doing a project involving web scraping, try the following exercises:

From the same page we have been working on above, print out all roles that mention Engineer along with the company name and location
Write a function that takes in a string and outputs all jobs (along with the company name and location) that mention the string in their job title
Write a function that takes in a string and outputs all jobs (along with company name and location) that are in the state corresponding to the input string

Most social media and other copmanies make their data easily accessible using an API (application programming interface). I won't go over APIs today but if we have a few people who think it would be useful for their project then I can go over this topic. In most cases though, you can easily use code that someone else wrote or the dataset already exists.

Note: If you cannot get this to work on your computer, let Dominic know and he will help you!

import sys
!{sys.executable} -m pip install twint                   # Twitter intelligence tool, Twitter scraping tool

!{sys.executable} -m pip install nest_asyncio            # Allows Twint looping to work in Juptyer notebooks

import nest_asyncio
nest_asyncio.apply()

import twint

c = twint.Config()                            # configure twint object
c.Username = "kanyewest"                      # Choose tweets from a specific twitter user
#c.Search = "Stranger Things"                 # Choose tweets with specific key words
c.Limit = 50
c.Store_csv = True
c.Output = 'Kanye_tweet_data.csv'
twint.run.Search(c)

Data scraping¶

Scraping a social media site¶