Data scraping, or webscraping, is the process of importing information froma website of multiple websites into a spreadsheet or a file. It's a common practice for people doing research, finding the best deals (think of travel websites like Kayak), etc.
Concrete example: Say you want to know how certain politician have voted throughout the years without having to click throught a website. https://clerk.house.gov/Votes
doing it by hand: Click "view details" for each bill -> copy and paste
But we can do much better through coding and web scraping
Steps:
Let's try this on a simple website first: https://realpython.github.io/fake-jobs/
import sys
!{sys.executable} -m pip install requests # requests: python library that allows you to access websites + resources
!{sys.executable} -m pip install beautifulsoup4 # bs4: python library for parsing structured data
# Store websites content so we can use the content in Python
import requests
from bs4 import BeautifulSoup
URL = "https://realpython.github.io/fake-jobs/"
page = requests.get(URL) # get's website page, returns requests.Response object
#print(page.status_code) # 200 means okay, 404 means page not found
#print(page.text) # could also print page.url, page.links, ...
soup = BeautifulSoup(page.content, "html.parser") # creates beautiful soup object which stores the page's
# information in a format that's easily usable
# pass page.content helps with character encoding
# html.parser makes sure you use the appropriate parser
results = soup.find(id="ResultsContainer")
#print all job titles on the page
job_elements = results.find_all("div", class_="card-content")
for job_element in job_elements:
title_element = job_element.find("h2", class_="title")
print(title_element.text.strip())
# But let's say we only care about the jobs that explicitly mention python
# We can use a lambda function
python_jobs = results.find_all("h2", string=lambda text: "python" in text.lower())
for jobs in python_jobs:
print(jobs.text.strip())
For more information on any of the above code, visit: https://realpython.com/beautiful-soup-web-scraper-python/
If you're interested in doing a project involving web scraping, try the following exercises:
Most social media and other copmanies make their data easily accessible using an API (application programming interface). I won't go over APIs today but if we have a few people who think it would be useful for their project then I can go over this topic. In most cases though, you can easily use code that someone else wrote or the dataset already exists.
Note: If you cannot get this to work on your computer, let Dominic know and he will help you!
import sys
!{sys.executable} -m pip install twint # Twitter intelligence tool, Twitter scraping tool
!{sys.executable} -m pip install nest_asyncio # Allows Twint looping to work in Juptyer notebooks
import nest_asyncio
nest_asyncio.apply()
import twint
c = twint.Config() # configure twint object
c.Username = "kanyewest" # Choose tweets from a specific twitter user
#c.Search = "Stranger Things" # Choose tweets with specific key words
c.Limit = 50
c.Store_csv = True
c.Output = 'Kanye_tweet_data.csv'
twint.run.Search(c)