Link Search Menu Expand Document

Python for Harvesting Data on the Web

This is an intermediate-to-advanced level Python workshop that describes ways to approach common data wrangling from the web for research needs. We will focus on obtaining open data sources through HTTP requests and then demonstrate how to access larger sources of data via APIs. Then we will show how to turn the retrieved data into more useful objects like data frames to do basic manipulations and analysis. This workshop is only recommended for Python users with familiarity in Pandas, Numpy, core Python objects (lists, dictionaries, strings, numbers), file types like JSON and CSV, and comfort using Jupyter Notebooks.

Presenters

Chasz Griego
Science and Engineering Librarian
Office: 4410, Sorrells Library
cgriego@andrew.cmu.edu

Lencia Beltran
Open Science Program Coordinator
Office: 4416, Sorrells Library
lbeltran@andrew.cmu.edu

Goals of this Workshop

This workshop will provide learners the basic approaches to extract information from the web using Python. The topics covered will give learners enough ground knowledge to harvest information from a several web sources.

Setup

To be best prepared for this workshop, please make sure you have Python Installed,preferably with the Anaconda distribution prior to attending. You may also install some of the libraries we will be using ahead of time:

  • urllib
  • beautifulsoup4
  • MechanicalSoup
  • sodapy

Schedule

Section Time
Setup  
Scrape and Parse Text from the Web 00:00
Using an HTML Parser 00:20
Interacting with HTML Forms 00:40
Interacting with Websites in Real Time 01:00
Web Requests 01:20
Simple Web API Requests 01:40
Finish 02:00

Accessing the Curriculum

See the concepts covered in this workshop as a Jupyter Notebook on nbviewer:

nbviewer

Or work through the course content in JupyterLab on Binder:

Binder

Post-Workshop Survey

Please complete this survey after attending the workshop. Thank you in advance!!!


Table of contents