Python for Harvesting Data on the Web

This is an intermediate-to-advanced level Python workshop that describes ways to approach common data wrangling from the web for research needs. We will focus on obtaining open data sources through HTTP requests and then demonstrate how to access larger sources of data via APIs. Then we will show how to turn the retrieved data into more useful objects like data frames to do basic manipulations and analysis. This workshop is only recommended for Python users with familiarity in Pandas, Numpy, core Python objects (lists, dictionaries, strings, numbers), file types like JSON and CSV, and comfort using Jupyter Notebooks.

Presenters

Chasz Griego
Science and Engineering Librarian
Office: 4410, Sorrells Library
cgriego@andrew.cmu.edu

Lencia Beltran
Open Science Program Coordinator
Office: 4416, Sorrells Library
lbeltran@andrew.cmu.edu

Goals of this Workshop

This workshop will provide learners the basic approaches to extract information from the web using Python. The topics covered will give learners enough ground knowledge to harvest information from a several web sources.

Setup

To be best prepared for this workshop, please make sure you have Python Installed,preferably with the Anaconda distribution prior to attending. You may also install some of the libraries we will be using ahead of time:

urllib
beautifulsoup4
MechanicalSoup
sodapy

Schedule

Section	Time
Setup
Scrape and Parse Text from the Web	00:00
Using an HTML Parser	00:20
Interacting with HTML Forms	00:40
Interacting with Websites in Real Time	01:00
Web Requests	01:20
Simple Web API Requests	01:40
Finish	02:00

Accessing the Curriculum

See the concepts covered in this workshop as a Jupyter Notebook on nbviewer:

Or work through the course content in JupyterLab on Binder:

Post-Workshop Survey

Please complete this survey after attending the workshop. Thank you in advance!!!