Python for Harvesting Data on the Web
This is an intermediate-to-advanced level Python workshop that describes ways to approach common data wrangling from the web for research needs. We will focus on obtaining open data sources through HTTP requests and then demonstrate how to access larger sources of data via APIs. Then we will show how to turn the retrieved data into more useful objects like data frames to do basic manipulations and analysis. This workshop is only recommended for Python users with familiarity in Pandas, Numpy, core Python objects (lists, dictionaries, strings, numbers), file types like JSON and CSV, and comfort using Jupyter Notebooks.
Presenters
Chasz Griego
Science and Engineering Librarian
Office: 4410, Sorrells Library
cgriego@andrew.cmu.edu
Lencia Beltran
Open Science Program Coordinator
Office: 4416, Sorrells Library
lbeltran@andrew.cmu.edu
Goals of this Workshop
This workshop will provide learners the basic approaches to extract information from the web using Python. The topics covered will give learners enough ground knowledge to harvest information from a several web sources.
Setup
To be best prepared for this workshop, please make sure you have Python Installed,preferably with the Anaconda distribution prior to attending. You may also install some of the libraries we will be using ahead of time:
- urllib
- beautifulsoup4
- MechanicalSoup
- sodapy
Schedule
Section | Time |
---|---|
Setup | |
Scrape and Parse Text from the Web | 00:00 |
Using an HTML Parser | 00:20 |
Interacting with HTML Forms | 00:40 |
Interacting with Websites in Real Time | 01:00 |
Web Requests | 01:20 |
Simple Web API Requests | 01:40 |
Finish | 02:00 |
Accessing the Curriculum
See the concepts covered in this workshop as a Jupyter Notebook on nbviewer:
Or work through the course content in JupyterLab on Binder:
Post-Workshop Survey
Please complete this survey after attending the workshop. Thank you in advance!!!