Metadata Heatmaps for Distant Reading

This video explores how metadata can be used to undertake a distant reading of a textual corpus, in this case 1,8000 dissertations, for the purposes of rhetorical analysis. It shows how heatmaps of metadata can provide visualizations that help lead to further analysis in R. The video also shows how to use the data cleaning tool OpenRefine to easily convert from MARC codes into a more readable format such as CSV to TSV.

Further Reading and Resources

I highly recommend http://openrefine.org: it’s an excellent, free tool for data inspection and transformation. OpenRefine lives in your browser, but operates locally, so you don’t have to share data with any third parties. The site includes walkthrough videos for common operations.

For inspiration and some great tutorials on data visualization, including heat maps, check out https://flowingdata.com.

The R software includes extensive built-in help pages, which I found were enough to get my bearings (especially when supplemented with web searches that mostly led to Stack Overflow). But Springer has put out several books specific to learning R for DH, including Humanities Data in R(Arnold and Tilton), Text Analysis with R for Students ofLiterature (Jockers), and CorpusLinguistics and Statistics with R (Desaguiler). I haven’t read any of them, so your mileage may vary, but if I were starting again I’d probably start with one of these.

If you do get into R, before too long, you’ll likely want to use the interoperable collection of add-on packages known as “the tidyverse,” which make it easier to rearrange and re-represent your data (without having to kick it back out to openrefine). There’s a series of free courses by the creators at https://www.datacamp.com/courses/introduction-to-the-tidyverse.

Distant Reading, Corpus Linguistics, Computational Linguistics, Text Mining and Analytics

Posted by

Benjamin Miller is an Assistant Professor of English at the University of Pittsburgh, focusing on digital research and pedagogy. He is the author of “Mapping the Methods of Composition/Rhetoric Dissertations: A ‘Landscape Plotted and Pieced,’” an article drawing on data visualization of several thousand documents, published in College Composition and Communication. Ben received a CCCC Emergent Research/er Grant in 2017 for work toward his multimodal book project, “Distant Readings of Disciplinarity: Knowing and Doing in Composition/Rhetoric Dissertations.”

Similar Projects by Discipline

English

How to grow data forests with XML trees

Elisa Beshero-Bondar

eXtensible Markup Language (XML).

Stylometry and Authorship Analysis

Patrick Juola

Machine learning to identify authors.

DocuScope

David Kaufer

Computer Support for Close Reading and Textual Analysis in DH.

Logistic Regression

Matthew J. Lavin

Machine learning for literary analysis.

The Historical TV Guide

Kathy M. Newman, Steven Gotzler

Using digitized text to study television history.

Data Visualization: Tableau

Emma Slayton

Data visualization with Tableau.

Shakespeare-VR

Stephen Wittek

Building immersive VR projects.

Literature

How to grow data forests with XML trees

Elisa Beshero-Bondar

eXtensible Markup Language (XML).

Stylometry and Authorship Analysis

Patrick Juola

Machine learning to identify authors.

DocuScope

David Kaufer

Computer Support for Close Reading and Textual Analysis in DH.

Logistic Regression

Matthew J. Lavin

Machine learning for literary analysis.

Shakespeare-VR

Stephen Wittek

Building immersive VR projects.

Beyond the Ant Brotherhood

Tatyana Gershkovich

Dynamic digital archives of writings and timelines.

The Latin American Comics Archive (LACA)

Felipe Gómez

Online archives in comic book markup language.

Similar Projects by Topics

Distant Reading

Marriage & Divorce of Capitalism & Democracy

Simon DeDeo

DH methods for interdisciplinary studies and results.

The Historical TV Guide

Kathy M. Newman, Steven Gotzler

Using digitized text to study television history.

No other videos for this topic yet.

Corpus Linguistics

Marriage & Divorce of Capitalism & Democracy

Simon DeDeo

DH methods for interdisciplinary studies and results.

The Historical TV Guide

Kathy M. Newman, Steven Gotzler

Using digitized text to study television history.

Building your own data set

AmyJo Brown

A Journalist's approach

Structure-based Network Analysis

S.E. Hackney

Structure-based network analysis.

Stylometry and Authorship Analysis

Patrick Juola

Machine learning to identify authors.

DocuScope

David Kaufer

Computer Support for Close Reading and Textual Analysis in DH.

Logistic Regression

Matthew J. Lavin

Machine learning for literary analysis.

Topic Modeling Subreddits

Chloe Perry

Computational techniques to topic model subreddits.

Computational Linguistics

Structure-based Network Analysis

S.E. Hackney

Structure-based network analysis.

Stylometry and Authorship Analysis

Patrick Juola

Machine learning to identify authors.

Logistic Regression

Matthew J. Lavin

Machine learning for literary analysis.

Topic Modeling Subreddits

Chloe Perry

Computational techniques to topic model subreddits.

Text Mining and Analytics

Marriage & Divorce of Capitalism & Democracy

Simon DeDeo

DH methods for interdisciplinary studies and results.

The Historical TV Guide

Kathy M. Newman, Steven Gotzler

Using digitized text to study television history.

Building your own data set

AmyJo Brown

A Journalist's approach

Structure-based Network Analysis

S.E. Hackney

Structure-based network analysis.

DocuScope

David Kaufer

Computer Support for Close Reading and Textual Analysis in DH.

Logistic Regression

Matthew J. Lavin

Machine learning for literary analysis.

Topic Modeling Subreddits

Chloe Perry

Computational techniques to topic model subreddits.

Last updated: August 29, 2019
https://github.com/cmu-lib/dhlg/blob/master/_projects/miller.md