Text Mining


From searching for information

to computer-aided reading






Jeri Wieringa
PhD Student in History
George Mason University
@jeriwieringa

Big Data + Old History

by Adam Crymble


What is Distant Reading?



Data Mining: 


"The nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data."




Frequently Used Terms


  • N-Gram: The frequency of a word or set of words appearing across a set of documents.
  • KWIC (Key Words in Context): Search results or index where the key word is displayed with the words that appear immediately before and after.
  • Collocate: Words that tend to appear together.
  • Stop Words: Words removed before the text is analyzed. 
  • Topic Models or Topic Modeling: Particular type of text mining that groups words that appear together as "topics" and gives a statistical report of which topics appear in each document.

What can historians do 

with Text-Mining?


    • Categorize documents into groups
    • Contrast two groups of documents
    • Trace words and grouping of words over time
    • Cluster words that tend to be associated in documents
    • Find proper nouns in and across texts

How is this used in historical research?

Categorize Documents: Mapping Texts

How is this used? (2)

Contrast Documents: Separating by Genre

How is this used? (3)

Tracking Language Over Texts: Gender in Shakespeare


How is this used (4)?

Clustering words that appear together: 
Topic Modeling Martha Ballard's Diary


What do you need for Text Mining?


    • A Question
    • A large collection of texts, words, or other data
    • Tools for analyzing (or reading) the texts
    • A rubric for interpreting the results

What are some of the tools?


Google N-gram and BYU Google Interface  

Tools (2) 


Voyant  


Tools (3)


Programming languages and libraries 
  • Python
  • R
  • MALLET

What are some challenges of 

text mining for historical research?


  • Need to use math 

  • Need enough stuff for your questions

  • Need clean texts 

What are some of the advantages?



  • Ask new questions

  • Work with a large number of texts

  • Learn valuable skills

In Class Project


  • Goal is to use the tools to answer the question


    • Choose a dataset
    • Break the question into smaller questions
    • Determine what tests you could run to answer those questions
    • Run the tests and gather the data
      • Keep track of what you tested and what the results were
    • Use the data to construct your answer to the question
      • Charts and graphs are encouraged.

In Class Project


Question: How have descriptions of technology changed over the 20th century?

Goal: One (or two) phrase description of the change.

Option 1: Google Book Corpus for US


Option 2: Time Magazine Corpus

If you are interested and want to learn more:


Introduction to Text Mining

By Jeri Wieringa

Introduction to Text Mining

Short talk for History 390 at George Mason University

  • 1,586