Text Mining
From searching for information
to computer-aided reading
Jeri Wieringa
PhD Student in History
George Mason University
@jeriwieringa
Big Data + Old History
by Adam Crymble
What is Distant Reading?
Data Mining:
"The nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data."
Frequently Used Terms
- N-Gram: The frequency of a word or set of words appearing across a set of documents.
- KWIC (Key Words in Context): Search results or index where the key word is displayed with the words that appear immediately before and after.
- Collocate: Words that tend to appear together.
- Stop Words: Words removed before the text is analyzed.
- Topic Models or Topic Modeling: Particular type of text mining that groups words that appear together as "topics" and gives a statistical report of which topics appear in each document.
What can historians do
with Text-Mining?
- Categorize documents into groups
- Contrast two groups of documents
- Trace words and grouping of words over time
- Cluster words that tend to be associated in documents
- Find proper nouns in and across texts
How is this used in historical research?
Categorize Documents: Mapping Texts
How is this used? (2)
Contrast Documents: Separating by Genre
How is this used? (3)
Tracking Language Over Texts: Gender in Shakespeare
How is this used (4)?
Clustering words that appear together:
Topic Modeling Martha Ballard's Diary
What do you need for Text Mining?
-
A Question
- A large collection of texts, words, or other data
- Tools for analyzing (or reading) the texts
- A rubric for interpreting the results
What are some of the tools?
Google N-gram and BYU Google Interface
Tools (2)
Voyant
Tools (3)
What are some challenges of
text mining for historical research?
-
Need to use math
-
Need enough stuff for your questions
- Need clean texts
What are some of the advantages?
Ask new questions
Work with a large number of texts
Learn valuable skills
In Class Project
- Goal is to use the tools to answer the question
- Choose a dataset
- Break the question into smaller questions
- Determine what tests you could run to answer those questions
- Run the tests and gather the data
- Keep track of what you tested and what the results were
- Use the data to construct your answer to the question
- Charts and graphs are encouraged.
In Class Project
Question: How have descriptions of technology changed over the 20th century?
Goal: One (or two) phrase description of the change.
Option 1: Google Book Corpus for US
Option 2: Time Magazine Corpus
If you are interested and want to learn more:
-
Take a statistics course
-
Play
-
Read
- Journal of Digital Humanities 2.1
- Literary and Linguistic Computing
- Blogs
- Programming Historian (http://programminghistorian.org/)
- Ted Underwood (tedunderwood.com/)
- Scott Weingart (http://www.scottbot.net/HIAL/)
-
Practice
Introduction to Text Mining
By Jeri Wieringa
Introduction to Text Mining
Short talk for History 390 at George Mason University
- 1,586







