Intro to Topic Modeling
For topic modeling, your text needs to be sliced into sections to see change within the text. The file of slave narratives we'll be using is sliced by paragraph. Column A is the paragraph identifier, Column B is the text identifier (with a year, so we can see change over time), and Column C is the text to be analyzed. If you prepare your own topic model file, please note there are no column headers.
Our file includes more than five thousand rows from twenty works.
Topic modeling makes bags of words which commonly occur together in a text. The algorithm looks through the text for words that occur near each other often, then makes a topic of 8-12 words.
Words can appear in multiple topics if they associate with other words in very different contexts. Think of topics sort of like modularity clusters in network analysis.
The topics themselves may be meaningless since they're not sentences, but they can tell us something about a text--if emotion words are often associated with a place, or if gendered words are often associated with actions, for example. Slicing a text into subsections can also show how the topic of the work shifts over time. Running topic models on multiple texts can compare between the texts or change between texts over time.
This site runs topic modeling in your browser and is what we'll be using in class and for this lesson. For future use, you should run topic models using the Topic Model GUI software in case the website goes down, but the software can be tricky to install.
After uploading the Slave Narratives file, you need to upload a stopwords list. The stopwords list is a list of words the analysis will disregard. It includes things like the, and, he, they, is, were, etc.
We'll use this stopwords list of the most common 20th century English words. Depending on the time period and language of your texts, you may need a different stopwords list!
After uploading your texts, you need to train the algorithm to recognize the topics. We do this by setting the number of topic categories we want to see, and running the algorithm several times to regularize results.
You may get different results each time you run a batch of iterations, or different results from those shown on the example slides. This is because each time the algorithm runs, it uses random batches of words to find co-occuring words. Running many iterations helps stabilize the topics because the software then compares the results against many iterations.
Depending on the number of topics you choose, words may cluster differently since more topics means more fine-grained associations: the algorithm only looks for words that are very tightly associated together. There is no right way to pick the number of topics, but generally if you see the same word repeated in many topics, that means you have too many topics. Another indication of too many topics is if severals topics are only found in one or a few texts instead of across all texts--this means that the topics are so small and specific, they can't be compared across texts.
While the Topic Documents tab lets us see our topics in context, the Topics Correlations lets us see how often topics occur together. Small dots appear together with average frequency, but large dots occur together much more or less often than average. This might lead us to questions for close reading: why was there an association between time, overseer, and locations away from a plantation?
The Time Series tab shows us how the proportions of topics change over the body of our texts (or change over time, if we've ordered our texts by time). We might use this to ask a close reading question: Why was there proportionally more discussion of God in earlier texts?
If we had too many topics, we would see topics with one very large spike and little or no other appearance like below. This means that the topics are so small they only occur in one work and nowhere else, so they're not helpful for comparing across texts!
The vocabulary tab tells us how closely a word clusters with other words. A higher score (closer to 1) means a word occurs in fewer topics, and so is only associated with those words in its topic. This might tell us that new and left are associated with many ideas, but captain and mother are only associated with a few ideas.
We can't export our views from correlations, time series, or vocabulary, but we can download the data behind them to let us create our own visualizations elsewhere.
Document topics will let us create our own time series; topic words will let us determine specificity; topic summaries gives us the bag of words for each numeric topic ID (important for decoding our other files); topic-topic will create a correlation grid; and doc-topic will create a network of topics connected to texts.
To do further analysis in other programs, we might need to manipulate our files a bit. For example, our doctopics file only labels topics by their text subsection--we might want a column with just the text label so we can analyze topics by both text and text subsection. So we'd add a column duplicating the subsection label and then use a regular expression to remove the subsection numeric ID pattern.
All the files will export topics labeled with just their numeric ID rather than a human-readable label. To give our topics human-readable labels in other programs, we'd copy the labels from our key.csv and use a transpose to paste this vertically formatted text horizontally instead. (Transpose appears in different places in different programs, but exists in all spreadsheet programs.)
In Gephi, we might want to color our topics by their modularity cluster and size them by the number of texts they appear in. With more data preparation, we could also make a dynamic graph showing how topic clusters shift over time or within a text. See here for more about text networks.
In Tableau, we could use our topic-topic.csv to create and style our own correlation matrix.
We could also use doctopics.csv to create an area or stream graph showing the change in distribution of our topics across texts and/or time.
Using a filter, we could look at the distribution of topics within one text. In this case, we can see how the topics change over the course of Harriet Jacobs' Incidents in the Life of a Slave Girl. The topic "master years mistress" dominates at a few points, but the topic "children mother child" is a constant low-level element.
Intro to Topic Modeling
By mkane
Intro to Topic Modeling
- 1,354