1 of 3

¶ Text Analysis

Text analysis involves using digital tools and one’s own analytical skills to explore texts, be they literary works, historical documents, scientific literature, or tweets. Approaches can be quantitative (e.g., word counting) and qualitative (e.g., topic modeling and sentiment analysis), and tools can range from coding and scripting languages to "out of the box" platforms like Voyant and Lexos.

In the humanities, text analysis is closely associated with the concept of distant reading, which essentially means using computational methods to explore and query large (sometimes massive) corpora. The corpa or datasets, as they are more commonly called in the sciences and social sciences, can be structured or unstructured, and the results can have a data visualization component.

Related Terms:

Text mining (a term used more in the humanities), data mining (a term used more in the sciences and social sciences), and web scraping are techniques that use coding, scripting, and "out of the box" tools to gather text and create a corpus (or dataset).

Out of the Box vs Coding and Scripting

Text analysis can be done using "out of the box" tools or coding and scripting with the latter approach enabling scholars to explore more nuanced research questions.

Out of the Box

Using "out of the box" tools, which don't require coding or scripting, is a good way to get started in text analysis as it will help users begin to understand possibilities and techniques. Voyant and Lexos are examples of such tools. (Mallet, used for topic modeling, is an example of a tool that requires coding but also provides users with a lot of guidance and preexisting code.)

Here is a Voyant instance that contains all of Shakespeare's plays. Stopwords like "thou" and "sir" have been applied to prevent them from dominating the results. (The selection of stopwords is part of the scholarly decision making that goes into text analysis.)

Visit to interact with this Voyant instance.

Coding and Scripting

Coding is an umbrella term that involves using coding (or programming) languages to do things like create applications and websites. Scripting falls under coding and involves using coding languages to do things like automate processes and make websites more dynamic. Coding and scripting are typically done using a computer's command line or platforms like Jupiter Notebooks.

To get a sense of what coding and scripting look like in text analysis, here is a basic example from the Natural Language Toolkit, which uses the Python language. Here you can see a script being run that tags the parts of speech in the sentence, "And now for something completely different." (CC = coordinating conjunction, RB = adverb, IN = preposition, NN = noun, JJ=adjective. )

import nltk
text = word_tokenize("And now for something completely different")
nltk.pos_tag(text)
[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'),
('completely', 'RB'), ('different', 'JJ')]

In this example from Programming Historian, you see a portion of a Python script used for counting word frequencies.

wordstring = 'it was the best of times it was the worst of times '
wordstring += 'it was the age of wisdom it was the age of foolishness'

wordlist = wordstring.split()

wordfreq = []
for w in wordlist:
    wordfreq.append(wordlist.count(w))

print("String\n" + wordstring +"\n")
print("List\n" + str(wordlist) + "\n")
print("Frequencies\n" + str(wordfreq) + "\n")
print("Pairs\n" + str(list(zip(wordlist, wordfreq))))

String
it was the best of times it was the worst of times it was the age of wisdom it was the age of foolishness

List
['it', 'was', 'the', 'best', 'of', 'times', 'it', 'was',
'the', 'worst', 'of', 'times', 'it', 'was', 'the', 'age',
'of', 'wisdom', 'it', 'was', 'the', 'age', 'of',
'foolishness']

Frequencies
[4, 4, 4, 1, 4, 2, 4, 4, 4, 1, 4, 2, 4, 4, 4, 2, 4, 1, 4,
4, 4, 2, 4, 1]

Pairs
[('it', 4), ('was', 4), ('the', 4), ('best', 1), ('of', 4),
('times', 2), ('it', 4), ('was', 4), ('the', 4),
('worst', 1), ('of', 4), ('times', 2), ('it', 4),
('was', 4), ('the', 4), ('age', 2), ('of', 4),
('wisdom', 1), ('it', 4), ('was', 4), ('the', 4),
('age', 2), ('of', 4), ('foolishness', 1)]

Text Analysis Examples

Humanities Example

In this text analysis example, Ted Underwood and David Bamman used BookNLP, a Java-based natural language processing code, to explore gender in 93,708 English-language fiction volumes. They articulate one of their major discoveries as follows:

There is a clear decline from the nineteenth century (when women generally take up 40% or more of the “character space” in fiction) to the 1950s and 60s, when their prominence hovers around a low of 30%. A correction, beginning in the 1970s, almost restores fiction to its nineteenth-century state. (One way of thinking about this: second-wave feminism was a desperately-needed rescue operation.)

Visit their blog post to learn more about their methods and discoveries.

Science Example

Here CORD-19, a database containing thousands of scholarly articles about COVID-19 and other related coronaviruses, provides a topic model and visualization of 2437 journal articles. The approach they used, latent Dirichlet allocation (LDA), is a natural language processing based generative statistical model.

Visit to interact with the visualization.