arrow-left

All pages
gitbookPowered by GitBook
1 of 3

Loading...

Loading...

Loading...

Out of the Box vs Coding and Scripting

Text analysis can be done using "out of the box" tools or coding and scripting with the latter approach enabling scholars to explore more nuanced research questions.

hashtag
Out of the Box

Using "out of the box" tools, which don't require coding or scripting, is a good way to get started in text analysis as it will help users begin to understand possibilities and techniques. Voyantarrow-up-right and Lexosarrow-up-right are examples of such tools. (Malletarrow-up-right, used for topic modelingarrow-up-right, is an example of a tool that requires coding but also provides users with a lot of guidance and preexisting code.)

Here is a Voyant instance that contains all of Shakespeare's plays. like "thou" and "sir" have been applied to prevent them from dominating the results. (The selection of stopwords is part of the scholarly decision making that goes into text analysis.)

hashtag
Coding and Scripting

Coding is an umbrella term that involves using coding (or programming) languages to do things like create applications and websites. Scripting falls under coding and involves using coding languages to do things like automate processes and make websites more dynamic. Coding and scripting are typically done using a computer's command line or platforms like Jupiter Notebooks.

To get a sense of what coding and scripting look like in text analysis, here is a basic example from the , which uses the Python language. Here you can see a script being run that tags the parts of speech in the sentence, "And now for something completely different." (CC = coordinating conjunction, RB = adverb, IN = preposition, NN = noun, JJ=adjective. )

In this example from , you see a portion of a Python script used for counting word frequencies.

Text Analysis Examples

hashtag
Humanities Example

In this text analysis example, Ted Underwood and David Bamman used BookNLP, a Java-based natural language processing code, to explore gender in 93,708 English-language fiction volumes. They articulate one of their major discoveries as follows:

There is a clear decline from the nineteenth century (when women generally take up 40% or more of the “character space” in fiction) to the 1950s and 60s, when their prominence hovers around a low of 30%. A correction, beginning in the 1970s, almost restores fiction to its nineteenth-century state. (One way of thinking about this: second-wave feminism was a desperately-needed rescue operation.)

¶ Text Analysis

Text analysis involves using digital tools and one’s own analytical skills to explore texts, be they literary works, historical documents, scientific literature, or tweets. Approaches can be quantitative (e.g., word counting) and qualitative (e.g., topic modeling arrow-up-rightand sentiment analysisarrow-up-right), and tools can range from coding and scripting languages to "out of the box" platforms like Voyantarrow-up-right and Lexosarrow-up-right.

In the humanities, text analysis is closely associated with the concept of distant readingarrow-up-right, which essentially means using computational methods to explore and query large (sometimes massive) corpora. The corpa or datasets, as they are more commonly called in the sciences and social sciences, can be structured or unstructured, and the results can have a data visualization component.

Related Terms:

  • Text mining (a term used more in the humanities), data mining (a term used more in the sciences and social sciences), and web scraping are techniques that use coding, scripting, and "out of the box" tools to gather text and create a corpus (or dataset).

Visit their blog post to learn more about their methods and discoveries.arrow-up-right
A visualization from Underwood and Bamman's text analysis.

hashtag
Science Example

Here CORD-19arrow-up-right, a database containing thousands of scholarly articles about COVID-19 and other related coronaviruses, provides a topic modelarrow-up-right and visualization of 2437 journal articles. The approach they used, latent Dirichlet allocation (LDA)arrow-up-right, is a natural language processing based generative statistical model.

Visit to interact with the visualization.arrow-up-right

The topics identified
The visualization (color indicates the topic and the node size reflects the number of citations)
Stopwordsarrow-up-right
Visit to interact with this Voyant instance.arrow-up-right
Natural Language Toolkitarrow-up-right
Programming Historianarrow-up-right
Caption goes here
import nltk
text = word_tokenize("And now for something completely different")
nltk.pos_tag(text)
[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'),
('completely', 'RB'), ('different', 'JJ')]
wordstring = 'it was the best of times it was the worst of times '
wordstring += 'it was the age of wisdom it was the age of foolishness'

wordlist = wordstring.split()

wordfreq = []
for w in wordlist:
    wordfreq.append(wordlist.count(w))

print("String\n" + wordstring +"\n")
print("List\n" + str(wordlist) + "\n")
print("Frequencies\n" + str(wordfreq) + "\n")
print("Pairs\n" + str(list(zip(wordlist, wordfreq))))

String
it was the best of times it was the worst of times it was the age of wisdom it was the age of foolishness

List
['it', 'was', 'the', 'best', 'of', 'times', 'it', 'was',
'the', 'worst', 'of', 'times', 'it', 'was', 'the', 'age',
'of', 'wisdom', 'it', 'was', 'the', 'age', 'of',
'foolishness']

Frequencies
[4, 4, 4, 1, 4, 2, 4, 4, 4, 1, 4, 2, 4, 4, 4, 2, 4, 1, 4,
4, 4, 2, 4, 1]

Pairs
[('it', 4), ('was', 4), ('the', 4), ('best', 1), ('of', 4),
('times', 2), ('it', 4), ('was', 4), ('the', 4),
('worst', 1), ('of', 4), ('times', 2), ('it', 4),
('was', 4), ('the', 4), ('age', 2), ('of', 4),
('wisdom', 1), ('it', 4), ('was', 4), ('the', 4),
('age', 2), ('of', 4), ('foolishness', 1)]