arrow-left

All pages
gitbookPowered by GitBook
1 of 1

Loading...

Out of the Box vs Coding and Scripting

Text analysis can be done using "out of the box" tools or coding and scripting with the latter approach enabling scholars to explore more nuanced research questions.

hashtag
Out of the Box

Using "out of the box" tools, which don't require coding or scripting, is a good way to get started in text analysis as it will help users begin to understand possibilities and techniques. Voyantarrow-up-right and Lexosarrow-up-right are examples of such tools. (Malletarrow-up-right, used for topic modelingarrow-up-right, is an example of a tool that requires coding but also provides users with a lot of guidance and preexisting code.)

Here is a Voyant instance that contains all of Shakespeare's plays. like "thou" and "sir" have been applied to prevent them from dominating the results. (The selection of stopwords is part of the scholarly decision making that goes into text analysis.)

hashtag
Coding and Scripting

Coding is an umbrella term that involves using coding (or programming) languages to do things like create applications and websites. Scripting falls under coding and involves using coding languages to do things like automate processes and make websites more dynamic. Coding and scripting are typically done using a computer's command line or platforms like Jupiter Notebooks.

To get a sense of what coding and scripting look like in text analysis, here is a basic example from the , which uses the Python language. Here you can see a script being run that tags the parts of speech in the sentence, "And now for something completely different." (CC = coordinating conjunction, RB = adverb, IN = preposition, NN = noun, JJ=adjective. )

In this example from , you see a portion of a Python script used for counting word frequencies.

Stopwordsarrow-up-right
Visit to interact with this Voyant instance.arrow-up-right
Natural Language Toolkitarrow-up-right
Programming Historianarrow-up-right
Caption goes here
import nltk
text = word_tokenize("And now for something completely different")
nltk.pos_tag(text)
[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'),
('completely', 'RB'), ('different', 'JJ')]
wordstring = 'it was the best of times it was the worst of times '
wordstring += 'it was the age of wisdom it was the age of foolishness'

wordlist = wordstring.split()

wordfreq = []
for w in wordlist:
    wordfreq.append(wordlist.count(w))

print("String\n" + wordstring +"\n")
print("List\n" + str(wordlist) + "\n")
print("Frequencies\n" + str(wordfreq) + "\n")
print("Pairs\n" + str(list(zip(wordlist, wordfreq))))

String
it was the best of times it was the worst of times it was the age of wisdom it was the age of foolishness

List
['it', 'was', 'the', 'best', 'of', 'times', 'it', 'was',
'the', 'worst', 'of', 'times', 'it', 'was', 'the', 'age',
'of', 'wisdom', 'it', 'was', 'the', 'age', 'of',
'foolishness']

Frequencies
[4, 4, 4, 1, 4, 2, 4, 4, 4, 1, 4, 2, 4, 4, 4, 2, 4, 1, 4,
4, 4, 2, 4, 1]

Pairs
[('it', 4), ('was', 4), ('the', 4), ('best', 1), ('of', 4),
('times', 2), ('it', 4), ('was', 4), ('the', 4),
('worst', 1), ('of', 4), ('times', 2), ('it', 4),
('was', 4), ('the', 4), ('age', 2), ('of', 4),
('wisdom', 1), ('it', 4), ('was', 4), ('the', 4),
('age', 2), ('of', 4), ('foolishness', 1)]