Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
In this tutorial, the tools Voyant and Lexos are being used. They have been chosen because they are "out of the box," meaning they don't require any coding, they are relatively easy to use, and they have many capabilities. As such, they are great tools for getting started in text analysis.
Generally speaking, "out of the box" tools tend to be blunter instruments in that they do not allow for the level of customization and specificity that using coding and scripting languages like Python and R do. Consequently, if you want to run more in-depth text analysis queries, you would eventually need to gain some coding and scripting skills. If tools like Voyant and Lexos serve all of your text analysis requirements, then they might be all you need.
Lexos, which is more complex than Voyant, allows some more in-depth work and can be used for scraping, scrubbing, and cutting text in addition to conducting analyses. Voyant has a flexible and friendly interface that provides a lot of different ways into a text. In this tutorial, you will learn about a way in which Lexos and Voyant work well together.
When first using Voyant and Lexos, it is good to look over their guides and other helpful information. Voyant has an extensive guide, with the tools list and tool instructions being particularly helpful. The question marks on the Voyant interface also provide information. Lexos has helpful information within the tool. Click on a question mark to learn what something does, or click on "Help" in the top right of the navigation bar to open the help window on the left.
In this two-part exercise, you will dive straight into Voyant to get a sense of how the tool works and what the text analysis process can look like. In part one, you will learn how to copy and paste text into Voyant and see how different Voyant tools work together. In part two, you will learn how to upload a text file to Voyant and a little about preparing a text for text analysis.
1.) Copy Martin Luther King Jr.'s "I Have a Dream" speech from this site. (If the site is down, search for a full version of the speech.) Make sure only to copy the speech and not other text on the webpage. (Incorporating any other text will impact the results.)
2.) Go to Voyant
3.) Paste the text into the "Add Text" box and click "Reveal":
The results should look something like this:
4.) To get a sense of how Voyant works, click on the word "freedom" in the Cirrus [A], or word cloud, then scroll through the Reader [B] and notice how the word "freedom" is highlighted throughout.
In Trends [C], notice how the line graph only shows "freedom." Now click on one of the line graph points and see how the Contexts [D] changes to show the context in which "freedom" appears throughout the speech. (It should look like the below).
Note: If you click the question mark in the upper right corner of each tool, e.g., Cirrus, you will get an explanation of that specific one.
Text analysis is more often associated with working with a large corpus (for example, all the works of a single author) or an enormous one (for example, all fiction publications from 1800-1900). In the case of a smaller corpus, a single speech being particularly small, using a text analysis tool like Voyant can facilitate close reading and is especially good for examining structure and word usage.
1.) Download the following text file. (It's The Complete Works of Shakespeare in Project Gutenberg.) It will likely download to your download folder or desktop.
2.) Go to Voyant again to launch a new instance and upload the text file by clicking on "Upload," navigating to the downloaded file, and select it. It will automatically "reveal" and will look something like the following:
Take a minute to look the Voyant instance over. Notice words like "shall, "hath," and any special characters in the word cloud. Notice the Project Gutenberg "boilerplate text" in the Reader. In Trends, notice that the horizontal line has numbers, and notice that as you scroll down in Contexts, the name in the Document column comes from the text file name and never changes.
3.) Now take a look at this Voyant instance (also seen in the image below) that contains The Complete Works of Shakespeare as well. This time the text was prepared prior to it being uploaded.
Notice that words like "shall, "hath," and special characters are no longer in the word cloud. This is because stopwords were applied to remove them. Notice that the boilerplate text is gone. This is because it was deleted from the text file before being uploaded. In Trends, the play names can now be seen below the horizontal line, and they can also be seen in Contexts' Document column. This is because the file, which contained all of Shakespeare's plays, was cut up into individual text files, one for each play. The files were also renamed to represent the play titles. The sonnets, which were also in the text, were removed so that the instance would be solely focused on plays.
As you explore Voyant, keep in mind that it, and all text analysis tools, do not do the analysis. It provides ways into texts that enable users to come to conclusions based on their own knowledge and analytical skills.
When conducting a text analysis, it is important to keep in mind that:
1.) Word meaning changes over time.
While it might be understood, it's important not to forget that word meaning changes. One can use a source like the Oxford English Dictionary to look up the particular meaning of a word at a particular time.
2.) The word context is key.
In many, if not most, text analysis undertakings, word context is crucial to the analysis. Exceptions can occur when, for example, one is only interested in the number of times a word appears and not in the way the word is used.
3.) There may be issues of omission in the corpus.
It's important to keep in mind when exploring or creating a corpus that there may be issues of omission. People of color, women, and other marginalized groups have been published less throughout history and, therefore, a massive corpus--like Google Books or HathiTrust--will skew white and male. (Other areas of omission can be based on things like language, geography, time period, etc.) Moreover, it's important to consider what gets digitized. There can be (and no doubt is) bias in the decisions that drive the selection and funding of what ends up online.
4.) There can be quality issues with the corpus.
Often texts used in text analysis come from books and documents that have been OCR'd. (or optical character recognition) converts images of text into digital (machine-readable) text. Due to things like the quality of images and scanning mistakes, there can be OCR quality issues and, therefore, text errors.
Below are two examples of how OCR errors can occur. The one on the left is from a first edition of the 18th-century novel, The Life and Opinions of Tristram Shandy, Gentleman. With books from this period, you get characters such as the long s ( ſ ) and, often, ink bleed through, and foxing (all of the little dots that come from age) which can impact OCR. (These kinds of issues used to be much more of a factor before advancements in OCR technology.) The example on the right shows a scanning mistake made when the book was moved during the process. (Even with the advancements of technology, OCR issues are unavoidable in this case.) When working with a large or massive corpus, these kinds of errors might be inconsequential as long as there is a small enough number of them. With smaller corpora, such errors can have a greater impact and skew text analysis results.
Multiple Collections: Anchor collections from JSTOR and Portico, with additional content sources continually added.
Data Download in JSON.
Open content - bibliographic metadata, full-text, unigrams, bigrams, trigrams.
Dataset Dashboard: Easily view datasets you have built or accessed.
Dataset ID: unique identifier of the extracted dataset, can be used for retrieval in research notebooks.
Analyze: tutorials version for leaning how to use the research notebooks.
Download metadata in CSV file format, raw text data in JSON file format.
Built-in visualizations available by clicking the link under word cloud.
Step 1: Click "Jupyter" and go to the main directory Step 2: Go to folder "Data" Step 3: Check "stop_words.csv" and click Download
The Introduction to Text Analysis tutorial, created by Melanie Hubbard, BC Digital Scholarship Librarian, provides a basic introduction to text analysis concepts, the tools Voyant and Lexos, and how to create a corpus. Example texts are humanities-oriented, but text analysis can be used in any disciplinary field.
The following resources provide text for text analysis projects.
Internet Archive Books (includes plain-text [“full text”] access to books, issues of magazines, etc.)
Early English Books Online (EEBO) (BC library resource)
Oxford Text Archive (large number of texts available in variety of forms, including plain text; texts are accessed one at a time)
HATHITrust (16 million volumes, mostly in English)
Chronicling America (12.8 million pages of American newspapers)
DocSouth Data (narratives & literature from the American South)
Perseus Digital Library (large collection of classical texts, much of it encoded in TEI/XML)
EEBO-TCP (ca. 50,000 early English books, many encoded in TEI/XML)
Old Bailey Online (197,745 London criminal trials, 1674-1913)
Canadian Hansard (debates & journals of the Canadian Senate & House of Commons)
Australian Hansard (Parliamentary debates, 1901-1980)
UK Hansard (UK Parliamentary debates)
Open Islamicate Texts Initiative (see also repositories; 10,000 premodern Islamicate texts)
Transkribus Corpus and READ (efforts to use computer vision to recognize handwriting)
ToposText (557 classical texts linked with a gazetteer of the ancient world)
BYU Corpora (widely used corpora of American English)
Wright American Fiction (American adult fiction, 1774–1900)
UCLA Broadcast NewsScape (170K hours of captioned news programs; see Red Hen Lab for information on access)
Media History Digital Library (nearly 2 million pages of media-related books and articles, 1875-1995)
Christian Classics Ethereal Library (classic Christian texts)
NYT Annotated Corpus (1.8 million NYT articles + NYT-supplied metadata)
Europeana Collections (many datasets from European libraries & archives, from papyri to photographs to newspapers)
Foreign Records of the US (nearly complete run of Foreign Relations of the United States; see these tools to obtain full text)
Internet Archive (a huge collection of websites, texts, audio, and other media, available for bulk download via wget)
Twitter Datasets (a catalog of Twitter datasets that are publicly available on the web)
BitCurator (an effort to develop tools to analyze features of digital texts)
Movie Quotes Corpus (“220,579 conversational exchanges between 10,292 pairs of movie characters”)
Europe PMC (repository of life sciences books, articles, and preprints)
Trove Australia (565 million documents collected by the National Library of Australia, including a sizeable collection of newspapers)
BNC-Baby (4 million-word sub corpus of the 100 million-word British National Corpus, with parts-of-speech tagging in XML)
TEI-Encoded
Eighteenth Century Collections Online (BC library resource)
Resources from Laura Nelson’s “Analyzing Complex Digitized Data”
Constellate, the new text and data analytics service from JSTOR and Portico is a platform for learning and performing text analysis, building datasets, and sharing analytics course materials. The platform provides value to users in three core areas -- they can teach and learn text analytics, build datasets from across multiple content sources, and visualize and analyze their datasets.
Text analysis begins with a research question or curiosity and involves the use of digital tools and one’s own analytical skills to explore texts, be they literary works, historical documents, journal articles, legal briefs, transcribed interviews, or tweets, and is used in a wide variety of disciplines. Approaches can be quantitative and qualitative, and tools can range from coding and scripting languages to "out of the box" platforms like Voyant and Lexos.
Text mining (a term used more in the humanities), data mining (a term used more in the sciences and social sciences), and web scraping are techniques that use coding, scripting, and "out of the box" tools to gather text and create a corpus (or dataset).
For this exercise, you will be given an imaginary research topic and questions.
Imagine you are studying Frederick Douglass' rhetorical arguments against slavery and you notice a lot of mentions of family. This sparks your curiosity and makes you wonder: How Douglass evokes the idea of family in his arguments? How much and when does he mention it? What rhetorical purpose might these mentions serve? Does Douglass more often talk about family in the context of slaveholders or slaves?
You decide to focus on words like wife, mother, husband, father, child, baby, infant, family, and parent with the understanding that you can expand your list later.
To get started, you need to create your corpus. You will be acquiring the text from Project Gutenberg, which has thousands of texts covering a range of genres and topics. (There are numerous other text sources, some of which can be found on this text repositories list.)
If you want to skip this part of the process and go straight to the Working in Voyant section, you can download the prepared text files below. (You will need to unzip the file to upload the individual text files to Voyant.)
Go to Project Gutenberg and search "Frederick Douglass" (or see his works here). The results should look like the following:
From this list, Narrative of the Life of Frederick Douglass, an American Slave; My Bondage My Freedom; Abolition Fanaticism in New York; and Collected Articles of Frederick Douglass will be used.
a.) Now you will "scrape" (or extract) the text from the site. This begins with getting the text URLs (web addresses). To do this in Project Gutenberg, you go to each text's landing page, select "Plain Text UTF-8, and copy the URL.
For Example, here is the Narrative of the Life of Frederick Douglass landing page:
And here is the URL that you get after clicking on Plain Text UTF-8:
For a shortcut, you can get all of the URLs here:
b.) Go to Lexos. Copy and paste the URLs into the "Scrape" box on the Lexos landing page (also the "Upload" page). Click the "Scrape" button. (It should only take a few seconds since the texts aren't that big.) When it is done, you will see the texts in the "Upload List" box.
Now you will prepare the text, this means things like getting rid of punctuation, making all the text lower case, using lemmas, getting rid of tags, cutting up the text if one wants it to be divided into smaller units, and any other choices one might make.
a.) To prepare the text, click on "Prepare" in the top navigation menu (see image below) and then click on "Scrub." (A little bit below, you will find an explanation of the preparation choices being made.)
b.) On the Scrub page, you can make multiple decisions that will affect the text. It can be necessary to experiment with how you scrub your text. For now, select "Make Lowercase," "Remove Digits," "Scrub Tags," "Remove Punctuation," and "Keep Hyphens."
It should look like this:
Click "Apply."
c.) Now you are going to apply lemmas. It is important that you scrub the text before applying them. Doing so with the settings being used will get rid of the punctuation, which is necessary for some of the lemmas to work.
Cut and paste these lemmas into the "Lemmas" box:
It should look like this:
Click "Apply."
"Make lowercase" made all characters lowercase. This choice is best when using case-sensitive tools, which treat capitalized and lowercase words differently. For example, in certain tools, words capitalized at the beginning of sentences are seen as being different from the same word appearing in lowercase within the sentence.
"Remove Digits" and "Scrub Tags" got rid of unnecessary digits or distracting HTML tags that might be in the text.
"Remove Punctuation" and "Keep Hyphens" got rid of punctuation that might impact the effectiveness of the lemmas being used but kept hyphenated words intact.
Lemmas group together words so they can be analyzed as a single item. After Lemmatisation, it is easier to count, search, and categorize the grouped words. It is also easier to create stopwords and white lists, as only one version of a word will need to be added.
For example, "wife," "wifes," and "wives" will all appear as "wives." Note that "wifes" was "wife's." The first scrub application got rid of the apostrophe. Were the apostrophe not removed before applying lemmas, the lemmatization process for that word would not have worked.
b.) When it is done, click the "Download" button, and a zip file should download to your computer. Find the file and open it. You should see a folder with individual text files. (They are likely in your download folder or on your desktop.)
c.) Open each file to look for text not related to Douglass's work, e.g., Project Gutenberg boilerplate information at the beginning (pictured below) and end of the text.
This is when you can also decide whether to delete paratextual information, e.g., introductions, prefaces, table of contents, indexes, etc. Choosing whether or not to keep this kind of information is part of the intellectual decision-making that goes into text analysis.
d.) When you are done. Save your files.
e.) Rename the files (by clicking on each file name) to clarify which file is which text. Below are the recommended names, "abolition," "articles," "bondage," and "narrative." They use the first keyword from each title. Once this is done, you are ready to upload your files to Voyant.
For this second half of the tutorial, you will be introduced to some of Voyant's functions that will help you explore the proposed research questions. Here again, is the premise and research question:
Imagine you are studying Frederick Douglass' rhetorical arguments against slavery and you notice mentions in various works that evoke the idea of family. This sparks your curiosity and makes you wonder: How Douglass evokes the idea of family in his arguments? How much and when does he mention family? What rhetorical purpose might these mentions serve? Does Douglass more often talk about the family in the context of slaveholders or slaves?
Going forward, you are encouraged to follow the various steps presented and to explore on your own.
Launch a new Voyant instance, click on the "Upload" button, navigate to and select the edited Douglass files. When selecting the files, it should look something like this:
The results should look something like this:
The particular tools and layout you initially see is called the "default skin." It displays the tools:
a.) Notice that there are other view options within each box. For example, "Cirrus" also has "Terms" and "Links." Take a moment to explore the tools and their various options.
b.) Notice that when you hover to the left of any of the question marks (even the one in the blue field at the very top right of the page) a toolbar of icons appears:
These provide access to a range of options and functionalities: The arrow icon [a] allows you to export a URL or embed code for a specific tool or the entire project. It also allows you to export images. The window icon [b] is where you go to change the tool to a different one. The switch icon [c] is where you go to define options for that specific tool. For example, it is where you go to add stopwords, create categories, and change fonts. The question mark provides information about that specific tool.
c.) Take a little time to explore this toolbar as it is key to using Voyant effectively, and we will be using it quite a bit below.
Stopwords are words that you don't want to incorperate in the results you see in certain tools, i.e., Cirrus and Summary's "most frequent words in the corpus." When applying stopwords, you are not deleting them; they are just not visible.
Choosing stopwords is part of the intellectual decision-making that goes into text analysis. For example, in the context of the research question posed here, one could decide to stop the words "slavery," "masters," and "slaves" since those terms are pervasive and are understood to be there. Getting rid of words that are considered inconsequential, at least within the context of the research question (e.g., "like" and "mr"), can also be helpful. By stopping these words, other words will become more visible and might inspire new ideas and inform the analysis.
a.) To add stopwords, click on the switch or "define options" icon in the Cirrus tool (or in any other tool).
b.) Click "Edit List" next to the stopwords dropdown menu.
c.) Add the words, "slaves," "slavery," "masters," "mr," and "like," putting each one on a new line.
Save them and click "Confirm." You should see a change in the word cloud. If one of the words does not disappear, add it again. There could be a typo issue.
If you only wanted to apply stopwords in that particular tool, you would uncheck "apply globally."
Please note: When your text is first loaded in Voyant, the app automatically applies stopwords. You can turn this off by selecting "None" in the stopword dropdown menu. You can also remove stopwords from the list simply by deleting them, saving, and confirming the changes.
A white list is essentially the opposite of stopwords. It involves creating a list of words that you only want to see in the Cirrus results.
a.) To create a white list, click on the switch or "define options" icon in the Cirrus tool (or any other tool).
b.) Click "Edit List" next to the white list dropdown menu.
c.) Add the words: mothers, fathers, children, babies, infants, husbands, wives, families, parents, putting each one on a new line and save them and click "Confirm."
The results should look something like the below. (If one of the words does not disappear, add it again. There could be a typo issue.)
You can turn off the white list by selecting "None" in the white list dropdown menu. You can also remove words by deleting them from the list and resaving and confirming the changes.
You can also create categories that group words. They can be applied in many but not all tools.
a.) To create a category, click the switch or "define options" icon in the Cirrus tool (or any other tool).
b.) Click the "Edit" next to the Categories dropdown menu.
When you open up Categories, you will see that Voyant has two default ones, "positive" and "negative." To add a new category, click "Add Category" [b] and give it a title such as "family." To add terms to that list, search for them in the search box [c], and when they appear in the Terms box [d], drag them to the new categories list. To remove a term from a list, select the word and then click "Remove Selected Term" [a].
c.) Create a new category using "family" as the name and add the terms "mothers," "fathers," "children," "babies," "infants," "husbands," "wives," "families," and "parents." (It should look like the image above.)
After creating your category, you can apply the category to a specific tool. In the tool's search box, search the list name with an @ at the beginning, for example, @family. (Do not leave a space between the @ and the name). The result will be only ones containing terms in that category.
d.) In the Contexts tool, search @family in the search box and explore the results:
The following will show you how to change out tools. (To learn about the many Voyant tools to choose from see the tools list.)
a.) In the Trends (or any other tool), click on the window or "choose tool" icon.
b.) Click on "Visualization" and then "MicroSearch" and the tool should appear.
As Voyant describes MicroSearch, "each document in the corpus is represented as a vertical block where the height of the block indicates the relative size of the document compared to others in the corpus. The location of occurrences of search terms is located as red blocks...Multiple search terms are collapsed together." In the tool, you search the term(s) you want to see appear in the visualization.
c.) Search "children" in the search box.
You can choose individual texts that you want to visualize by selecting and deselecting them. For example, you can choose only to use Douglass' Narrative of a Slave and Articles.
a.) To select the texts, go to Summary (on the lower left) and click "Documents." Then select or deselect texts. Here is where you also can modify texts, meaning you can delete them and uploading new ones. To make these changes, click "Modify."
You can save or share an entire Voyant instance or individual tools by exporting the URL. (Exporting a tool launches that tool in its own window.)
To export the URL of an entire Voyant instance or a specific tool (see this Cirrus white list example), click on the arrow or "Export URL" icon in the very top right corner of the Voyant instance. Then click "Export," and a new window will launch. Copy and keep the URL for that window. To get the embed code, select the option "an HTML snippet for embedding this view..." and then click "Export." If you make changes to your project, you need to re-export to get a URL or embed code that reflects those changes.
Exporting the entire Voyant instance options:
Exporting specific tools options (notice that you can also export images. Select "export a PNG image..."):
You have completed Exercise Two and the tutorial. If you haven't already, now is a good time to explore the Voyant user's guide. You are also encouraged to experiment with texts that are part of your own research interests.
This notebook finds the word frequencies for a dataset.
Constellate provides: 1. access to over 29 million documents, including JSTOR, Portico and etc. >> collection details 2. Research Notebooks (Jupyter Notebooks) provides pre-built code snippets for a number of text analysis tasks. >> access to Research Notebook Text data and notebooks can be utilized together or separately; data downloaded in JSON format.