1 of 3

Text Analysis

Exercise 1: Voyant

Tools & Materials

Voyant (documentation)
()
(an already prepared dataset)

Part One: Preparing the Dataset in Lexos

1.) Go to

2.) Cut and paste the URLs below to the Scrape box (right side) and click Scrape

The URLs are to text files on of Frederick Douglass' Narrative of the Life of Frederick Douglass, an American Slave; My Bondage My Freedom; Abolition Fanaticism in New York; and Collected Articles of Frederick Douglass

3.) Click on Prepare and then Scrub

Select: "Make Lowercase," "Remove Digits," "Scrub Tags," "Remove Punctuation," and "Keep Hyphens"

Click Apply

4.) In the Lemmas box add the below, click Apply, and then Download.

5.) Locate the download text files and open one. We will discuss the results.

Part Two: Text Analysis with Voyant

1.) Download the or use the one you created in part one.

2.) Upload the dataset to

3.) Explore some of the lemma words (above) with different Voyant tools. Look at the different ways you can view their frequencies, relationships to other words, and their locations within the texts.

Breakout Group Questions

1.) What tools do you find most useful or promising be it for analyzing these texts or texts you are interested in exploring.

2.) What might be some of the challenges and pitfalls of Voyant as you understand it so far?

3.) Are there anyways you can see text analysis (in Voyant or another tool) fitting in with your own research or teaching?

Alternative Text For Further Exploration

(Internet Archive - search"boston public school," facets selected: Texts, Always Available)
(Internet Archive - search "boston health," facets selected: Texts, Always Available)
for a variety of pre-1923 books.

Exercise 2: Python

Tools & Materials

Google Colab

Exercise

In this exercise, you will be using a small dataset of BC dissertations focused on segregation in Boston schools to conduct a word frequency analysis on the dissertation abstracts.

Getting Started

1.) Import Libraries

Import Pandas (a library for analysis and manipulation tool), CSV (a module for reading and writing tabular data in CSV format), Natural Language Tool Kit (a platform used for building Python programs that work with human language data for applying in statistical natural language processing).

2.) Import Data

3.) Preview Dataframe

a.) Cut and paste and run the code to see the dataframe.

b.) Change df.head() to df.head(8) and run the code again.

Text Cleaning

In this section, the text is cleaned. All of the cleaning is being applied specifically to the abstracts field since that is what we will be analyzing.

1.) Remove Empty Cell, Remove Punctions, to Lowercase

The following code

Gets rid of records (rows in the dataset spreadsheet) that do not have abstracts.
Removes punctuation so that it will not affect how tokens are created and how words are counted
Make all uppercase letters into lowercase because otherwise, words having upper case letters will be counted as different than the same words in lowercase. For example, "Education" and "education" will be counted as two different words.

After the code is run, a new dataframe will show the changes that have been made.

2.) Tokenization

Here the Natural Language Tool Kit (NLTK) library is being used to tokenize the text so that each individual word is a token. 'Punkt' is the Punkt Sentence Tokenizer, an NLTK algorithm that is being incorporated to tokenize the text.

Again a new dataframe will be created. Note that the new column abs_tokenize has been added. This is where the tokenized text is. The abstract column remains untouched.

3.) Stopwords

a.) Download the NLTK stopwords list:

b.) Apply stopwords and add changes to abs_nostops column:

c.) Add customized stopwords not included in the NLTK:

d.) Apply stopwords and add changes to abs_nostops column:

Word Frequency

1.) Counting Words

a.) Count words in abs_nostops field:

b.) Use Counter, a container that keeps track of how many times equivalent values are added, to calculate word frequency:

c.) Display the 15 most common words:

2.) Add more stopwords

Return to the code written for step 3.d own_stops = {'study', 'school', 'schools', 'public'} and add additional stopwords that you think should be added.

3.) Visualization

a.) Import Matplotlib, a comprehensive library for creating static, animated, and interactive visualizations in Python.

b.) Identify the 30 most used words.

c.) Display the results in a bar chart with the words (x value) and a blue bar (y value).

Exercise 1: Voyant

Tools & Materials

Voyant (documentation)
()
(an already prepared dataset)

Part One: Preparing the Dataset in Lexos

1.) Go to

2.) Cut and paste the URLs below to the Scrape box (right side) and click Scrape

3.) Click on Prepare and then Scrub

Select: "Make Lowercase," "Remove Digits," "Scrub Tags," "Remove Punctuation," and "Keep Hyphens"

Click Apply

4.) In the Lemmas box add the below, click Apply, and then Download.

5.) Locate the download text files and open one. We will discuss the results.

Part Two: Text Analysis with Voyant

1.) Download the or use the one you created in part one.

2.) Upload the dataset to

Breakout Group Questions

1.) What tools do you find most useful or promising be it for analyzing these texts or texts you are interested in exploring.

2.) What might be some of the challenges and pitfalls of Voyant as you understand it so far?

3.) Are there anyways you can see text analysis (in Voyant or another tool) fitting in with your own research or teaching?

Alternative Text For Further Exploration

(Internet Archive - search"boston public school," facets selected: Texts, Always Available)
(Internet Archive - search "boston health," facets selected: Texts, Always Available)
for a variety of pre-1923 books.

Exercise 2: Python

Tools & Materials

Google Colab

Exercise

In this exercise, you will be using a small dataset of BC dissertations focused on segregation in Boston schools to conduct a word frequency analysis on the dissertation abstracts.

Getting Started

1.) Import Libraries

2.) Import Data

3.) Preview Dataframe

a.) Cut and paste and run the code to see the dataframe.

b.) Change df.head() to df.head(8) and run the code again.

Text Cleaning

In this section, the text is cleaned. All of the cleaning is being applied specifically to the abstracts field since that is what we will be analyzing.

1.) Remove Empty Cell, Remove Punctions, to Lowercase

The following code

Gets rid of records (rows in the dataset spreadsheet) that do not have abstracts.
Removes punctuation so that it will not affect how tokens are created and how words are counted
Make all uppercase letters into lowercase because otherwise, words having upper case letters will be counted as different than the same words in lowercase. For example, "Education" and "education" will be counted as two different words.

After the code is run, a new dataframe will show the changes that have been made.

2.) Tokenization

Again a new dataframe will be created. Note that the new column abs_tokenize has been added. This is where the tokenized text is. The abstract column remains untouched.

3.) Stopwords

a.) Download the NLTK stopwords list:

b.) Apply stopwords and add changes to abs_nostops column:

c.) Add customized stopwords not included in the NLTK:

d.) Apply stopwords and add changes to abs_nostops column:

Word Frequency

1.) Counting Words

a.) Count words in abs_nostops field:

b.) Use Counter, a container that keeps track of how many times equivalent values are added, to calculate word frequency:

c.) Display the 15 most common words:

2.) Add more stopwords

Return to the code written for step 3.d own_stops = {'study', 'school', 'schools', 'public'} and add additional stopwords that you think should be added.

3.) Visualization

a.) Import Matplotlib, a comprehensive library for creating static, animated, and interactive visualizations in Python.

b.) Identify the 30 most used words.

c.) Display the results in a bar chart with the words (x value) and a blue bar (y value).

Text Analysis

Contents

Exercise 1: Voyant

Tools & Materials

Part One: Preparing the Dataset in Lexos

Part Two: Text Analysis with Voyant

Breakout Group Questions

Alternative Text For Further Exploration

Exercise 2: Python

Tools & Materials

Exercise

Getting Started

Text Cleaning

Word Frequency

Exercise 1: Voyant

Tools & Materials

Part One: Preparing the Dataset in Lexos

Part Two: Text Analysis with Voyant

Breakout Group Questions

Alternative Text For Further Exploration

Text Analysis

Contents

Resources

Exercise 2: Python

Tools & Materials

Exercise

Getting Started

Text Cleaning

Word Frequency

Text Analysis

hashtagContents

Exercise 1: Voyant

hashtagTools & Materials

hashtagPart One: Preparing the Dataset in Lexos

hashtagPart Two: Text Analysis with Voyant

hashtagBreakout Group Questions

hashtagAlternative Text For Further Exploration

Exercise 2: Python

hashtagTools & Materials

hashtagExercise

hashtagGetting Started

hashtagText Cleaning

hashtagWord Frequency

Exercise 1: Voyant

hashtagTools & Materials

hashtagPart One: Preparing the Dataset in Lexos

hashtagPart Two: Text Analysis with Voyant

hashtagBreakout Group Questions

hashtagAlternative Text For Further Exploration

Text Analysis

hashtagContents

hashtagResources

Exercise 2: Python

hashtagTools & Materials

hashtagExercise

hashtagGetting Started

hashtagText Cleaning

hashtagWord Frequency

Contents

Tools & Materials

Part One: Preparing the Dataset in Lexos

Part Two: Text Analysis with Voyant

Breakout Group Questions

Alternative Text For Further Exploration

Tools & Materials

Exercise

Getting Started

Text Cleaning

Word Frequency

Tools & Materials

Part One: Preparing the Dataset in Lexos

Part Two: Text Analysis with Voyant

Breakout Group Questions

Alternative Text For Further Exploration

Contents

Resources

Tools & Materials

Exercise

Getting Started

Text Cleaning

Word Frequency