arrow-left

All pages
gitbookPowered by GitBook
1 of 3

Loading...

Loading...

Loading...

Exercise 1: Voyant

hashtag
Tools & Materials

  • Voyantarrow-up-right (documentationarrow-up-right)

  • ()

  • (an already prepared dataset)

hashtag
Part One: Preparing the Dataset in Lexos

1.) Go to

2.) Cut and paste the URLs below to the Scrape box (right side) and click Scrape

The URLs are to text files on of Frederick Douglass' Narrative of the Life of Frederick Douglass, an American Slave; My Bondage My Freedom; Abolition Fanaticism in New York; and Collected Articles of Frederick Douglass

3.) Click on Prepare and then Scrub

Select: "Make Lowercase," "Remove Digits," "Scrub Tags," "Remove Punctuation," and "Keep Hyphens"

Click Apply

4.) In the Lemmas box add the below, click Apply, and then Download.

5.) Locate the download text files and open one. We will discuss the results.

hashtag
Part Two: Text Analysis with Voyant

1.) Download the or use the one you created in part one.

2.) Upload the dataset to

3.) Explore some of the lemma words (above) with different Voyant tools. Look at the different ways you can view their frequencies, relationships to other words, and their locations within the texts.

hashtag
Breakout Group Questions

1.) What tools do you find most useful or promising be it for analyzing these texts or texts you are interested in exploring.

2.) What might be some of the challenges and pitfalls of Voyant as you understand it so far?

3.) Are there anyways you can see text analysis (in Voyant or another tool) fitting in with your own research or teaching?

hashtag
Alternative Text For Further Exploration

  • (Internet Archive - search"boston public school," facets selected: Texts, Always Available)

  • (Internet Archive - search "boston health," facets selected: Texts, Always Available)

  • for a variety of pre-1923 books.

Text Analysis

hashtag
Contents

Lexosarrow-up-right
documentationarrow-up-right
Frederick Douglass datasetarrow-up-right
Lexosarrow-up-right
Project Gutenbergarrow-up-right
Frederick Douglass datasetarrow-up-right
Voyantarrow-up-right
Boston Public Schools related materialsarrow-up-right
Boston healtharrow-up-right
Search Project Guttenberg arrow-up-right
https://www.gutenberg.org/cache/epub/23/pg23.txt
https://www.gutenberg.org/files/202/202.txt
https://www.gutenberg.org/cache/epub/34915/pg34915.txt
https://www.gutenberg.org/cache/epub/99/pg99.txt
slave-child:children
slave-mother:mother
slave:slaves
master:masters
wife, wifes:wives
husband's:husband
husband:husbands
child, childs:children
baby, babys:babies
infant:infants
mother:mothers
father's:father
father:fathers
parent:parents
family, familys:famlies
Python Exercise

hashtag
Resources

  • BC Text & Data Mining Libguidearrow-up-right

Project Examples:

  • Mining the Dispatcharrow-up-right

  • The Virtual Text Project arrow-up-right

  • Robots Reading Voguearrow-up-right

  • (article)

Voyant & Lexos Exercise

Exercise 2: Python

hashtag
Tools & Materials

  • Google Colabarrow-up-right

hashtag
Exercise

In this exercise, you will be using a small dataset of BC dissertations focused on segregation in Boston schools to conduct a word frequency analysis on the dissertation abstracts.

hashtag
Getting Started

1.) Import Libraries

Import Pandas (a library for analysis and manipulation tool), CSV (a module for reading and writing tabular data in CSV format), Natural Language Tool Kit (a platform used for building Python programs that work with human language data for applying in statistical natural language processing).

2.) Import Data

3.) Preview Dataframe

a.) Cut and paste and run the code to see the dataframe.

b.) Change df.head() to df.head(8) and run the code again.

hashtag
Text Cleaning

In this section, the text is cleaned. All of the cleaning is being applied specifically to the abstracts field since that is what we will be analyzing.

1.) Remove Empty Cell, Remove Punctions, to Lowercase

The following code

  • Gets rid of records (rows in the dataset spreadsheet) that do not have abstracts.

  • Removes punctuation so that it will not affect how tokens are created and how words are counted

  • Make all uppercase letters into lowercase because otherwise, words having upper case letters will be counted as different than the same words in lowercase. For example, "Education" and "education" will be counted as two different words.

After the code is run, a new dataframe will show the changes that have been made.

2.) Tokenization

Here the Natural Language Tool Kit (NLTK) library is being used to tokenize the text so that each individual word is a token. 'Punkt' is the Punkt Sentence Tokenizer, an NLTK algorithm that is being incorporated to tokenize the text.

Again a new dataframe will be created. Note that the new column abs_tokenize has been added. This is where the tokenized text is. The abstract column remains untouched.

3.) Stopwords

a.) Download the NLTK stopwords list:

b.) Apply stopwords and add changes to abs_nostops column:

c.) Add customized stopwords not included in the NLTK:

d.) Apply stopwords and add changes to abs_nostops column:

hashtag
Word Frequency

1.) Counting Words

a.) Count words in abs_nostops field:

b.) Use Counter, a container that keeps track of how many times equivalent values are added, to calculate word frequency:

c.) Display the 15 most common words:

2.) Add more stopwords

Return to the code written for step 3.d own_stops = {'study', 'school', 'schools', 'public'} and add additional stopwords that you think should be added.

3.) Visualization

a.) Import Matplotlib, a comprehensive library for creating static, animated, and interactive visualizations in Python.

b.) Identify the 30 most used words.

c.) Display the results in a bar chart with the words (x value) and a blue bar (y value).

Download datasetarrow-up-right
import pandas as pd
import csv
import nltk
#Uplocad csv file from your local directory
from google.colab import files
uploaded = files.upload()
df = pd.read_csv('incubator_etd-dataset.csv')  #The filename of the uplpaded csv file 
df.head() #PREVIEW A DATAFRAME
#DROP RECORDS WITH NO ABSTRACT TEXT
df.dropna(subset=['abstract'], inplace=True) 

#REMOVE PUNCTUATIONS
import re              
df['abstract']=[re.sub('[^\w\s]+', '', s) for s in df['abstract'].tolist()]

#CONVERT TO LOWER CASE
df['abstract'] =  df['abstract'].apply(lambda x: " ".join(x.lower() for x in x.split())) 
df.head() #PREVIEW DATAFRAME
nltk.download('punkt')
def tokenize_text(row):
    d = row['abstract']
    tokens = nltk.word_tokenize(d)
    token_words = [w for w in tokens]
    return token_words

#ADD TOKENIZED TEXT TO NEW DATAFRAME COLUMN 
df['abs_tokenize'] = df.apply(tokenize_text, axis=1)
df.head() #PREVIEW DATAFRAME
nltk.download('stopwords')   #DOWNLOAD STOPWORDS FROM NLTK
from nltk.corpus import stopwords
stops = set(stopwords.words("english")) #STORE STOPWORDS IN stops 
print(stops) #SHOW STOPWORDS
def remove_stops(row):
    d = row['abs_tokenize']
    meaningful_words = [w for w in d if not w in stops]
    return (meaningful_words)
#ADD TOKENIZED TEXT WITH STOPWORDS REMOVED TO NEW DATAFRAME COLUMN 
df['abs_nostops'] = df.apply(remove_stops, axis=1)
df.head() #PREVIEW DATAFRAME
own_stops = {'study', 'school', 'schools', 'public'}
stops.update(own_stops)
print(stops) #Stopwords List is updated
def remove_stops(row):
    d = row['abs_tokenize']
    meaningful_words = [w for w in d if not w in stops]
    return (meaningful_words)
df['abs_nostops'] = df.apply(remove_stops, axis=1)
df.head() #PREVIEW DATAFRAME
abs_count = []
for i, row in df.iterrows():
    abs_count.append(row['abs_nostops'])
# Import Counter()
from collections import Counter
# Create an empty Counter object called `word_frequency`
word_frequency = Counter()
word_frequency = Counter(x for xs in abs_count for x in set(xs))
word_frequency.most_common(15)
import matplotlib.pyplot as plt
a = word_frequency.most_common(30)
bar_values = list(list(zip(*a)))
x_val = list(bar_values[0])
y_val = list(bar_values[1])

plt.figure(figsize=(12,8)) #Customize plot size
plt.barh(x_val, y_val, color='blue',height=0.3)
plt.xlabel("Word Counts")
plt.gca().invert_yaxis()
Cohort Succession Explains Most Change in Literary Culture Cohort Succession Explains Most Change in Literary Culturearrow-up-right
See morearrow-up-right
https://docs.google.com/presentation/d/13z0gL9aBdufa6vgBJdT9hG1Ycx_nKtO4lPvqR1KoPEk/edit#slide=id.gbb8a836fca_0_27docs.google.comchevron-right