LogoLogo
  • About DS Learn
  • Tutorials
    • ¶ Digital Exhibits
      • Getting Started with Digital Exhibits
        • Considerations
        • Basic Steps
          • Site Organization
          • Usability & Accessibility
        • Platforms
    • ¶ Digital Storytelling
      • Introduction to ArcGIS StoryMaps
        • Getting Started
        • Using Content Blocks
        • Importing Maps from David Rumsey
      • Introduction to KnightLab StoryMap JS
      • TimelineJS
    • ¶ 3D Modeling & Immersive Technology
      • Adding 3D Models in Omeka
      • Intro to Photo Processing with Agisoft Metashape for 3D Model Making
      • Tips and Tricks for Taking Photos for 3D Model Creation
      • An Introduction to Apple's Reality Composer AR
      • Importing SketchFab Models into AR for the iPad or iPhone
      • Creating Basic 3D Objects for AR in Blender
      • Introduction to Meshlab
    • ¶ Data Visualization
      • Introduction to Tableau
        • Download and Install Tableau
        • Using Tableau to Visualize COVID-19 Data
        • Tableau DH
        • Resources
      • Beyond Simple Chart in Tableau
        • Beyond Simple chart Examples
      • Google Colab
        • Get Started
        • Data Import
        • Data Wangling
        • Visualization
        • Results Export
      • Out of Box Data Visualization Tools
        • How to use Google Data Studio with Google Sheets
        • Google Data Studio Interface
        • Creating Visualizations in Google Data Studio
    • ¶ Mapping
      • Tiling High-Resolution Images for Knightlab StoryMapJS
      • Hosting and Displaying Zoomable Images on Your Webpage
      • Georectifying Historical Maps using MapWarper
      • Making a Starter Map using Leaflet
    • ¶ REST API
      • How does REST API work?
      • JSON File
      • Get Started with Google Sheets Script Editor
      • Example 1: Extract Data by One Cell
      • Example 2: Extract Data by A Cell Range
    • ¶ Text Analysis
      • Introduction to Text Analysis
        • Step 1: Exercise One
        • Step 2: What is Text Analysis?
        • Step 3: Important Considerations
        • Step 4: Why Voyant and Lexos?
        • Step 5: Exercise Two
      • Text Repositories
      • Text Analysis in JSTOR
        • Overview of Constellate
        • Build A Dataset
        • Create A Stopwords List
        • Word Frequency
  • Digital Scholarship Incubator
    • Schedule
    • Getting Started
    • People
    • Project Guidelines
    • Topics
      • 3D Modeling and Immersive Technologies
        • Part 1: 3D Photogrammetry & Laser Scanning
          • Exercise: Experiment with 3D creation tools
        • Part 2: An Introduction to Apple's Reality Composer AR
          • Exercise: Experiment with Apple RealityComposer AR
      • Anatomy of a DS Project
        • Parts of a DS Project
        • Some DS Project Examples
        • Exercise: Evaluating a DS Project
      • Pedagogy
      • Data and Data Visualization
        • Introduction to Data
        • Introduction to Data Visualization
        • Introduction to Tableau
          • Download and Install Tableau
        • Introduction to Network Visualization
      • Digital Exhibits
        • Exercise 1: Exploring Exhibits
        • Exercise 2: Exhibit.so
      • DS Intro & Methodologies
      • User Experience
        • Usability Exercise
      • Mapping and GIS
        • An Introduction to Mapping, GIS and Vector Data
          • Workshop: Exploring and Creating Vector Data
          • Quick Review: Spatial Data
        • An Introduction to Raster Data and Georeferencing Historical Maps
          • Workshop: Finding and Georeferencing an Historical Map
          • Tutorial: Georectifying Historical Maps using MapWarper
        • Presentation + Workshop: Putting it together in ArcGIS Online
        • Workshop: A Brief Introduction to QGIS
          • Adding Base-maps and Raster Data
          • Adding and Creating Basic Vector Data
          • Styling your data and preparing it for exporting
      • Story Maps
        • Story Map Exercise
      • Text Analysis
        • Exercise 1: Voyant
        • Exercise 2: Python
Powered by GitBook
On this page
  • Tools & Materials
  • Exercise
  • Getting Started
  • Text Cleaning
  • Word Frequency

Was this helpful?

Export as PDF
  1. Digital Scholarship Incubator
  2. Topics
  3. Text Analysis

Exercise 2: Python

PreviousExercise 1: Voyant

Last updated 2 years ago

Was this helpful?

Tools & Materials

Exercise

In this exercise, you will be using a small dataset of BC dissertations focused on segregation in Boston schools to conduct a word frequency analysis on the dissertation abstracts.

Getting Started

1.) Import Libraries

Import Pandas (a library for analysis and manipulation tool), CSV (a module for reading and writing tabular data in CSV format), Natural Language Tool Kit (a platform used for building Python programs that work with human language data for applying in statistical natural language processing).

import pandas as pd
import csv
import nltk

2.) Import Data

#Uplocad csv file from your local directory
from google.colab import files
uploaded = files.upload()

3.) Preview Dataframe

a.) Cut and paste and run the code to see the dataframe.

df = pd.read_csv('incubator_etd-dataset.csv')  #The filename of the uplpaded csv file 
df.head() #PREVIEW A DATAFRAME

b.) Change df.head() to df.head(8) and run the code again.

Text Cleaning

In this section, the text is cleaned. All of the cleaning is being applied specifically to the abstracts field since that is what we will be analyzing.

1.) Remove Empty Cell, Remove Punctions, to Lowercase

The following code

  • Gets rid of records (rows in the dataset spreadsheet) that do not have abstracts.

  • Removes punctuation so that it will not affect how tokens are created and how words are counted

  • Make all uppercase letters into lowercase because otherwise, words having upper case letters will be counted as different than the same words in lowercase. For example, "Education" and "education" will be counted as two different words.

After the code is run, a new dataframe will show the changes that have been made.

#DROP RECORDS WITH NO ABSTRACT TEXT
df.dropna(subset=['abstract'], inplace=True) 

#REMOVE PUNCTUATIONS
import re              
df['abstract']=[re.sub('[^\w\s]+', '', s) for s in df['abstract'].tolist()]

#CONVERT TO LOWER CASE
df['abstract'] =  df['abstract'].apply(lambda x: " ".join(x.lower() for x in x.split())) 
df.head() #PREVIEW DATAFRAME

2.) Tokenization

Here the Natural Language Tool Kit (NLTK) library is being used to tokenize the text so that each individual word is a token. 'Punkt' is the Punkt Sentence Tokenizer, an NLTK algorithm that is being incorporated to tokenize the text.

Again a new dataframe will be created. Note that the new column abs_tokenize has been added. This is where the tokenized text is. The abstract column remains untouched.

nltk.download('punkt')
def tokenize_text(row):
    d = row['abstract']
    tokens = nltk.word_tokenize(d)
    token_words = [w for w in tokens]
    return token_words

#ADD TOKENIZED TEXT TO NEW DATAFRAME COLUMN 
df['abs_tokenize'] = df.apply(tokenize_text, axis=1)
df.head() #PREVIEW DATAFRAME

3.) Stopwords

a.) Download the NLTK stopwords list:

nltk.download('stopwords')   #DOWNLOAD STOPWORDS FROM NLTK
from nltk.corpus import stopwords
stops = set(stopwords.words("english")) #STORE STOPWORDS IN stops 
print(stops) #SHOW STOPWORDS

b.) Apply stopwords and add changes to abs_nostops column:

def remove_stops(row):
    d = row['abs_tokenize']
    meaningful_words = [w for w in d if not w in stops]
    return (meaningful_words)
#ADD TOKENIZED TEXT WITH STOPWORDS REMOVED TO NEW DATAFRAME COLUMN 
df['abs_nostops'] = df.apply(remove_stops, axis=1)
df.head() #PREVIEW DATAFRAME

c.) Add customized stopwords not included in the NLTK:

own_stops = {'study', 'school', 'schools', 'public'}
stops.update(own_stops)
print(stops) #Stopwords List is updated

d.) Apply stopwords and add changes to abs_nostops column:

def remove_stops(row):
    d = row['abs_tokenize']
    meaningful_words = [w for w in d if not w in stops]
    return (meaningful_words)
df['abs_nostops'] = df.apply(remove_stops, axis=1)
df.head() #PREVIEW DATAFRAME

Word Frequency

1.) Counting Words

a.) Count words in abs_nostops field:

abs_count = []
for i, row in df.iterrows():
    abs_count.append(row['abs_nostops'])

b.) Use Counter, a container that keeps track of how many times equivalent values are added, to calculate word frequency:

# Import Counter()
from collections import Counter
# Create an empty Counter object called `word_frequency`
word_frequency = Counter()

c.) Display the 15 most common words:

word_frequency = Counter(x for xs in abs_count for x in set(xs))
word_frequency.most_common(15)

2.) Add more stopwords

Return to the code written for step 3.d own_stops = {'study', 'school', 'schools', 'public'} and add additional stopwords that you think should be added.

3.) Visualization

a.) Import Matplotlib, a comprehensive library for creating static, animated, and interactive visualizations in Python.

import matplotlib.pyplot as plt

b.) Identify the 30 most used words.

a = word_frequency.most_common(30)
bar_values = list(list(zip(*a)))

c.) Display the results in a bar chart with the words (x value) and a blue bar (y value).

x_val = list(bar_values[0])
y_val = list(bar_values[1])

plt.figure(figsize=(12,8)) #Customize plot size
plt.barh(x_val, y_val, color='blue',height=0.3)
plt.xlabel("Word Counts")
plt.gca().invert_yaxis()
Google Colab
Download dataset