Exercise 2: Python

Tools & Materials

Exercise

In this exercise, you will be using a small dataset of BC dissertations focused on segregation in Boston schools to conduct a word frequency analysis on the dissertation abstracts.

Getting Started

1.) Import Libraries

Import Pandas (a library for analysis and manipulation tool), CSV (a module for reading and writing tabular data in CSV format), Natural Language Tool Kit (a platform used for building Python programs that work with human language data for applying in statistical natural language processing).

import pandas as pd
import csv
import nltk

2.) Import Data

#Uplocad csv file from your local directory
from google.colab import files
uploaded = files.upload()

3.) Preview Dataframe

a.) Cut and paste and run the code to see the dataframe.

df = pd.read_csv('incubator_etd-dataset.csv')  #The filename of the uplpaded csv file 
df.head() #PREVIEW A DATAFRAME

b.) Change df.head() to df.head(8) and run the code again.

Text Cleaning

In this section, the text is cleaned. All of the cleaning is being applied specifically to the abstracts field since that is what we will be analyzing.

1.) Remove Empty Cell, Remove Punctions, to Lowercase

The following code

  • Gets rid of records (rows in the dataset spreadsheet) that do not have abstracts.

  • Removes punctuation so that it will not affect how tokens are created and how words are counted

  • Make all uppercase letters into lowercase because otherwise, words having upper case letters will be counted as different than the same words in lowercase. For example, "Education" and "education" will be counted as two different words.

After the code is run, a new dataframe will show the changes that have been made.

#DROP RECORDS WITH NO ABSTRACT TEXT
df.dropna(subset=['abstract'], inplace=True) 

#REMOVE PUNCTUATIONS
import re              
df['abstract']=[re.sub('[^\w\s]+', '', s) for s in df['abstract'].tolist()]

#CONVERT TO LOWER CASE
df['abstract'] =  df['abstract'].apply(lambda x: " ".join(x.lower() for x in x.split())) 
df.head() #PREVIEW DATAFRAME

2.) Tokenization

Here the Natural Language Tool Kit (NLTK) library is being used to tokenize the text so that each individual word is a token. 'Punkt' is the Punkt Sentence Tokenizer, an NLTK algorithm that is being incorporated to tokenize the text.

Again a new dataframe will be created. Note that the new column abs_tokenize has been added. This is where the tokenized text is. The abstract column remains untouched.

nltk.download('punkt')
def tokenize_text(row):
    d = row['abstract']
    tokens = nltk.word_tokenize(d)
    token_words = [w for w in tokens]
    return token_words

#ADD TOKENIZED TEXT TO NEW DATAFRAME COLUMN 
df['abs_tokenize'] = df.apply(tokenize_text, axis=1)
df.head() #PREVIEW DATAFRAME

3.) Stopwords

a.) Download the NLTK stopwords list:

nltk.download('stopwords')   #DOWNLOAD STOPWORDS FROM NLTK
from nltk.corpus import stopwords
stops = set(stopwords.words("english")) #STORE STOPWORDS IN stops 
print(stops) #SHOW STOPWORDS

b.) Apply stopwords and add changes to abs_nostops column:

def remove_stops(row):
    d = row['abs_tokenize']
    meaningful_words = [w for w in d if not w in stops]
    return (meaningful_words)
#ADD TOKENIZED TEXT WITH STOPWORDS REMOVED TO NEW DATAFRAME COLUMN 
df['abs_nostops'] = df.apply(remove_stops, axis=1)
df.head() #PREVIEW DATAFRAME

c.) Add customized stopwords not included in the NLTK:

own_stops = {'study', 'school', 'schools', 'public'}
stops.update(own_stops)
print(stops) #Stopwords List is updated

d.) Apply stopwords and add changes to abs_nostops column:

def remove_stops(row):
    d = row['abs_tokenize']
    meaningful_words = [w for w in d if not w in stops]
    return (meaningful_words)
df['abs_nostops'] = df.apply(remove_stops, axis=1)
df.head() #PREVIEW DATAFRAME

Word Frequency

1.) Counting Words

a.) Count words in abs_nostops field:

abs_count = []
for i, row in df.iterrows():
    abs_count.append(row['abs_nostops'])

b.) Use Counter, a container that keeps track of how many times equivalent values are added, to calculate word frequency:

# Import Counter()
from collections import Counter
# Create an empty Counter object called `word_frequency`
word_frequency = Counter()

c.) Display the 15 most common words:

word_frequency = Counter(x for xs in abs_count for x in set(xs))
word_frequency.most_common(15)

2.) Add more stopwords

Return to the code written for step 3.d own_stops = {'study', 'school', 'schools', 'public'} and add additional stopwords that you think should be added.

3.) Visualization

a.) Import Matplotlib, a comprehensive library for creating static, animated, and interactive visualizations in Python.

import matplotlib.pyplot as plt

b.) Identify the 30 most used words.

a = word_frequency.most_common(30)
bar_values = list(list(zip(*a)))

c.) Display the results in a bar chart with the words (x value) and a blue bar (y value).

x_val = list(bar_values[0])
y_val = list(bar_values[1])

plt.figure(figsize=(12,8)) #Customize plot size
plt.barh(x_val, y_val, color='blue',height=0.3)
plt.xlabel("Word Counts")
plt.gca().invert_yaxis()

Last updated