In this exercise, you will be using a small dataset of BC dissertations focused on segregation in Boston schools to conduct a word frequency analysis on the dissertation abstracts.
1.) Import Libraries
Import Pandas (a library for analysis and manipulation tool), CSV (a module for reading and writing tabular data in CSV format), Natural Language Tool Kit (a platform used for building Python programs that work with human language data for applying in statistical natural language processing).
2.) Import Data
3.) Preview Dataframe
a.) Cut and paste and run the code to see the dataframe.
b.) Change df.head() to df.head(8) and run the code again.
In this section, the text is cleaned. All of the cleaning is being applied specifically to the abstracts field since that is what we will be analyzing.
1.) Remove Empty Cell, Remove Punctions, to Lowercase
The following code
Gets rid of records (rows in the dataset spreadsheet) that do not have abstracts.
Removes punctuation so that it will not affect how tokens are created and how words are counted
Make all uppercase letters into lowercase because otherwise, words having upper case letters will be counted as different than the same words in lowercase. For example, "Education" and "education" will be counted as two different words.
After the code is run, a new dataframe will show the changes that have been made.
2.) Tokenization
Here the Natural Language Tool Kit (NLTK) library is being used to tokenize the text so that each individual word is a token. 'Punkt' is the Punkt Sentence Tokenizer, an NLTK algorithm that is being incorporated to tokenize the text.
Again a new dataframe will be created. Note that the new column abs_tokenize has been added. This is where the tokenized text is. The abstract column remains untouched.
3.) Stopwords
a.) Download the NLTK stopwords list:
b.) Apply stopwords and add changes to abs_nostops column:
c.) Add customized stopwords not included in the NLTK:
d.) Apply stopwords and add changes to abs_nostops column:
1.) Counting Words
a.) Count words in abs_nostops field:
b.) Use Counter, a container that keeps track of how many times equivalent values are added, to calculate word frequency:
c.) Display the 15 most common words:
2.) Add more stopwords
Return to the code written for step 3.d own_stops = {'study', 'school', 'schools', 'public'} and add additional stopwords that you think should be added.
3.) Visualization
a.) Import Matplotlib, a comprehensive library for creating static, animated, and interactive visualizations in Python.
b.) Identify the 30 most used words.
c.) Display the results in a bar chart with the words (x value) and a blue bar (y value).