arrow-left

All pages
gitbookPowered by GitBook
1 of 5

Loading...

Loading...

Loading...

Loading...

Loading...

Create A Stopwords List

hashtag
Create a stopwords list from the Natural Language Tookit (NLTK)

hashtag
Research Notebook: Creating a Stopwords List for Researcharrow-up-right

hashtag
Download the stopwords list to your local device:

Step 1: Click "Jupyter" and go to the main directory Step 2: Go to folder "Data" Step 3: Check "stop_words.csv" and click Download

hashtag
Customize a stopwords list by adding your own stopwords:

my_stopwords= ['my_word1', 'my_word2']        #Add two custom stopwords
stop_words = stop_words + my_stopwords
#print(len(stop_words))

with open('data/stop_words.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(stop_words)

Build A Dataset

hashtag
Build a dataset >>arrow-up-right

  • Multiple Collections: Anchor collections from JSTOR and Portico, with additional content sources continually added.

  • Data Download in JSON.

  • Open content - bibliographic metadata, full-text, unigrams, bigrams, trigrams.

  • Dataset Dashboard: Easily view datasets you have built or accessed.

hashtag
Example of one extraction ()

  • Dataset ID: unique identifier of the extracted dataset, can be used for retrieval in research notebooks.

  • Analyze: tutorials version for leaning how to use the research notebooks.

  • Download metadata in CSV file format, raw text data in JSON file format.

Overview of Constellate

hashtag
Roadmap of Constellate - Top Level

Constellate provides: 1. access to over 29 million documents, including JSTOR, Portico and etc. 2. Research Notebooks (Jupyter Notebooks) provides pre-built code snippets for a number of text analysis tasks. Text data and notebooks can be utilized together or separately; data downloaded in JSON format.

Text Analysis in JSTOR

hashtag
Constellate - new text analysis platform by JSTOR Labs

Constellate, the new text and data analytics service from JSTOR and Portico is a platform for learning and performing text analysis, building datasets, and sharing analytics course materials. The platform provides value to users in three core areas -- they can teach and learn text analytics, build datasets from across multiple content sources, and visualize and analyze their datasets.

Word Frequency

This notebook finds the word frequencies for a dataset.

hashtag

hashtag
Explore word frequency of your own extracted data

hashtag
>>> Go to Constellatearrow-up-right

Built-in visualizations available by clicking the link under word cloud.

examplesarrow-up-right
hashtag

hashtag
Roadmap of Constellate - function details with features

>> collection detailsarrow-up-right
>> access to Research Notebookarrow-up-right

hashtag
Create a bar chart for the 20 most frequently used words

hashtag
Create a wordcloud chart for the extracted text data

hashtag
Modify 4 Find Word Frequencies by:

Research Notebook: Exploring Word Frequencies for Researcharrow-up-right
import matplotlib.pyplot as plt 

a = transformed_word_frequency.most_common(20)
bar_values = list(list(zip(*a)))

x_val = list(bar_values[0])
y_val = list(bar_values[1])

plt.figure(figsize=(12,8))    #Customize plot size
plt.barh(x_val, y_val, color='blue',height=0.3)
plt.xlabel("Word Counts")
plt.gca().invert_yaxis()
#4 Find Word Frequencies
word_str = " "

# from collections import Counter

# # Hold our word counts in a Counter Object
# transformed_word_frequency = Counter()

# # Apply filter list
# for document in tdm_client.dataset_reader(dataset_file):
#     if use_filtered_list is True:
#         document_id = document['id']
#         # Skip documents not in our filtered_id_list
#         if document_id not in filtered_id_list:
#             continue
#     unigrams = document.get("unigramCount", [])
#     for gram, count in unigrams.items():
#         clean_gram = gram.lower() # Lowercase the unigram
        word_str += " " + clean_gram  #Added: string of all words
#         if clean_gram in stop_words: # Remove unigrams from stop words
#             continue
#         if not clean_gram.isalpha(): # Remove unigrams that are not alphanumeric
#             continue
#         transformed_word_frequency[clean_gram] += count
#Install wordcloud
pip install wordcloud
#Install matplotlib for word plot cloud
from wordcloud import WordCloud, STOPWORDS 
import matplotlib.pyplot as plt 
#Added: plot word cloud
wordcloud = WordCloud(width = 800, height = 800, 
                background_color ='white', 
                stopwords = stop_words, 
                min_font_size = 10).generate(word_str) 
  
# plot the WordCloud image                        
plt.figure(figsize = (8, 8), facecolor = None) 
plt.imshow(wordcloud) 
plt.axis("off") 
plt.tight_layout(pad = 0) 
plt.show()