1 of 5

Text Analysis in JSTOR

Constellate - new text analysis platform by JSTOR Labs

Constellate, the new text and data analytics service from JSTOR and Portico is a platform for learning and performing text analysis, building datasets, and sharing analytics course materials. The platform provides value to users in three core areas -- they can teach and learn text analytics, build datasets from across multiple content sources, and visualize and analyze their datasets.

>>> Go to Constellate

Overview of Constellate

Roadmap of Constellate - Top Level

Constellate provides: 1. access to over 29 million documents, including JSTOR, Portico and etc. >> collection details 2. Research Notebooks (Jupyter Notebooks) provides pre-built code snippets for a number of text analysis tasks. >> access to Research Notebook Text data and notebooks can be utilized together or separately; data downloaded in JSON format.

Roadmap of Constellate - function details with features

Build A Dataset

Build a dataset >>

Multiple Collections: Anchor collections from JSTOR and Portico, with additional content sources continually added.
Data Download in JSON.
Open content - bibliographic metadata, full-text, unigrams, bigrams, trigrams.
Dataset Dashboard: Easily view datasets you have built or accessed.

Example of one extraction (examples)

Dataset ID: unique identifier of the extracted dataset, can be used for retrieval in research notebooks.
Analyze: tutorials version for leaning how to use the research notebooks.
Download metadata in CSV file format, raw text data in JSON file format.
Built-in visualizations available by clicking the link under word cloud.

Create A Stopwords List

Create a stopwords list from the Natural Language Tookit (NLTK)

Download the stopwords list to your local device:

Step 1: Click "Jupyter" and go to the main directory Step 2: Go to folder "Data" Step 3: Check "stop_words.csv" and click Download

Customize a stopwords list by adding your own stopwords:

Word Frequency

This notebook finds the word frequencies for a dataset.

Research Notebook: Exploring Word Frequencies for Research

Explore word frequency of your own extracted data

Create a bar chart for the 20 most frequently used words

import matplotlib.pyplot as plt 

a = transformed_word_frequency.most_common(20)
bar_values = list(list(zip(*a)))

x_val = list(bar_values[0])
y_val = list(bar_values[1])

plt.figure(figsize=(12,8))    #Customize plot size
plt.barh(x_val, y_val, color='blue',height=0.3)
plt.xlabel("Word Counts")
plt.gca().invert_yaxis()

Create a wordcloud chart for the extracted text data

Modify 4 Find Word Frequencies by:

#4 Find Word Frequencies
word_str = " "

# from collections import Counter

# # Hold our word counts in a Counter Object
# transformed_word_frequency = Counter()

# # Apply filter list
# for document in tdm_client.dataset_reader(dataset_file):
#     if use_filtered_list is True:
#         document_id = document['id']
#         # Skip documents not in our filtered_id_list
#         if document_id not in filtered_id_list:
#             continue
#     unigrams = document.get("unigramCount", [])
#     for gram, count in unigrams.items():
#         clean_gram = gram.lower() # Lowercase the unigram
        word_str += " " + clean_gram  #Added: string of all words
#         if clean_gram in stop_words: # Remove unigrams from stop words
#             continue
#         if not clean_gram.isalpha(): # Remove unigrams that are not alphanumeric
#             continue
#         transformed_word_frequency[clean_gram] += count

#Install wordcloud
pip install wordcloud

#Install matplotlib for word plot cloud
from wordcloud import WordCloud, STOPWORDS 
import matplotlib.pyplot as plt

#Added: plot word cloud
wordcloud = WordCloud(width = 800, height = 800, 
                background_color ='white', 
                stopwords = stop_words, 
                min_font_size = 10).generate(word_str) 
  
# plot the WordCloud image                        
plt.figure(figsize = (8, 8), facecolor = None) 
plt.imshow(wordcloud) 
plt.axis("off") 
plt.tight_layout(pad = 0) 
plt.show()