In this text analysis example, Ted Underwood and David Bamman used BookNLP, a Java-based natural language processing code, to explore gender in 93,708 English-language fiction volumes. They articulate one of their major discoveries as follows:
There is a clear decline from the nineteenth century (when women generally take up 40% or more of the “character space” in fiction) to the 1950s and 60s, when their prominence hovers around a low of 30%. A correction, beginning in the 1970s, almost restores fiction to its nineteenth-century state. (One way of thinking about this: second-wave feminism was a desperately-needed rescue operation.)
Visit their blog post to learn more about their methods and discoveries.
Here CORD-19, a database containing thousands of scholarly articles about COVID-19 and other related coronaviruses, provides a topic model and visualization of 2437 journal articles. The approach they used, latent Dirichlet allocation (LDA), is a natural language processing based generative statistical model.
Visit to interact with the visualization.