1 of 1

Text Repositories

The following resources provide text for text analysis projects.

Internet Archive Books (includes plain-text [“full text”] access to books, issues of magazines, etc.)
Early English Books Online (EEBO) (BC library resource)
(large number of texts available in variety of forms, including plain text; texts are accessed one at a time)
(16 million volumes, mostly in English)
(12.8 million pages of American newspapers)
(narratives & literature from the American South)
(large collection of classical texts, much of it encoded in TEI/XML)
(ca. 50,000 early English books, many encoded in TEI/XML)
(197,745 London criminal trials, 1674-1913)
(debates & journals of the Canadian Senate & House of Commons)
(Parliamentary debates, 1901-1980)
(UK Parliamentary debates)
(see also ; 10,000 premodern Islamicate texts)
and (efforts to use computer vision to recognize handwriting)
(557 classical texts linked with a gazetteer of the ancient world)
(widely used corpora of American English)
(American adult fiction, 1774–1900)
(170K hours of captioned news programs; see for information on access)
(nearly 2 million pages of media-related books and articles, 1875-1995)
(classic Christian texts)
(1.8 million NYT articles + NYT-supplied metadata)
(many datasets from European libraries & archives, from papyri to photographs to newspapers)
(nearly complete run of Foreign Relations of the United States; see to obtain full text)
(a huge collection of websites, texts, audio, and other media, available for bulk download via wget)
(a catalog of Twitter datasets that are publicly available on the web)
(an effort to develop tools to analyze features of digital texts)
(“220,579 conversational exchanges between 10,292 pairs of movie characters”)
(repository of life sciences books, articles, and preprints)
(565 million documents collected by the National Library of Australia, including a sizeable collection of newspapers)
(4 million-word sub corpus of the 100 million-word British National Corpus, with parts-of-speech tagging in XML)

TEI-Encoded

(BC library resource)

Text Repositories

The following resources provide text for text analysis projects.

Internet Archive Books (includes plain-text [“full text”] access to books, issues of magazines, etc.)
Early English Books Online (EEBO) (BC library resource)
(large number of texts available in variety of forms, including plain text; texts are accessed one at a time)
(16 million volumes, mostly in English)
(12.8 million pages of American newspapers)
(narratives & literature from the American South)
(large collection of classical texts, much of it encoded in TEI/XML)
(ca. 50,000 early English books, many encoded in TEI/XML)
(197,745 London criminal trials, 1674-1913)
(debates & journals of the Canadian Senate & House of Commons)
(Parliamentary debates, 1901-1980)
(UK Parliamentary debates)
(see also ; 10,000 premodern Islamicate texts)
and (efforts to use computer vision to recognize handwriting)
(557 classical texts linked with a gazetteer of the ancient world)
(widely used corpora of American English)
(American adult fiction, 1774–1900)
(170K hours of captioned news programs; see for information on access)
(nearly 2 million pages of media-related books and articles, 1875-1995)
(classic Christian texts)
(1.8 million NYT articles + NYT-supplied metadata)
(many datasets from European libraries & archives, from papyri to photographs to newspapers)
(nearly complete run of Foreign Relations of the United States; see to obtain full text)
(a huge collection of websites, texts, audio, and other media, available for bulk download via wget)
(a catalog of Twitter datasets that are publicly available on the web)
(an effort to develop tools to analyze features of digital texts)
(“220,579 conversational exchanges between 10,292 pairs of movie characters”)
(repository of life sciences books, articles, and preprints)
(565 million documents collected by the National Library of Australia, including a sizeable collection of newspapers)
(4 million-word sub corpus of the 100 million-word British National Corpus, with parts-of-speech tagging in XML)

TEI-Encoded

(BC library resource)