Text Repositories
The following resources provide text for text analysis projects.
Internet Archive Books (includes plain-text [“full text”] access to books, issues of magazines, etc.)
Early English Books Online (EEBO) (BC library resource)
Oxford Text Archive (large number of texts available in variety of forms, including plain text; texts are accessed one at a time)
HATHITrust (16 million volumes, mostly in English)
Chronicling America (12.8 million pages of American newspapers)
DocSouth Data (narratives & literature from the American South)
Perseus Digital Library (large collection of classical texts, much of it encoded in TEI/XML)
EEBO-TCP (ca. 50,000 early English books, many encoded in TEI/XML)
Old Bailey Online (197,745 London criminal trials, 1674-1913)
Canadian Hansard (debates & journals of the Canadian Senate & House of Commons)
Australian Hansard (Parliamentary debates, 1901-1980)
UK Hansard (UK Parliamentary debates)
Open Islamicate Texts Initiative (see also repositories; 10,000 premodern Islamicate texts)
Transkribus Corpus and READ (efforts to use computer vision to recognize handwriting)
ToposText (557 classical texts linked with a gazetteer of the ancient world)
BYU Corpora (widely used corpora of American English)
Wright American Fiction (American adult fiction, 1774–1900)
UCLA Broadcast NewsScape (170K hours of captioned news programs; see Red Hen Lab for information on access)
Media History Digital Library (nearly 2 million pages of media-related books and articles, 1875-1995)
Christian Classics Ethereal Library (classic Christian texts)
NYT Annotated Corpus (1.8 million NYT articles + NYT-supplied metadata)
Europeana Collections (many datasets from European libraries & archives, from papyri to photographs to newspapers)
Foreign Records of the US (nearly complete run of Foreign Relations of the United States; see these tools to obtain full text)
Internet Archive (a huge collection of websites, texts, audio, and other media, available for bulk download via wget)
Twitter Datasets (a catalog of Twitter datasets that are publicly available on the web)
BitCurator (an effort to develop tools to analyze features of digital texts)
Movie Quotes Corpus (“220,579 conversational exchanges between 10,292 pairs of movie characters”)
Europe PMC (repository of life sciences books, articles, and preprints)
Trove Australia (565 million documents collected by the National Library of Australia, including a sizeable collection of newspapers)
BNC-Baby (4 million-word sub corpus of the 100 million-word British National Corpus, with parts-of-speech tagging in XML)
TEI-Encoded
Eighteenth Century Collections Online (BC library resource)
Resources from Laura Nelson’s “Analyzing Complex Digitized Data”
Last updated