> For the complete documentation index, see [llms.txt](https://bcds.gitbook.io/learn/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://bcds.gitbook.io/learn/tutorials/text-analysis/text-repositories.md).

# Text Repositories

The following resources provide text for text analysis projects.

* [**Internet Archive Books**](https://archive.org/details/internetarchivebooks) **(includes plain-text \[“full text”] access to books, issues of magazines, etc.)**
* [**Early English Books Online (EEBO)**](https://bc-primo.hosted.exlibrisgroup.com/permalink/f/l6ucgu/ALMA-BC61414345060001021) **(BC library resource)**
* [**Early Caribbean Digital Archive (ECDA)**](http://omekasites.northeastern.edu/ECDA/)
* [**Oxford Text Archive**](http://ota.ox.ac.uk/catalogue/index.html) **(large number of texts available in variety of forms, including plain text; texts are accessed one at a time)**
* [**Project Gutenberg**](https://www.gutenberg.org/)
* [**HATHITrust**](https://www.hathitrust.org/) **(16 million volumes, mostly in English)**
* [**Chronicling America**](https://chroniclingamerica.loc.gov/) **(12.8 million pages of American newspapers)**
* [**DocSouth Data**](http://docsouth.unc.edu/docsouthdata/) **(narratives & literature from the American South)**
* [**Perseus Digital Library**](http://www.perseus.tufts.edu/hopper/) **(large collection of classical texts, much of it encoded in TEI/XML)**
* [**EEBO-TCP**](http://www.textcreationpartnership.org/tcp-eebo/) **(ca. 50,000 early English books, many encoded in TEI/XML)**
* [**Old Bailey Online**](https://www.oldbaileyonline.org/) **(197,745 London criminal trials, 1674-1913)**
* [**Canadian Hansard**](http://parl.canadiana.ca/) **(debates & journals of the Canadian Senate & House of Commons)**
* [**Australian Hansard**](http://historichansard.net/) **(Parliamentary debates, 1901-1980)**
* [**UK Hansard**](http://hansard.millbanksystems.com/) **(UK Parliamentary debates)**
* [**Open Islamicate Texts Initiative**](http://iti-corpus.github.io/) **(see also** [**repositories**](https://github.com/OpenITI)**; 10,000 premodern Islamicate texts)**
* [**Transkribus Corpus**](https://transkribus.eu/Transkribus/) **and** [**READ**](https://read.transkribus.eu/) **(efforts to use computer vision to recognize handwriting)**
* [**ToposText**](https://topostext.org/) **(557 classical texts linked with a gazetteer of the ancient world)**
* [**BYU Corpora**](https://corpus.byu.edu/) **(widely used corpora of American English)**
* [**Wright American Fiction**](http://webapp1.dlib.indiana.edu/TEIgeneral/welcome.do?brand=wright) **(American adult fiction, 1774–1900)**
* [**UCLA Broadcast NewsScape**](http://tvnews.library.ucla.edu/) **(170K hours of captioned news programs; see** [**Red Hen Lab**](https://sites.google.com/site/distributedlittleredhen/home/what-kind-of-red-hen-are-you/access-to-red-hen-tools-and-data) **for information on access)**
* [**Media History Digital Library**](http://mediahistoryproject.org/) **(nearly 2 million pages of media-related books and articles, 1875-1995)**
* [**Christian Classics Ethereal Library**](https://www.ccel.org/) **(classic Christian texts)**
* [**NYT Annotated Corpus**](https://catalog.ldc.upenn.edu/ldc2008t19) **(1.8 million NYT articles + NYT-supplied metadata)**
* [**Europeana Collections**](https://pro.europeana.eu/pages/data-collections/data/itemtype/newspapers) **(many datasets from European libraries & archives, from papyri to photographs to newspapers)**
* [**Foreign Records of the US**](https://uwdc.library.wisc.edu/collections/frus/) **(nearly complete run of Foreign Relations of the United States; see** [**these tools**](https://github.com/thomasgpadilla/webscraping) **to obtain full text)**
* [**Internet Archive**](https://blog.archive.org/2012/04/26/downloading-in-bulk-using-wget/) **(a huge collection of websites, texts, audio, and other media, available for bulk download via wget)**
* [**Twitter Datasets**](https://catalog.docnow.io/) **(a catalog of Twitter datasets that are publicly available on the web)**
* [**BitCurator**](https://bitcurator.net/bitcurator-nlp/) **(an effort to develop tools to analyze features of digital texts)**
* [**Movie Quotes Corpus**](https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html) **(“220,579 conversational exchanges between 10,292 pairs of movie characters”)**
* [**Europe PMC**](https://europepmc.org/downloads) **(repository of life sciences books, articles, and preprints)**
* [**Trove Australia**](https://trove.nla.gov.au/) **(565 million documents collected by the National Library of Australia, including a sizeable collection of newspapers)**
* [**BNC-Baby**](http://www.natcorp.ox.ac.uk/getting/index.xml) **(4 million-word sub corpus of the 100 million-word British National Corpus, with parts-of-speech tagging in XML)**

TEI-Encoded

* [**Women Writers Online**](http://www.wwp.northeastern.edu/wwo/)
* [**Eighteenth Century Collections Online**](https://bc-primo.hosted.exlibrisgroup.com/permalink/f/l6ucgu/ALMA-BC61418232430001021) **(BC library resource)**
* [**Documenting the American South**](http://docsouth.unc.edu/)
* **Resources from** [**Laura Nelson’s “Analyzing Complex Digitized Data”**](http://www.lauraknelson.com/p/teaching.html)
* [**Demonstration Corpora, by Alan Liu**](http://dhresourcesforprojectbuilding.pbworks.com/w/page/69244469/Data%20Collections%20and%20Datasets#demo-corpora)