> For the complete documentation index, see [llms.txt](https://bcds.gitbook.io/learn/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://bcds.gitbook.io/learn/tutorials/text-analysis/text-repositories.md).

# Text Repositories

The following resources provide text for text analysis projects.

* [**Internet Archive Books**](https://archive.org/details/internetarchivebooks) **(includes plain-text \[“full text”] access to books, issues of magazines, etc.)**
* [**Early English Books Online (EEBO)**](https://bc-primo.hosted.exlibrisgroup.com/permalink/f/l6ucgu/ALMA-BC61414345060001021) **(BC library resource)**
* [**Early Caribbean Digital Archive (ECDA)**](http://omekasites.northeastern.edu/ECDA/)
* [**Oxford Text Archive**](http://ota.ox.ac.uk/catalogue/index.html) **(large number of texts available in variety of forms, including plain text; texts are accessed one at a time)**
* [**Project Gutenberg**](https://www.gutenberg.org/)
* [**HATHITrust**](https://www.hathitrust.org/) **(16 million volumes, mostly in English)**
* [**Chronicling America**](https://chroniclingamerica.loc.gov/) **(12.8 million pages of American newspapers)**
* [**DocSouth Data**](http://docsouth.unc.edu/docsouthdata/) **(narratives & literature from the American South)**
* [**Perseus Digital Library**](http://www.perseus.tufts.edu/hopper/) **(large collection of classical texts, much of it encoded in TEI/XML)**
* [**EEBO-TCP**](http://www.textcreationpartnership.org/tcp-eebo/) **(ca. 50,000 early English books, many encoded in TEI/XML)**
* [**Old Bailey Online**](https://www.oldbaileyonline.org/) **(197,745 London criminal trials, 1674-1913)**
* [**Canadian Hansard**](http://parl.canadiana.ca/) **(debates & journals of the Canadian Senate & House of Commons)**
* [**Australian Hansard**](http://historichansard.net/) **(Parliamentary debates, 1901-1980)**
* [**UK Hansard**](http://hansard.millbanksystems.com/) **(UK Parliamentary debates)**
* [**Open Islamicate Texts Initiative**](http://iti-corpus.github.io/) **(see also** [**repositories**](https://github.com/OpenITI)**; 10,000 premodern Islamicate texts)**
* [**Transkribus Corpus**](https://transkribus.eu/Transkribus/) **and** [**READ**](https://read.transkribus.eu/) **(efforts to use computer vision to recognize handwriting)**
* [**ToposText**](https://topostext.org/) **(557 classical texts linked with a gazetteer of the ancient world)**
* [**BYU Corpora**](https://corpus.byu.edu/) **(widely used corpora of American English)**
* [**Wright American Fiction**](http://webapp1.dlib.indiana.edu/TEIgeneral/welcome.do?brand=wright) **(American adult fiction, 1774–1900)**
* [**UCLA Broadcast NewsScape**](http://tvnews.library.ucla.edu/) **(170K hours of captioned news programs; see** [**Red Hen Lab**](https://sites.google.com/site/distributedlittleredhen/home/what-kind-of-red-hen-are-you/access-to-red-hen-tools-and-data) **for information on access)**
* [**Media History Digital Library**](http://mediahistoryproject.org/) **(nearly 2 million pages of media-related books and articles, 1875-1995)**
* [**Christian Classics Ethereal Library**](https://www.ccel.org/) **(classic Christian texts)**
* [**NYT Annotated Corpus**](https://catalog.ldc.upenn.edu/ldc2008t19) **(1.8 million NYT articles + NYT-supplied metadata)**
* [**Europeana Collections**](https://pro.europeana.eu/pages/data-collections/data/itemtype/newspapers) **(many datasets from European libraries & archives, from papyri to photographs to newspapers)**
* [**Foreign Records of the US**](https://uwdc.library.wisc.edu/collections/frus/) **(nearly complete run of Foreign Relations of the United States; see** [**these tools**](https://github.com/thomasgpadilla/webscraping) **to obtain full text)**
* [**Internet Archive**](https://blog.archive.org/2012/04/26/downloading-in-bulk-using-wget/) **(a huge collection of websites, texts, audio, and other media, available for bulk download via wget)**
* [**Twitter Datasets**](https://catalog.docnow.io/) **(a catalog of Twitter datasets that are publicly available on the web)**
* [**BitCurator**](https://bitcurator.net/bitcurator-nlp/) **(an effort to develop tools to analyze features of digital texts)**
* [**Movie Quotes Corpus**](https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html) **(“220,579 conversational exchanges between 10,292 pairs of movie characters”)**
* [**Europe PMC**](https://europepmc.org/downloads) **(repository of life sciences books, articles, and preprints)**
* [**Trove Australia**](https://trove.nla.gov.au/) **(565 million documents collected by the National Library of Australia, including a sizeable collection of newspapers)**
* [**BNC-Baby**](http://www.natcorp.ox.ac.uk/getting/index.xml) **(4 million-word sub corpus of the 100 million-word British National Corpus, with parts-of-speech tagging in XML)**

TEI-Encoded

* [**Women Writers Online**](http://www.wwp.northeastern.edu/wwo/)
* [**Eighteenth Century Collections Online**](https://bc-primo.hosted.exlibrisgroup.com/permalink/f/l6ucgu/ALMA-BC61418232430001021) **(BC library resource)**
* [**Documenting the American South**](http://docsouth.unc.edu/)
* **Resources from** [**Laura Nelson’s “Analyzing Complex Digitized Data”**](http://www.lauraknelson.com/p/teaching.html)
* [**Demonstration Corpora, by Alan Liu**](http://dhresourcesforprojectbuilding.pbworks.com/w/page/69244469/Data%20Collections%20and%20Datasets#demo-corpora)


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://bcds.gitbook.io/learn/tutorials/text-analysis/text-repositories.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
