Step 3: Important Considerations

When conducting a text analysis, it is important to keep in mind that:

1.) Word meaning changes over time.

While it might be understood, it's important not to forget that word meaning changes. One can use a source like the Oxford English Dictionary to look up the particular meaning of a word at a particular time.

2.) The word context is key.

In many, if not most, text analysis undertakings, word context is crucial to the analysis. Exceptions can occur when, for example, one is only interested in the number of times a word appears and not in the way the word is used.

3.) There may be issues of omission in the corpus.

It's important to keep in mind when exploring or creating a corpus that there may be issues of omission. People of color, women, and other marginalized groups have been published less throughout history and, therefore, a massive corpus--like Google Books or HathiTrust--will skew white and male. (Other areas of omission can be based on things like language, geography, time period, etc.) Moreover, it's important to consider what gets digitized. There can be (and no doubt is) bias in the decisions that drive the selection and funding of what ends up online.

4.) There can be quality issues with the corpus.

Often texts used in text analysis come from books and documents that have been OCR'd. OCR (or optical character recognition) converts images of text into digital (machine-readable) text. Due to things like the quality of images and scanning mistakes, there can be OCR quality issues and, therefore, text errors.

Below are two examples of how OCR errors can occur. The one on the left is from a first edition of the 18th-century novel, The Life and Opinions of Tristram Shandy, Gentleman. With books from this period, you get characters such as the long s ( ſ ) and, often, ink bleed through, and foxing (all of the little dots that come from age) which can impact OCR. (These kinds of issues used to be much more of a factor before advancements in OCR technology.) The example on the right shows a scanning mistake made when the book was moved during the process. (Even with the advancements of technology, OCR issues are unavoidable in this case.) When working with a large or massive corpus, these kinds of errors might be inconsequential as long as there is a small enough number of them. With smaller corpora, such errors can have a greater impact and skew text analysis results.

Last updated