Full-Text Corpus

This page contains the Nickels and Dimes text corpus, provided for use by anyone interested in text mining our dime novel collections. The corpus will be updated after each phase of the project concludes.

The data set consists of a separate text file for each dime novel, accompanied by metadata in a CSV file. The text files and metadata are keyed to one another using the repository identifier. Click here to download the zip archive (135 MB).

Total novels: 1,609

Total pages: 78,390

Total words*: 13,875,024

Please note that this is uncorrected text, computer-generated using Optical Character Recognition (OCR). The quality of the OCR can vary significantly from title to title, depending on the condition of the item, typographical features of the novel (e.g. font style and size), and the scan itself. When possible, we have tried to improve the OCR quality through manipulation of the pages images, but some pages still consist of few recognizable words. In order to assist researchers to make more informed decisions about what texts they use, a word and page count have been provided with each dime novel. The word count is calculated after removing punctuation, stop words (a, and, the, etc.), and any words not found in the Unix word list. N.B. that the corpus itself is still raw and unprocessed.

In addition to the sometimes-poor OCR quality, many dime novels are actually periodicals. A single issue of a series may contain multiple stories by different authors, which are sometimes serialized. Issues may also contain non-fiction, like news articles or advice columns, and advertisements. Often these features only take up a few pages at the end of an issue, but in some cases an issue might contain two complete dime novels.

If there are additional formats that would be useful for your research or if you have suggestions for how we can improve the quality of the corpus, please contact us.

*Word count excludes stop words and noise

What is text mining?

Text mining is a method of textual analysis using a large body of text. It generally involves processing raw text and then applying statistical methods to identify patterns, analysis of which may derive new knowledge. Methods of analysis include topic modeling, document classification, named entity extraction, among others. These methods have been employed in the digital humanities to examine text at a distance by literary scholars, historians, and social scientists. To learn more about text mining, see Ted Underwood's Where to start with text mining.

Using the dime novel corpus and these methods, you might:

  • Analyze an author’s style to identify the person responsible for writing a novel that has been attributed to a pseudonym or that has been incorrectly attributed.
  • Study how words or phrases used to describe Chinese immigrants change over time.
  • Train a computer to recognize genres, like “Western stories” or “Detective and mystery stories,” or to extract proper names to aid in cataloging the collection.