Preparing Literary Data

Preparing a literary text for analysis is a multi-step process. In brief, the steps are:

Download the text from your source repository
Remove extraneous material from the text
Transform the text so that you can answer your research questions

After completing these steps, it is possible to load the text into R for analysis. A more detailed walkthrough of each of these steps follows below:

Download the text from your source repository

There are many different sources where you can obtain texts online. This tutorial will describe how to work with two such texts. Greek and Latin texts that are downloadable from the Perseus open-source archive and the English language texts that are downloadable from Project Guteneberg

Downloading Texts from Perseus

To download the Perseus texts, navigate to http://www.perseus.tufts.edu/hopper/opensource/download and then download the 'Greek and Roman Collection texts' The Perseus Greek and Roman text is mirrored locally here).

Once you download and expand the text archive, you will find a folder named 'Classics' that contains an individual folder for each author. In the author folders, you will find a an 'Open Source' folder that contains all of the Perseus xml files for that author. These files are generally named using the following convention: AuthorName.AbbreviatedTitle_Language.xml

Downloading Text from Project Gutenberg

To download an English text from Project Gutenberg, navigate to http://www.gutenberg.org and then use the 'Search Book Catalog' link in the left hand sidebar to search for the text that you want to use. For the purposes of this tutorial, type 'Frankenstein' in the box, and you will go to a list of books that have Frankenstein in the title or author fields. Click on the link for any of the texts that you want to use and then look for a download link 'Plain Text UTF-8.' The Project Gutenberg Frankenstein text is also mirrored locally here.

Clean the Text

Cleaning Gutenberg Texts

Both Perseus and Project Gutenberg texts contain information that you will not want to consider when analyzing them with R. Gutenberg texts have licensing information at both the beginning and ending of the texts. Perseus texts contain XML tags that encode named entities, geographic regions, cross references to other texts, etc. that are all used as part of the Perseus text display system. All of this information needs to be discarded in order to use these texts in R.

Although there are ways to accomplish these tasks in R, I generally prefer to do this preliminary cleaning in a plain text editor that supports Regular Expressions. On a Macintosh, I use the free text editor 'TextWrangler' that can be downloaded for free at http://www.barebones.com/products/textwrangler/. Windows users can download the free Notepad++ at http://notepad-plus-plus.org.

In Project Gutenberg texts, contain both metadata and licensing information in the header and footer. You should open the file in your text editor and delete this information: generally this falls in the first 25 - 50 lines and the last 400-500 lines. The license information at the end generally begins with the text *** END OF THIS PROJECT GUTENBERG EBOOK that you can locate using the 'Find' command in your text editor.

Cleaning Perseus Texts

It is a more complex job to clean Perseus texts. In addition to removing the metadata at the beginning of the file, you also need to remove the XML tags that are used to mark-up other information such as geographic place names, named entities, cross-references to other texts, etc. The quickest way to do this is using regular expressions.

A regular expression is a method that can be used to match character patterns in a text rather than the text itself. As a very simple example, imagine that you wanted to use the 'Find Command' to locate the beginning of every chapter in the Gutenberg text of Frankenstein. In our source text, chapters begin with the word Chapter followed by a Roman numeral followed by a period. You would represent this as a regular expression by searching for the text:

Chapter [IVX]+\.

This regular expression is broken down as follows:

The text Chapter functions like a normal find command and looks for the word 'Chapter' followed by a space in the text.
The next section - [IVXL] - asks the find command to look for any of the characters I, V, X, or L.
The next character - + - asks the find command to look for any of the characters I, V, X, or L one or more times.
The next piece of the regular expression - \. - asks the find command to locate a single period character. In a regular expression, a period is reserved to mean any single character. The backslash indicates that we are literally looking for a period rather than any single character.

This is a relatively simple task that could actually have been accomplished by simply searching for the word Chapter or the substring Chap. However, the power of regular expressions becomes clear if you consider the problems posed by the letters between Robert and Isabella at the beginning of Frankenstein. For the sake of this example, I will consider each of these letters as a 'chapter'. Using the simpler search string Chap would simply skip over these letters and take you directly to Chapter I.. Our regular expression can be modified, however, with a boolean operator | that tells the regular expression to look for either one pattern or another. A modified regular expression Chapter|Letter [IVX]+\. will properly find the beginning of both the chapters and the letters in the text of Frankenstein.

On their own, regular expressions can be a powerful tool for exploring a text. A more detailed overview of regular expression syntax is available at http://www.w3schools.com/jsref/jsref_obj_regexp.asp. For our purposes, they will be useful to help us clean the XML from a Perseus greek text.

Perseus XML are marked up according to the standards of the Text Encoding Initiative. The earliest texts in Perseus predate XML and the web and these were originally marked up in SGML - a precursor of XML - and still encode some of the original text breaks that were used in the Hypercard versions of Perseus in the 1980s. (See http://www.perseus.tufts.edu/hopper/help/archived/P2/ch3.html for a reminder of just how far things have come!). Although these elements may be useful for helping break the text down into smaller chunks, they generally should just be ignored in order to process the texts using 'R'.

As with the Gutenberg texts, the first 50 - 100 lines of every Perseus text contains a header that contains metadata and encoding information abou the text. Everything down to and including the tag can simply be deleted. One way to remove the XML tags that are embedded within the text is with a regular expression. XML tags are contained in angle brackets - < > ; a regular expression can quickly remove anything contained with angle brackets and anything inside them should be removed in order to analyze the text using R.

This can be accomplished by using the regular expression <.+> to replace everything in angle brackets with a space. This transforms the first lines of a text such as Herodotus' History the XML text shown above into the following: *(hrodo/tou *(alikarnhsse/os i(stori/hs a)po/decis h(/de, w(s mh/te ta\ geno/mena e)c a)nqrw/pwn tw=| xro/nw| e)ci/thla ge/nhtai, mh/te e)/rga mega/la te kai\ qwmasta/, ta\ me\n *(/ellhsi ta\ de\ barba/roisi a)podexqe/nta, a)klea= ge/nhtai, ta/ te a)/lla kai\ di' h(\n ai)ti/hn e)pole/mhsan a)llh/loisi. *perse/wn me/n nun oi( lo/gioi *foi/nikas ai)ti/ous fasi\ gene/sqai th=s diaforh=s. tou/tous ga\r a)po\ th=s *)eruqrh=s kaleome/nhs qala/sshs a)pikome/nous e)pi\ th/nde th\n qa/lassan, kai\ oi)kh/santas tou=ton to\n xw=ron to\n kai\ nu=n oi)ke/ousi, au)ti/ka nautili/h|si makrh=|si e)piqe/sqai, a)pagine/ontas de\ forti/a *ai)gu/ptia/ te kai\ *)assu/ria th=| te a)/llh| e)sapikne/esqai kai\ dh\ kai\ e)s *)/argos.

Lemmatizing A Text

The question of whether to lemmatize a text depends very much on the research questions that you want to ask. For some applications, it might be more appropriate to consider the word such 'creature' in Frankenstein as something different from the word 'creatures' or 'sleep' as something different than 'slept', but in other applications such as a word frequency list, it might be preferable to consider the lexical forms rather than the forms that appear in the text.

Lemmatizing English Language Texts from Project Gutenberg

For English language texts, there are several tools that can be used to generate a lemmatized text such as the Stanford Core NLP software (available from http://nlp.stanford.edu/software/corenlp.shtml or Natural Language Tool Kit in Python. One tool that can be used to process create a lemmatized text without requiring knowledge of a language such as Python or Perl is morphadorner available from http://morphadorner.northwestern.edu/.

After downloading and installing morphadorner, you can use the command adornplaintext to transform the first line of Frankenstein from: I am by birth a Genevese; and my family is one of the most distinguished of that republic. into the following:

I	I	pns11	I	i	0
am	am	vbm	am	be	0
by	by	p-acp	by	by	0
birth	birth	n1	birth	birth	0
a	a	dt	a	a	0
Genevese	Genevese	np1	Genevese	Genevese	0
;	;	;	;	;	0
and	and	cc	and	and	0
my	my	po11	my	my	0
family	family	n1	family	family	0
is	is	vbz	is	be	0
one	one	crd	one	one	0
of	of	pp-f	of	of	0
the	the	dt	the	the	0
most	most	av-ds	most	most	0
distinguished	distinguished	vvn	distinguished	distinguish	0
of	of	pp-f	of	of	0
that	that	d	that	that	0
republic	republic	n1	republic	republic	0
.	.	.	.	.	1

Morphadorner is designed to work on 18th and 19th century fiction and much of the output is intended to be used to normalize spelling variants found in those texts. This output also contains a part of speech tag for more advanced linguistic analysis. For the purposes of creating a lemmatized text, we only need to be concerned with the fifth column that contains the lemma from which the lexical form in the text is derived. This output can then be processed using R. The text of the first chapter of Frankenstein that has been processed with Morphadorner can be downloaded here.

Lemmatizing Perseus Texts

Ancient Greek texts can also be lemmatized if this is appropriate for the research questions that you want to ask. The primary tool for lemmatizing Greek texts is called morpheus and it was written as part of the original instantiation of the Perseus project with ongoing refinements since that time.

Although the actual source code for morpheus cannot be downloaded, there are several web services that provide access to this data. These include a morphological services API that is described at http://sites.tufts.edu/perseusupdates/2012/11/01/morphology-service-beta/, a web based morphological service that is available at http://archimedes.mpiwg-berlin.mpg.de/arch/doc/xml-rpc.html, a series of Python scripts written by Timothy Mallon available at https://github.com/tmallon/morpheus2. In addition, a SQL data file containing the parses entitled hib_parses.sql is available from the Perseus open source download page at http://www.perseus.tufts.edu/hopper/opensource/download.

The simplest way to access the Perseus morphological analysis engine is via the Perseus XML morphological service that is available at http://www.perseus.tufts.edu/hopper/xmlmorph. A request can be sent to this server using with arguments for the language and the word to be looked up. The word should be in beta-code without any accents or breathing marks. The service returns an XML document that contains the morphological analysis for that word.

For example, the query http://www.perseus.tufts.edu/hopper/xmlmorph?lang=greek&lookup=luw returns the following result.

<analyses>
<analysis>
<form lang="greek">λύω</form>
<lemma>λύω</lemma>
<expandedForm>λύω</expandedForm>
<pos>verb</pos>
<person>1st</person>
<number>sg</number>
<tense>pres</tense>
<mood>subj</mood>
<voice>act</voice>
<dialect>epic</dialect>
<feature/>
</analysis>
<analysis>
<form lang="greek">λύω</form>
<lemma>λύω</lemma>
<expandedForm>λύω</expandedForm>
<pos>verb</pos>
<person>1st</person>
<number>sg</number>
<tense>pres</tense>
<mood>ind</mood>
<voice>act</voice>
<dialect>epic</dialect>
<feature/>
</analysis>
<analysis>
<form lang="greek">λύω</form>
<lemma>λυάω</lemma>
<expandedForm>λύω</expandedForm>
<pos>verb</pos>
<person>2nd</person>
<number>sg</number>
<tense>pres</tense>
<mood>imperat</mood>
<voice>mp</voice>
<dialect/>
<feature>contr</feature>
</analysis>
<analysis>
<form lang="greek">λύω</form>
<lemma>λυάω</lemma>
<expandedForm>λύω</expandedForm>
<pos>verb</pos>
<person>2nd</person>
<number>sg</number>
<tense>imperf</tense>
<mood>ind</mood>
<voice>mp</voice>
<dialect>homeric ionic</dialect>
<feature>contr unaugmented</feature>
</analysis>
<analysis>
<form lang="greek">λύω</form>
<lemma>λυάω</lemma>
<expandedForm>λύω</expandedForm>
<pos>verb</pos>
<person>1st</person>
<number>sg</number>
<tense>pres</tense>
<mood>subj</mood>
<voice>act</voice>
<dialect>attic epic doric ionic</dialect>
<feature>contr</feature>
</analysis>
<analysis>
<form lang="greek">λύω</form>
<lemma>λυάω</lemma>
<expandedForm>λύω</expandedForm>
<pos>verb</pos>
<person>1st</person>
<number>sg</number>
<tense>pres</tense>
<mood>ind</mood>
<voice>act</voice>
<dialect>attic epic ionic</dialect>
<feature>contr</feature>
</analysis>
</analyses>

This XML can be parsed for further analysis and the service can be called for every word in an entire text to generate a lemmatized version of the text. A very simple script (with much room for improvement) that accomplishes this is available here and you can download a lemmatized version of the the first book of Perseus' Herodotus here.

Load This Data Into R

The adorned text of the first chapter of Frankenstein can be loaded into R in the same way that we imported the data about the composition dates and lengths of Greek Tragedy on the Preparing and Importing Data page. Since morphadorner outputs tab delimited text, we use the command frank.ch1 <- read.table(file.choose(), sep = "\t", header = TRUE, quote="") to import the adorned Frankenstein file that we created above.

Our lemmatized Herodotus text is in a different format with every line of the original XML text as one line in our text file. Loading this type of file for analysis in R requires a few more steps. (NB: You would also use this technique to work with unlemmatized files from Project Gutenberg. The technique is described in detail in Chapter 2 of Jocker's Text Analysis with R for Students of Literature and also in a very helpful blog post at A Simple Frequency List Using R.)

The first step is to read the file into a variable. The command is similar to the one we have been using to read tabular data; instead of read.table you use the command scan. The command hdt.lines <- scan(file.choose(), what="char", sep="\n") will bring up a dialogue box that will allow you to select the lemmatized Herodotus file and then read every line into a variable named hdt.lines. This variable is a vector that can be accessed using the same methods that were discussed on the R Environment page. Typing hdt.lines[1] will display the first lemmatized line of Herodotus' history.

We need to split this variable into individual words in order to actually process it with R. This is done using two commands. First, hdt.words <- strsplit(hdt.lines, "\\W"). This command uses the regular expression \\W to split each line into individual elements at word boundaries. The strsplit command generates a data-type known as a list that is more complicated than we need for our immediate purposes so the next step is to transform this data-type into a vector with the command hdt.words <- unlist(hdt.words). The variable hdt.words can now be accessed as a vector so that the command hdt.words[1:10] will show the first ten lemmas in book 1 of Herodotus.

<<-- Previous: Preparing and Importing Data
Next: Using Data Frames -->>

Statistical Methods for Studying Literature Using R

Jeff Rydberg-Cox, The University of Missouri-Kansas City