For the past three years, the Cultural Heritage Language Technologies consortium [1] situated at eight institutions in four countries [2] has received funding from the National Science Foundation and the European Commission International Digital Libraries program to engage in research about the most effective ways to apply technologies and techniques from the fields of computational linguistics, natural language processing, and information retrieval technologies to challenges faced by students and scholars who are working with texts written in Greek, Latin, and Old Norse [3]. In its broadest terms, our work has focused in four primary areas: 1) providing access to primary source materials that are often rare and fragile, 2) helping readers understand texts written in difficult languages, 3) enabling researchers to conduct new types of scholarship, and 4) preserving digital resources for the future. Our research has produced concrete results in each of these areas; over the past three years, we have:
1. Provided Access to Rare and Fragile Source Materials
2. Helped Readers Understand Texts Written in Difficult
Languages
3. Enabled Scholars to Conduct New Types of Scholarship
4. Worked to Preserve Digital Resources for the Future
As we began this project, many of our partners had texts that were available to us and we also created some new digital corpora. The existing texts included a six-million-word corpus of classical Greek and a four-million-word corpus of
classical Latin from the Perseus project, selected works of Isaac Newton from the Newton Project at Imperial College, and 16th century Latin texts in the history of science provided by the Archimedes Project, a Deutsche Forschungsgemeinschaft / National Science Foundation International Digital Libraries [4] project for texts related to the history of mechanics based at Harvard University and the Max Planck Institute for the History of Science in Berlin. We also developed new corpora under the aegis of this project. We worked with the History of Science and Special Collections department at the Linda Hall Library of Science, Technology, and Engineering in Kansas City to create a second corpus of early printed scientific works thematically focused around early geology and earth science. The Arnamagnaean Institute has created diplomatic transcriptions and page images of the Old Norse sagas from their collections, while the UCLA group has gathered a corpus of some 20 saga texts. Finally, the Stoa Consortium at the University of Kentucky has developed a corpus of Early-Modern Latin colloquia, collections of sayings and dialogues that are designed to help students learn to speak Latin. The impetus for this corpus arose from their partnership with the Institute for Latin Studies, a scholarly institute that teaches spoken Latin and conducts summer seminars in conversational Latin. This work began with a collection of colloquia by Erasmus. As we investigated Erasmus' work, however, it became apparent that many other colloquia by some twenty authors existed that had not been widely published. The University of Kentucky group, therefore, began to create a digital corpus of these works, including colloquia by Corderius, Duncanus, Juan-Luis Vivs' Exercitatio Linguae Latinae, Petrus Mosellanus' Paedologia, Laurentius Corvinus' Latinum Ydeoma, Nicolaus Beraldus' Dialogus (de ratione dicendi) and Jacobus Fontanus' Hortulus puerorum. In addition to these texts, we have located colloquia by some 16 other authors that we hope to digitize after the completion of work of the CHLT consortium to form a coherent corpus of Early-Modern Latin that can be used with our Latin parser and help people learn to both read and speak Latin.
Natural language processing and multi-lingual information retrieval are mature research areas that focus largely on commercial and national security applications. There are established fora and standards for evaluating work in these areas that include TREC (Text Retrieval Evaluation Conference) [5], TIDES (Translingual Information Detection, Extraction, and Summarization) [6], and CLEF (Cross-Language Evaluation forum) [7]. Our research has been focused on understanding which of these technologies are language independent, applying them to our research areas, and optimizing them both for the needs of our user communities and also for the highly-inflected cultural heritage languages that we are studying. In our work over the past three years, we have achieved many of our goals, particularly in multi-lingual application areas that include the creation of parsers, reading environments, multi-lingual search and visualization facilities, and we have laid the foundations for substantial new works of scholarship dealing with ancient languages.
The most important baseline technology we have developed is a suite of morphological analysis tools. These parsers have been essential from the point of view of computational analysis and also in delivering the results of our work to a large audience. From the point of view of computational analysis, the key challenge we face arises from the fact that our languages are highly inflected with complex morphologies and relatively free word orders. Addressing these difficulties requires modifications to the approach used for English and other modern language texts but also provides the basis to determine what parts of these algorithms are language-specific and what parts are not. First, Ancient Greek and Latin are both highly inflected languages, and their inflected forms often share few (or no) common surface features with their lexical forms. For example, the Greek noun system has five grammatical cases with singular and plural forms and two dual inflections, each with a distinct morphological form. The Greek verb system allows for even more variation; a single Greek verb can appear in thousands of different morphological forms. Second, Greek makes much more extensive use of definite articles than English. In fact, the most frequent word in any Greek text will almost always be the definite article and, according to Zipf's law, it will usually occur twice as often as the next most frequent word in the corpus. This problem is compounded by the prevalence of particles in Greek texts. Greek particles are short words that provide emphasis, mark the tone of a sentence, or indicate the relationship of a sentence to the preceding and following sentences. These particles are also among the most common words in any corpus of ancient Greek. Finally, Greek and Latin word order is far less structured than English word order. Both languages are highly inflected; grammatical relationships such as subject, direct and indirect objects, adjectives and nouns are expressed by gender, number, and case endings of noun and adjectives rather than by their position in a sentence. This problem is further compounded because Greek and Latin authors consciously varied their word order for rhetorical and stylistic effect. These three problems present both a challenge and an opportunity for our work, because we can develop new techniques to more effectively analyze language while also understanding what modifications are required for existing algorithms to gain language independence. Over the course of this project, we have found that simple stemming techniques such as Porter's algorithm are not precise enough; in highly inflected languages, lexical normalization is required in order to have enough data to obtain statistically significant results.
In addition to being required for computational analysis of language, these parsers have also been essential in the construction of a reading environment that brings the results of our work to large audiences. The Perseus Digital Library has long used its Classical Greek and Latin parsers to automatically generate linguistic hypertexts [8]. These hypertexts allow a reader to move from an inflected form as it appears in a text to a 'word study tool' that shows the lexical form along with frequency data, links to entries for the word in dictionaries, and grammatical aids [9]. Log analysis done by the Perseus Project has shown that users of this functionality range from experts who look up the occasional word to novice readers (either students or scholars from another field working with a Greek or Latin text) who painstakingly move through a chunk of text word by word trying to understand the original language. This support for the automatic generation of hypertext is one of the key elements in our efforts to lower the barriers to reading cultural heritage languages; if other parsers can be integrated with original language primary source materials, we can make it easier for people to read these texts in their original language at an earlier point in their study of the language and, therefore, make it easier for broad audiences to read them. To this end, we have supplemented the Classical Greek and Latin parsers provided to the project by the Perseus digital library with parsers for Early-Modern Latin and Old Norse, and we have done extremely preliminary work on a parser for Old English.
The Early-Modern Latin parser has been developed by the Istituto di Linguistica Computazionale del Consiglio Nazionale delle Ricerche (CNR) in Pisa, and this parser contains word stems drawn from extensive lexical sources that cover both the ancient and early-modern periods, including Lewis and Short's Latin Lexicon, the Oxford Latin Dictionary, and Medieval Latin dictionaries by DuCange and Forcellini. This broad collection of source lexica allows this parser to be applied to a larger text base than the one covered by the Perseus parser. In addition, the parser presents its output using the code standards developed by the EAGLES consortium [10], an EU-funded work group that address the standardization of computational lexicons, textual corpora and their annotations. Thus, texts parsed with this parser can be used with the variety of tools that also utilize these standards.
The Old Norse parser has been developed by researchers at the University of California at Los Angeles. This group has written two rules-based parsers, one that takes an inflected form as its input and outputs the lexical form, and the other that generates complete paradigms for a word based on its lexical form. The source data for these parsers is drawn from Geir Zoega's Concise Dictionary of Old Icelandic. In the corpus of Old Icelandic sagas that have been created for this project, the parser currently provides feedback for 67% of the running words in the corpus. The vast majority of unparsed words are either proper names that do not appear in the lexicon or very common irregular forms that will be addressed using a manual lookup table.
In addition to being available as part of automatically generated hypertexts, the Perseus Greek parser and the CNR Latin parser have also been deployed as a web service in a framework that will also allow for the easy integration of the Old Norse parser when irregular forms are entered. This service allows users to enter individual words, upload text files, or submit a request via SOAP for processing, thereby allowing them to apply these tools to any text that is not in our system.
While recognizing the importance of morphological analysis tools for our work, we have also been aware of cost/benefit ratios. Parsers are expensive and difficult to develop, so we have explored three different methodologies to determine cost effective ways to help students and lifelong learners read texts in original languages while also providing accurate data for natural language processing and information retrieval applications. The most expensive approach is the one adopted for the Early-Modern Latin parser where data is extracted from lexica and the morphological type of each stem is hand coded according to the EAGLES standards. A less expensive, more automated approach was adopted for the Old Norse parser where morphological data was automatically extracted from the lexicon and the rules-based parser generated the database of possible lexical forms that were used in the hypertextual reading environment. Finally, our work on Old English has been the least expensive; for this work, we extracted citation information from specific lexica and used this data to bootstrap a series of pattern matching algorithms to move from words in context to the dictionary. This method proved an inexpensive method for providing reading support, but it was not robust enough for information retrieval or natural language processing applications. While the results derived using the hand-coded method are superior and more comprehensive, the middle approach also seems to strike an effective balance between precision, recall, and expense.
With baseline parsers and corpora in place, we have been able to investigate applications of information retrieval and natural language processing technologies to these texts. In the initial conception of this project, we envisioned not just the development of collections of Greek, Latin, and Old Norse texts with tools to allow beginning and intermediate readers the opportunity to read them, but also the creation of tools that would allow scholars to perform traditional tasks more efficiently and to study texts in new ways. Our work in this area has revolved around facilities for multi-lingual information retrieval and visualization, applications for lexicography, and tools for the computational analysis of style.
The multi-lingual search facility is essentially a reverse lookup tool for a collection of multi-lingual dictionaries that is used to construct a query for a mono-lingual search engine [11]. For Greek searches, users can enter their query terms in English and ask the tool to return entries from different dictionaries including the 'unabridged' Greek-English dictionary by Liddell, Scott and Jones, an 'abridged' dictionary by Liddell and Scott, and author specific lexica for Homer and Pindar. The Latin interface searches the definitions of the abridged and unabridged Lewis and Short lexica. The search interface also allows users to show only words that appear in works written by a specific author (i.e., a user can ask for Greek words with the word 'goat' in their definition that also appear in the works of Homer). In a second step, the program provides a checklist where users select the words that ultimately are sent as a query to the search engine. This interface provides a list of translation equivalents for the word or words that the user entered along with an automatically extracted and abridged English definition of the word, a link to the full definition for each word, a list of authors who use the words, and the number of times the word appears in the corpus. The tool also suggests related words that might be of interest based on a simple similarity calculation with other definitions in the dictionary.
After users select their search terms with this tool, these terms are then sent to a monolingual search engine with several visualization front ends. These visualization tools graphically group related documents and label each of them with a keyword or group of keywords. The visualization interfaces include a tree view that presents documents as the nodes of a binary tree flattened into a circular pattern; a Sammon map that represents each cluster as a circle with the radius of the circle indicating the relative size of the cluster and the distance between them representing their relative similarity; and a radial visualization that places the twelve highest ranked keywords around a circle and represents each document as a point within the circle that is placed based on the number of keywords it possesses from outside of the circle. Each of these display modes allows users to drill down into the clusters, eliminate keywords, see document fragments, and link to the complete document in the digital library.
The primary impact of our work in a traditional field has been in the creation of a citations database of lexicographic 'slips' for a new intermediate level Greek-English lexicon that is currently being written at Cambridge University. The database is the main repository of source material being used to create this new dictionary; it provides a key-word-in-context display for every occurrence of each word in the corpus along with English translations from the Perseus Corpus where possible. These passages are presented in chronological order and are accompanied by an author-by-author frequency summary. Links are also provided to the Online Edition of the Liddell, Scott, Jones Greek English Lexicon (LSJ) that integrates statistical information about the word, including comparative frequency data, word collocation information, and an automatically extracted list of words with similar definitions. One of the primary problems faced by lexicographers is information overload; the word grapho (to write), for example, appears more than 2,000 times in the corpus, and the creation of a definition that accounts for all of its senses and nuances can be a daunting task. While we have done some experiments in automatic categorization, we have also addressed this problem by reintegrating previously existing lexicographic resources such as earlier dictionaries. When constructing the citation database, our program mines citations from the existing LSJ, flags these passages, and presents them apart from the other citations in the order in which they appear in the Lexicon.
In addition to our lexicographic work, we have also developed tools for use with the digital library environment that allow for the computational study of style including common subjects and objects of verbs, the relative frequency or their different grammatical forms, and the ways that grammatical patterns are distributed and used in literary works. One example of the sort of work that is possible with the tools we have created can be seen in a forthcoming article about the use of the Greek participle in the works of Lysias, an Athenian orator of the 5th century BCE [12]. Using the statistical tools developed by the consortium, it is possible to show that narrative descriptions of violence are marked by the frequent use of the participle. It is also possible to show that clusters of participles are used as structuring devices that mark the conclusion of narrative arcs and lines of argument while also providing immediacy and momentum to the argument or narrative descriptions of events.
As a consortium, we have worked on facilities for long-term preservation and federation of digital libraries. Our initial efforts towards data federation were focused primarily on locating and linking to shared resources across digital libraries through a system of abstract document identifiers. The Thesaurus Linguae Graecae project has developed a system for uniquely identifying classical works using author and work numbers [13]. The Perseus Digital Library used these author and work numbers as the basis for a system of 'Abstract Bibliographic Objects (ABO)' that allowed us to automatically connect different versions of a text. In this system, individual texts are declared to be a version of a particular ABO in their metadata. When the user requests a particular document, the system is able to provide links to all versions of the document in Perseus. In the early phases of this project, we extended this system so that we could publish information about our versions of texts and discover other versions of texts in other libraries via the OAI protocols [14]. For example, if a text on a server in London cites a passage from Homer's Iliad, this system can automatically discover the text of the Iliad in the Perseus Digital Library using OAI protocols and link to that text in its context.
In the area of open access, more than other areas we have explored, standards and thinking have evolved rapidly. There has been a burst of energy surrounding open access initiatives; the Perseus Digital Library has long been committed to free and open access to all of its materials, and we have continued that tradition in our project. All the texts and tools we have described are freely available on the internet. Further, we have negotiated a contract with Cambridge University Press that will allow for simultaneous free on-line access to the Greek-English lexicon as part of the Perseus Digital Library with the printed edition.
The OAI (Open Archives Initiative) protocols have changed substantially during the life of this project. As we initially formulated this project four and even five years ago, we had also hoped that the OAI would be an appropriate venue for sharing more detailed data, such as morphological analysis information, in addition to catalog-level metadata. As the OAI has evolved, however, it has become clear that publishing this sort of data via the OAI would 'break' some existing applications that were expecting catalog data. As an alternative, we have created the SOAP services described above for tools such as the morphological analysis engine. Our initial work that brought us into collaboration with the Classical Text Services Protocol also made it apparent that we needed to explore the ways in which semantic interoperability and cultural heritage ontologies would allow for federation at the level of individual passages of texts in single works. We have therefore been working with FRBR, EAD, TEI and the CIDOC Conceptual Reference Model [15] to see how our work might be harmonized at the metadata level with larger ontological frameworks being developed for museums, libraries and archives. This has led us into new areas of research that could not have been anticipated at the beginning of the project. The advances in digital technology in the area of metadata sharing and cultural heritage ontologies has led us to the idea of creating a semantically integrated GRID-Digital Library that offers support for knowledge extraction from a wide variety of distributed data sources.
As the Cultural Heritage Language Technologies consortium wraps up its work, we are pleased with the progress we have made towards the application of technologies and techniques from computational linguistics, natural language processing, and information retrieval to texts written in Greek, Latin, and Old Norse. As we look to the future, we have discovered areas that require further investigation, such as how to encode elements of early-modern typography, how to integrate our work beyond the boundaries of our consortium via digital library services, and the potential use of computing grid technologies for real-time presentation and manipulation of computationally intense statistical analyses of cultural heritage texts. Our work on this project has, however, laid the foundation for this new work with its construction of new corpora and tools that allow students and scholars to study cultural heritage languages more effectively.
1. Cultural Heritage Language Technologies home page, <http://www.chlt.org>. The results of our work are available on-line at <http://www.chlt.org/CHLT/demonstrations.html>.
2. The partners from the United States include Jeff Rydberg-Cox from the Classical Studies Program and Department of English at the University of Missouri at Kansas City, Gregory R. Crane from the Perseus Digital Library at Tufts University, Ross Scaife of the Stoa Consortium at the University of Kentucky, and Timothy R. Tangherlini from the Scandinavian Section at the University of California at Los Angeles. Our European partners include Dolores Iorizzo, Rob Iliffe, and Stephan Rueger from The Newton Project and the Department of Computer Science, Imperial College, A.A. Thompson and Bruce Fraser from the Intermediate Greek Lexicon project and the Faculty of Classics at Cambridge University, Andrea Bozzi at the Istituto di Linguistica Computazionale del Consiglio Nazionale delle Ricerche in Pisa, Italy and Matthew Driscol from the Arnamagnaean Institute at the University of Copenhagen.
3. We wrote an article describing our aims at the onset of this project, Rydberg-Cox, J. (2003). "Building an Infrastructure for Collaborative Digital Libraries in the Humanities." Ariadne 24: <http://www.ariadne.ac.uk/issue34/rydberg-cox/>. As we conclude the project, it has been instructive for us to compare our initial aims with our final products.
4. National Science Foundation DLI2 International Projects, <http://www.dli2.nsf.gov/intl.html>.
5. Text Retrieval Conference (TREC), <http://trec.nist.gov/>.
6. DARPA Information Processing Office, <http://www.darpa.mil/ipto/programs/tides/>.
7. CLEF (Cross Language Evaluation Forum), <http://clef.isti.cnr.it/>.
8. See Crane, G. (1991). "Generating and Parsing Classical Greek." Literary and Linguistic Computing 6: 243-245 and Crane, G. (1998). "New Technologies for Reading: The Lexicon and the Digital Library." Classical World: 471-501.
9. Mahoney, A. (2001). "Studying the Word Study Tool." New England Classical Journal 28(3): 181-183.
10. EAGLES on line home page, <http://www.ilc.cnr.it/EAGLES96/home.html>.
11. A detailed description of this tool is available in Rydberg-Cox, J., L. Vetter, et al. (2004). "Approaching the Problem of Multi-Lingual Information and Visualization in Greek, Latin and Old Norse Texts." Lecture Notes in Computer Science 32: 168-178.
12. Rydberg-Cox, J. (Forthcoming (2005)). "Talking About Violence: Clustered Participles in the Speeches of Lysias." Literary and Linguistic Computing.
13. Thesaurus Linguae Graecae, <http://www.tlg.uci.edu/>.
14. Open Archives Initiative (OAI) home page, <http://www.openarchives.org/>.
15. FRBR (Functional Requirements for Bibliographic Records),
<http://www.oclc.org/research/projects/frbr/default.htm>;
EAD (Encoded Archival Description home page, <http://www.loc.gov/ead/>;
TEI (Text Encoding Initiative), <http://www.tei-c.org/>;
and
The CIDOC Conceptual Reference Model, http://cidoc.ics.forth.gr/>.