Workpackage 1: Key Term Extraction and Document Clustering
Imperial College, London will focus on developing tools for extracting key terms and phrases from document collections, clustering these phrases into related groups, and visualizing and accessing the resulting clusters. They will also focus on the problem of concept emergence and change in collections of thematically related documents. The goals for this group are to generate a method for an automated document representation based on computed keywords which are suitable for clustering; to suggest and implement clustering and visualisation algorithms on sub-repositories in the document space; to suggest and implement clustering and visualisation algorithms on sub-repositories in the word space using language-specific morphological analysis; to demonstrate the viability and usefulness of this tool in a user evaluation.
Workpackage 2: Word Profile Tools, Multi-Lingual Information Retrieval, and Syntactic Parsing Tools
In this workpackage, The University of Missouri at Kansas City and the Lexicon group at Cambridge University will work to develop multi-lingual information retrieval facilities for the digital library in addition to a series of tools that provide vocabulary profiles for texts and corpora within the digital library system and also tools for syntactic parsing of Greek and Latin texts. The deliverables in this workpackage call for a twelve-month development cycle, beginning with the word profile tools, followed by multi-lingual information retrieval and, finally, tools to discover selectional preferences and categorization frames for Greek verbs.
Workpackage 3: Collaborative Infrastructure and Metadata Sharing
This workpackage provides the core infrastructure on which all other partners in the consortium will build. The first deliverable of this infrastructure is a CVS repository containing the core digital library code. A functional version of this infrastructure is already in place and it will be made available to the other members of the consortium under the GNU public license at the outset of our work. As the infrastructure developments envisaged in this grant are realized, they will be propagated to each member of the group. This infrastructure runs on a variety of UNIX platforms, including Compaq's TruUnix, Red Hat Linux, SuSE Linux, and Apple's OSX. This flexibility means that we do not need to define a strict hardware standard. Each project needs only a UNIX platform that can compile the software of the core infrastructure.
The Perseus Project based at Tufts and the Stoa Consortium at the University of Kentucky will work also together to deliver the basic infrastructure for metadata sharing and collaborative resource discovery. This infrastructure includes the establishment of naming rules for objects in the digital library group, generalization of an existing OAI data provider for use by all the cooperating projects, implementation of a metadata harvester, and incorporation of metadata from remote systems into a local DL repository. To leverage the resources collaboratively created in other parts of the grant, they also will implement cross-searching of the full content of our various repositories.
The technical framework for the project will be the OAI's protocol to share more detailed metadata which will allow for automatic generation of links among resources in separate systems. To achieve this, we will modify the existing OAI data provider, and incorporate harvesting and service provider functions into the DL infrastructure. The Perseus Project has been involved in efforts to create cross-collection searching. In collaboration with the Stoa Consortium, we will extend the established metadata sharing procedures (described above) to allow harvesting and indexing of collection content. In order to keep barriers to inclusion as low as possible, the indexing will be developed as a separate middleware application and not a feature that each repository will be required to implement on its own. In addition, it will be possible to define virtual corpora, which will permit searching over sub-collections that are distributed across or among repositories, in physically disparate locations.
Workpackage 4: Old Norse Morphological Analyzer, Texts, and Reading Environment
The Scandinavian Section at the University of California at Los Angeles and The Arnamagnaean Institute will similarly digitize lexical resources and write a morphological analyzer for Old Norse texts that will also be used to help students and scholars read and understand Old Norse texts. They will also edit a corpus of Old Norse literature and link the tagged texts to digitized manuscript images as a testbed for the integrated reading environment. As with the Neolatin parser, this morphological analyzer will be built around existing standards for both input and output of the results so that it can serve as a 'plug in' module for the existing digital library infrastructure and participate in the data sharing models that we will develop with this grant.
The Istituto di Linguistica Computazionale del CNR will work to create a morphological analyzer for Classical, Renaissance and later Latin that can be used in the integrated reading environment of our digital library. This work will focus on the design and development of a morphological analyser for Neo-Latin in order to retrieve information from large textual archives. The basic approach will leverage a Machine Dictionary organized as a look-up table where each word-form, generated by an automatic system, is linked with the lemma (or lemmas, in the case of homography) to which it is morphologically related. The output of this program will conform to standards already used by the Greek and Classical Latin parsers in use at the Perseus Project so that it can be seamlessly integrated with the Perseus Project's reading environment.
Collaborators at The Stoa Consortium, The University of Missouri-Kansas City, and The Newton Project at Imperial College, London will develop collections of documents that will contain approximately 300 MB (more than 60,000 printed pages) of early modern Latin literary and scientific texts that can be used as testbeds for the advanced tools described in this project. These texts will be marked up in XML or SGML in conformance with the standards defined by the Text Encoding Initiative
There will be two main coordination and management sites: The Newton Project at Imperial College London (EU) which will act as European Coordinator and management anchor for CHLT and the two co-principal investigator sites at The University of Missouri, Kansas City (US) that will act as anchor for the American partners. Bi-yearly collaboration and integration meetings at alternate US/EU partner sites will play a vital role in the coordination of effort and the monitoring of progress