Workpackage 3: Collaborative Infrastucture

 

Year 2 Executive Summary

 

 

            The Perseus Project has made steady progress towards the completion of Workpackage 3. The goal of this workpackage is to provide infrastructure for a distributed cultural heritage library where any addition immediately increases the value of the existing collections. We have made significant accomplishments in several areas.

           

            We have spent considerable time over the past year improving the underlying Perseus text processing system. In order for a federated digital library system to work, it must be simple to install in multiple locations. Our development effort has focused on reducing the complexity of the setup process, limiting the number of third party packages and libraries that must be installed, and building clear, publishable specifications for extensions and customizations. Although it is not always easy to demonstrate the benefits of this type of work, the progress has been substantial.

 

            In addition to the often invisible structural changes, we have made important improvements to the user interface. Generally speaking, the default interface has been made "cleaner". An effort has been made to reduce the number of clicks required to perform common tasks. We have attempted to improve clarity and performance. The specifics of the user interface can, of course, be customized by individual libraries.

 

            As we approach the end of the second year, one of the most exciting new features is the ability to share catalogs between related digital libraries. Catalog and metadata sharing is particularly useful for commentaries and notes on cultural heritage documents. Works of this type typically draw connections and comparisons between the document in question and other related works. A commentary on Beowulf, for example, can contain a direct link to the Odyssey even if the specific digital library site does not itself contain Homer. The naming conventions developed by Perseus are critical for this component, because they allow document editors to specify links to texts without having to personally discover individual URLs.

 

            Another, more specialized instance of digital library integration that we have made progress on this year is morphological analysis. Morphology, and linguistic analysis in general, is extremely important in increasing access to cultural heritage documents. We have produced a working system for integrating morphological analysis data from a variety of sources into a single web service model. We are working with UCLA and Pisa to incorporate Old Norse and LEMLAT parses into this system.

 

            A third aspect of digital library integration is the ability to identify and share individual entries in reference works.


Year 2 Progress Reports

 

Ë

 

Cultural Heritage Language Technologies

 IST Ð2001-32745

 

1 June - 30 November 2003

 

Workpackage 3: Collaborative Infrastructure and Metadata Sharing

 

Perseus Project

 

Gregory R. Crane

Anne Mahoney

David Mimno

 

 

 

1.Summary of key indicators of project progress

 

 

The primary objective of this phase is to continue to create a robust infrastructure for data sharing, which is based on a system of unique name identifiers for organizations, collections, and individual digital objects.

           

2. Work Progress Overview

           

            2.1 Progress of Workpackage/Tasks

 

Perseus is building infrastructure for Cultural Heritage information sharing in three areas:

 

First, we are developing our existing digital library system (known as Òthe HopperÓ) to simplify the installation process, increase efficiency, and build a clear, well-documented API to accelerate the development of advanced digital library applications.

 

Second, we are developing abstract interfaces to our morphological databases. These include SOAP and XML-RPC web services as well as simple XML documents and human-readable web pages. This infrastructure allows us to rapidly integrate newly generated morphological information from the UCLA Old Norse parsers and PisaÕs LEMLAT. Furthermore, the publicly available APIs make this incredibly valuable data easily accessible to Cultural Heritage application developers around the world. An example of this system is at http://www.chlt.org/morph

 

Third, we are experimenting with new ways to use Open Archives Initiative protocols to share large volumes of data derived from reference works. The following is a discussion of this work.

 

Current State

 

OAI data providers are currently collections of document-level metadata. This is good for sharing collection data, but not much else. James Allen (an Information Retrieval expert at UMass Amherst) reported at the 2003 NSDL all projects meeting that 5% of the 700,000 OAI records in the NSDL even had usable URLs.

 

Low Resolution

 

In current practice, OAI data provides a very low-resolution look at documents in the digital library. Here's the average number of bytes per OAI record in three Perseus subcollections.

 

Greek

109236 kb / 234 = 466.8 kb/record

Latin

70428 kb / 129 = 545.9 kb/record

Secondary Materials

334272 kb / 68 = 4.9 Mb/record !

 

In one case (Harper's Dictionary of Classical Antiquities) there is one record for a 12 Mb SGML document made up of 50,000 individual entries. The record publicizes the existence of the document, but offers no help to the harvester in making use of it.

 

Case Study: PlanetMath.org Encyclopedia

 

PlanetMath.org is a community-driven website that hosts an encyclopedia of mathematical terminology. Unlike the Perseus encyclopedias and other entry-based works, the PlanetMath encyclopedia is, from the perspective of their OAI data provider, a collection of single-page documents, rather than a single monolithic document.

 

he implication is that a harvester can extract the headwords of the encyclopedia directly from the OAI provider. Direct access to the subdocuments can give harvesters detailed access to the document. This increases the potential for linking between sites and opens the possibility for uses of the data that the original document maintainers would never have envisioned.

 

Licensing

 

Expanding the coverage of the OAI provider has the effect of making it easier for users to download entire digital works. It doesn't make the difference between possible and impossible, but it will remove one or two steps. For this reason it will be necessary to clarify the licensing status of certain high-value reference works like the Greek and Latin lexicons. We propose to provide single-page OAI records for the entries in a number of reference works, specifically the Harper's encyclopedia and the Zoega Lexicon of Old Icelandic.

 

                       

 

            2.2 Work planned for next reporting period

 

Continuing work on identifying and exposing linkable reference materials.

 

 


Cultural Heritage Language Technologies

 IST Ð2001-32745

 

 

Workpackage 3: Collaborative Infrastructure and Metadata Sharing

 

 

1 December 2003 - 28 February  2004

 

Perseus Project

 

Gregory R. Crane

Anne Mahoney

David Mimno

 

 

1.Summary of key indicators of project progress

 

1.1  Overview of objectives

 

The primary objective of this phase is to continue to create a robust infrastructure for data sharing, which is based on a system of unique name identifiers for organizations, collections, and individual digital objects.

           

 

1.2  Overall assessment of main milestones, results, or deliverables

 

We are continuing to develop a prototype of a metadata sharing system between two digital libraries for the June 2004 deliverable. The foundation of this system is each libraryÕs OAI data provider. Numerous implementations of OAI providers and harvesters exist. We are focusing our development on integrating OAI-derived metadata into every aspect of the digital library, from catalog browsing to full text searching to implicit linking between documents.

 

Traditional Dublin Core metadata provides a starting point for digital library integration, but the potential of this infrastructure is much larger. The same protocols and naming standards can be used to share much deeper and more complex information. We are actively developing methods of sharing tables of contents, reference entries, and service-oriented information such as morphological analysis.

 

 

 

2. Work Progress Overview

           

                        2.1 Progress of Workpackage/Tasks

 

We are actively developing systems to provide metadata integration between digital libraries.

 

                        2.2 Deliverables

 

            D 3. 4  Report on Naming Conventions for DL Objects

 

            D 3.5   Maintenance Procedures for Naming Conventions