Workpackage 4: Old Norse Analyser and Texts

Executive Summary

Year 2 Quarterly Progress Reports

Cultural Heritage Language Technologies

IST –2001-32745

June 1 – August 31, 2003

Workpackage 4: Old Norse Morphological Analyser

University of California, Los Angeles

Timothy R. Tangherlini

1.Summary of key indicators of project progress

Reduced error rate in Analyzer A

Produced program specifications for Analyzer B

Ran output tests on Analyzer A for integration

2. Work Progress Overview

1. Completed beta-version of Analyzer A

2. Begun refining code of Analyzer A to reduce error rates

3. Outlined code for Analyzer B

4. Designed web-forms based system for the input and correction of irregular forms; Beta version to be implemented in next quarter

5. Proofread dictionary entries from Zoega

6. Completed regularization of tagging of dictionary entries and look-up tool

7. Developed plans for integration of Analyzers A and B into the Perseus environment

8. In conjunction with the AMI team at University of Copenhagen, regularized the mark-up of diplomatic editions. This is an added feature, so that the analyzer will be able to work on <reg> fields of transcribed mss along with the regularized SE of the sagas

9. Planned for addition of approximately 40,000 headwords and part-of-speech information from the Ordbog over det norrøne prosasprog (Dictionary of Old Norse Prose) at the Univ. of Copenhagen

10. Planned for expansion of database of irregular forms

11. Planned for integrated text environment, linking mss to transcriptions to SE

12. Completed a database of secondary sources of scholarship on the Fornaldar søgur. This complements the primary source database compiled by the AMI team at the Univ. of Copenhagen.

3. Project Management

The project has been managed at UCLA by Prof. Timothy Tangherlini and the Univ. of Copenhagen by Prof. Matthew Driscoll. Additional coordination and project management has been provided by Dr. Zoe Borovsky at the Center for Digital Humanities at UCLA. Essential project management has been provided through the central CHLT office by Dolores Iorizzo who coordinates the efforts of all the CHLT partners.

1.Summary of key indicators of project progress

1.1 Overview of objectives

1.1.1 Main objectives for quarter

To develop analyzer A with a relatively low error rate

To begin the coding of analyzer B

To design a webforms based system for additing irregular forms and for error correction

To develop a database of secondary sources for the Fornaldar søgur

To continue to collaborate with the AMI on the integration of all of these resources in a rich environment

1.1.1 User feedback

User feedback, as measured by the limited access-release has been overwhelmingly positive. Experts in Old Norse language and literature have begun working with the analyzer, and have provided help in recognizing weaknesses (generally in the database of irregular forms) and suggesting refinements for our analyzer A. This feedback has helped us begin making refinements to the underlying code of analyzer A.

1.1.2 Task allocation

At UCLA, Kryztof Urban continued to take the lead role in coding analyzer A, as well as the structure of the database of irregular forms. Bruce Dumes provides additional coding-support to Kryztof Urban. Additional work on the database of secondary sources was provided by Randall Gordon. Zoe Borovsky has coordinated server space and support, and provided additional intellectual guidance to Randall Gordon. Prof. Timothy Tangherlini continues to provide overall coordination and direction to the various tasks that are being undertaken by the UCLA team.

1.2 Overall assessment of main milestones, results, or deliverables

The project team believes that we have not only achieved our main milestones for this period, but have been successful in getting a leg up on some of our future milestones:

1.2.1 Our main analyzer is complete, and now being refined.

1.2.2 We have mapped out integration for the analyzer and SE texts into the Perseus environment.

1.2.3 We have expanded the coverage of our analyzer to diplomatic transcriptions of mss.

1.2.4 Our dictionary is complete, and will now be expanded with additional headwords from the ONP—thereby covering all of the prose language of Old Norse.

1.2.5 We have designed a clear mechanism for correcting forms generated by the analyzer, and

1.2.6 We have developed a database of secondary sources for the study of our pilot saga texts, the Fornaldar søgur

2. Work Progress Overview

2.1 Specific objectives for the reporting period

· Completion of Analyzer A Beta

· Design of Analyzer B

· Design of error-correction reporting forms

· Completion of Dictionary mark-up and integration into Analyzer

· Completion of database of secondary sources

2.2 Achievements

2.2.1 List of Deliverables

· Completed Analyzer A Beta

· Designed Analyzer B

· Designed error-correction reporting forms

· Completed markup and integration of Dictionary into Analyzer

2.2.2 Progress of Workpackage/Tasks

All completed as planned.

2.2.3 Deviations if any and corrective action

· Decided to begin the process of expanding the headwords in the dictionary to include all of the headwords from ONP

· Decided to integrate the analyzer with the diplomatic transcriptions of mss as well

2.2.4 Work planned for next reporting period

· Complete Analyzer B beta

· Implement error correction forms for irregular forms

· Greatly expand database of irregular forms

· Continue integration of analyzer, transcriptions and SE texts into rich text environment

· Test visualization tools on lemmatized (pseudo)texts

3. Co-operation within the consortium, including project meetings

-Urban and Tangherlini attended the successful consortium meeting held at Imperial College, London in early June 2003

-Stefan Rueger from the consortium came to UCLA in July 2003 to work with us on integrating our work with the visualization tools

3.1 Participation in workshops, conferences, publications

-Urban presented the work on the Analyzer at the Society for the Advancement of Scandinavian Study

CONFERENCES:

Society of the Advancement of Scandinavian Study Annual Conference

Cultural Heritage Language Technologes
(IST-2001-32745)

Workpackage 4: Old Norse Morphological Analyser

1 September - 31 November 2003

Scandinavian Section, UCLA

Det Arnemagnænske Institut, Københavns Universitet

Tim Tangherlini and Mat Driscoll

1 Introduction

This report summarizes the previous and most recent achievements for the Old Norse morphological analyzer, look-up tool, and digital corpus.

2 Previous achievements

This project can best be divided into two parts. The main work on the morphological analyser and the look-up tool have taken place at UCLA. The main work on the mark-up of diplomatic editions of mss has taken place at Københavns Universitet.

The initial development of a morphological parser has been described in detail in our previous report. As a summary, the Los Angeles group implemented a morphological analyzer for normalized Old Norse texts using the standard Fonrit normalizations. As input the parser accepts headwords from the complete Zoega Old Norse dictionary. For this purpose, a cleaned-up electronic version of the dictionary was created. The parser outputs declension tables for the headword. A trial version of the program can be found at http://ecampusdev.humnet.ucla.edu/curban/index.html. It was written in Perl using a MySQL database to ensure high portability.

The Copenhagen group of researchers has produced a preliminary markup of six sagas from the Fonaldar saga corpus in XML format based on conventions developed by Matthew Driscoll and his colleagues. The marked-up texts include bothe diplomatic and facsimile documents.

3 Current work

The project members currently focus on performance improvement and integration of the parser. In terms of the former, a first implementation of a second parser has been released. Instead of accepting only dictionary headwords, the second parser ideally accepts all declined word forms. This improved analyzer can be connected to the XML Old Norse saga texts. A usage sample can be found in the "Old Norse Morphological Analyzer" report. The Los Angeles group currently improves performance of both parsers.

Wit respect to integration, we are working with the Perseus group around Jeff Rydberg-Cox to integrate the Old Norse material in their text collection. In particular, the parser output needs to be adjusted to meet the Perseus requirements, and the underlying dictionary has to be ported into XML. The latter requirement has been fulfilled using the Text Encoding Initiative (TEI) standard XML conventions. For example, the electronic Zoega entry

abbadis abbadis pl. -ar f abbess

has been modified into the following format:

<entry>

<form>

<orth>abbadis</orth>

</form>

<form>

<number>pl.</number>

<orth extent='part'>-ar</orth>

</form>

<gramGrp>

<gram type='pos'>f</gram>

</gramGrp>

<sense n='1'>abbess</sense>

</entry>

We are currently completing the Perl filter to adjust parser output.

Cultural Heritage Language Technologies

IST –2001-32745

1 December 2003 – 31 May 2004

Workpackage 4: Old Norse Morphological Analyser

University of California, Los Angeles

University of Copenhagen

Timothy R. Tangherlini

Matthew Driscoll

1. Summary of key indicators of project progress

a. Parser error reduced

b. Parser working with SE and Diplomatic texts in beta-environment, Perseus and Greenstone DL.

2. Work Progress Overview

a. Restructured parser

b. Added beta version of B-parser

c. Developed more articulated rules for morphology

d. Incorporated these rules into the parser, eliminating significant errors in output

e. Tested Standard Edition texts and diplomatic edition texts with the B-parser

f. Tested integration of parser with new java-based version of Perseus

g. Began integration of parser with texts in Greenstone DL, and visualization tools developed by Imperial College consortium members.

3. Project Management

The project continues to be managed at UCLA by Prof. Timothy Tangherlini and, for our closest collaborators at the Univ. of Copenhagen, by Prof. Matthew Driscoll. Graduate student Aurelius Vijunas continues to work with the team of researchers as a Graduate Student Researcher, and Dr. Kryztof Urban continues in his post as Postdoctoral researcher – project development lead at UCLA. Other project management has been provided through the central CHLT office by Dr. Dolores Iorizzo

1.Summary of key indicators of project progress

1.1 Overview of objectives

1.1.1 User feedback

User feedback, as measured by the limited access-release continues to be overwhelmingly positive. Experts in Old Norse language and literature in the United States, Denmark and Iceland continue testing the analyzer in a systematic fashion, suggesting refinements using a newly implemented online feedback system. This feedback, along with the work of Aurelius Vijunas in detailing more elaborate rules for adjectives, strong verbs and compounds, along with a more detailed series of rules for ablaut and syncope has helped us to continue making refinements to the underlying code of analyzer A, to implement a more efficient and faster version of analyzer B, and to test a series of texts, both Standard Edition and diplomatic edition, for use with morphological look-up and dictionary look-up. Because of this feedback, our error rate is now into the target range of < 1.0% for all but the strong classes of verbs.

1.1.2 Task allocation

Tasks are allocated at UCLA as follows:

a. Timothy Tangherlini is in charge of coordination, overall theoretical focus, collaboration, and first-line testing of routines to be implemented. Tangherlini is also in charge of coordinating the efforts of scholars at other institutions to provide consistent feedback on the analyzer and to develop a series of queries that can be tested with the visualization tools developed by the team headed by Prof. Stefan Rueger at Imperial College.

b. Kryztof Urban is charged with implementing the refined rules set for morphological classes, for developing greater efficiencies in the analyzer code, for developing new ideas to create efficiencies, scalability and interoperability in the parser, and for error testing. Urban is also in charge of integration and understanding the APIs of other systems for which the analyzer is intended to be a “plug in.”

c. Aurelijus Vijunas is charged with developing a series of detailed rules for various difficult morphological cases, and for developing elaborated explanations of the forms as a means for increasing the accuracy of the analyzer. He is also in charge of first-level testing of all forms generated by the analyzer.

1.2 Overall assessment of main milestones, results, or deliverables

1.2.1 Significant increase in accuracy of parser as a result of restructuring the underlying code.

1.2.2 Significant increase in accuracy of parser as a result of implementing more detailed rules based on the work of Aurelijus Vijunas

1.2.3 Development of an available series of descriptive documents detailing the more articulated morphological rules functional in the restructured parser.

2. Work Progress Overview

2.1 Specific objectives for the reporting period

a. Restructure parser for efficiency and to reduce error rate

b. Develop clearer rules for paradigm generation

c. Develop well-described sets of information for irregular forms

d. Test implementation of B-parser with Standard Edition texts and Diplomatic transcriptions

e. Work on integration of parser into other Digital Library environments based on an understanding of the APIs for these environments – in this instance, Greenstone DL and Perseus DL.

f. To continue to collaborate with the AMI on the integration of all of these resources in a rich environment

2.2 Achievements

2.2.1 List of Deliverables

a. Beta version of B-parser

b. Test version of SE edition and diplomatic texts

2.2.2 Progress of Workpackage/Tasks

Parser update

Since the last report, the parser has undergone major structural changes. In addition, its performance and scope have been improved.

Structural changes

Initially, the parser was designed to create declension tables on the fly using the Zoega dictionary. As the online test version demonstrated, it performed well for most nouns and adjectives, and also for roughly half of the verbs in the dictionary. However, the program’s general design needed to be improved for the following reasons:

Source over-specialization: to maximize code efficiency, the structural aspects of the Zoega dictionary were to a high degree integrated into the code. To work with sources other than Zoega, dictionary information and parsing code needed to be separated using a modular design.
Output over-specialization: the old code was designed to work as the backbone for an online parser. Due to interest in the parser from other scholars who would like to integrate our program and its output into their own projects, the parser had to become more flexible with respect to its output (for example, output as plain text file, html files, database tables, etc).
Clarity of code: the success of our project led to interest in the code by a number of scholars in Europe and the US. What started as a coding project conducted by an individual now needs to be accessible to the public. Following accepted coding standards, the parser code underwent changes to become more object-oriented.
Redundancy: Whenever a declension table was requested, the parser used to view them as separate requests. Thus, if the same table was requested multiple times, the parser generated them anew every time. The new parser generates a table only once, thereby increasing efficiency.
Environmental dependency: The old parser relied heavily on Perl, CGI and MySQL. However, creating the needed computational environment can be difficult when moving code from one server to another. The new code requires Perl only.

In particular, the old code was split up into three module types. Module 1 specializes in rewriting the Zoega dictionary into a parser-readable format. Module 2 consists of the parser itself. Its output contains all declension paradigms in a simple text file. Now that the two processes are separated, additional sources can be added by writing another specialized Module 1 for that source without having to change the parser code. Both modules are object-oriented (i.e., using packages written in Perl) and minimize code by having access to often used subroutines. Finally, Module 3 outputs the results into different formats, such as html pages.

The new parser design answers all requests listed above. In addition, it introduces new clarity to a pretty complex coding project.

Morphology: More Articulated Rules

Over the course of the past three months, under the lead of Aurelijus Vijunas, we have developed far more articulated descriptions of Old Norse morphology. In over thirty files, we describe the inflection of many morphological classes of Old Norse words as well as various phonological processes that characterise Old Norse inflections (such as vowel assimilations [umlauts], syncope, shortening, and various other euphonic changes). Up to date, the nouns, the adjectives, numerals, and a part of verbs have been fully covered. The remaining part includes the rest of verbs, pronouns, and a review. Wenticipate that the creation of verb paradigms can be finished by the end of June, and the remaining part of work at this stage would be used for review.These files will also bevailable on our site and form the basis for the new rules incorporated into the parser described above. The files to date describe:

1. adjectives in –ligr

2. consonantal nouns

3. feminine wō-stems

4. feminine monosyllaba (á, gjá, slá)

5. I class str. v.

6. II class str. v.

7. III class str. v.

8. IV class str. v.

9. V class str. v.

10. VI class str. v. (review !)

11. i-stems

12. ja-stems

13. jō-stems

14. j-stem adjectives (sekr, nýr)

15. i-stems of bekkr-type

16. nouns of relationship

17. n-stems

18. numerals

19. rk (don’t count, but think of various phonological – computeristic problems)

20. singular tantum nouns (think of plurale tantum!)

21. ija-stems

22. system words

23. u-umlaut (a/i- as well, if needed)

24. u-stems

25. wa-nouns

26. words in –an

27. words in –andi

28. words in –ll

29. words with -nnr-

30. words with stem in –r

31. w-stem adjectives

2. 2.4 Work planned for next reporting period

a. Continue work on the description of irregular word classes

b. Correct input errors and scanning errors in underlying lookup database.

c. Develop simple interface to add necessary information to headwords harvested from the Ordbog over det norrønne prosasprog at the University of Copenhagen to take advantage of changes in the underlying parser structure.

3.1 Co-operation within the consortium, including project meetings

a. Kryztof Urban will attend the consortium meeting in Pisa/London in June, 2004

b. Timothy Tangherlini and Aurelijus Vijunas have met in Los Angeles with Stefan Rueger, to discuss aspects of integration with Greenstone DL, and concerning the visualization tools, keyword extraction routines, and clustering developed at Imperial by Rueger’s team.

c. Timothy Tangherlini has met in Los Angeles with Gregory Crane to discuss aspects of integration into Perseus and future ideas for development.

3.2 Participation in workshops, conferences, publications

a. Timothy Tangherlini presented interim results at the Society for the Advancement of Scandinavian Study, April 2004.

CONFERENCES:

Society for the Advancement of Scandinavian Study, April 2004.