Workpackage 4: Old Norse Analyser and Texts
Executive Summary
Ë
Year 2 Quarterly Progress Reports
Cultural
Heritage Language Technologies
IST Ð2001-32745
June 1 Ð August 31, 2003
Workpackage
4: Old Norse Morphological Analyser
University
of California, Los Angeles
Timothy
R. Tangherlini
1.Summary of key indicators
of project progress
Reduced error rate in Analyzer A
Produced program specifications for Analyzer B
Ran output tests on Analyzer A for integration
2. Work Progress Overview
1. Completed beta-version of Analyzer A
2. Begun refining code of Analyzer A to reduce error rates
3. Outlined code for Analyzer B
4. Designed web-forms based system for the input and correction of irregular forms; Beta version to be implemented in next quarter
5. Proofread dictionary entries from Zoega
6. Completed regularization of tagging of dictionary entries and look-up tool
7. Developed plans for integration of Analyzers A and B into the Perseus environment
8. In conjunction with the AMI team at University of Copenhagen, regularized the mark-up of diplomatic editions. This is an added feature, so that the analyzer will be able to work on <reg> fields of transcribed mss along with the regularized SE of the sagas
9. Planned for addition of approximately 40,000 headwords and part-of-speech information from the Ordbog over det norr¿ne prosasprog (Dictionary of Old Norse Prose) at the Univ. of Copenhagen
10. Planned for expansion of database of irregular forms
11. Planned for integrated text environment, linking mss to transcriptions to SE
12. Completed a database of secondary sources of scholarship on the Fornaldar s¿gur. This complements the primary source database compiled by the AMI team at the Univ. of Copenhagen.
3. Project Management
The project has been managed at UCLA by Prof. Timothy Tangherlini and the Univ. of Copenhagen by Prof. Matthew Driscoll. Additional coordination and project management has been provided by Dr. Zoe Borovsky at the Center for Digital Humanities at UCLA. Essential project management has been provided through the central CHLT office by Dolores Iorizzo who coordinates the efforts of all the CHLT partners.
1.Summary of key indicators
of project progress
1.1 Overview of objectives
1.1.1 Main objectives for quarter
To develop analyzer A with a relatively low error rate
To begin the coding of analyzer B
To design a webforms based system for additing irregular forms and for error correction
To develop a database of secondary sources for the Fornaldar s¿gur
To continue to collaborate with the AMI on the integration of all of these resources in a rich environment
1.1.1
User feedback
User feedback, as measured by the limited access-release has been overwhelmingly positive. Experts in Old Norse language and literature have begun working with the analyzer, and have provided help in recognizing weaknesses (generally in the database of irregular forms) and suggesting refinements for our analyzer A. This feedback has helped us begin making refinements to the underlying code of analyzer A.
1.1.2 Task
allocation
At UCLA, Kryztof Urban continued to take the lead role in coding analyzer A, as well as the structure of the database of irregular forms. Bruce Dumes provides additional coding-support to Kryztof Urban. Additional work on the database of secondary sources was provided by Randall Gordon. Zoe Borovsky has coordinated server space and support, and provided additional intellectual guidance to Randall Gordon. Prof. Timothy Tangherlini continues to provide overall coordination and direction to the various tasks that are being undertaken by the UCLA team.
1.2
Overall assessment of main milestones, results, or
deliverables
The project team believes that we have not only achieved our main milestones for this period, but have been successful in getting a leg up on some of our future milestones:
1.2.1 Our main analyzer is complete, and now being refined.
1.2.2 We have mapped out integration for the analyzer and SE texts into the Perseus environment.
1.2.3 We have expanded the coverage of our analyzer to diplomatic transcriptions of mss.
1.2.4 Our dictionary is complete, and will now be expanded with additional headwords from the ONPÑthereby covering all of the prose language of Old Norse.
1.2.5 We have designed a clear mechanism for correcting forms generated by the analyzer, and
1.2.6 We have developed a database of secondary sources for the study of our pilot saga texts, the Fornaldar s¿gur
2. Work
Progress Overview
2.1
Specific objectives for the reporting period
á
Completion of Analyzer A Beta
á
Design of Analyzer B
á
Design of error-correction reporting forms
á
Completion of Dictionary mark-up and integration into
Analyzer
á
Completion of database of secondary sources
2.2
Achievements
2.2.1
List of Deliverables
á Completed Analyzer A Beta
á Designed Analyzer B
á Designed error-correction reporting forms
á Completed markup and integration of Dictionary into Analyzer
á
2.2.2
Progress of Workpackage/Tasks
All completed as planned.
2.2.3
Deviations if any and corrective action
á Decided to begin the process of expanding the headwords in the dictionary to include all of the headwords from ONP
á
Decided to integrate the analyzer with the diplomatic
transcriptions of mss as well
2.2.4
Work planned for next reporting period
á
Complete Analyzer B beta
á
Implement error correction forms for irregular forms
á
Greatly expand database of irregular forms
á
Continue integration of analyzer, transcriptions and SE
texts into rich text environment
á
Test visualization tools on lemmatized (pseudo)texts
3.
Co-operation within the consortium, including
project meetings
-Urban and Tangherlini attended the successful consortium meeting held at Imperial College, London in early June 2003
-Stefan Rueger from the consortium came to UCLA in July 2003 to work with us on integrating our work with the visualization tools
3.1
Participation in workshops, conferences, publications
-Urban presented the work on the Analyzer at the Society for the Advancement of Scandinavian Study
CONFERENCES:
Society
of the Advancement of Scandinavian Study Annual Conference
Ë
Cultural Heritage Language Technologes
(IST-2001-32745)
Workpackage 4: Old Norse
Morphological Analyser
1 September - 31 November 2003
Scandinavian Section, UCLA
Det Arnemagn¾nske
Institut, K¿benhavns Universitet
Tim Tangherlini
and Mat Driscoll
1 Introduction
This report summarizes the previous and most recent achievements for the Old Norse morphological analyzer, look-up tool, and digital corpus.
2 Previous achievements
This project can
best be divided into two parts. The
main work on the morphological analyser and the look-up tool have taken place
at UCLA. The main work on the mark-up of diplomatic editions of mss has taken
place at K¿benhavns Universitet.
The initial development of a morphological parser has been described in detail in our previous report. As a summary, the Los Angeles group implemented a morphological analyzer for normalized Old Norse texts using the standard Fonrit normalizations. As input the parser accepts headwords from the complete Zoega Old Norse dictionary. For this purpose, a cleaned-up electronic version of the dictionary was created. The parser outputs declension tables for the headword. A trial version of the program can be found at http://ecampusdev.humnet.ucla.edu/curban/index.html. It was written in Perl using a MySQL database to ensure high portability.
The Copenhagen group of researchers has produced a preliminary markup of six sagas from the Fonaldar saga corpus in XML format based on conventions developed by Matthew Driscoll and his colleagues. The marked-up texts include bothe diplomatic and facsimile documents.
3
Current work
The project members currently focus on performance improvement and integration of the parser. In terms of the former, a first implementation of a second parser has been released. Instead of accepting only dictionary headwords, the second parser ideally accepts all declined word forms. This improved analyzer can be connected to the XML Old Norse saga texts. A usage sample can be found in the "Old Norse Morphological Analyzer" report. The Los Angeles group currently improves performance of both parsers.
Wit respect to integration, we are working with the Perseus group around Jeff Rydberg-Cox to integrate the Old Norse material in their text collection. In particular, the parser output needs to be adjusted to meet the Perseus requirements, and the underlying dictionary has to be ported into XML. The latter requirement has been fulfilled using the Text Encoding Initiative (TEI) standard XML conventions. For example, the electronic Zoega entry
abbadis abbadis pl. -ar f abbess
has been modified into the following format:
<entry>
<form>
<orth>abbadis</orth>
</form>
<form>
<number>pl.</number>
<orth
extent='part'>-ar</orth>
</form>
<gramGrp>
<gram
type='pos'>f</gram>
</gramGrp>
<sense n='1'>abbess</sense>
</entry>
We are currently completing the Perl filter to adjust parser output.
Cultural Heritage Language Technologies
IST
Ð2001-32745
1 December 2003 Ð 31 May 2004
Workpackage 4: Old Norse Morphological Analyser
University of California, Los Angeles
University of Copenhagen
Timothy R. Tangherlini
Matthew Driscoll
1. Summary
of key indicators of project progress
a.
Parser
error reduced
b.
Parser
working with SE and Diplomatic texts in beta-environment, Perseus and
Greenstone DL.
2.
Work
Progress Overview
a.
Restructured
parser
b.
Added
beta version of B-parser
c.
Developed
more articulated rules for morphology
d.
Incorporated
these rules into the parser, eliminating significant errors in output
e.
Tested
Standard Edition texts and diplomatic edition texts with the B-parser
f.
Tested
integration of parser with new java-based version of Perseus
g.
Began
integration of parser with texts in Greenstone DL, and visualization tools
developed by Imperial College consortium members.
3.
Project
Management
The project continues to be managed at UCLA by Prof.
Timothy Tangherlini and, for our closest collaborators at the Univ. of
Copenhagen, by Prof. Matthew Driscoll. Graduate student Aurelius Vijunas
continues to work with the team of researchers as a Graduate Student
Researcher, and Dr. Kryztof Urban continues in his post as Postdoctoral
researcher Ð project development lead at UCLA. Other project management has
been provided through the central CHLT office by Dr. Dolores Iorizzo
1.Summary
of key indicators of project progress
1.1 Overview
of objectives
1.1.1 User feedback
User
feedback, as measured by the limited access-release continues to be
overwhelmingly positive. Experts in Old Norse language and literature in the
United States, Denmark and Iceland continue testing the analyzer in a
systematic fashion, suggesting refinements using a newly implemented online
feedback system. This feedback, along with the work of Aurelius Vijunas in
detailing more elaborate rules for adjectives, strong verbs and compounds,
along with a more detailed series of rules for ablaut and syncope has helped us
to continue making refinements to the underlying code of analyzer A, to implement a more
efficient and faster version of analyzer B, and to test a series of texts, both
Standard Edition and diplomatic edition, for use with morphological look-up and
dictionary look-up. Because of this feedback, our error rate is now into the
target range of < 1.0% for all but the strong classes of verbs.
1.1.2
Task allocation
Tasks
are allocated at UCLA as follows:
a. Timothy Tangherlini is in charge of coordination,
overall theoretical focus, collaboration, and first-line testing of routines to
be implemented. Tangherlini is also in charge of coordinating the efforts of
scholars at other institutions to provide consistent feedback on the analyzer
and to develop a series of queries that can be tested with the visualization
tools developed by the team headed by Prof. Stefan Rueger at Imperial College.
b. Kryztof Urban is charged with implementing the refined
rules set for morphological classes, for developing greater efficiencies in the
analyzer code, for developing new ideas to create efficiencies, scalability and
interoperability in the parser, and for error testing. Urban is also in charge
of integration and understanding the APIs of other systems for which the
analyzer is intended to be a Òplug in.Ó
c. Aurelijus Vijunas is charged with developing a series
of detailed rules for various difficult morphological cases, and for developing
elaborated explanations of the forms as a means for increasing the accuracy of
the analyzer. He is also in charge of first-level testing of all forms
generated by the analyzer.
1.2 Overall assessment of main milestones, results, or
deliverables
1.2.1
Significant increase
in accuracy of parser as a result of restructuring the underlying code.
1.2.2
Significant increase
in accuracy of parser as a result of implementing more detailed rules based on
the work of Aurelijus Vijunas
1.2.3
Development of an
available series of descriptive documents detailing the more articulated
morphological rules functional in the restructured parser.
2. Work Progress Overview
2.1
Specific objectives for the reporting period
a. Restructure parser for efficiency and to reduce error
rate
b. Develop clearer rules for paradigm generation
c. Develop well-described sets of information for
irregular forms
d. Test implementation of B-parser with Standard Edition
texts and Diplomatic transcriptions
e. Work on integration of parser into other Digital
Library environments based on an understanding of the APIs for these
environments Ð in this instance, Greenstone DL and Perseus DL.
f.
To continue to
collaborate with the AMI on the integration of all of these resources in a rich
environment
2.2
Achievements
2.2.1
List of Deliverables
a. Beta version of B-parser
b.
Test version of SE edition and diplomatic texts
2.2.2
Progress of Workpackage/Tasks
Since the last report, the
parser has undergone major structural changes. In addition, its performance and
scope have been improved.
Initially, the parser was
designed to create declension tables on the fly using the Zoega dictionary. As
the online test version demonstrated, it performed well for most nouns and
adjectives, and also for roughly half of the verbs in the dictionary. However,
the programÕs general design needed to be improved for the following reasons:
In particular, the old code
was split up into three module types. Module 1 specializes in rewriting the
Zoega dictionary into a parser-readable format. Module 2 consists of the parser
itself. Its output contains all declension paradigms in a simple text file. Now
that the two processes are separated, additional sources can be added by
writing another specialized Module 1 for that source without having to change
the parser code. Both modules are object-oriented (i.e., using packages written
in Perl) and minimize code by having access to often used subroutines. Finally,
Module 3 outputs the results into different formats, such as html pages.
The
new parser design answers all requests listed above. In addition, it introduces
new clarity to a pretty complex coding project.
Morphology: More Articulated Rules
Over
the course of the past three months, under the lead of Aurelijus Vijunas, we
have developed far more articulated descriptions of Old Norse morphology. In
over thirty files, we describe the inflection of many morphological classes of
Old Norse words as well as various phonological processes that characterise Old
Norse inflections (such as vowel assimilations [umlauts], syncope, shortening, and various other euphonic
changes). Up to date, the nouns, the adjectives, numerals, and a part of verbs
have been fully covered. The remaining part includes the rest of verbs,
pronouns, and a review. Wenticipate that the creation of verb paradigms can be
finished by the end of June, and the remaining part of work at this stage would
be used for review.These files will also bevailable on our site and form the
basis for the new rules incorporated into the parser described above. The files
to date describe:
1. adjectives in Ðligr
2. consonantal nouns
3. feminine wō-stems
4.
feminine monosyllaba (‡, gj‡, sl‡)
5.
I class str. v.
6.
II class str. v.
7.
III class str. v.
8.
IV class str. v.
9.
V class str. v.
10. VI class str. v. (review !)
11. i-stems
12. ja-stems
13. jō-stems
14. j-stem adjectives (sekr,
nýr)
15. i-stems of bekkr-type
16. nouns of relationship
17. n-stems
18. numerals
19. rk (donÕt count, but think of various phonological Ð
computeristic problems)
20. singular tantum nouns (think of plurale tantum!)
21. ija-stems
22. system words
23. u-umlaut (a/i- as well, if needed)
24. u-stems
25. wa-nouns
26. words in Ðan
27. words in Ðandi
28. words in Ðll
29. words with -nnr-
30. words with stem in Ðr
31. w-stem adjectives
2. 2.4 Work planned for next
reporting period
a.
Continue work on the
description of irregular word classes
b.
Correct input errors and
scanning errors in underlying lookup database.
c.
Develop simple interface
to add necessary information to headwords harvested from the Ordbog over det
norr¿nne prosasprog at the University
of Copenhagen to take advantage of changes in the underlying parser structure.
3.1 Co-operation within the consortium, including
project meetings
a.
Kryztof Urban will
attend the consortium meeting in Pisa/London in June, 2004
b.
Timothy Tangherlini and
Aurelijus Vijunas have met in Los Angeles with Stefan Rueger, to discuss
aspects of integration with Greenstone DL, and concerning the visualization
tools, keyword extraction routines, and clustering developed at Imperial by
RuegerÕs team.
c.
Timothy Tangherlini has
met in Los Angeles with Gregory Crane to discuss aspects of integration into
Perseus and future ideas for development.
3.2 Participation
in workshops, conferences, publications
a.
Timothy Tangherlini
presented interim results at the Society for the Advancement of Scandinavian
Study, April 2004.
CONFERENCES:
Society for the Advancement of Scandinavian Study, April
2004.