Workpackage 4: Old
Norse Morphological Analyser
Year 3 Executive
Summary
Timothy
Tangherlini and Matthew Driscol
UCLA and
University of Copenhagen
In year three we re-wrote the code for and refined a
morphological analyzer and English language look-up tool for Old Icelandic that
is interoperable with the CHLT-Perseus and Greenstone digital library
systems. Our teams at UCLA and the
Arnamagnaean Institute at the University of Copenhagen linked the morphological
analyzer to diplomatic and normalized editions of manuscripts (transcribed and
XML-TEI hand tagged for CHLT at AMI), which in turn linked to images of the
manuscript pages (created for CHLT at AMI), and integrated them into the
CHLT-Perseus Digital Library System at Tufts University where we have
incorporated the morphological analyzer/look-up tool with these diplomatic
texts and manuscripts, as well as with the Standard Edition texts of the
legendary sagas (Fornaldar sšgur). We have also worked with Imperial
College London to incorporate our tools and texts within the visualization
programme developed for CHLT.
The details of this work break down into six thematic areas:
(i) Underlying Code: We changed the underlying code of the morpho-syntactic parser to make it 'object oriented' so that the rules set for Old Norse took the form of a 'module' (rather than hard coded in the actual parser).
(ii) Rules: We developed far more precise rules than the Year 2 version for the various word classes.
(iii) Phenomena: We planned a strategy to deal with certain phenomena unique to the language (umlat, Werner's Law, syncopation) and implemented these to increase parser
accuracy.
(iv) Lexical Sets: We fixed most of the problems with the lexical sets, making sure that we had eliminated most spelling errors.
(v) Integration : We integrated the parser with the (a) diplomatic marked-up texts, (b) the
normalized marked-up texts, (c) integrated both with Perseus, and (d) integrated results with Imperial College visualization and clustering tool.
(vi) Dissemination and Exploitation of Results: We began work on a proposal for future work that builds on the results of CHLT.
Our results are revolutionary since it means that both students and scholars can search across the Fornaldas Saga's for the first time with morphological analysis tools that provide paradigms for all Old Norse words in Zoega; this opens up new ways of studying both the language and the literature of Old Norse which is for the most part inaccessible to the uninitiated and illusive to those that know the tradition.
The most tangible results of CHLT are the following:
1) Old Norse Morphological Analyzer (http://krispy.humnet.ucla.edu/~curban/)
2) Old Norse TEI-Transcription Guidelines (see below)
3) Creation of TEI hand-tagged transcriptions of Old Norse MSS
4) Creation of Images of Old Norse MSS
5) Integration of Old Norse Texts with Morphological Analyzer in Integrated
Reading Environment
D. 4.7 : www.chlt.org and
http://www.perseus.tufts.edu/hopper/collection.jsp?collection=Perseus:Collection:Germanic
(this is a dynamic web-based deliverable)
6) Co-authored article, by Urban, Aurelijus Vijunas and Tangherlini, ÒToward an
automated morphological analyzer for the study of Old Icelandic TextsÓ, in
preparation for the Journal of English and Germanic Philology.
7) Strategy for dissemination and continuation of CHLT (see NSF Proposal below)
Appendix
(2) Transcription Guidlines for Old Norse on CHLT Project
A. Text
1. The text should be transcribed exactly as it is with respect to orthography and spacing. With the exception of small capitals, used to denote geminates (principally N and R, but potentially also D, G, M, S and T), variant forms of the same letter (allographs) need not be distinguished. It may, in some cases, be deemed necessary also to distinguish between:
high and round s
ordinary and round r (r-rotunda)
ordinary and insular forms of f and v
ordinary and uncial forms of d, e, m and t
Note that only ligatures with an independent phonemic value (a and e, double a etc.) are to be represented (using the entities defined by MENOTA); ligatures which are the result of graphic economy should be treated as two separate characters (high s + t, for example).
2. Expand abbreviations in accordance with the normal spelling of the scribe in question, using <expan> to indicate supplied letters:
se<expan>m</expan>
It is not necessary in first-level transcription to indicate the means by which the abbreviation is achieved, although one may choose to do so:
se<expan abbr="&bar;">m</expan>
Abbreviation by suspension may be distinguished from abbreviation by other means (contraction, supraliner symbol etc.) by means of the type attribute:
Hald<expan type="susp">anar</expan>
Expansions of such abbreviations can then be made to display in round brackets, as in a tradition printed edition.
3. Use <supplied reason="omitted"> to indicate letters or words assumed to have been inadvertently omitted by the scribe (which in a printed edition would be placed in angle brackets):
gieck sijdan <supplied reason="omitted">j burt</supplied>
4. Use <supplied reason="illegible"> to indicate letters now unreadable but assumed originally to have been in the manuscript (which in a printed edition would be placed in square brackets):
lid<supplied reason="illegible">z</supplied>
5. Where necessary to the sense, certain emendations and alterations may be made to the text; obvious misspellings, for example, should be corrected using <corr>, with the original reading given as the value of the SIC attribute:
<corr sic="giorit">giorir</corr>
With <supplied> and <corr> the attribute resp can be used (although this will not generally be necessary) to indicate the scholar or previous editor responsible for the conjectural emendation:
<corr sic="giorit" resp="MO">giorir</corr>
With <supplied> a source attribute is also available (but for first-level encoding normally not necessary), where the reading is taken from another witness:
ath þeir <supplied reason="omitted" source="AM 152 fol., 76ra">mundu</supplied> sundr ganga
Note that source is not available on <corr>, although logically it should be.
The <supplied> element should only be used when the missing text can be reconstructed with a very high degree of certainty. When such is not the case <gap> should be used instead, with both a reason and an extent attribute. The extent should be given as the number of characters presumed missing, which can then be made to display as a series of small noughts, as is customary in a printed edition.
6. Additions and deletions made in the manuscript by the scribe or in another hand should be indicated with the <add> and <del> elements; further information may (but need not) be given as attribute values.
<add place="margin" hand="scribe">því ha&nscap; var komi&nscap; fra godonum ok kallaþur &stall;on Odins</add>
s<del rend="superpunction">a</del>umra
B. Structure
1. Indicate line-, column- and page-boundaries using the empty milestone tags <lb/>, <cb/> and <pb/>, giving a number for each as the value of the n attribute.
<pb n="1v"/>
These tags should come at the beginning of the line/column/page to which they refer.
2. Large structural divisions in the text, i.e. chapters, should be tagged using <div type="chapter"> and given a number; each <div> will contain one or more <p> elements.
3. Chapter headings should be tagged using <head>, which is placed immediately after <div> and before the first <p>. The nature of the <head>, i.e. whether it is found in the manuscript itself or supplied by an editor, should be indicated in the value of the TYPE attribute.
<head type="rubric">I. Cap<expan>itulum</expan></head>
<head type="supplied">Chapter 3</head>
4. Verses in the text should be tagged using <lg> (line-group) for stanzas and <l> (line) for individual lines:
<lg>
<l>Aukum nu elldana</l>
<l>ad Adilz borg.</l>
</lg>
As they frequently are part of direct speech, verses will normally occur within a <p> (the DTD has been changed to allow for this).
C. Normalisation
1. Place each word inside an <orig> element, giving the normalised form as the value of the reg attribute
<orig reg="líkaði">lijkadi</orig>
2. Compound words written separately in the manuscript should be grouped together within a single set of <orig> tags:
<orig reg="stórilla">&stall;tór illa</orig>
In the opposite situation, where for example a preposition and its object are written as a single word, the two parts should be treated as separate words, each placed within a set of <orig> tags, but with no space between them:
<orig reg="á">a</orig><orig reg="landi">lande</orig>
3. Marks of punctuation should be outside the <orig> tags.
4. Care must be taken to ensure that tags are placed in such a way so as not to overlap. Where one or more letters have been supplied within a word, for example, the <supplied> tag will obviously come inside the <orig> tag:
<orig reg="þriðjung">þ<supplied reason="omitted">r</supplied>idiung</orig>
When an entire word has been supplied, it is in theory immaterial which element contains the other, but when two or more words are supplied the <supplied> tag must stand inside the <orig> tags:
<supplied reason="omitted"><orig reg="í">j</orig> <orig reg="burt">burt</orig></supplied>
Problems can thus arise in particular when supplied (or added or deleted) text begins within one word and ends within another. In such cases two sets of tags must be used, as in the following example (where the rend attribute has been used in order to ensure proper display):
<orig reg="Frigg">f<expan>ri</expan><supplied reason="illegible" rend="noclose">gg</supplied></orig> <supplied reason="illegible" rend="noopen"><orig reg="heyrir">heyrir</orig> <orig reg="bæn">bęn</orig> <orig reg="þeira">þeirra</orig> <orig reg="ok">ok</orig> <orig reg="segir">segir</orig></supplied>
M. J. Driscoll
Last update: 23.05.2005.
7) Strategy for Dissemination, Expoitation
and Continuation of CHLT:
2005 NSF Project Proposal
based on CHLT results:
From a morphonological point of view, Old Icelandic is the most complex ancient Germanic language (Iversen 1994, J—nsson 1908, Krahe 1969, Noreen 1884, Steblin-Kamenskij 1953 and 1955). Study of lexical, syntactic and morphological change in Old Icelandic, and the study of word use in Old Icelandic texts has been greatly enhanced by the NSF-EC funded Cultural Heritage Language Technologies Project. CHLT has made possible a quantum leap in the study of Old Norse by creating for the first time sophisticated tools and modes of analysis for scholars in this field. The creation of CHLT digital editions of Old Icelandic texts and images tagged in morpho-syntactic detail has made it possible to achieve disambiguation scores for this difficult language.
To develop the results of CHLT further, we propose to:
Augment our morphological analyzer for Old Icelandic to include grammatical disambiguation (morphological and syntactic) in context to develop automated orthographic normalization routines, coupled to a timelining function related to orthographic change; refine the output of the morphological analyzer; greatly expand the underlying lexical dataset for which the analyzer currently works; expand the English language lookup tool to include all of the entries in Cleasby-Vigfusson; integrate these tools with a widening corpus of both Standard Edition and diplomatically transcribed Old Icelandic texts; integrate these components into several digital library environments; and adapt the morphological analyzer code on a test basis to Old English.
Our next goal is to develop a series of automated grammatical disambiguation routines for Old Icelandic. Because of the phonological and morphological complexity of Old Icelandic, a great number of ambiguous forms existÑeither identical forms can be derived from multiple lemmata (eg. fara gen. pl. of the neut. noun far Ôa means of passageÕ; or inf. / 3rd. pl. ind. prs. / 1st sg. sub. prs. of the verb fara Ôto travelÕ), or identical forms exist within the paradigm of a single lemmaÑfor example oblique forms of nouns are often ambiguous in regards to case (eg. bleytu, acc./gen./dat. sg. of bleyta ÔmudÕ). In some instances, both are true (eg. unni a) 3. p. sg. preterite indicative active of the verb unna Ôto loveÕ; or b) 3. p. sg./pl. present subjunctive active of the verb unna Ôto loveÕ; or c) acc./dat. sg. of the feminine noun unnr ÔwaveÕ). Our system will incorporate rules that disambiguate between these forms in context, offering greater precision in identifying part of speech and case/person for a given form. We are not proposing to undertake automated semantic disambiguation at this stage (differentiating between possible meanings for a single lemma Ð eg. far can mean (1) Ôa means of passage, shipÕ (2) ÔpassageÕ (3) Ôtrace, print, traceÕ (4) Ôlife, conduct, behaviorÕ (5) Ôstate, conditionÕ).
In cases where grammatical disambiguation cannot occur with 100% surety, the routines we develop will provide a measure of statistical likelihood between the possible choices (described in greater detail below in section 2.c). The disambiguation routines will allow for greater precision of the morphological analyzerÕs output when used in conjunction with the growing digital corpus of Old Icelandic prose texts (predominantly sagas). Searches for words and phrases in a text corpus in which ambiguous forms have been further tagged with information concerning the disambiguation of those forms will yield far better results than similar searches on unmarked texts or marked texts in which ambiguous forms are not properly tagged, as the ÒnoiseÓ generated by ambiguous forms will be greatly reduced or eliminated. Consequently, evaluations of the results will be more meaningful to researchers. Such queries will allow for quite specific searches concerning word use, overall vocabulary, linguistic change over time, regional usage of words (to the extent that this can be determined) as well as specific aspects of syntax and grammar. Our disambiguation and tagging of Old Icelandic texts is an important first step toward building a linguistic tree bank and eventually a parsed corpus of Old Icelandic.
To expand the range of the morphological analyzer and the disambiguation routines, we propose to increase greatly the lexical set to which the current analyzer is linked. Currently, the analyzer has a limited lexical database derived from Zo‘gaÕs Old Icelandic Dictionary (1910). This dictionary is a subset of the standard English language dictionary for Old Icelandic, CleasbyÐVigfussonÕs An Icelandic-English Dictionary (1874). We will incorporate all of CleasbyÐVigfusson in our new dictionary tool, and further expand the lexical set with the Ordbog over det Norr¿ne prosasprog (ONP) (Degbol et. al. 1995-), a comprehensive list of all lemmata in the Old Icelandic prose language (~approx. 68,000 lemmata). While CleasbyÐVigfusson includes adequate information concerning irregular forms, ONP does not. Accordingly, our database of exceptions and irregular forms will initially only cover the 40,000 or so lemmata found in CleasbyÐVigfusson; we will develop an easy user interface for the updating the irregular forms of words found in ONP not occuring in CleasbyÐVigfusson so that, by the end of the grant period, the underlying database of exceptions and irregular forms should be nearly complete. In turn, this more complete table of exceptions will increase the accuracy of the morphological analyzer.
We will also want to develop normalizing routines for Medieval Icelandic orthography. The architecture of the normalizer already exists in our ÒNormalizerÓ module used by the morphological analyzer to standardize lexical database input (See section 2.b below). The advanced orthographic normalizer will include a timelining feature that will take into account orthographical changes in Icelandic up to the 15th century. Changes in Icelandic orthography were often related to changes in phonology. Unlike phonological changes, orthographical changes were reflected rather inconsistently particularly in later manuscripts. Incorporating a time-lining feature will help map orthographic change and will also allow for standardized searching on forms across manuscripts written in different time periods. We consider this time-lining of orthographic change to be an important feature since the morphological analyzer/disambiguator is intended to work with normalized text as well as diplomatic transcriptions of manuscripts from various periods and various scriptoria. The normalizer will also allow us to back-normalize Old Icelandic texts written with modern Icelandic orthography, greatly expanding the number of texts available to the system.
We will continue to use the Fornaldar sšgur (Legendary Sagas) as our test platform. The corpus includes Standard Edition (SE) texts as well as diplomatic transcriptions of variant manuscripts on which those SE texts are based. We propose to apply our disambiguation routines to the standard edition (normalized) texts, as well as to the automatically marked-up diplomatic editions of manuscript variants of Fornaldar sšgur texts that we have developed during the past three years. The morphological analyzer, the disambiguation routines, and the XML-tagged SE and diplomatic edition texts will be incorporated into the Perseus digital library at Tufts University and into a Greenstone Digital library environment at UCLA. Automatic XML markup will proceed according to the conventions for Early Scandinavian mark-up described in the MENOTA handbook (Menota 2003).
Because of the multiple structural similarities between Old
English and Old Icelandic, we plan to adapt the architecture of our
morphological analyzer/disambiguator on a test basis for Old English. We will
likely base the look-up tool on the available headwords of the Dictionary of
Old English from the University of Toronto. The goal of this adaptation is to
show how the architecture of our morphological analyzer can be applied to the
other ancient Germanic languages. From the morph(on)ological point of view, the
other ancient Germanic languages are less complex than Old Icelandic, and the
derivation of analyzers and disambiguation for them should be less difficult to
implement. We expect our work to have import for the development of
morphological analyzers for other Indo European languages.
Objectives of the Project
Develop disambiguation routines for addressing ambiguous
forms in Old Icelandic texts (Sections 2.c and 2.d)
Develop a method for automatically scoring results of
disambiguation; and integrate these statistical scores into the XML markup
(Sections 2.c and 2.d)
Develop an orthographical normalizer and a time-lining
feature to account for orthographical change in Old Icelandic; expand the
number of digital texts available in standard Old Icelandic orthography
(Section 4)
Refine the output of the existing Old Icelandic
morphological analyzer primarily by increasing the size and accuracy of the
table of exceptions (Sections 2.a, 2.b and 3)
Expand the underlying lexical set to include all of the
lemmata in Cleasby-Vigfusson; supplement this set with lemma from Ordbog
over det norr¿ne prosasprog using a simple
webform accessible to project developers and expert users (Section 3)
Expand the English language lookup tool (Section 3)
Port the code of the morphological analyzer to Old English
on a test basis (Section 6)
Integrate the second generation morphological
analyzer/disambiguator with digital library systems (Perseus and Greenstone),
visualization tools (Greenstone, Spire, Cascade, Navigational View Builder
etc.) and other text analysis tools (eg. Wordstat, Xaira, Juxta); explore
export of the system as a SCORM learning object (Section 5)
Morphology, Disambiguation and Old Icelandic
2.a Morphological Complexity of Old Icelandic and
automated morphological analysis
Compared to the
other ancient Germanic languages, the morphonological system of Old Icelandic
is relatively complex and in many ways irregular. This complexity stems from
the instability of the phonological system and multiple irregular developments,
as well as from an ambiguity of endings and active processes of analogy. Many
of these phenomena took place before Icelandic was established as an individual
language, while others affected the language in the course of its internal
development.
The nominal
system of Old Icelandic consists of sixteen inflectional classes, most of which
can be further subdivided into subclasses. The Old Icelandic adjectives can be
grouped into two morphologically and semantically distinct classes of ÒstrongÓ
(indefinite) and ÒweakÓ (definite) adjectives. The verb system consists of
three large classes of the so-called ÒstrongÓ, ÒweakÓ and Òpreterito-presentÓ
verbs. Each of these classes pose specific challenges to the development of an
automated morphological analyzer.
In the Germanic
proto-language, nouns of different classes were characterized primarily by
different suffixes and at times by different endings. In the course of
development of the Germanic languages, the various suffixes frequently merged
with endings by means of various phonological processes, and eventually
disappeared as independent morphemes. The new endings, which in early Germanic
were still quite different from each other, were affected by special Germanic
phonological rules (VernerÕs law, reduction of unstressed vowels) and in several
instances became homorganic, cf. feminine ō-stem genitive ending *-ōR (< *-ās) and u-stem genitive ending *-ōR (< *-ous), both of which evolved into -ar in Old Icelandic. Also, many endings could
be added to nouns of more than one gender. In numerous cases, the ambiguity of
endings caused confusion of inflections, transfer of nouns from one class to
another, or paradigmatic split (Gutenbrunner 1951, Krahe 1969, Noreen 1884)
The endings of
Old Icelandic verbs are relatively straightforward. Much more problematic are
the morphologically and phonologically conditioned vowel alternations in the
root, which can significantly affect the shape of the root in different forms
of the same word. Analogical restorations and transformations, which always
work against regular phonological development, have contributed to the creation
of numerous by-forms and parallel paradigms.
As with the
nouns, there have been numerous transfers of verbs between classes and
paradigmatic splits. Due to their phonological shape, many archaic strong verbs
developed in irregular ways, eventually developing abnormal paradigms. Already
at an early stage, native speakers created alternative regular paradigms for
such verbs, and in many instances irregular verbs possessed more than one paradigm
(in some cases as many as six, cf. the verb g¿rva ÔmakeÕ, which due to aberrant shape in the course
of development acquired five by-forms, cf. gera, gerva, g¿ra, gjšra, gjšrva, each having its own paradigm). Our morphological
analyzer deals well with this type of complexity, relying both on calculation
of regularly produced forms, and an underlying table of exceptions for
irregular forms that cannot be calculated. For example, it currently returns
the following complex paradigm for g¿ra:
|
Active |
|||
|
Indicative |
Subjunctive |
||
|
Present |
Past |
Present |
Past |
1sg |
g¿ri,geri,gjšri |
g¿rða,gerða,gjšrða |
g¿ra |
g¿rða |
2sg |
g¿rir,gerir,gjšrir |
g¿rðir,gerðir,gjšrðir |
g¿rir |
g¿rðir |
3sg |
g¿rir,gerir,gjšrir |
g¿rði,gerði,gjšrði |
g¿ri |
g¿rði |
1pl |
g¿rum,gerum,gjšrum |
g¿rðum,gerðum,gjšrðum |
g¿rim |
g¿rðim |
2pl |
g¿rið,gerið,gjšrið |
g¿rðuð,gerðuð,gjšrðuð |
g¿rið |
g¿rðið |
3pl |
g¿ra,gera,gjšra |
g¿rðu,gerðu,gjšrðu |
g¿ri |
g¿rði |
Infinitive: g¿ra
Present Participle: g¿randi
Past Participle: gšrr
|
Medio-Passive |
|||
|
Indicative |
Subjunctive |
||
|
Present |
Past |
Present |
Past |
1sg |
g¿rumk |
g¿rðumk |
g¿rumk |
g¿rðumk |
2sg |
g¿risk |
g¿rðisk |
g¿risk |
g¿rðisk |
3sg |
g¿risk |
g¿rðisk |
g¿risk |
g¿rðisk |
1pl |
g¿rumk |
g¿rðumk |
g¿rimk |
g¿rðimk |
2pl |
g¿rizk |
g¿rðuzk |
g¿rizk |
g¿rðizk |
3pl |
g¿rask |
g¿rðusk |
g¿risk |
g¿rðisk |
Infinitive: g¿rask
Present Participle: g¿randisk
Past Participle: g¿rzk
|
Imperative |
|
|
Active |
Medio-Passive |
2sg |
g¿r |
g¿rask |
1pl |
g¿rum |
g¿rumk |
2pl |
g¿rið |
g¿rizk |
Despite the
ability of the morphological analyzer to deal with complex paradigms, the
current lexical database does not account for all five secondary forms, but
rather uses Zo‘gaÕs pointers of gera to g¿ra, gerva to g¿rva, and gj¿r- to g¿r- or gšr-; Zo‘gaÕs standard form g¿ra includes the secondary form g¿rva, which is actually the original form and
should be the default form. This lack of clarity regarding secondary forms in our underlying lexical
database will be addressed in the extension of the lexical database and the
expansion of the table of exceptions.
A large part of
the complexity of Old Icelandic morphonology can be attributed to the
phonological processes of umlaut and breaking, which affect the shape of the
stem in various ways, cf. sag-a ÔsagaÕ (nom. sg.) vs. sšg-u (obl. sg.; u-umlaut
changes /a/ to /š/), or berg ÔsaveÕ (1. p. sg. pres.) vs. bjargið ÔsaveÕ (2. p. pl. pres.; a-breaking changes /e/ to /ja/), etc. In
many cases, more than umlaut (or umlaut and breaking) obtains, cf. s¿kkva Ôsink (transitive v.)Õ (< *sankwijan; the root vowel /a/ undergoes u/w-umlaut
and then the resulting */š/ undergoes i-umlaut to /¿/).
In those
word-forms where the conditions for an umlaut did not exist, it did not occur.
This lack of umlaut resulted in different forms of the same word having
different shapes (ÒallomorphyÓ), cf. sšk (nom. sg.) vs. sak-ar (gen. sg.), or sag-a (nom. sg.) vs. sšg-u (obl. sg.). In those cases, where more than one
umlaut (or umlaut and breaking) operated, the number of allomorphs rose
accordingly, cf. fjšrðr ÔfiordÕ (nom. sg.; < *ferþ-uR; u-breaking: e > jš /_Cu) vs. firði (dat. sg.; < ferþ-ī; i-umlaut: e > i
/_Ci) vs. fjarðar
(gen. sg.; < *ferþ-aR; a-breaking: e > ja /_Ca). As a result, paradigms can become quite complex, such as the
paradigm for fjšrðr:
Singular Plural
Nom. fjšrðr (<
*ferþ-ur) firðir
(< *ferþ-īr)
Acc. fjšrð (<
*ferþ-un) fjšrðu
(< *ferþ-unn)
Gen. fjarðar (<
*ferþ-ar) fjarða
(< *ferþ-an)
Dat. firði (<
*ferþ-ī) fjšrðum
(<
*ferþ-umm)
Again, our
morphological analyzer deals quite well with this type of complexity, and
accurately returns:
|
Singular |
Plural |
Nom |
fjšrðr |
firðir |
Acc |
fjšrð |
fjšrðu |
Gen |
fjarðar |
fjarða |
Dat |
firði |
fjšrðum |
In addition to
umlaut and breaking, Old Icelandic exhibits other complex phonological
features. Another common phenomenon is syncope of unstressed vowels. However,
the rules for its operation are not easy to define. Syncope tends to occur in
words which in the protolanguage were trisyllabic (or longer). However, it is
reflected in by no means a regular way, cf. jštn-ar ÔgiantsÕ (nom. pl.; 2 syllables) <
*jšt-un-ar (3 syllables),
but skrif-ar-ar
ÔscribesÕ (3 syllables). Syncope is quite irregular among adjectives, operating
in some words, and not operating in others, even though they may belong to the
same derivational type, cf. m‡l-i-gr ÔtalkativeÕ Ð acc. sg. masc. m‡l-gan, but kunn-i-gr ÔexpertÕ Ð acc. sg. masc. kunn-i-gan. For m‡ligr, for example, our morphological analyzer accurately returns:
|
Singular |
Plural |
Nom |
m‡ligr |
m‡lgir |
Acc |
m‡lgan |
m‡lga |
Gen |
m‡ligs |
m‡ligra |
Dat |
m‡lgum |
m‡lgum |
In other cases,
our morphological analyzer returns results that are incorrect. Syncope is one
of the ongoing challenges as we refine our automated morphological analyzer.
Along with
syncope, phenomena related to the phonological changes to consonants and
consonant clusters as a result of the processes of assimilation, dissimilation,
degemination, devoicing in word-final position and VernerÕs law pose a
challenge to our automated morphological analyzer. All of these irregularities,
while fairly well addressed in the current morphological analyzer, require a
degree of attention that we have as of yet been unable to consistently apply
across all word classes. However, we have been able to describe these phenomena
well, and will implement these descriptions as refined rule-sets in the Target
Language module described
below in conjunction with our planned expansion of the lexical dataset, the
table of irregular forms and the English language lookup tool.
2.b The Old
Icelandic Morphological Analyzer: Architecture and Implementation
The morphological analyzer produces word form tables based on lemmata from the Zo‘gaÕs lexicon. In addition, it comments on its computations to arrive at the final output. For example, given a head word barn the analyzer performs a lexicon lookup to retrieve the following information from its digital copy of the Zo‘gaÕs lexicon:
barn | barn | E | n | (1) bairn, child; vera með
barni, to be with child; ganga með barni, to go with child; barns hafandi
or hafandi at barni, with child, pregnant; fr‡ blautu barni, from one's tender
years; (2) = mannsbarn; hvert b,
every man, every living soul
Each lexicon entry consists of five fields: the headword itself, its original form in the lexicon, declension information (which in this case is empty as signaled by the symbol ÔEÕ), its part-of-speech, and finally its translation and usages.
Given this information and an internal representation of the phonology and morphology of the target language Old Icelandic, the morphological analyzer determines and outputs all potential paradigms:
barn, noun, gender: n, a-stem |
||
|
Singular |
Plural |
Nom |
barn |
bšrn |
Acc |
barn |
bšrn |
Gen |
barns |
barna |
Dat |
barni |
bšrnum |
1) bairn, child; vera með barni, to be with child;
ganga með barni, to go with child; barns hafandi or hafandi at barni, with
child, pregnant; fr‡ blautu barni, from one's tender years; (2) = mannsbarn;
hvert b, every man, every living soul
The user can choose to output the analyzerÕs internal application of its linguistic rules, allowing students of Old Icelandic to understand the derivation of the forms found in the paradigm. For each output form, it lists the phonological and/or morphological rule underlying its change:
Lexeme: barn
Gender (if any): n
Declension info: nom_sg E
Stem (if any): a
The root is barn.
I found a stem: a.
Root consonants: b - r n
Root vowels: - a - -
Root vowels only: a
Sound changes for element Nom Sg:
None.
Sound changes for element Acc Sg:
None.
Sound changes for element Gen Sg:
None.
Sound changes for element Dat Sg:
None.
Sound changes for element Nom Pl:
u-mutation to
neut a-stem, nom & acc pl ...
Sound changes for element Acc Pl:
u-mutation to
neut a-stem, nom & acc pl ...
Sound changes for element Gen Pl:
None.
Sound changes for element Dat Pl:
Regular
u-mutation ...n
Figure 1: Parsing detail from the Old Icelandic Morphological Analyzer
The design and implementation of our morphological analyzer is guided by two main principles. An object-oriented layout allows for its adaptation to languages other than Old Icelandic. In addition, its separation between linguistic rules, natural language resources, and the code itself enables the user to add new language resources. Both design principles are of major importance in regards the scalability of the analyzer. The analyzer essentially works as a two-level morphological analyzer as described in part by Koskenniemi (1983; 1986) and Kartunnen (1983) and later refined by others (Karttunen, L., Koskenniemi, K., and Kaplan, R. M. 1987; Antworth, E. L. 1990; Pulman, S. 1991; see also Karttunen, L. and Beesley, K. R. 2001). Our analyzer accepts as input either lemmata from the lexical database, and outputing the paradigm for that headword; or forms from a text and outputting all possible lemmata and their paradigms (with the form clearly marked) for the input form
The analyzer code is written in Perl (http://www.perl.com), a programming language particularly suited for manipulation of Unicode and plain text strings. In addition, it allows for the creation of classes, i.e. an object-oriented architecture. Some attractive features of object-oriented programming are the hierarchical structuring of classes, the control over variable declarations and user permissions, and a high degree of convergence between the application design and its problem space. Figure 1 illustrates the general architecture of the analyzer:
Figure 2: General architecture of the morphological analyzer.
Currently, the Lexicon module consists of an electronic copy of the Zo‘gaÕs Old Icelandic lexicon, excerpts from Old Icelandic sagas and a table of exceptions that overrides the output of forms calculated by the morphological analyzer where appropriate. To ensure scaleability, the analyzer has been designed to accept input from various language resources. This design feature makes the incorporation of other lexical databases quite straightforward, and will allow us to implement the CleasbyÐVigfusson additions, as well as the ONP additions, in an efficient manner.
In operation, the morphological analyzer expects a normalized form of lexical entries as its input. This is accomplished by the Normalizer module. Compare for example the following entry in Zo‘ga:
barna-bšrn, n. pl. grandchildren;
with its normalized version which is accessed by the analyzer:
barnabšrn | barna-bšrn | E | n pl | grandchildren
The normalization occurs automatically. The general features of the Normalizer module will be expanded as the underlying structure of the Orthographic Normalizer module (see Section 4 below). This latter module will be used in conjunction with input texts rather than lexical databases.
The Target Language module contains information regarding the target language such as phonological and morphological rules. For example, the morpho-phonetic rule for the excision of consonants in Old Icelandic is represented as Perl pseudo-code:
RULE: excision_consonant
CONDITION: rootc(-2)
ne Ô-Ô && rootc(-1) ne Ô-Ô && tmp(0) eq rootc(-1)
ACTION: shift
tmp
In this rule, the morphological analyzer deletes a given consonant if certain conditions regarding the consonantal structure of the lexeme root are met. The rule set in the Target Language module contains the majority of phonological and morphological rules for Old Icelandic. Accordingly, only a few linguistic rules are hard coded into the analyzer. Our goal is to achieve complete separation between the Target Language module and the morphological analyzer itself. By implementing this separation, the Morphological Analyzer will be able to interact with Target Language modules and Lexicon Modules of other languages, such as Old English. In addition to a linguistic rule set, the Target Language module consists of several databases for language specific data such as exceptions, umlaut information, and word ending paradigms.
The third module in the architecture is the morphological analyzer itself. Upon being called, it determines the root structure of a word from the Lexicon module based on the rules and definitions in the Target Language module entry for Old Icelandic. Once it determines its part-of-speech, the analyzer creates a paradigm, performs the appropriate morpho-phonetic changes, and finally outputs the paradigm.
2.c Grammatical Disambiguation
Our proposed system for the grammatical disambiguation (both morphological and syntactic) of Old Icelandic will rely directly on accurate output from the morphological analyzer coupled to a significant library of digital versions of Old Icelandic texts. While there are a variety of disambiguation strategies, this typeof ÒsupervisedÓ disambiguation yields better results than ÒunsupervisedÓ disambiguation (Manning and SchŸtze 2000).
Ambiguous forms arise in several ways in Old Icelandic. The ambiguity of the grammatical endings
is generally a result of convergent phonological development. Respectively,
ambiguity of the endings is one of the causes of morphological analogy and
paradigmatic reformation. Analogy and paradigmatic levelling can also be caused
directly by phonological processes. In such cases, aberrant phonological
development creates allomorphy within a single paradigm. The development of
allomorphous paradigm can follow several differents courses. Sometimes, the
more prominent allomorph may push out the less prominent one, cf. the present
singular active paradigm of the verb eta ÔeatÕ, in which the more prominent allomorph et- pushed out the less prominent *jšt-, expected in the 1. p. sg. present. Conversely, allomorphy may be
preserved, cf. the paradigm of fjšrðr above (section 2.a). Or, finally, one may
encounter paradigmatic split as in the verb g¿rva (see section 2.a; on
analogy see Sturtevant 1957).
The disambiguation routines we plan to develop will allow for varying levels of end-user expertise and rely on ÒscoringÓ the results for each ambiguous form. Basic users will likely want to accept the high score suggestions of the disambiguation routines, while users with a strong background in Old Icelandic may want the ability to override the suggestionsÑor consider all of the scored outputÑof the disambiguation routines. Searches on the corpus will allow users to toggle on and off the disambiguation functionsÑresults of these searches can then be passed to various statistical tools (estimates of proportion for the occurence of forms in a corpus, calculation of z-scores for such forms, and other standard measures of word-use, co-occurence and vocabulary incorporated into textual analysis systems such as Wordstat or Xaira) and visualization tools (such as those developed at Imperial College and incorporated in the most recent release of Greenstone). Disambiguation will also contribute significantly to meaningful clustering and key-term extraction routines. Finally, this automated, supervised disambiguation is an important component of developing a linguistic tree bank for Old Icelandic and subsequently a parsed corpus.
2.d Design and Implementation of Disambiguation Routines
In Germanic and most other natural languages, word order follows patterns (Duda, et al 2000). To varying degrees, they may be enforced by the grammar of a language. On a sub-sentence level, words can often be combined to form phrases. For example, the English phrase Òthe old manÓ is an instantiation of the abstract pattern ÒDeterminer Ð Adjective Ð NounÓ. Old Icelandic, too, contains patterns of word order. This fact lies at the heart of disambiguation based on phrase structure dependencies. For example, in the following excerpt from The Saga of Grettir the Strong, notice the context of menn Ômen, envoysÕ:
En er þeir frŽttu þat, Þ—rir haklangr
ok Kjštvi konungr, þ‡ sendu þeir menn til m—ts við þ‡ ok b‡ðu þ‡
liðs ok hŽtu þeim sÏmðum.
[É] and when Thorir Long-chin and Kjštvi the King heard
of their landing they sent envoys
to ask for their aid, promising to treat them with honor.
The form menn can be found in the paradigm for maðr:
maðr, noun,
gender: m, r-stem |
||
|
Singular |
Plural |
Nom |
maðr |
menn |
Acc |
mann |
menn |
Gen |
manns |
manna |
Dat |
manni |
mšnnum |
According to the paradigm, menn could be either nominative or accusative plural. To resolve this ambiguity and determine the correct form, we can analyze the context window in which menn occurs:
þ‡ sendu þeir menn til m—ts við
The form sendu is uniquely identified by the morphological analyzer as active indicative 3rd person plural verb Õ(they) sentÕ. In addition, þeir is uniquely identified as masculine nominative plural ÔtheyÕ. Given the fact that a pattern such as Verb Ð Subject Ð Object occurs with high frequency in Old Icelandic texts, the disambiguation tool would correctly determine that the above instance of menn is accusative plural.
The set of word order patterns is currently not available. To create it, we will analyze each word of our text corpus using the morphological analyzer. For each cluster of uniquely identified forms, their pattern of grammatical dependency will be added to the pool of possible phrases. Thus, given a phrase like sendu þeir menn, the first two words will be uniquely identified as: Verb (active past, 3rd plural) Ð Noun (nominative) and added to the pool of permissible phrases. At the end of this process, we will have an inventory of permissible phrase structures together with their frequency of occurrence in the corpus.
Our disambiguation strategies depend on local clues to correctly disambiguate a form. If no clue is provided, these algorithms fail. A straightforward method to improve their success rate is to expand their application to a global level, i.e. corpus-wide analysis. Here, the idea is to include similar or identical phrases that occur elsewhere in the corpus in the decision-making process.
In Saussurian linguistics, the words form a paradigmatic relationship if they occur in the same linguistic environment. For example:
directing {my, the, a, É} call
In this case, the words Òmy, the, a, ÉÓ form a paradigmatic relationship. Conversely, a syntagmatic cluster of words shares the property of occurring with the same form, as in
fiscal {policy, institution, responsibility, year, É}
During a corpus-based paradigmatic analysis, the algorithm finds all occurrences of the context of a given form. Thus, given the text excerpt
Þ‡ m¾lti Guðrœn til sinnar vinkonu
from Všlsunga Saga with Guðrœn being the current form to disambiguate, a search for phrases with identical context m¾lti ___ til yields the following results:
Þ‡ m¾lti Guðrœn til Gunnars
Ok er þau všknuðu, m¾lti hœn til Hšgna
Þ‡ m¾lti Bikki til RandvŽs
The search results provide the disambiguation algorithm with three more opportunities (Guðrœn, hœn, Bikki) to apply its local context analysis to determine the correct grammatical form.
In similar fashion to the paradigmatic search, the syntagmatic searches for all occurrences of the form in question. Using the same form Guðrœn, a search of the saga text yields multiple results:
Guðrœn
hŽt d—ttir hans.
Eitt sinn segir Guðrœn meyjum s’num at hœn m‡ eigi glšð vera.
Guðrœn
svarar:
"Þar mun vera Guðrœn Gjœkad—ttir," segir hœn.
[É]
For each of these search results, the local dependency algorithms can be applied.
Global searches have the advantage of offering multiple opportunities to the disambiguation tool to determine the grammatical nature of a form. Their downside is, however, that a corpus-based search may yield more than one possible solution. The most commonly applied strategy for decision-making is based on calculations of frequency or probability. One such way of deciding on a form which yields multiple solutions is to calculate the mean and variance of the contexts of a particular result. For example, given the above form Guðrœn and its multiple contexts from the syntagmatic analysis, we would like to find out which of the contexts
[Empty] ___ hŽt
segir ___ meyjum
[Empty] ___ svarar
vera ___ Gjœkad—ttir
occur relatively often in the corpus at roughly the same distance. To that end, we compute the variance
where N is the number of times the context occurs, xi is the offset between the two contexts, and is the sample mean of the offsets. The square root of this formula is the variance of a given context; the smaller the variance, the more likely a given context occurs often. In turn, this indicates that a context with low variance is more likely to yield the correct grammatical interpretation of a form. This calculation of variance will allow us to assign a score to each result. For each ambiguous form, these scores, the part of speech information and lemma can be automatically encoded in the XML tag. The definition of this element will be added to the Menota handbook.
The level of ambiguity in the above examplesÑand in Old Icelandic texts in generalÑranges from the very low (or non existent) to quite high. While the goal of the disambiguation program is not to provide absolute disambiguation (nor is it intended for automatic translation although it can certainly assist in machine-assisted translation), it should allow for users of various backgrounds the opportunity to undertake sophisticated and nuanced searches of a large text corpus. Grammatical disambiguation is a multi-faceted linguistic and computational problem. In our opinion, it should be approached by a multi-tiered strategy of local, global, and probability-based solutions.
Expanding the underlying lexical set
A current limitation of the morphological analyzer is the fairly small lexical set of its corpus. Zo‘gaÕs subset of CleasbyÐVigfusson has been instrumental in our ability to develop the morphological analyzer but needs to be expanded in order to deal with the lexical diversity of the text corpus. Expanding the lexical set will also result in a refinement of the table of exceptions. Both developments will greatly improve the performance of the morphological analyzer in a real textual environment. Furthermore, expansion of these underlying lexical sets will greatly improve the accuracy of the disambiguation routines.
Initially, we intend to focus on incorporating all of the CleasbyÐVigfusson lexical data into the underlying database. Definitions from CleasbyÐVigfusson will also greatly enhance the usability of the English language lookup tool. As unexpected, rare or unusual forms arise in the saga texts (words not covered by CleasbyÐVigfusson), we will supplement the dataset with information from the ONP. In collaboration with researchers at the ONP, we have already harvested all of the headwords from that project, along with the minimal part-of-speech information currently in their database. As we encounter lemmata not in CleasbyÐVigfusson, we can input information from the non-digital ONP archive via a webform that we will develop specifically for this purpose.
Because of the architecture of our system, all normalized lexicon entries share the same structure regardless of their source document. The normalization process relies on a library of rule objects and each object contains the layout rules for a particular lexicon, thereby allowing the Normalizer module to correctly interpret lexicon entries. Currently there exists only one rule object that our Normalizer accessesÑnamely that for the Zo‘gaÕs dictionary. To integrate the Cleasby and Vigfusson lexicon and harvested lemmata from the ONP (or any other Old Icelandic lexicon, for that matter), our team will create a new rule objects and add it to the library. This system of rule objects will allow us to expand the underlying lexical database incrementally, while continuing to work on the more challenging tasks of disambiguation and orthographic normalization.
Orthographic change and time-lining
Many of the Old Icelandic texts available in digital form use different orthographic conventions. While some of these conventions are a matter of simple substitution, others are significantly more complex. Furthermore, diplomatic editions of manuscripts follow orthographic conventions in place during the time of writing. All of these orthographic differences need to be ÒnormalizedÓ for morphological analysis and disambiguation to take place. At the same time, significant information concerning language development exists in the orthographic conventions of a particular era.
We propose to develop a series of normalization routines that will allow any medieval Icelandic text to be normalized to a standard orthography. This standard orthography will be used as a REG field as defined by the MENOTA handbook for the XML markup of the text in question, allowing the original orthography to be accessed by the end user. The end user will also be able to toggle between texts to take advantage of the morphological analyzer and disambiguation routines within a digital library environment. A time-lining function will allow an end user to call up texts written with a particular orthographic convention, as well as the normalized version of that text. There are significant challenges associated with developing such timelining protocols. Perhaps one of the most challenging elements is that archaic orthographic features tend to re-occur in later manuscripts. It may well be that we will need to develop a specific orthographic module for each individual manuscriptÑdescriptions of the orthographic features of the document will be incorporated into the metadata describing the digital text, allowing it to function with both the normalizer and the timelining functions. Significantly, the orthographic normalizer will allow us to use digital versions of Old Icelandic texts normalized to modern Icelandic orthography, by renormalizing these texts to Old Icelandic orthography.
We have begun describing the rules for the orthographic normalizer and believe the implementation of these rules will be relatively straight-forward, given our implementation of the Normalizer module described above. That does not imply that there are not challenges inherent in this task. The differences between a diplomatic transcription of a manuscript, standardized Old Icelandic and Modern Icelandic orthography for example can be seen in the following short text samples:
Text samples
from Victors saga ok Bl‡vus (Loth 1962)
Diplomatic
text |
Standardized Old Icelandic |
Modernized
Old Icelandic |
...kongr
gerdjzt hliodr eirn |
...k—ngr
g¿rðisk hlj—ðr einn |
...k—ngur
gerðist hlj—ður einn |
dag er
þau Alba satu b¾ði |
dag er
þau Alba s‡tu b¾ði |
dag er
þau Alba s‡tu b¾ði |
samt ok
tavlvdvzt vid... |
samt ok toþluðusk við... |
samt og
tšluðust við...
|
...at veitzlunj
vt endadri |
...at veizlunni
œt endaðri |
...að
veislunni œt endaðri |
uoru allir
herrar ok |
v‡ru allir
herrar ok |
voru allir
herrar og |
haufdjngiar vt
leyster |
h oþfðingjar œt leystir |
hšfðingjar
œt leystir |
med agi¾tum
giofum... |
með ‡g¾tum
gj oþfum... |
með ‡g¾tum
gjšfum... |
...sau þeir
fostbr¾dur |
s‡ þeir
f—stbrÏðr |
...s‡u
þeir f—stbr¾ður
|
at þar
var allr sioR svartR |
at þar
var allr sj—r svartr |
að
þar var allur sj—r svartur |
sem kolum w¾ri
saad... |
sem kolum v¾ri
s‡t... |
sem kolum v¾ri
s‡ð... |
The rules we
expect to develop fall into two main areas, phonology (vowels and consonants)
and morphology. As we expand the range of the orthographic normalizer, rules
will be added to account for incremental changes in orthography from the
earliest writing up through the present (this latter category is of course only
applicable for Old Icelandic texts that have been normalized in the digital
realm to modern Icelandic spelling).
Ongoing expansion of the text corpus and integration with other systems
The development of normalization routines will immediately allow us to expand the digital corpus on which the morphological analyzer, lookup tool, and disambiguation routines to all extant digital editions of Old Icelandic texts. Collaboration with the University of Iceland (Arnamagnaean Institute), the University of Copenhagen (Arnamagnaean Institute) and the Ordbog over det Norr¿ne Prosasprog, will greatly facilitate this process. A collaboration with Matthew Driscoll at the Arnamagnaean Institute in Copenhagen surrounding the ongoing digitization of diplomatic editions of the manuscripts that form the basis of all standard edition Old Icelandic texts further insures that the corpus will not be limited solely to standardized texts, but rather will afford researchers the opportunity to work online with variant manuscript texts. The XML encoding of all these texts to normalized spelling, part of speech information (from the morphological analyzer) and disambiguation scores (from the disambiguation routines), will greatly enhance the ability of end users to carry out sophisticated searches and analyses of a significant component of the extant Old Icelandic corpus. It will also likely contribute to the eventual creation of a parsed corpus of Old Icelandic.
We will continue to work closely with the Perseus project to integrate the texts and the tools into the Perseus digital library project. We will also continue the development of our own Greenstone Digital Library site at UCLA, and will mirror this site at the University of Copenhagen. We will continue to explore ways in which to integrate the system and the tagged texts with developing systems so as to take advantage of the latest advances in textual analysis and visualization tools, and will also explore exporting the system as a SCORM learning object.
Porting to Old English
Porting our work to another early Germanic language will allow us to test the rules-based approach to automatic morphological analysis and our underlying architecture that separates the Target Language rules from the analyzer itself. At the same time, it will provide a quick and efficient way for the automatic morphosyntactic markup of Standard Edition Old English texts.
We have chosen Old English as our
test project for several reasons. Although
Old Icelandic and Old English belong to different branches of the Germanic
group of the Indo-European language family, their morphological systems are
relatively similar to each other. Both languages share the division of nouns,
adjectives and verbs into ÒstrongÓ and ÒweakÓ, which is inherited from the
Germanic proto-language. Also the stem classes of the various parts of speech
are essentially the same in both languages (see Krahe 1969, specially for OE,
see Campbell 1959).
The Old English morphological analyzer will work primarily with the currently limited online Dictionary of Old English at the University of Toronto as the input for its Lexical module. Of course, given the architecture of the system, any Old English lexicon can be attached once a rule object for that lexicon has been developedÑwe will make information on how to write a rule object readily available on our project site so that interested parties can write their own and import their lexica. We expect that the underlying lexical set can be expanded to include the online edition of Bosworth--Toller (1898) as it becomes available.
We consider the porting of the morphological analyzer to be an important test of the scalability of our architecture to other Germanic languages. Old English is complex, yet sufficiently related to Old Icelandic that developing a Target Language module for the morphological analyzer should proceed smoothly. Indeed, similar to the Natural Language component of the analyzer, the ability to handle multiple target languages will be accomplished by adding language objects into the library of target languages. For a specific request, the morphological analyzer accesses the appropriate language object to apply the necessary phonological and morphological rules. We will limit the scope of our Old English Target Language module to the West Saxon dialect (the ÒstandardÓ Old English dialect), and specifically to nouns and verbs in the first instance. This adaptation of the underlying architecture of our morphological analyzer to Old English will not only help substantiate the applicability of our approach for morphological analysis to Germanic languages in general, but also extend to other Indo-European languages as well.
Work Plan
We propose a three year horizon for the development and implementation of our proposed project.
In the first year:
Assemble and describe rules for orthographic normalization. (Vijunas, months 1-3)
Complete digitization of CleasbyÐVigfusson, and insure that the lexical database conforms with the requirements of our lexicon module; these materials will be ported to Perseus to expand the reach of the lookup tool for Old Icelandic in their system (Tangherlini and graduate student researchers (GSR), months 1-12)
Develop a system for the incorporation of lemmata from the ONP into the lexical database (Tangherlini, months 1-2)
Optimize the current morphological analyzer for speedier lookup; and refine several routines that occasionally do not return the proper output (Urban, months 1-6)
Devise the second generation orthographic normalization module; and implement the first set of orthographic normalization routines (Urban and PA, months 7-9)
Draw up rules for the most common situations in which ambiguous forms arise (Vijunas, months 3-9)
Develop the alpha version of the disambiguator, including
scoring of ambiguous forms in the Legendary sagas. (Urban and PA, months 10-12)
Develop rules for the Old English lexicon normalizer and implement them (Vijunas, months 10-12)
In the second year:
Develop and implement routines for disambiguation of the most commonly occurring situations based on a computer-driven analysis of the Legendary sagas (Vijunas and Urban, months 13-15)
refinement and optimization of our proposed algorythms for disambiguation (Urban and PA, months 13-15)
Incorporate The Family Sagas (back-normalized to Old Icelandic) into the underlying text corpus to increase the accuracy of the disambiguation routines and scoring (Tangherlini, months 13-15)
Incorporate our disambiguated texts into a Greenstone Digital Library implementation at UCLA (Tangherlini and GSR, months 15-18)
Continue to analyze and describe ambiguity in Old Icelandic (Vijunas, months 16-20
Refine the disambiguation routines (Urban and PA, months 16-20)
Identify all ambiguous forms for which the disambiguator cannot provide adequate scoring; explore if routines can be developed for these forms (Vijunas and GSR, months 16-24)
Develop rules for West Saxon verbs and nouns (Vijunas, months 16-24)
Incorporate these rules into a test Target Language module for Old English (Urban and PA, months 21-24)
In the third year:
Expand the orthographical normalizer to account for orthographic change from the 11th to the 15th centuries (GSR, Vijunas and PA, months 25-28)
Refine disambiguation routines and scoring (GSR, months 25-36; Urban and PA, months 25-31)
Refine Old English (West Saxon) analyzer (Urban, Vijunas and PA, months 25-31)
Release Beta-version of the disambiguator, and publish all parameters for rule sets for the adaptation of the system to other Germanic languages (Tangherlini, months 34-36)
Outcomes
Among the most significant outcome of the project will be a well integrated series of tools that provide for an accurate morphological analysis that accounts for the phonological and morphological complexity of Old Icelandic; an English language lookup based on a nearly comprehensive lexical set for Old Icelandic; orthographic normalization routines that allow for searches, analysis and visualization on a wide range of Old Icelandic texts, irrespective of the orthographic conventions used; and the disambiguation of forms in context allowing for more accurate textual analysis (including pattern matching, clustering, and keyword extraction) as a first step toward a parsed corpus of Old Icelandic.
Our work will make more accessible for linguistic and comparative research a significant corpus of morpho-syntactically marked texts for researchers, students and the broader public who may have little understanding of the complexity of Old Icelandic or other ancient Scandinavian languages. Coupled to the expanded English-language look-up tool, the morphological analyzer/disambiguator will allow scholars with little background in early Scandinavian languages access to this rich, early prose narrative tradition, and allow them to answer questions of significant complexity. The system can also function as an integral component in the teaching of Old Icelandic. Our extension of our work to Old English will greatly enhance for the community of scholars, students and members of the general public interested in materials written in that language. Adapting the underlying program to work with other ancient Germanic languages will pave the way for the development of a series of morphological analyzers for Germanic languages in general, as well as potentially allow for cross-corpora comparisons of specific phenomena.
By integrating the analyzer and the disambiguation extensions, along with the lookup tool, into established digital library systems, we take advantage of statistical and visualization tools being developed at other institutions, such as those included in Worstat and Xaira; those developed at Imperial College as part of CHLT; and those developed as part of the Perseus Project. Tools that make use of texts marked for morpho-syntactical detail allow for highly accurate searches and comparisons within and across corpora. Such searches and analysis can lead to new understandings of relationships between texts, as well as the discovery of hitherto unrecognized aspects of the historical development of these languages.
Finally, our morphological analyzer and the disambiguation extensions will be shared in the open-source community, and will be cognizant of the APIs for various shared learning environments. We will explore the packaging of the system as a SCORM learning object.