Workpackage 5: Neo-Latin Morphological Analyser

Andrea Bozzi, Marco Passorotti, Paolo Ruffullo

ILC – Pisa

Year 3 Executive Summary

May, 2005

In year 3 we completed our CHLT work on the Neo-Latin Lemmatizer focusing on five areas: (i) management of non-segmented word-forms, (ii) writing DTD's for CHLT-LEMLAT, (iii) creating a reference manual for the use of CHLT-LEMLAT, (iv) integration of LEMLAT into the CHLT-Perseus Digital Library System, and (v) development of future work with CHLT-LEMLAT.

The modifications to the Lemmatiser that took place in Year 3 are the following:

a. Adding of the gender codes to the LES belonging to ambiguous morphological categories

b. Implementation of new algorithms, for the management of not segmented wordforms

c. implementation of new algorithms, in order to analyse worforms with structure LES + SM +SF

d. coding of the Type of each adjectival LES

e. testing the lemmatization results about the wordforms with structure LES + SF

f. continuing code source modification in order to make it more clear and easy to modify

g. documentation of the implemented functions, data structures and algorithms

h. development of automatic morpho-syntatctic disambiguator for a semi-automatic morpho-syntactic lemmatization

i. adding an Onomasticon in LEMLAT lexical basis

j. structuring LEMLAT lexical basis according to Word Formation Rules

k. developing a user-friendly lexicographic workstation for LEMLAT disambiguation

l. creation of Latin Lexical Database, in which each LEMLAT lexical entry is related to its dictionary entry

Our CHLT work has transformed the way scholars can work with Latin texts in the following ways:

(i) Managing Latin texts in electronic form which provides automatic morphological

lemmatisation

(ii) Ability to add new information to LEMLAT lexical basis (adding of lemmas)

(iii) Ability to modify LEMLAT source code for personal purposes

(iv) Ability to modifying LEMLAT morphological codes for personal purposes

(v) Greater integration between Cultural Heritage documentation in Latin

texts and ICT tools and applications

(vi) Implementation of open source versions of software, which were previously

available under licence

(vii) Greater collaboration between centres of excellence in the US and Europe

in the study of ancient texts and the development of ICT tools for digital scholarship.

Conclusions of CHLT WP4

CHLT- LEMLAT is a useful tool for analysing and filtering large Latin corpora, covering a wide historical period in the history of this language. It fills an urgent need to find ways of managing large corpora of this kind in a digital environment where users can access a multitude of documents on-line, but have no way of filtering their linguistic content. CHLT-LEMLAT offers for the first time a way lemmatising Latin corpora for the purposes of sophisticated linguistic analysis, and (at the moment) is the most powerful tool available anywhere in the world for the Latin language. The most important thing that CHLT-LEMLAT provides is a powerful lemmatizer that ensures a powerful tool for syntactic disambiguation: it receives the text as input, reads the word-forms in the syntax parser and chooses the correct analysis of the word-forms from those offered by the lemmatizer. For instance, the word-form puella is analysed by the lemmatizer in three possible ways (noun, common, first declension, singular, feminine, nominative, vocative and ablative): but, in a syntactic context, only one of these values is correct. The task of a syntactic disambiguator is to choose the correct one.

Future Work: Dissemination and Exploitation of Results

The aim should be the development of a multi-modular tool that allows the user to query a corpus of Latin texts, with the thought that it will stand as a paradigm for future work in other languages.

The kinds of query we'd like to be able to answer to are the following:

Ø Morhpological Queries: On a merely morphological level: for instance, the user can know all the wordforms inflected as first declension, singular genitive in -ai nouns occuring in the texts of Cicero. The homographs are not disambiguated;

Ø Morpho-syntactic Queries: the homographs are disambiguated. The user can know where and if a partucular kind of syntactic structure occurs in the texts of Cicero;

Ø Semantic Queries: on a semantic level the user searches for the word love (in English) and obtains as an answer of all the lemmas whose semantic definition contains love as first, second, third,... meaning, metaphorical use, technical use...;

Ø Statistical Levels: Each lemma is accompained by its use frequency in the corpus (structured per author, age, book, style of the book,...). Each wordform is bound to its morphological (no disambiguation of the homographs) and morpho-syntactic frequency in the corpus. Each lemma is part of a "semantic family" (SF) and a "morphological family" (MF): an SF contains all the lemmas having a common meaning in the definition; an MF contains all the lemmas have a common stem in the stemming procedure.

Ø Multilingual Queries: Greek-Latin relationship through English: all the Latin lemmas are related to the corrispondent Greek lemma (linked), selected through the common meaning in the dictionary.

The general structure of the analysis of a text is the following (the example is in Latin, but is suitable for other languages):

1. Input Latin text (from the CHLT corpus),

2. Morphological analysis (CHLT LEMLAT),

3. Morpho-syntactic analysis (Stemming and Syntactic Parser),

4. Dictionary entry (lemma) with (a) statistical information, (b) structured semantic description (SF and MF) and (c) link to Greek dictionary.

The division of possible Workpackages:

Ø WP1: development of the actual CHLT corpus of Latin texts (we need even more texts);

Ø WP2: development of CHLT LEMLAT. We need:

o a wider lexical basis, in order to cover at least the medieval lexical extension and the proper names (Onomasticon),

o for the stemming, to reduce the number of LES, adding lists of affixes and, thus, of rules of morphological derivation. For instance, design a corpus of rules such as the one that creates adjectives in -bilis from verbs: amabilis);

Ø WP3: a syntactic parser (to disambiguate the homographs);

Ø WP4: to extract statistical information form the CHLT corpus;

Ø WP5: structuring the semantic description of the lemmas in the dictionary and Greek-Latin

linking.

The results of such a multi-modular tool can be applied in a more general framework and be extended to the following areas:

Ø Education: e-learning,

Ø Digital libraries: information retrieval from Latin texts in digital format,

Ø Research: linguistics, lexicography, grammatical theories.

CHLT Deliverable 5.3: Documentation for Lemmatisation Module for Early Modern Latin (Month 30)

Reference Manual for CHLT-LEMLAT

LEMLAT

Wordforms analysis

Database description

Key to Codes

o LES: the invariable part of the inflected forms;

o SM (Segmento Mediano): the middle part of the inflected forms;

o SF (Segmento Finale): the final part of the inflected forms;

o SI (Segmento Iniziale): the initial part of the inflected forms;

o SPF (Segmento Post Finale): a segment added on the right side of the final part of a wordform;

o COD LES: it is the code assigned to each LES; each COD LES refers to a particular type of inflexion;

o COD LEM: it is the code assigned to each output lemma; each COD LEM refers to a general type of inflexion;

o FE (Forma Eccezionale): exceptional wordform. A wordform inflected in an exceptional way that cannot be regularly segmented and recognised;

o LE (Lemma Eccezionale): exceptional lemma. A lemma created in an exceptional way that cannot be automatically created;

o CLEM (Costellazione LEMmatica): contains all the LES related to a common lemma, or common dictionary entry; it is referred to through a unique N_ID

o Ipolemma: intermediate lemma produced in output, not referring to a dictionary entry;

o Iperlemma: lemma produced in output referring to a dictionary entry

o N_ID: alphanumeric code applied to all the LES. More LES can share the same N_ID: all the LES related to a common lemma, or common dictionary entry are registered with the same N_ID (forming a CLEM)

o CodLE: numeric code of LE, related to pattern(s) of 7 EAGLES codes bringing morphological information about the wordforms

o EAGLES (Expert Advisory Group on Language Engineering Standards): standard coding of morphological, morpho-syntactic and semantic information of the words. In LEMLAT, 3 EAGLES codes are related to lemmas, and 7 to wordforms

Analysis of Wordforms

Receiving in input a wordform, if it is suitable to be analysed, LEMLAT produces in output:

- the corresponding lemma(s);

- a code expressing the inflexional paradigm of the lemma(s) (codlem)

- the n_id of the lemma(s) CLEM (see table “lessario”)

- 3 EAGLES codes (converted by codlem) related to the lemma (one pattern of 3 EAGLES codes for each lemma produced in output), with information about (see “cod_morf” table):

o P(osition)1: PoS

o P2: Type (different possible types of each PoS; for instance, a noun can have Type “common”, or “proper”)

o P3: Flexional Category (declension, conjugation,…)

- pattern(s) of 7 EAGLES codes related to the wordform, with information about (see “cod_morf” table):

o P4: mood

o P5: tense

o P6: case

o P7: gender

o P8: number

o P9: person

o P10: degree

This analysis is obtained through a process of segmentation/recognition of input wordforms.

For each input wordform, LEMLAT operates a number of segmentation attempts.

When one of these attempts is found consistent with LEMLAT data about wordforms possible segments, the analysis of the wordform is performed in output.

There are three possible segmentation structures:

1. LES + SF

2. LES + SM + SF

3. LES + SM + SM +SF

Each of these structures can be preceeded by a SI and followed by a SPF.

In addition to segmentation process, a wordform can be also recognised (and, thus, analysed) with no segmentation, in the following cases:

- Input wordform is a FE

- Input wordform is a LE

- Input wordform is a les with codles “i” (invariables)

- Input wordform is a les with codles “n” (uninflected nouns)

- Input wordform is a les with codles “v” (verbs not related to a specific conjugation)

- Input wordform is a les with codles “pr”, or “p1-p9”, or “p18” (not segmented pronominals)

Also each of these structures can be preceeded by a SI and followed by a SPF.

A segmentation is valid if its segments are found as each other compatible (on left and/or on right side). The compatibility of the segments is coded along with the segments itself (see “lessario”, “tabsf”, “tabsm”, “tabsi”, “tabspf” tables).

For instance, a structure such as

LES + SM + SF

is found valid if:

- left compatibility of SM corresponds to codles (that is, with right compatibility of LES)

- right compatibility of SM corresponds with left compatibility of SF

In order to produce output information:

If the input wordform is segmented:

- lemma and codlem (3 EAGLES lemma codes): produced according to codles (see “eagles” table and annex –2-)

- pattern(s) of 7 EAGLES wordform codes: from SF (and SM) coding (see “tabsf” and “tabsm” tables)

If the input wordform is not segmented:

- in case of LE:

o codlem (3 EAGLES lemma codes): according to codles (see “eagles” table and annex –2-)

o pattern(s) of 7 EAGLES wordform codes: from codLE (each LE is related to a codLE, that brings the seven EAGLES codes pattern(s) of the wordform; see “cod_le” and “tabl_le” tables)

o lemma: LE itself (possibly, reduced to an iperlemma)

- in case of LES with codles “i”:

o patterns of 10 EAGLES codes (3 lemma codes + 7 wordform codes): 1-3 converted from codlem (see “eagles” table); 7-10 automatically assigned as “-------“

o lemma: produced according to codles (see annex –2-), or to information related to concerned les on table “lessario”

- in case of les with codles “FE”, “n”, “v”, “pr”, “p1-p9”, or “p18”:

o pattern(s) of 10 EAGLES codes (3 lemma codes + 7 wordform codes): from hard-coding of each les with codles “FE”, “n”, “v”, “pr”, “p1-p9”, or “p18” (see “forme_ecc” table)

o lemma: produced according to codles (see annex –2-), or to information related to concerned les on table “lessario”

Each segmentation can produce analysis related to more than one lemma.

When a segmentation is found valid and the analysis is performed, LEMLAT does not stop the process, but produces other segmentation/recognition attempts: a wordform can be segmented (and analysed) in more then one way. Equally, the same wordform can be analysed through segmentation and through no-segmentation (see the case of a wordform showing homography between a regular – segmented – one and, for instance, a FE – not segmented -).

The analysis of a wordform performed by LEMLAT can be summarised according to the following schema:

Database Tables

o lessario

o cod_le

o cod_morf

o eagles

o forme_ecc

o teb_le

o tabsf

o tabsm

o tabspf

o tabsai

o tabsi

lessario

List of the les.

- n_id

o clem identification number

o values:

§ letter (first letter of the lemma)

§ four numbers

- gen

o gender

o values: see “cod_morf” table, field “field_pos”, value “7”

- clem

o in a clem containing more than 1 les, identifies the les through which the lemma has to be created

o values:

§ v: identifies the les through which the lemma has to be created

§ i: for superlative and comparative forms of irregular participle and irregular gerundive, the second lemma created (participle, or gerundive at positive degree) is an ipo- and not an iperlemma

§ k: stops the creation of the iperlemma (value “v” is inhibited)

- si (Segmento Iniziale)

o initial alteration h

o value:

§ h: the les appears also with an initial h

- smv (Segmento Mediano Verbale)

o automatic insertion/exclusion of smv

o values:

§ +: adds a smv to the les, to automatically create the regular basis for perfect and future participle, and perfectum

§ –: adds a smv to the les, to automatically create the regular basis for comparative, superlative, present participle, gerund and gerundive

§ blank: no smv to be added (irregular inflections)

- spf (Segmento PostFinale)

o adds/cuts a spf to les

o values:

§ 3: exclusion of –que (enclitic)

§ see “tabspf” table, field “comp_cod”

- les

- codles

o values: see annex -1-; see table “eagles”, field “codles”

- lem

o LE:

§ a complete form

NOTE: in case of homography between two, or more lemmas, if the only difference among them is the length of a vowel, this is recorded in LE as follows:

· one quote (‘) after the involved vowel: the vowel is short

· two quotes (‘’) after the involved vowel: the vowel is long

§ a SF to be added to les

§ =: the lemma is identical to the les

if more than one LE is concerned, the LE are divided by a slash

o if no LE is recorded, the lemma is created through through automatically adding a SF to the les, rule depending on codles.; see annex –2-

- s_omo

o omographic lemma

o values:

§ A: omographic lemma A

§ B: omographic lemma B

- più

o more les in the same clem, but none with “v” in clem field

o values:

§ +

- codlem

o manually recorded if cannot be automatically assigned according to codles

o see annex –3-; see table “eagles”, field “codlem” for the correspondance codles/codlem

- type

o manually recording of Type

- codLE

o in case of LE, exclusion of the 7-10 position codes in output patterns

o values: see “cod_le” table

- pt

o pluralia tantum

o values:

§ x: exclusion of patterns with code “s” in position 8

- a_gra

o graphic alteration

o values: see “tabsai” table

- gra_u

o les possibly divided in two parts

o values

§ x

- notes

- pr_key

o identification number of the les

- ts

o Time Stamp: last time when the line has been modified

cod_le

List of codes and values for LE analysis.

- cod_LE

o codLE: in the analyis of an LE, adds the codes from c04 to c10. See “cod_morf” table for codes values

- c04

o codes in position 4

- c05

o codes in position 5

- c06

o codes in position 6

- c07

o codes in position 7

- c08

o codes in position 8

- c09

o codes in position 9

- c10

o codes in position 10

- pr_key

o identification number of the codLE

- ts

o Time Stamp: last time when the line has been modified

cod_morf

Description of codes/values/attributes occurring in the 10 positions output patterns.

- field_pos

o position in the pattern

o values: 1-10

- field_descr

o description of the field value

- value_descr

o description of the attribute for each field

- value

o description of the code for each attribute/field

- ts

o Time Stamp: last time when the line has been modified

EAGLES

Conversion codles/codlem/1-3 position codes (lemma codes)

- codles

o codles list

- codlem

o codlem corresponding to codles recorded on the same line

- c01

o codes in position 1

- c02

o codes in position 2

- c03

o codes in position 3

forme_ecc

Hard-Coding of exceptional wordforms pattern(s).

- les_id

o link to corresponding line in lessario table (pr_key field)

- add_lem

o link to a second lemma through pr_key field in lessario table

- enc

o presence of an enclitic

- c01

o codes in position 1

- c02

o codes in position 2

- c03

o codes in position 3

- c04

o codes in position 4

- c05

o codes in position 5

- c06

o codes in position 6

- c07

o codes in position 7

- c08

o codes in position 8

- c09

o codes in position 9

- c10

o codes in position 10

- pr_key

o identification number of the line

- ts

o Time Stamp: last time when the line has been modified

tab_le

List of LE recorded along with its own codLE

- lemma

o list of LE

- codLE

o codLE

o Value: see “cod_le” table, field “cod_LE”

- les_id

o link to corresponding line in lessario table (pr_key field)

- pr_key

o identification number of the line

tabsf

List of SF and related codes patterns.

- segment

o SF

- comp_cod

o codles compatible on SF left side

- c01

o codes in position 1

- c02

o codes in position 2

- c03

o codes in position 3

- c04

o codes in position 4

- c05

o codes in position 5

- c06

o codes in position 6

- c07

o codes in position 7

- c08

o codes in position 8

- c09

o codes in position 9

- c10

o codes in position 10

- ex

o example

- pr_key

o identification number of the line

- ts

o Time Stamp: last time when the line has been modified

tabsm

List of SM and related codes patterns.

- segment

o SM

- pm

o +: if “+” is recorded in field “smv” (table “lessario”), automatically creates ipolemmas of perfectum, supine, future and perfect participle

o –: no ipolemma is created through the SM occurring in field “segment”

- comp_cod_prec

o codles compatible on SM left side

- comp_cod_succ

o codles compatible on SM right side

- c01

o codes in position 1

- c02

o codes in position 2

- c03

o codes in position 3

- c04

o codes in position 4

- c05

o codes in position 5

- c06

o codes in position 6

- c07

o codes in position 7

- c08

o codes in position 8

- c09

o codes in position 9

- c10

o codes in position 10

- ex

o example

- pr_key

o identification number of the line

- ts

o Time Stamp: last time when the line has been modified

tabspf

SPF compatibility.

- segment

o SPF

- comp_cod

o compatibility on SF left side

- pr_key

o identification number of the line

- ts

o Time Stamp: last time when the line has been modified

tabsai

Initial graphic alteration. Related to “a_gra” field in lessario.

- segment

o Initial alterated segment

- comp_cod

o compatibility code

- pr_key

o identification number of the line

- ts

o Time Stamp: last time when the line has been modified

tabsi

Initial segment. Related to “si” field in lessario.

- segment

o Initial segment

- comp_cod

o compatibility code

- pr_key

o identification number of the line

- ts

o Time Stamp: last time when the line has been modified

graph_vars

Graphical variation in the les.

- gv_code

o Code of graphical variation; recorded in field “a_gra” of “lessario” table

- gv_pos

o Ordinal number of occurrence position (in the les) of the letter after/before which the variation is applied

o Numeric code:

§ 1: the variation is applied after/before the first occurrence of the letter concerned by the variation (this letter is recorded in “gv_out” field)

§ 2: the variation is applied after/before the second occurrence of the letter concerned by the variation (this letter is recorded in “gv_out” field)

§ …

- gv_in

o graphical form appearing in input wodform

- gv_out

o output graphical variation (letter concerned by graphical variation)

o graphical variation is applied to input wordform to retrieve the involved les in table “lessario” (the les is recorded wth no graphical variation)

- ts

o Time Stamp: last time when the line has been modified

New Codes

Adding a new les

To add a new les into an already existing clem:

- table “lessario”: identify the clem into which the new les has to be added

- add a new line in the table

- write clem n_id in “n_id” field

- write the new les in “les” field

- compile the “codles” field

Compiling these three field is obligatory; the others should be compiled according to the inflection of the les to be added.

Remind:

- if the new les has been added into a clem previously formed by only one les

and

- if the wordforms formed with the newly added les should be lemmatized under the lemma created with the previously registered les,

- thus, on the line of the previously registered les add the code “v” into the field “clem”

- if the codles of the new les is “FE”, or “v”, or “n”, or “pr”, or “p1-p9”, “p18”

- write the code pattern(s) related to the analysis of the new les in “forme_ecc” table, linking the two tables (“lessario” and “forme_ecc”) pasting in field “les_id” (in “forme_ecc” table) the numeric value appearing in the field “pr_key” of the new les line in “lessario”.

- if the new les should be lemmatised under a new LE (recorded in field “lem”)

- this LE has to be recorded in “tab_le” table along with its own codLE

Adding a new clem

To add a new clem:

- table “lessario”: identify a clem n_id still available

- add a new line in the table

- write clem n_id in “n_id” field

- write the new les in “les” field

- compile the “codles” field and other necessary fields

Adding a new SF

To add a new SF along with its compatibility code(s) and EAGLES codes pattern(s):

- in “tabsf” table: add a new line

- in “segment” field, write the new SF

- in “comp_cod” field, write codles compatible on SF left side

- in “c01-c10” fields, write the code pattern(s) related to new SF. Note: only the codes recorded in “c04-c10” are active: in output analysis, the first 3 codes are, in fact, converted from codlem. The fields “c01”, “c02”, “c03” are equally reported for a more confortable recording and data view: otherwise, they can be hidden.

Adding a new SM

To add a new SM along with its compatibility code(s) and EAGLES codes pattern(s):

- in “tabsm” table: add a new line

- in “segment” field, write the new SM

- in “comp_cod_succ” and “comp_cod_prec” fields, write codles compatible on SM left and right sides

- in “c01-c10” fields, write the code pattern(s) related to new SM. Note: only the codes recorded in “c04-c10” are active: in output analysis, the first 3 codes are, in fact, converted from codlem. The fields “c01”, “c02”, “c03” are equally reported for a more confortable recording and data view: otherwise, they can be hidden.

In SM coding the code “=” means that in the final analysis of the input wordform, the code to appear in this position of the pattern is the coding appearing in the same position of the pattern in the coding of the SF occurring in that wordform (on the right side of SM).

Adding a new SI

To add a new SI along with its compatibility code:

- in “tabsi” table: add a new line

- in “segment” field, write the new SI

- in “comp_cod” field, write compatible code appearing in “si” field of “lessario” table

Adding a new SPF

To add a new SPF along with its compatibility code:

- in “tabspf” table: add a new line

- in “segment” field, write the new SPF

- in “comp_cod” field, write compatible code appearing in “spf” field of “lessario” table

Adding a new codLE:

To add a new codLE along with its EAGLES codes pattern(s):

- in “cod_le” table: add a new line

- in “cod_le” field, write the new code (pay attention to use an available code, not already used)

- in “c-4-c10” fields, write code pattern(s) realted to the new codLE

Adding a new morphological code (EAGLES)

To add a new morphological code:

- in “cod_morf” table: add a new line

- in “field_pos” field, write the position of the new code in the analysis pattern

- in “field_descr” field write the value realted to the chosen position

- in “value_descr” field write the attribute related to the new morphological code

- in “value” field write the new code. Pay attention that the code is not already used

Adding a new coddles

To add a new codles along with related codlem and first 3 EAGLES codes:

- in “eagles” table: add a new line

- in “codles” field, write the new codles

- in “codlem” field, write the codlem related to the new codles

- in “c01/c02/c03” write the first three EAGLES codes (lemma codes) corresponding to codlem related to new codles

Examples

I declension noun

- lexical entry: abiga, -ae

- inflection: regular; I declension

- gender: feminine

- table: lessario

- add a new line

- write in field “n_id” a new n_id (not already used)

remember that the first letter of n_id is the same of the first letter of lemma

- write “f” (feminine) in field “gen”

- write “abig” in field “les”

- write “n1” in field “codles”

I conjugation verb with some fe

- lexical entry: amo, -are

- inflection: regular; I conjugation;

presence of the following fe:

· amarei: present passive infinitive

· amassint: active perfect congiunctive, plural third person; active past future indicative, plural third person

· amassis: active perfect congiunctive, singular second person; active past future indicative, singular second person

· amasso: active past future indicative, singular first person

· ameminor: passive future imperative, plural second person

- table: lessario

- add a new line

- write in field “n_id” a new n_id (not already used)

- write “v” in field “clem”, to use the data on this line to create the iperlemma

- write “+” in field “smv”, to automatically create the regular basis for perfect and future participle, and perfectum

- write “am” in field “les”

- write “v1r” in field “codles”

- add another line

- write in field “n_id” the same n_id used for the previous line

- write “amarei” in field “les”

- write “fe” in field “codles”

- add another line

- write in field “n_id” the same n_id used for the previous line

- write “amassint” in field “les”

- write “fe” in field “codles”

- write “amaui” in field “lem”, to create ipolemma “amaui” (basis of perfectum)

- write “vp” in field “codlem”, to assign codlem “vp” to ipolemma “amaui”

- add another line

- write in field “n_id” the same n_id used for the previous line

- write “amassis” in field “les”

- write “fe” in field “codles”

- write “amaui” in field “lem”, to create ipolemma “amaui” (basis of perfectum)

- write “vp” in field “codlem”, to assign codlem “vp” to ipolemma “amaui”

- add another line

- write in field “n_id” the same n_id used for the previous line

- write “amasso” in field “les”

- write “fe” in field “codles”

- write “amaui” in field “lem”, to create ipolemma “amaui” (basis of perfectum)

- write “vp” in field “codlem”, to assign codlem “vp” to ipolemma “amaui”

- add another line

- write in field “n_id” the same n_id used for the previous line

- write “ameminor” in field “les”

- write “fe” in field “codles”

- table “forme_ecc”

- in field “les_id”, copy/paste the number occurring in field “pr_key” in table “lessario” on the line of fe “amarei”

- in the fields “c01-c10”, write the following codes: VmFh1----- (I declension verb, present passive infinitive). See table “cod_morf” for details about codes and positions

- the same for the other fe

III conjugation verb with irregular perfect/future participle

- lexical entry: abigo, -ere

- inflection: III conjugation;

variant: abago, -ere

perfectum basis: abeg

perfect participle basis: abact

- table: lessario

- add a new line

- write in field “n_id” a new n_id (not already used)

- write “v” in field “clem”, to use the data on this line to create the iperlemma

- write “abig” in field “les”

- write “v3r” in field “codles”

- add a new line

- write in field “n_id” the same n_id used for the previous line

- write “abag” in field “les”

- write “v3r” in field “codles”

- add a new line

- write in field “n_id” the same n_id used for the previous line

- write “abeg” in field “les”

- write “v7s” in field “codles”

- add a new line

- write in field “n_id” the same n_id used for the previous line

- write “abact” in field “les”

- write “n41” in field “codles” (for supine)

- add a new line

- write in field “n_id” the same n_id used for the previous line

- write “i” in field “clem”

- write “abact” in field “les”

- write “n6p1” in field “codles” (for perfect participle)

- add a new line

- write in field “n_id” the same n_id used for the previous line

- write “i” in field “clem”

- write “abactur” in field “les”

- write “n6p2” in field “codles” (for future participle)

III declension noun with lemma created through sustitution of codles ending

- lexical entry: raucedo, -inis

- inflection: III declension –o, -inis

- gender: feminine

- table: lessario

- add a new line

- write in field “n_id” a new n_id (not already used)

- write “f” (feminine) in field “gen”

- write “raucedin” in field “les”

- write “n31” in field “codles”

III declension noun with initial graphic alteration

- lexical entry: abscessio, -inis

- inflection: III declension –o, -inis

variant: apscessio, -onis

- gender: feminine

- table: lessario

- add a new line

- write in field “n_id” a new n_id (not already used)

- write “f” (feminine) in field “gen”

- write “abscession” in field “les”

- write “n31” in field “codles”

- write “b02” in field “a_gra” (for details: see “tabsai” table)

I declension noun with a graphical variation

- lexical entry: carruca, -ae

- inflection: I declension

variant: carrucha, -ae

caruca, -ae

carucha, -ae

- gender: feminine

- table: lessario

- add a new line

- write in field “n_id” a new n_id (not already used)

- write “f” (feminine) in field “gen”

- write “carruc” in field “les”

- write “n1” in field “codles”

- write “h12” in field “a_gra” (for details: see “graph_vars” table)

- add a new line

- write in field “n_id” the same n_id used for the previous line

- write “f” (feminine) in field “gen”

- write “caruc” in field “les”

- write “n1” in field “codles”

- write “h12” in field “a_gra” (for details: see “graph_vars” table)

Code “h12”:

- “h12”: variation “ch(“gv_in”)/c(“gv_out”)”

- “h12”: graphical variation is related to the second occurrence (“2”) of “c” in the les

o “c(1)arruc(2)”: graphical variation “ch/c” is related to “c” (2)

o input les “carruch” (“ch”: “gv_in”) is transformed in “carruc” (“c”: “gv_out”) to retrieve the involved les in table “lessario”

Appendix I

cod-les

List of available cod-les along with its cod-lem and semantic description.

Not related to SF

- fe (codlem: iperlemma codlem): exceptional wordforms

- i (codlem: i): invariables

- n (codlem: n): uninflected nouns

- pr; p1-p9; p18 (codlem: pr): pronominals

- v: verbs not related to a specific conjugation

Related to SF

I declension nouns (codlem: n1)

- n1: I declension nouns

- n1e: I declension irregular nouns

II declension nouns (codlem: n2)

- n2: II declension nouns (masculine, and feminine)

- n2e: II declension irregular nouns

- n2i: II declension masculine nouns in -ius

- n2n: II declension neuter nouns

- n2ni: II declension neuter nouns in -ius

Gerund (codlem: n2g)

- n21: gerund

Only neuter perfect participle (cod-lem: n2np)

- n2np: only neuter perfect participle

III declension nouns (cod-lem: n3b)

- n3: III declension nouns (masculine, and feminine) with plural genitive in –um/-ium

- n31: III declension nouns (masculine, and feminine) with plural genitive in –um

- n32: III declension nouns (masculine, and feminine) with plural genitive in –ium

- n3e: III declension irregular nouns; singular ablative in -e

- n3n: III declension nouns (neuter) with plural genitive in –um/-ium

- n3n1: III declension nouns (neuter) with plural genitive in –um

- n3n2: III declension nouns (neuter) with plural genitive in –ium

- n3p: III declension nouns with ; singular ablative in –e/-i

IV declension nouns (cod-lem: n4)

- n4: IV declension nouns

Supine (cod-lem: n4s)

- n41: supine

V declension nouns (cod-lem: n5)

- n5: V declension nouns

I class adjectives; perfect and future participles; gerundives (cod-lem: n2/1)

- n6: I class adjectives; perfect and future participles; gerundives

- n6i: I class adjectives in –ius

- n6r: I class adjectives in –er

- n6s: I class superlative degree adjectives

I class pronominal adjectives (cod-lem: pr)

- n6p: I class pronominal adjectives with singular genitive in –ius and singular dative in -i

- n6p3: I class pronominal adjectives inflected as regular first class adjectives

Perfect, future irregular participles; irregular gerundives (codlem: n1/2)

- n6g: irregular gerundives

- n6p1: perfect irregular participles

- n6p2: future irregular participles

Only neuter irregular gerundive (codlem: n2np)

- n6gn: only neuter irregular gerundive

II class adjectives (codlem: n3a)

- n7: II class adjectives with singular nominative masculine and feminine ending in –is, neuter in –e, and singular ablative in –i

- n71: II class adjectives with singular nominative the same for masculine, feminine and neuter (-s; -x; -r; -l), and singular ablative in –e/-i

- n72: II class adjectives with singular nominative feminine ending in –is, masculine in –er, neuter in –e, and singular ablative in –i

- n7c: II class comparative degree adjectives

II class pronominal adjectives (codlem: pr)

- n7p: II class pronominal adjectives

Present irregular participle (codlem: n3p)

- n7p3: present irregular participle

Pronominals (codlem: pr)

- p10-p17; p19-p23: see table “tabsf”

Verbs

A) Infectum

Each codles beginning with a v- ca have, in fourth position, one of the following letters:

Infectum

- v**a: compatibility with present indicative SF

- v**b: compatibility with present conjunctive SF

- v**c: compatibility with future indicative SF

- v**d: compatibility with imperfect indicative SF

- v**e: compatibility with imperfect conjunctive SF

- v**f: compatibility with present imperative SF

- v**g: compatibility with present infinitive SF

Perfectum

- v**a: compatibility with active perfect indicative SF

- v**b: compatibility with active perfect conjunctive SF

- v**c: compatibility with active perfect future indicative SF

- v**d: compatibility with active plusperfect indicative SF

- v**e: compatibility with active plusperfect conjunctive SF

- v**g: compatibility with active perfect infinitive SF

I conjugation verbs (cod-lem: v1)

- v1d: I conjugation deponent verbs

- v1e: I conjugation verbs, impersonal

- v1i: I conjugation verbs, intransitive

- v1r: I conjugation verbs, transitive

- v1s: I conjugation verbs, only active diathesis

II conjugation verbs (cod-lem: v2)

- v2d: II conjugation deponent verbs

- v2e: II conjugation verbs, impersonal

- v2i: II conjugation verbs, intransitive

- v2r: II conjugation verbs, transitive

- v2s: II conjugation verbs, only active diathesis

III conjugation verbs (cod-lem: v3)

- v3d: III conjugation deponent verbs

- v3e: III conjugation verbs, impersonal

- v3i: III conjugation verbs, intransitive

- v3r: III conjugation verbs, transitive

- v3s: III conjugation verbs, only active diathesis

IV conjugation verbs (cod-lem: v4)

- v4d: IV conjugation deponent verbs

- v4e: IV conjugation verbs, impersonal

- v4i: IV conjugation verbs, intransitive

- v4r: IV conjugation verbs, transitive

- v4s: IV conjugation verbs, only active diathesis

e/i conjugation verbs (cod-lem: v5)

- v5d: e/i conjugation deponent verbs

- v5e: e/i conjugation verbs, impersonal

- v5i: e/i conjugation verbs, intransitive

- v5r: e/i conjugation verbs, transitive

- v5s: e/i conjugation verbs, only active diathesis

Not regular conjugation verbs (codlem: va)

- v6d: not regular conjugation deponent verbs

- v6i: not regular conjugation verbs, intransitive

- v6r: not regular conjugation verbs, transitive

- v6s: not regular conjugation verbs, only active diathesis

- v61a: not regular conjugation verbs; compatibility with present indicative SF

- v62a: not regular conjugation verbs; compatibility with active imperfect conjunctive SF

- v63a: not regular conjugation verbs; compatibility with active future perfect indicative SF

- v64a: not regular conjugation verbs; compatibility with present active conjunctive SF

- v65a: not regular conjugation verbs; compatibility with imperfect active indicative SF

- v66a: not regular conjugation verbs; compatibility with perfect active conjunctive SF

- v67a: not regular conjugation verbs; compatibility with passive future perfect indicative SF

- v68a: not regular conjugation verbs; compatibility with present conjunctive SF

- v69a: not regular conjugation verbs; compatibility with present indicative SF (passive: only SF –tur)

B) Perfectum (codlem: vp)

- v7s: perfectum

- v7e: impersonal perfectum

- v8s: syncopated perfectum

Appendix II

Automatic Creation of Lemma

If in field “lem” of “lessario” table no LE is recorded, the lemma is created through automatically adding a SF to the les, rule depending on cod-les.

Cod-les automatic SF

n1 -a

n1e -a

n2 -us

n2e -us

n2i -ius

n2n -um

n2ni -ium

n21 -i

n2np -um

n3 -is

n31 -is

n32 -is

n3e -is

n3n -is

n3n1 -is

n3n2 -is

n3p -is

n4 -us

n41 -um

n5 -es

n6 -us

n6i -ius

n6r -us

n6s -us

n6p -us

n6p3 -us

n6g -us

n6p1 -us

n6p2 -us

n6gn -us

n7 -is

n71 -is

n72 -is

n7c -is

n7p -is

n7p3 -is

p10-p23 LE (always)

v1d LE (always)

v1e -at

v1i -o

v1r -o

v1s -o

v2d LE (always)

v2e -et

v2i -eo

v2r -eo

v2s -eo

v3d LE (always)

v3e -it

v3i -o

v3r -o

v3s -o

v4d LE (always)

v4e -it

v4i -o

v4r -o

v4s -o

v5d LE (always)

v5e -it

v5i -io

v5r -io

v5s -io

v6d LE (always)

v6i -o

v6r -o

v6s -o

v61a -o

v62a -o

v63a -o

v64a -o

v65a -o

v66a -o

v67a -o

v68a -o

v69a -o

v7s -i

v7e -it

v8s -i

i =les

n =les

v =les

pr =les

With cod-les n3* and n7*:

Cod-les ending substituted with automatic SF

-in -o

-on -o

-c -x

-g -x

-d -s

-t -s

Appendix III

Cod-lem

List of available codlem along with its semantic description:

- enc: enclitics

- i: invariables

- n: uninflected nouns 1

- n1: I declension nouns

- n1/2: perfect and future participles; gerundives

- n2: II declension nouns

- n2/1: I class adjectives

- n2g: gerunds

- n2np: only neuter gerundives and only neuter past participles

- n3a: II class adjectives and only neuter gerundive comparative degree

- n3b: III declension nouns

- n3p: present participles

- n4: IV declension nouns

- n4s: supins

- n5: V declension nouns

- pr: pronominals

- nx: uninflected nouns 2

- ny: uninflected adjectives

- v: verbs not related to a specific conjugation

- v1: I conjugation verbs

- v2: II conjugation verbs

- v3: III conjugation verbs

- v4: IV conjugation verbs

- v5: e/i conjugation verbs

- va: not regular conjugation verbs

- vp: verbs at perfectum

Appendix IV

DTD for CHLT LEMLAT

Text lemmatization results wordform oriented.

<!ELEMENT TextAnalysis (Analyses*)>

<!– – form: raw input wordform; alt_form: modified input wordform – –>

<!ELEMENT Analyses (form, alt_form?, Analysis*)>

<!ELEMENT form (#PCDATA)>

<!ELEMENT alt_form (#PCDATA)>

<!– – enc: enclitics; part: particle – –>

<!ELEMENT Analysis (enc?, part?, segmentation?, morphological_analyses, Lemmas)>

<!ELEMENT enc (#PCDATA)>

<!ELEMENT part (#PCDATA)>

<!– – alt: initial alteration; spf: post-final segment – –>

<!ELEMENT segmentation (alt?, les, sm1?, sm2?, sf?, spf?)>

<!ELEMENT les (#PCDATA)>

<!ELEMENT alt (#PCDATA)>

<!ELEMENT sm1 (#PCDATA)>

<!ELEMENT sm2 (#PCDATA)>

<!ELEMENT sf (#PCDATA)>

<!ELEMENT spf (#PCDATA)>

<!ELEMENT morphological_analyses (morphological_codes*)>

<!ELEMENT Lemmas (Lemma+)>

<!ELEMENT morphological_codes (Mood?, Tense?, Case?, Gender?, Number?, Person?, Degree?)>

<!ELEMENT Mood (#PCDATA)>

<!ELEMENT Tense (#PCDATA)>

<!ELEMENT Case (#PCDATA)>

<!ELEMENT Gender (#PCDATA)>

<!ELEMENT Number (#PCDATA)>

<!ELEMENT Person (#PCDATA)>

<!ELEMENT Degree (#PCDATA)>

<!ELEMENT Lemma (lemma, lemma_gender?, lemma_morphological_codes?)>

<!ATTLIST Lemma type (iper|ipo) #IMPLIED>

<!ELEMENT lemma (#PCDATA)>

<!ELEMENT lemma_gender (#PCDATA)>

<!ELEMENT lemma_morphological_codes (PoS, Type?, Flexional_category?)>

<!ELEMENT PoS (#PCDATA)>

<!ELEMENT Type (#PCDATA)>

<!ELEMENT Flexional_category (#PCDATA)>

Text lemmatizzation results lemmas oriented (Rationarium).

<!ELEMENT rationarium (rat_item*)>

<!ELEMENT rat_item (txtlemma, lem_analyses)>

<!– – lem_occ_ipo: number of occurrencies of lemma as hypolemma; lem_frq_ipo: number of different wordforms related to lemma as hypolemma; lem_occ_iper: number of occurrencies of lemma as hyperlemma; lem_frq_iper: number of different wordforms related to lemma as hyperlemma – –>

<!ELEMENT txtlemma ( Lemma, lem_occ_ipo, lem_frq_ipo, lem_occ_iper, lem_frq_iper)>

<!ELEMENT Lemma (lemma, lem_id?, lemma_morphological_codes?)>

<!ATTLIST Lemma type (iper|ipo) #IMPLIED>

<!ELEMENT lemma (#PCDATA)>

<!– – lem_id: identification number of lemma in ‘lessario’ table – –>

<!ELEMENT lem_id (#PCDATA)>

<!ELEMENT lemma_morphological_codes (PoS, Type?, Flexional_category?)>

<!ELEMENT PoS (#PCDATA)>

<!ELEMENT Type (#PCDATA)>

<!ELEMENT Flexional_category (#PCDATA)>

<!ELEMENT lem_occ_ipo (#PCDATA)>

<!ELEMENT lem_frq_ipo (#PCDATA)>

<!ELEMENT lem_occ_iper (#PCDATA)>

<!ELEMENT lem_frq_iper (#PCDATA)>

<!ELEMENT lem_analyses (lem_analysis*)>

<!– – a_type: kind of relation between lemma and wordform specified in ‘txtwf’ – –>

<!ELEMENT lem_analysis (a_type, txtwf, morphological_analyses)>

<!ELEMENT a_type (#PCDATA)>

<!– – wf_occ: wordform occurrence – –>

<!ELEMENT txtwf (wordform, wf_occ, lem_occ_ipo, lem_occ_iper)>

<!ELEMENT wordform (#PCDATA)>

<!ELEMENT wf_occ (#PCDATA)>

<!ELEMENT morphological_analyses (morphological_codes*)>

<!ELEMENT morphological_codes (Mood?, Tense?, Case?, Gender?, Number?, Person?, Degree?)>

<!ELEMENT Mood (#PCDATA)>

<!ELEMENT Tense (#PCDATA)>

<!ELEMENT Case (#PCDATA)>

<!ELEMENT Gender (#PCDATA)>

<!ELEMENT Number (#PCDATA)>

<!ELEMENT Person (#PCDATA)>

<!ELEMENT Degree (#PCDATA)>

List of wordforms (Formario)

<!ELEMENT formario (txtwf*)>

<!ELEMENT txtwf (wordform, wf_occ, lem_occ_ipo, lem_occ_iper)>

<!ELEMENT wordform (#PCDATA)>

<!ELEMENT wf_occ (#PCDATA)>

<!ELEMENT lem_occ_ipo (#PCDATA)>

<!ELEMENT lem_occ_iper (#PCDATA)>

List of lemmas (Lemmario)

<!ELEMENT lemmario (txtlemma*)>

<!ELEMENT txtlemma ( Lemma, lem_occ_ipo, lem_frq_ipo, lem_occ_iper, lem_frq_iper)>

<!ELEMENT Lemma (lemma, lem_id?, lemma_morphological_codes?)>

<!ATTLIST Lemma type (iper|ipo) #IMPLIED>

<!ELEMENT lemma (#PCDATA)>

<!ELEMENT lem_id (#PCDATA)>

<!ELEMENT lemma_morphological_codes (PoS, Type?, Flexional_category?)>

<!ELEMENT PoS (#PCDATA)>

<!ELEMENT Type (#PCDATA)>

<!ELEMENT Flexional_category (#PCDATA)>

<!ELEMENT lem_occ_ipo (#PCDATA)>

<!ELEMENT lem_frq_ipo (#PCDATA)>

<!ELEMENT lem_occ_iper (#PCDATA)>

<!ELEMENT lem_frq_iper (#PCDATA)>