CHLT Documentation for Encoders


CHLT Documentation for Encoders

Getting Started
Encoding the Header
Encoding the Text
General Guidelines
Front Matter
Body
Finishing Up



Last updated 06/23/04 by lv.




 

Getting Started

To encode a document, you'll need an XML editor (e.g., jEdit or oXygen), the XML file of the document, the LHL XML template (LHLheader.xml), the list of entities (entities.txt), and the page images of the original document. You will also need to make sure that the TEI DTD library is on your computer and that the XML editor has been set up to recognize the catalog file associated with it.

DATA ENTRY POST-PROCESSING: Open the template and the document file in the XML editor. Paste the entire contents of the XML file into the template file (between the <body> tags), close the document file, then rename the template file using the document file name.

Initially there will be many parsing errors, and to address these you will need to replace the shorthand tags used offsite with TEI tags. Using the original tagging specication as a guide, you will do a search and replace routine on some or all of these:

  • replace <B> with <hi rend="bold"> and </B> with </hi>
  • replace <I> with <hi rend="ital"> and </I> with </hi>
  • replace <C> with <hi rend="center"> and </C> with </hi>
  • replace <D> with <div1> and </D> with </div1>
  • replace <D1> with <div1> and </D1> with </div1> (and do the same for <D2> for <div2>, <D3> for <div3>, etc.)
  • replace <H> with <head> and </H> with </head>
  • replace <M> with <note place='marg'> and </M> with </note>
  • replace <P> with <p> and </P> with </p>
  • replace <TABLE> with <table> and </TABLE> with </table>, <TR> with <row> and </TR> with </row>, and <TD> with <cell> and </TD> with </cell> (leave any <table> attribute values alone for now; see section on tables below)
  • search for the string <PB and make lowercase the tag and attributes if they are uppercase

Additionally, unknown characters are denoted with a variety of marks (including, for instance, <*>). Eventually, you will work through the document and replace as many of these as possible with the appropriate character for expansion; for now, replace whatever marker they used with two question marks (i.e., ??).

After the replacements, there may still be parsing errors. Check the "error list," click on an error and go back to the document to locate it and fix it. Pay particular attention to valid tag names that are in all-caps instead of lowercase (which the DTD requires).



Encoding the Header

Enter the following information in the <fileDesc>:

  • the title of the work, in <title> within both the <titleStmt> and the <sourceDesc>
  • the author of the work, in <author> within both the <titleStmt> and the <sourceDesc>
  • today's date in <date> of <publicationStmt>
  • the place of the work's publication in <pubPlace> within <sourceDesc>
  • the publisher of the work in <publisher> within <sourceDesc>
  • the date of the work's publication in <date> within <sourceDesc>
If you do not have some of the publishing information, remove the placeholders in the header and leave the tag content blank.

In the <profileDesc>, remove any <language> that is not the primary language of the document. For instance, if the document is written in Latin, this is what the <langUsage> section of the <profileDesc> should look like:


<langUsage>
<language id="la">Latin</language>
</langUsage>

You will add other languages, as necessary, after the primary language; see section on language below.



Encoding the Text

General Guidelines

These guidelines apply to any part of the <text>, including the front matter (the <front>) and the main body of the work (the <body>).

FORMATTING: In addition to the formatting performed by offsite data entry personnel, you may need to use <hi> to indicate subscript, superscript, or writing in small caps; use the rend attribute of <hi> to register the type of special formatting needed, and one of these values for rend:

  • sub
  • super
  • small caps

PAGE BREAKS: All page breaks will have been marked with <pb/>s at the beginning of the pages they denote, and an id attribute value that denotes the path to the page image. Working from the beginning of the document, and using page images as your guide, you will assign an n attribute value to each <pb/> that matches the page number on which the text occurs. For unnumbered pages, assign sequential letters instead. The first <pb/> you find with no number will be given an n attribute value of "a"; the second is "b"; and so forth. If you run out of letters, begin with double letters (i.e., "aa," "bb," and so forth).

ENTITIES: You will need to work through the document and check nearly all of the special characters used, using the page images of the original as a guide. You will also need to go back through all of the unknown characters (marked with ??) and determine the appropriate special character entity. The list of entities contains common entities with their expansions. If the special character you need isn't listed there, there are several websites devoted to these; you might try: http://www.aim-higher.net/EntityReferenceViewer.asp or http://hotwired.lycos.com/webmonkey/geektalk/97/36/iso.html.

In Latin documents, the long S is sometimes marked with a $; replace these with &longs;.

If you use an entity in the document, it may give you a parsing error; this means that you will need to add the appropriate entity declaration to the document. For instance, if you are working in Latin, you will probably encounter a long S; the special character entity is &longs; and the entity declaration looks like this: <!ENTITY longs 's'>

Insert this entity declaration into the DOCTYPE declaration at the top of the document, before the final ending square bracket, like so:

%PersTeiHead;

<!ENTITY longs 's'>

]>

Additionally, in both jEdit and oXygen, an ampersand (i.e., &) will give you a parsing error. You will need to replace each & with its special character entity, &amp;.

LANGUAGES: Determine the primary language of the document, and denote this in the lang attribute of <text>. The code you will need for the appropriate language is found in the <langUsage> within the <profileDesc> of the <teiHeader>, and it must match exactly the id attribute value of <language>.

Within the text, if you encounter a character, word, or phrase in a language other than the primary language of the document, tag it with <foreign>, and again enter the appropriate language code as the value of the lang attribute.

If the language you need is not already listed in the <langUsage>, you will need to take two steps. First, you determine the two-digit language code: if it is not English ('en'), Greek ('greek'), or Latin ('la'), then go to http://xml.coverpages.org/iso639a.html to find it; then enter it into the lang attribute of <foreign>. Next, you return to the <langUsage> and add a new <language>. The data within the <language> tag is the unabbreviated name of the language; enter the two-digit language code into the id attribute value.

Here is an example. The tagged text would look like this:

....tho what they said was much more absurd than Aristotle's <lang='greek'>entele/xeia</foreign>, or the Schoolmens substantial formes....

And here is the newly modified <langUsage> in the header:


<langUsage>
<language id="en">English</language>
<language id="greek">Greek</language>
</langUsage>


Front Matter

If present, front matter is tagged with <front>, and it precedes the <body> within <text>. Because the template assumes there is no front matter, you will need to enter opening and closing <front> tags before <body>, remove the text that is front matter from <body>, and paste it in the <front> tags.

Like the body of the text, the front matter is divided up into chunks of data called <div1>s. These may include such data as a title page (<titlePage>), table of contents (marked up as a <list>), or preface (marked up as <p>s or <lg>s, as appropriate).

A <titlePage> may consist of some combination of the following: <docTitle> (composed of individual <titlePart>s), a <byline> (typically containing a <docAuthor>), an <epigraph>, and/or publishing information (a <docImprint> and <docDate>). Treat a dedication or figure as a <titlePart>. For example:

<text lang="la"><front>
<titlePage>

<byline>
<docAuthor>HIERONYMVS BORRIVS ARRETINVS</docAuthor>
</byline>
<docTitle>
<titlePart>De Motu Grauium, & Leuium</titlePart>
<titlePart type="dedication">Ad FRANCISCVM Medicem Magnum Etruria Ducem II.</titlePart>
</docTitle>
<docImprint>FLORENTIAE, In Officina Georgii Marescotti. </docImprint>
<docDate>MDLXXV.</docDate>
</titlePage>
</front>

A table of contents (contained within a <div1>) may or may not have a <head> and is tagged as a <list> or a series of <list>s. A <list> may consist of some combination of the following: a <head>, and one or more <item>s. See "Body" below for more on tagging <list>s. Here's a simple exammple:


<div1>
<head>Table of Contents</head>
<list>
<item>Preface</item>
</list>
<list>
<head>CHAPTER I</head>
<item>Aristophanes</item>
<item>Sophocles</item>
</list>

A preface (contained within a <div1>) is tagged with <p>s or <lg>s, as appropriate, just like the <body> of the text (see "Body"). It may also contain a <closer> at the end, if it is a signed preface.


Body

HEAD: A <head> is the title of a section of text. These should have been tagged by the offsite data entry personnel. A <head> can only occur at the beginning of a <div1> (or another numbered <div>), before prose or verse begins.

PROSE: Each paragraph of prose within a <div1> (or another numbered <div>) is tagged with a <p>. These will have been tagged by the offsite data entry personnel.

VERSE: The offsite taggers will not have tagged verse, so you will need to go through the document and tag any verse that you find. Verse is tagged with <l> (for individual lines). For example:

<div1>
<head>Book I</head>
<l>Seamen with sailing art their vessels move;</l>
<l>Art guides the chariot: art instructs to love.</l>

If verse is grouped in stanzas, wrap them in <lg>s:

<lg>
<l>Content of stanza 1, line 1</l>
<l>Content of stanza 1, line 2</l>
</lg>
<lg>
<l>Content of stanza 2, line 1</l>
<l>Content of stanza 2, line 2</l>
</lg>

To denote indentations of poetic lines, use the rend attribute of the <l>. For one tab, place a value of "tab1" in the rend attribute; for two tabs, the value is "tab2"; and for three tabs, the value is "tab3." For instance, the following text

Tandem venit amor, qualem texisse pudori
quam nudasse alicui sit mihi fama magis.

is tagged like this:


<lg>
<l>Tandem venit amor, qualem texisse pudori</l>
<l rend="tab1">quam nudasse alicui sit mihi fama magis.</l>
</lg>

TABLES: Individual <table>s can vary in terms of number of <row>s and <cell>s. However, within a particular <table>, each <row> must contain the same number of <cell>s as the other <row>s of that <table>. Imagine the first cells in the rows are one vertical column, the second cells another vertical column, and so forth. Within a <table>, the first <cell>s of every <row> should always contain the same kind of data, the second <cell>s should always contain the same kind of data, etc.

For example, here's a simple table:


Santa Claus North Pole sleigh
Donald Trump New York limousine
Ahab Massachusetts whaling ship

It would be tagged like this:

<table>
<row>
<cell>Santa Claus</cell>
<cell>North Pole</cell>
<cell>sleigh</cell>
</row>
<row>
<cell>Donald Trump</cell>
<cell>New York</cell>
<cell>limousine</cell>
</row>
<row>
<cell>Ahab</cell>
<cell>Massachusetts</cell>
<cell>whaling ship</cell>
</row>
</table>

There is no TEI equivalent to HTML's attribute colspan. This means that empty <cell>s will need to be inserted in order to ensure that each <row> has the same number of <cell>s. Just make sure that they are added consistently such that an empty <cell> in one <row> corresponds in position to empty <cell>s in every <row>. If, for instance, you have a table like the one above, with three items (i.e., <cell>s) in each row (i.e., <row>), but the first <row> has one <cell> with an attribute colspan='3', then you will enter two empty <cell>s after it:

<table>
<row>
<cell>Transportation Preferences</cell>
<cell></cell>
<cell></cell>
</row>
<row>
<cell>Santa Claus</cell>
<cell>North Pole</cell>
<cell>sleigh</cell>
</row>
<row>
<cell>Donald Trump</cell>
<cell>New York</cell>
<cell>limousine</cell>
</row>
<row>
<cell>Ahab</cell>
<cell>Massachusetts</cell>
<cell>whaling ship</cell>
</row>
</table>

LINKING: Each <head> should be assigned an n attribute equal to the page number of the page on which it occurs. In other words, the <head> and the <pb/> that precedes it (even if there is interceding text between them) should have matching values on their respective n attributes. It is possible to have multiple <head>s on one page, so in that case more than one <head> will have the same n attribute value. Here's an example:

<pb n='36'/>
<p>.... the principal valley on too high or too low a level,--a
circumstance which would be infinitely improbable if
each of these vallies were not the work of the stream
that flows in it.”</p>
<div2 type='subsection'>
<head n="36">BIBLIOGRAPHY</head>
<div3 type='subsubsection'>
<head n="36">I. ORIGINAL WORKS.</head>

ILLEGIBLE TEXT: If part of the text is illegible and cannot be deciphered, enter a <gap/>, and enter "illegible" in its reason attribute.




Finishing Up

When you are finished with the encoding, proofread carefully against the original journal issue. Parse the document and correct any errors that are reported. Save the document, and use the CVS system to send it to the server.