DH 190: Scholarly Text Encoding

 

Week 3

January 30, 2015

Blog post #1

Digital text formats

  • HTML
  • PDF
  • EPUB
  • DOCX
  • DRM
  • RTF
  • DjVu
  • ibook
  • AZW

Text Editors

  • Sublime
  • Komodo
  • Notepad++
  • TextMate

What is XML?

Markup

Extensible

Language

Extensible

  • Able to expand
  • Does not contain a fixed set of tags

 

Markup

  • Tags around content
  • Vocabulary
  • Grammar
<title>Example of Using Proofreader's Marks</title>
<p>Traditional proof reading is becoming a dying 
art since so many clients no longer use hardcopies 
to mark-up their corrections by hand. Today most clients simply
write corrections in an email for you to decipher and follow
up with a call, while others write them diligently in a Word 
document.</p>

Descriptive

<title>Example of Using Proofreader's Marks</title>

Presentational

XML cares about what the text is, not how it looks.  

Not just a language, but a

metalanguage.

A language used to formally express the components or structure of another language.

<element>content</element>

<title>Hamlet</title>
 

 

<author>William Shakespeare</author>

 

<forename>William</forename>

<surname>Shakespeare</surname>

<element attribute="value">content</element>

<l n="1710">To be or not to be, that is the question </l>

<date when="2015-01-30">Today</date>

XML is hierarchical.

<groceryList>
    <vegetable>
        <item>Broccoli</item>
        <item>Kale</item>
    </vegetable>
    <dairy>
        <item>Milk</item>
        <item>Yogurt</item>
    </dairy>
    <meat>
        <beef>
            <item>Ground beef</item>
            <item>Steak</item>
        </beef>
        <poultry>
            <item>Drumsticks</item>
        </poultry>
    </meat>
    <alcohol>
        <wine color="white">
            <item>Pinot Grigio</item>
        </wine>
    </alcohol>
<groceryList>

XML should be well formed.

 

<groceryList>
    <vegetable>
        <item>Broccoli</item>
        <item>Kale</item>
    <dairy>
        <item>Milk</item>
        <item>Yogurt</item>
    <dairy>
    <meat>
        <beef>
            <item>Ground beef
            <item>Steak</item>
        </beef>
        <Poultry>
            <item>Drumsticks</item>
        </poultry>
    </meat>
    <alcohol>
        <item type=wine>Pinot Grigio</item>
        <item><beer>Devils Backbone</item></beer>
    </alcohol>
<groceryList>

This XML is not well formed in 5 places:

XML should be valid.

<?xml version="1.0" encoding="UTF-8"?>
<?xml-model href="groceryschema.rnc" type="application/relax-ng-compact-syntax"?>
<groceryList xmlns="http://www.grocerylist.com">
    <vegetable>
        <item>Broccoli</item>
        <item>Kale</item>
    </vegetable>
    <dairy>
        <item>Milk</item>
        <item>Yogurt</item>
    </dairy>
    <meat>
        <beef>
            <item>Ground beef</item>
            <item>Steak</item>
        </beef>
        <poultry>
            <item>Drumsticks</item>
        </poultry>
    </meat>
    <alcohol>
        <beer>
            <item>Devils Backbone</item>
        </beer>
    </alcohol>
    <chore>Run the dishwasher</chore>
    <chore>Fold laundry</chore>
<groceryList>

XML Declaration

Namespace Declaration

<?xml version="1.0" encoding="UTF-8"?>
<groceryList xmlns="http://www.grocerylist.com">
<?xml-model href="groceryschema.rnc" type="application/relax-ng-compact-syntax"?>

Schema

Encode this recipe:

  • 1 1/2 cups of flour
  • 3 1/2 tsps baking soda
  • 1 tsp salt
  • 1 tbsp sugar
  • 1 1/4 cups milk
  • 1 egg
  • 3 tbsp butter

 

In a large bowl, mix dry ingredients.

Make a well in the center and pour in wet ingredients.

Mix until smooth.

Heat a lightly oiled griddle or frying pan over medium high heat.

Pour or scoop the batter onto the griddle, using approximately 1/4 cup for each pancake.

Brown on both sides and serve hot.

Why XML?

  • Plain text

  • Easy to parse by humans and computers

  • Non-proprietary

  • Interoperability (your transcripts for instance)

  • Preservation 

The

Text

Encoding

Initiative

"a markup language for representing the structural, renditional, and conceptual features of texts. They focus (though not exclusively) on the encoding of documents in the humanities and social sciences, and in particular on the representation of primary source materials for research and analysis."

http://www.tei-c.org/Guidelines/

Poughkeepsie Principles (1987)

  • Provide a standard format for data interchange;
  • Provide guidance for encoding of texts in this format;
  • Support the encoding of all kinds of features of all kinds of texts studied by researchers;
  • Allow the rigorous definition and efficient processing of texts;
  • Provide for user-defined extensions;
  • Be application independent;
  • Be simple, clear, and concrete;
  • Be simple for researchers to use without specialized software.

21 modules

 

503 elements

1 The TEI Infrastructure
2 The TEI Header
3 Elements Available in All TEI Documents
4 Default Text Structure
5 Characters, Glyphs, and Writing Modes
6 Verse
7 Performance Texts
8 Transcriptions of Speech
9 Dictionaries
10 Manuscript Description
11 Representation of Primary Sources
12 Critical Apparatus
13 Names, Dates, People, and Places
14 Tables, Formulæ, Graphics and Notated Music
15 Language Corpora
16 Linking, Segmentation, and Alignment
17 Simple Analytic Mechanisms
18 Feature Structures
19 Graphs, Networks, and Trees
20 Non-hierarchical Structures
21 Certainty, Precision, and Responsibility
22 Documentation Elements
23 Using the TEI

TEI

Modules

Basic document structure

<TEI>
    <teiHeader>
        <!---...-->
    </teiHeader>
    <text>
        <!--...-->
    </text>
</TEI>

Image credits

Proofreader's Marks

http://www.designersinsights.com/wp-content/uploads/2012/03/Proofreader_Marks.png

DH190-W15-Week3

By Mackenzie Brooks

DH190-W15-Week3

  • 1,262