perseus greek & latin treebanks

Anna-Sophia Zingarelli-Sweet

University of Pittsburgh School of Information Sciences

LIS 2975: Digital Scholarship

16 October 2013

anz31@pitt.edu // @aszingarelli

what is a treebank?

parsed text corpus
originated in the early 1990s in computational linguistics
usually constructed from a corpus already annotated with part-of-speech tags
can involve specific XML schemes
12th International Workshop on Treebanks and Linguistic Theories

what is dependency structure?

modern syntactic theory
finite verb as structural center
all other words "depend" on the verb =
described in relationship to the verb
Lucien Tesnière, Éléments de syntaxe structurale, pub. posthumously 1959

The ancient greek and latin dependency treebanks

"are an attempt to create a linguistic genome: a large database of Classical texts where the morphological, syntactic, and lexical information for each sentence has been explicitly encoded.

The point? To put linguistic research in Greek and Latin on a new quantitative foundation. To help drive a new generation of computational analysis. And above all, to get students and faculty both involved in the production of data that can be useful to the wider scholarly community."

Labor intensive: 200+ researchers annotating 350,000+ words (and this is a tiny fraction of the corpus)
"standard" production method: 2 researchers annotate independently, then a 3rd reconciles differences
"scholarly" production method: single researcher "publishes" their own annotation
XML files all available under Creative Commons

Perseus Digital library

begun 1987
contains 3.4 million words of Latin and 4.9 million words of Greek
public domain texts w/ OCR and XML encoded
Treebanks both benefit from this corpus & provide new services for the library
recommender service offers most likely interpretation

machine translation

Treebanks allow for extraction of rules
Use known translations to map parallel trees
Train program to accurately map out unknown tranlslations
(Technical description of this process found in Gideon Kotze et al, "Large Aligned Treebanks for Syntax-based Machine Translation", in proceedings of the International Conference on Language Resources and Evaluation 2012

Perseus Treebanks

By Anna-Sophia Zingarelli-Sweet

Perseus Treebanks

1,769

Anna-Sophia Zingarelli-Sweet

cultural heritage + metadata + climate resilience

perseus greek & latin treebanks

what is a treebank?

what is dependency structure?

The ancient greek and latin dependency treebanks

Perseus Digital library

machine translation

Perseus Treebanks

Perseus Treebanks

Anna-Sophia Zingarelli-Sweet

More from Anna-Sophia Zingarelli-Sweet