perseus greek & latin treebanks
Anna-Sophia Zingarelli-Sweet
University of Pittsburgh School of Information Sciences
LIS 2975: Digital Scholarship
16 October 2013
anz31@pitt.edu // @aszingarelli
what is dependency structure?
- modern syntactic theory
- finite verb as structural center
- all other words "depend" on the verb =
- described in relationship to the verb
- Lucien Tesnière, Éléments de syntaxe structurale, pub. posthumously 1959
"are an attempt to create a linguistic genome: a large database of Classical texts where the morphological, syntactic, and lexical information for each sentence has been explicitly encoded.
The point? To put linguistic research in Greek and Latin on a new quantitative foundation. To help drive a new generation of computational analysis. And above all, to get students and faculty both involved in the production of data that can be useful to the wider scholarly community."
- Labor intensive: 200+ researchers annotating 350,000+ words (and this is a tiny fraction of the corpus)
- "standard" production method: 2 researchers annotate independently, then a 3rd reconciles differences
- "scholarly" production method: single researcher "publishes" their own annotation
- XML files all available under Creative Commons
Perseus Digital library
- begun 1987
- contains 3.4 million words of Latin and 4.9 million words of Greek
- public domain texts w/ OCR and XML encoded
- Treebanks both benefit from this corpus & provide new services for the library
- recommender service offers most likely interpretation
machine translation
- Treebanks allow for extraction of rules
- Use known translations to map parallel trees
- Train program to accurately map out unknown tranlslations
- (Technical description of this process found in Gideon Kotze et al, "Large Aligned Treebanks for Syntax-based Machine Translation", in proceedings of the International Conference on Language Resources and Evaluation 2012