"are an attempt to create a linguistic genome: a large database of Classical texts where the morphological, syntactic, and lexical information for each sentence has been explicitly encoded.
The point? To put linguistic research in Greek and Latin on a new quantitative foundation. To help drive a new generation of computational analysis. And above all, to get students and faculty both involved in the production of data that can be useful to the wider scholarly community."
Labor intensive: 200+ researchers annotating 350,000+ words (and this is a tiny fraction of the corpus)
"standard" production method: 2 researchers annotate independently, then a 3rd reconciles differences
"scholarly" production method: single researcher "publishes" their own annotation
XML files all available under Creative Commons
Perseus Digital library
begun 1987
contains 3.4 million words of Latin and 4.9 million words of Greek
public domain texts w/ OCR and XML encoded
Treebanks both benefit from this corpus & provide new services for the library
recommender service offers most likely interpretation
machine translation
Treebanks allow for extraction of rules
Use known translations to map parallel trees
Train program to accurately map out unknown tranlslations