New languages for NLP

DH 2019

$ pip install standoffconverter

from lxml import etree
import requests
from standoffconverter import Converter

# 1. download the Macbeth TEI file and convert it to standoff format
url = "https://firstfolio.bodleian.ox.ac.uk/download/xml/F-mac.xml"
response = requests.get(url)

tree = etree.fromstring(response.text.encode('utf-8'))
    
macbeth = Converter.from_tree(tree)

# 2. create new annotations (automatically) and add them to the original
macbeth.add_annotation(begin, end, "SOMETAG", 0, {attributes})

# 3. store the modified XML
new_tree = macbeth.to_tree()

DH Budapest

DARIAH

Cadet

New languages

  • literature
  • historical texts
  • scientific, diplomatic, military

Domain-specific models

Bulk-update seed terms 

 

spaCy defaults and models for auto-suggestion

model in the loop (Prodigy,

LightTag, tagtog)

New Languages for NLP

seeks to:

  • Enpower scholars to create the linguistic data and language models they need to use NLP in their research
  • Grant humanities scholars greater perspective on how machines reason about the content of texts
  • Create fine-tuned models for project-specific language and domain.
  • Change relationship between humanists and NLP research: from consumers to contributors.

Thank you!  ajanco@haverford.edu