New languages for NLP
DH 2019
$ pip install standoffconverter
from lxml import etree
import requests
from standoffconverter import Converter
# 1. download the Macbeth TEI file and convert it to standoff format
url = "https://firstfolio.bodleian.ox.ac.uk/download/xml/F-mac.xml"
response = requests.get(url)
tree = etree.fromstring(response.text.encode('utf-8'))
macbeth = Converter.from_tree(tree)
# 2. create new annotations (automatically) and add them to the original
macbeth.add_annotation(begin, end, "SOMETAG", 0, {attributes})
# 3. store the modified XML
new_tree = macbeth.to_tree()
DH Budapest
DARIAH
Cadet
New languages
- literature
- historical texts
- scientific, diplomatic, military
Domain-specific models
Bulk-update seed terms
spaCy defaults and models for auto-suggestion
model in the loop (Prodigy,
LightTag, tagtog)
New Languages for NLP
seeks to:
- Enpower scholars to create the linguistic data and language models they need to use NLP in their research
- Grant humanities scholars greater perspective on how machines reason about the content of texts
- Create fine-tuned models for project-specific language and domain.
- Change relationship between humanists and NLP research: from consumers to contributors.
Thank you! ajanco@haverford.edu
deck
By Andrew Janco
deck
- 603