Digital Linguistics for Language Documentation

Daniel W. Hieber

University of California, Santa Barbara

May 24, 2019

What is Digital Linguistics (DLx)?

Digital Linguistics (DLx) is the science of the digital data management for linguistics, including the digital storage, representation, manipulation, and dissemination of linguistic data. It concerns itself with how to represent linguistic data in digital form, as well as best practices for working with that data, while being attentive to best practices and ethical concerns in language documentation, sociocultural linguistics, and language revitalization.

DLx Resources

Data Management

Types of things called "data" in linguistics:

  • audiovisual media
  • (time-aligned) annotations
  • metadata
  • lexical databases
  • corpora
  • publications containing any of the above

Metadata

Data that describes another set of data.

  • location(s)
  • date(s)
  • speakers / researchers
  • sociocultural context
  • documentary context
  • folder/repository structure
  • file formats / naming conventions
  • terminology / glossary / abbreviations

Metadata Standards

 

Different tools utilize different metadata formats, or just use their own

Data Management Plan (DMP)

  • Required by most funding organizations
  • Current practice has a focus on archiving
  • Good DMPs plan for the entire lifecycle of the data

 

Data Lifecycle

  1. data entry
  2. data cleaning
  3. data editing
  4. data use

Data Workflow

  1. recording
  2. metadata
  3. (time-aligned) annotation
  4. presentation

 

Backup and/or archive at every stage

Backup and/or archive at every version

Primary ("Raw") Data

  • audiovisual recordings
  • images / scans

 

Data are in "binary" format files (i.e. non-text files)

Must have specialized software to read

Not human-readable

Images .jpg, .jpeg, .png, .svg
Scans / Documents .pdf, .docx
Audio .wav, .mp3, .wma
Video .mpeg, .avi, .mov, .mp4
Databases .xlsx, .accdb, .fmp

JPEG file

JPEG file (as text)

Structured Data (Text)

Markup

 

Non-Proprietary

  • .txt (Text)
  • .md (Markdown)
  • .json (JavaScript Object Notation / JSON)
  • .sql (Structured Query Language / SQL)
  • .yml (YAML)
  • .xml (Extensible Markup Language / XML)

Structured Data (Text)

Proprietary

  • EAF (ELAN)
  • FlexText (FLEx)
  • SFM (Toolbox)
  • TextGrid (Praat)
  • Saymore

Tools

  • Audacity
  • database software (Access, Filemaker Pro)
  • ELAN
  • Elpis
  • FLEx
  • keyboards (Keyman, typeit.org)
  • Kratylos
  • LexiquePro
  • open source projects (DLx)
  • Praat
  • Saymore
  • scripts (JavaScript, Python, R)
  • spreadsheet software (Excel, Open Office)
  • SQL (HeidiSQL)
  • text editors (Atom, Notepad++)
  • Toolbox
  • Transcriber
  • Webonary
  • WeSay

Data Workflow + Tools

Problems

  • operating system-specific
  • task-specific
  • variety of formats
  • access / licensing
  • do not synchronize (easily)
  • few backup / archiving solutions
  • not easily citeable / shareable

Data Workflow + Tools

Recommendations

  • version control
  • single source of truth
  • document your workflow (for yourself as much as others)
  • document your formats / fields
  • avoid manual transformations / processes
  • write scripts (document their inputs and outputs carefully)

Principles

  • Open Web Platform
  • open source
  • web-based
  • standards-based
  • discoverable / open access

Goals

  • data format (JSON)
  • open-source tools ecosystem
  • education

Formats

Tools

Contact