Digital Linguistics for Language Documentation

Daniel W. Hieber

University of California, Santa Barbara

May 24, 2019

What is Digital Linguistics (DLx)?

Digital Linguistics (DLx) is the science of the digital data management for linguistics, including the digital storage, representation, manipulation, and dissemination of linguistic data. It concerns itself with how to represent linguistic data in digital form, as well as best practices for working with that data, while being attentive to best practices and ethical concerns in language documentation, sociocultural linguistics, and language revitalization.

DLx Resources

Data Management

Types of things called "data" in linguistics:

  • audiovisual media
  • (time-aligned) annotations
  • metadata
  • lexical databases
  • corpora
  • publications containing any of the above


Data that describes another set of data.

  • location(s)
  • date(s)
  • speakers / researchers
  • sociocultural context
  • documentary context
  • folder/repository structure
  • file formats / naming conventions
  • terminology / glossary / abbreviations

Metadata Standards


Different tools utilize different metadata formats, or just use their own

Data Management Plan (DMP)

  • Required by most funding organizations
  • Current practice has a focus on archiving
  • Good DMPs plan for the entire lifecycle of the data


Data Lifecycle

  1. data entry
  2. data cleaning
  3. data editing
  4. data use

Data Workflow

  1. recording
  2. metadata
  3. (time-aligned) annotation
  4. presentation


Backup and/or archive at every stage

Backup and/or archive at every version

Primary ("Raw") Data

  • audiovisual recordings
  • images / scans


Data are in "binary" format files (i.e. non-text files)

Must have specialized software to read

Not human-readable

Images .jpg, .jpeg, .png, .svg
Scans / Documents .pdf, .docx
Audio .wav, .mp3, .wma
Video .mpeg, .avi, .mov, .mp4
Databases .xlsx, .accdb, .fmp

JPEG file

JPEG file (as text)

Structured Data (Text)




  • .txt (Text)
  • .md (Markdown)
  • .json (JavaScript Object Notation / JSON)
  • .sql (Structured Query Language / SQL)
  • .yml (YAML)
  • .xml (Extensible Markup Language / XML)

Structured Data (Text)


  • EAF (ELAN)
  • FlexText (FLEx)
  • SFM (Toolbox)
  • TextGrid (Praat)
  • Saymore


  • Audacity
  • database software (Access, Filemaker Pro)
  • ELAN
  • Elpis
  • FLEx
  • keyboards (Keyman,
  • Kratylos
  • LexiquePro
  • open source projects (DLx)
  • Praat
  • Saymore
  • scripts (JavaScript, Python, R)
  • spreadsheet software (Excel, Open Office)
  • SQL (HeidiSQL)
  • text editors (Atom, Notepad++)
  • Toolbox
  • Transcriber
  • Webonary
  • WeSay

Data Workflow + Tools


  • operating system-specific
  • task-specific
  • variety of formats
  • access / licensing
  • do not synchronize (easily)
  • few backup / archiving solutions
  • not easily citeable / shareable

Data Workflow + Tools


  • version control
  • single source of truth
  • document your workflow (for yourself as much as others)
  • document your formats / fields
  • avoid manual transformations / processes
  • write scripts (document their inputs and outputs carefully)


  • Open Web Platform
  • open source
  • web-based
  • standards-based
  • discoverable / open access


  • data format (JSON)
  • open-source tools ecosystem
  • education