Digital Linguistics for Language Documentation
Daniel W. Hieber
University of California, Santa Barbara
May 24, 2019
Slides available at:
https://slides.com/dwhieb/digital-linguistics-for-language-documentation
What is Digital Linguistics (DLx)?
Digital Linguistics (DLx) is the science of the digital data management for linguistics, including the digital storage, representation, manipulation, and dissemination of linguistic data. It concerns itself with how to represent linguistic data in digital form, as well as best practices for working with that data, while being attentive to best practices and ethical concerns in language documentation, sociocultural linguistics, and language revitalization.
DLx Resources
Data Management
Types of things called "data" in linguistics:
- audiovisual media
- (time-aligned) annotations
- metadata
- lexical databases
- corpora
- publications containing any of the above
Metadata
Data that describes another set of data.
- location(s)
- date(s)
- speakers / researchers
- sociocultural context
- documentary context
- folder/repository structure
- file formats / naming conventions
- terminology / glossary / abbreviations
Metadata Standards
- Open Language Archives Community (OLAC)
- ISLE Metadata Initiative (IMDI)
- Data Format for Digital Linguistics (DaFoDiL)
Different tools utilize different metadata formats, or just use their own
Data Management Plan (DMP)
- Required by most funding organizations
- Current practice has a focus on archiving
- Good DMPs plan for the entire lifecycle of the data
Data Lifecycle
- data entry
- data cleaning
- data editing
- data use
Data Workflow
- recording
- metadata
- (time-aligned) annotation
- presentation
Backup and/or archive at every stage
Backup and/or archive at every version
Primary ("Raw") Data
- audiovisual recordings
- images / scans
Data are in "binary" format files (i.e. non-text files)
Must have specialized software to read
Not human-readable
Images | .jpg, .jpeg, .png, .svg |
Scans / Documents | .pdf, .docx |
Audio | .wav, .mp3, .wma |
Video | .mpeg, .avi, .mov, .mp4 |
Databases | .xlsx, .accdb, .fmp |
JPEG file
JPEG file (as text)
Structured Data (Text)
Markup
Non-Proprietary
- .txt (Text)
- .md (Markdown)
- .json (JavaScript Object Notation / JSON)
- .sql (Structured Query Language / SQL)
- .yml (YAML)
- .xml (Extensible Markup Language / XML)
Structured Data (Text)
Proprietary
- EAF (ELAN)
- FlexText (FLEx)
- SFM (Toolbox)
- TextGrid (Praat)
- Saymore
Tools
- Praat
- Saymore
- scripts (JavaScript, Python, R)
- spreadsheet software (Excel, Open Office)
- SQL (HeidiSQL)
- text editors (Atom, Notepad++)
- Toolbox
- Transcriber
- Webonary
- WeSay
Data Workflow + Tools
Problems
- operating system-specific
- task-specific
- variety of formats
- access / licensing
- do not synchronize (easily)
- few backup / archiving solutions
- not easily citeable / shareable
Data Workflow + Tools
Recommendations
- version control
- single source of truth
- document your workflow (for yourself as much as others)
- document your formats / fields
- avoid manual transformations / processes
- write scripts (document their inputs and outputs carefully)
Principles
- Open Web Platform
- open source
- web-based
- standards-based
- discoverable / open access
Goals
- data format (JSON)
- open-source tools ecosystem
- education
Formats
Tools
- tools.digitallinguistics.io
- app.digitallinguistics.io
- scripts (GitHub)
- converters
- transliterator