Oracc and RITS


New tools for editing the world's oldest texts

Prof. Eleanor Robson

Raquel Alegre

Dr. James Hetherington

Dr. Jens Nielsen

UCL History Department

UCL Research Software Development Group

RSDG work for ORACC

ASCII

Transliteration

Format

Edition of ATF now:

  • Very good functionality, but:
    • Emacs plugin
    • Difficult to install
    • Difficult to use
    • Not very user friendly
    • Steep learning curve for users
    • Needs internet connection to validate files

UCL RSDG objectives:

  • Appropriate ATF grammar description
  • Create a user friendly intuitive interface
  • Reduce PIs time spent training and retraining

RSDG work for ORACC

 

  • Custom Software Development of two tools:
    • PyORACC: Python tools for validating ATF files
    • Nammu: GUI for edition of ATF files

 

  • SOAP Web Services
    • Communication with the ORACC Server

 

  • Easy deployment
    • JAR files

PyORACC

PyORACC

  • Developed in Python using the Ply module
    • Implementation of Lex and Yacc parsing tools
      • Lexical analysis: Splits the input text file into tokens 
      • Semantic analysis: Finds hierarchical structure of the text 

 

  • Allows validation of ATF files offline

ASCII

Transliteration

Format

Metadata: project info, lang, protocols...

Transliteration and lemmatization

Translation

Comments

Descriptions:

rulings, blank, ...

Sections:

object, parts. ...

Lexical Analysis

Breaks the input text into a stream of tokens and matches with RE:

#
project

 

 cams/gkab

+

+

+

+

[new line]
t_HASH
r'\#'
PROJECT
t_COLON
r'\:'
t_ID
t_NEWLINE
r'\/n'
r'[a-zA-Z0-9]+[/]?[a-zA-Z0-9]+'

Semantic Analysis

Yacc parses and does semantic processing on the stream of tokens produced by Lex, following a grammar description:

expression : expression + term
           | expression - term
           | expression * term
           | expression / term
           | term
 3 * 5 + 1
def p_document(self, p):
    """document : text
                | object
                | composite"""

ATF document grammar example

def p_text_language(self, p):
    "text : text language_protocol"
    p[0] = Text()
    p[0].language = p[2]
def p_language_protocol(self, p):
    "language_protocol : ATF LANG ID newline"
    p[0] = p[3]

Parse tree

Nammu

Nammu

  • Graphical User Interface for edition of ATF texts

 

  • Uses PyORACC:
    • ​ATF file validation
    • Syntax highlighting

 

  • Can talk to the ORACC server to validate and lemmatise ATF files.

Nammu

  • Developed using Jython:
    • Python for the Java Virtual Machine
    • Platform independent - runs in any computer that has JVM installed​
    • Access to both Python and Java libraries
    • Swing - powerful and widely used GUI widget toolkit for Java

Functionality

  • Text edition
  • Offline (partial) validation using PyORACC
  • ATF validation using ORACC Server
  • Lemmatisation
  • Interactive Model View
  • Console log
  • Syntax Highlighting based on PyORACC
  • Error highlighting
  • Easy to install - JAR file

 

Web Services

Web Services

  • Client-Server architecture:
    • ORACC Server (UPenn)
    • Tools for validating and lemmatising ATF files.
    • Listens for request from clients like Nammu or Emacs' plugin for ORACC

Web Services

 

  • SOAP: Simple Object Access Protocol
    • Specifies how to exchange structured information via web services over HTTP
    • WSDL: Web Services Domain Language
    • XML format for describing network services
      • SOAP Envelope

Asynchronous communication

Software Development

Software Development

 

  • Open Source
  • Continuous Integration
    • Deployment and changes daily tested on Jenkins
    • ~90% whole ORACC corpus covered by PyORACC tests
  • Version Control: GitHub 
  • Maven
    • Compiles, tests and prepares JAR file
    • Maven-Jyton plugin update

Future steps on development

Feedback and suggestions are welcome!

  • 1st functional version of Nammu ready in July
    • Text edition, validation and lemmatisation are ready, but other functionality needs to be added
  • Interactive Model View for new users needs more work
  • ORACC Unicode keyboard 
  • Remote edition of files
  • Allow creation of new projects
  • Other server functionality using glossaries

If we were inventing ORACC now,

what would we do differently?

Principles:

we got this right, I think

  • Open access content at all levels
  • Free to create projects, however large or small
  • Balance between flexibility and coherence
  • Commitment to documentation (could do more)
  • Open linked data (could do more)

Staff and infrastructure:

opportunities much better now!

  • Involve research software engineers from (before) day 1
  • Acknowledge and plan for the ‘bus problem’
  • Plan for scalability
  • Make home institutions aware, proud, supportive
  • Fund ongoing programming and operational support

Code:

planned design, not ad hoc evolution!

  • Treat input, processing, output as separate issues (as we did)
  • Weigh cost-benefits of domain-specific language design (e.g. ATF vs EpiDoc)
  • Accessible interface for content creators:
    low entry barriers vs speed & control

  • Practical tools, not just principles, to make content searchable and mineable
  • Document and get user feedback on ALL of it!


UCL Research Software Development Group

Research Software Developers

Researcher

Software Developer

Research Software Developer

Research Software Developers

 

  • Not independent researchers
    • No personal research agenda​
  • Facilitative, Supportive and Collaborative
    • Deep engagement with research groups
    • Understand, study and be part of group research activities
  • Sustainable outputs
    • Institutional memory
    • Continuity, stability and maintenance

UCL Research Software Development Group

  • Started in 2012, grown from 1 to 8 RSDs!
    • Helped UCL win £1.5M in research income
    • Part of UCL Research IT Services
  • Custom software development for research projects
    • From simple scripts to HPC
    • From theoretical physisc to humanities
  • Infrastructure
    • Version control, testing, DevOps, HPC, ...

UCL Research Software Development Group

 

  • Training
    • Software Carpentry, C++/Python for research, ...
  • Networking
    • UCL Programming Hub, SSI, UK RSE, ...
  • Tech support for research groups
    • Mentoring, assessments, recruitment panels, ...

UCL Programming Hub

  • Research groups or individuals across campus involved in software development for research.
  • Monthly tech socials on a range of RSD topics and tools
  • Weekly coffee mornings on Wednesday at SCR

 

​http://research-programming.ucl.ac.uk

UCL RSE community

UCL RSE community

UCL Research Software Dashboard

  • Central repositories for UCL Research Software

http://dashboard.rc.ucl.ac.uk

Work with us!

  • Avoid well-known problems on research software:
    • Low levels of reuse
    • Poor standard verification
  • Why have an RSD?
    • More and more research uses software
    • General programmers don't understand research
    • Post-docs and PhD students don't always write reliable reusable maintable code

Contact us

@uclrcsoftdev

 

rc-softdev@ucl.ac.uk

 

r.alegre@ucl.ac.uk

 

 

 

Thank you!

Questions?

Oracc and RSDG

By Raquel Alegre

Oracc and RSDG

Presentation for UCL Digital Humanities on work carried out for UCL History as part of Oracc.

  • 2,820