Open Richly Annotated Cuneiform Corpus

RSDG Team Meeting - 9th October 2018

Raquel Alegre

ASCII

Transliteration

Format

Metadata: project info, lang, protocols...

Transliteration and lemmatization

Translation

Comments

Descriptions:

rulings, blank, ...

Sections:

object, parts. ...

Edition of ATF before:

  • Emacs plug-in: good functionality, but
    • Difficult to install
    • Difficult to use
    • Not very user friendly
    • Steep learning curve for users
    • Needs internet connection to validate files

RSDG project objectives:

  • Appropriate ATF grammar description
  • Create a user friendly intuitive interface
  • Allow for offline validation of texts
  • Take into account Arabic-speaking users

RSDG work for Oracc


  • PyOracc:
    • Python tool for validating ATF files
  • Nammu:
    • GUI for edition of ATF files that incorporates PyOracc
    • SOAP Web Services
      • Communication with the Oracc Server
    • Cross platform and easy deployment
      • JAR 
  • Website update
    • ​Search and browse catalogue
    • Angular, Flask+ElasticSearch 

PyOracc

PyORACC

  • Developed in Python using the Ply module
    • Implementation of Lex and Yacc parsing tools
      • Lexical analysis: Splits the input text file into tokens 
      • Semantic analysis: Finds hierarchical structure of the text 

 

  • Allows validation of ATF files offline

Lexical Analysis

Breaks the input text into a stream of tokens and matches with RE:

#
project

 

 cams/gkab

+

+

+

+

[new line]
t_HASH
r'\#'
PROJECT
t_COLON
r'\:'
t_ID
t_NEWLINE
r'\/n'
r'[a-zA-Z0-9]+[/]?[a-zA-Z0-9]+'

Semantic Analysis

Yacc parses and does semantic processing on the stream of tokens produced by Lex, following a grammar description:

expression : expression + term
           | expression - term
           | expression * term
           | expression / term
           | term
 3 * 5 + 1
def p_document(self, p):
    """document : text
                | object
                | composite"""
def p_text_language(self, p):
    "text : text language_protocol"
    p[0] = Text()
    p[0].language = p[2]
def p_language_protocol(self, p):
    "language_protocol : ATF LANG ID newline"
    p[0] = p[3]

Parse tree

PyOracc summary

 

  • Developed 2014 - 2016 by James Hetherington, Jens Nielsen and Raquel Alegre.
  • Extension of PyOracc by external collaborator in 2018.
  • Released on PyPI https://pypi.org/project/pyoracc

 

   > pip install pyoracc
  • Can be used as a CLI, as a Python module.
  • Integrated on Nammu for offline text validation and syntax highlighting.

Nammu

Nammu

  • Graphical User Interface for edition of ATF texts

  • Uses PyOracc:
    • ​ATF file validation
    • Syntax highlighting
  • Can talk to the Oracc server to validate and lemmatise ATF files.
  • Developed from 2015 by Raquel and Stuart with help of Jens, Anastasis and Roma.

Nammu

  • Developed using Jython:
    • Python for the Java Virtual Machine
    • Platform independent - runs in any computer that has JVM installed​
    • Access to both Python and Java libraries
    • Swing - powerful and widely used GUI widget toolkit for Java
  • Maven
    • ​Compiles, tests and builds JAR file
    • Maven-Jython plugin development

Functionality

  • Text edition (find/replace, undo/redo, split view, ...)
  • Offline (partial) validation using PyOracc
  • Communication with Oracc Server
    • Lemmatisation and validation
  • Console log
  • Syntax highlighting based on PyOracc
  • Error highlighting
  • Cross-platform and easy to install - JAR file
  • Configurable
  • Nahrein - Arabic translation mode

 

Next steps

  • Consolidation of Arabic translation mode functionality
  • Maintenance and user support
  • Bugfixing:
    • Bugs reported by users
  • GUI testing improvements
  • Update of Maven-Jython plugin
  • More functionality (e.g. creating Oracc projects in the server)
  • Consider moving to a combination of Electron + Ace web editor

Web Services

Web Services

  • Client-Server architecture:
    • ORACC Server (UPenn)
    • Tools for validating and lemmatising ATF files.
    • Listens for request from clients like Nammu or Emacs' plugin for ORACC

Web Services

 

  • SOAP: Simple Object Access Protocol
    • Specifies how to exchange structured information via web services over HTTP
    • WSDL: Web Services Domain Language
    • XML format for describing network services
      • SOAP Envelope

Asynchronous communication

Website

  • Old website broken and outdated
  • Not designed for smaller screens
  • Search is per Oracc project
  • Only for English speakers
  • Content and update and redesigned
  • Will take into account smaller screens
  • Search is global across the whole Oracc glossary
  • Translated to Arabic

Website

  • Tech stack:
    • Front-End - Angular2
    • Back-End - Flask app + ElasticSearch
  • Next steps:
    • UI design for small screens and Arabic translations with possible external help
    • Navigation through catalogue texts
    • More work on Arabic translations

Thank you!

Questions?

RSDG team meeting: Oracc

By Raquel Alegre

RSDG team meeting: Oracc

Informal presentation about work done on Oracc by UCL RSDG since 2014 - October 2018

  • 982