Open Richly Annotated Cuneiform Corpus
RSDG Team Meeting - 9th October 2018
Raquel Alegre
ASCII
Transliteration
Format
Metadata: project info, lang, protocols...
Transliteration and lemmatization
Translation
Comments
Descriptions:
rulings, blank, ...
Sections:
object, parts. ...
Edition of ATF before:
- Emacs plug-in: good functionality, but
- Difficult to install
- Difficult to use
- Not very user friendly
- Steep learning curve for users
- Needs internet connection to validate files
RSDG project objectives:
- Appropriate ATF grammar description
- Create a user friendly intuitive interface
- Allow for offline validation of texts
- Take into account Arabic-speaking users
RSDG work for Oracc
-
PyOracc:
- Python tool for validating ATF files
-
Nammu:
- GUI for edition of ATF files that incorporates PyOracc
-
SOAP Web Services
- Communication with the Oracc Server
-
Cross platform and easy deployment
- JAR
-
Website update
- Search and browse catalogue
- Angular, Flask+ElasticSearch
PyOracc
PyORACC
- Developed in Python using the Ply module
- Implementation of Lex and Yacc parsing tools
- Lexical analysis: Splits the input text file into tokens
- Semantic analysis: Finds hierarchical structure of the text
- Implementation of Lex and Yacc parsing tools
- Allows validation of ATF files offline
Lexical Analysis
Breaks the input text into a stream of tokens and matches with RE:
#
project
:
cams/gkab
+
+
+
+
[new line]
t_HASH
r'\#'
PROJECT
t_COLON
r'\:'
t_ID
t_NEWLINE
r'\/n'
r'[a-zA-Z0-9]+[/]?[a-zA-Z0-9]+'
Semantic Analysis
Yacc parses and does semantic processing on the stream of tokens produced by Lex, following a grammar description:
expression : expression + term
| expression - term
| expression * term
| expression / term
| term
3 * 5 + 1
def p_document(self, p):
"""document : text
| object
| composite"""
def p_text_language(self, p):
"text : text language_protocol"
p[0] = Text()
p[0].language = p[2]
def p_language_protocol(self, p):
"language_protocol : ATF LANG ID newline"
p[0] = p[3]
Parse tree
PyOracc summary
- Developed 2014 - 2016 by James Hetherington, Jens Nielsen and Raquel Alegre.
- Extension of PyOracc by external collaborator in 2018.
- Released on PyPI https://pypi.org/project/pyoracc
> pip install pyoracc
- Can be used as a CLI, as a Python module.
- Integrated on Nammu for offline text validation and syntax highlighting.
Nammu
Nammu
-
Graphical User Interface for edition of ATF texts
- Uses PyOracc:
- ATF file validation
- Syntax highlighting
- Can talk to the Oracc server to validate and lemmatise ATF files.
- Developed from 2015 by Raquel and Stuart with help of Jens, Anastasis and Roma.
Nammu
-
Developed using Jython:
- Python for the Java Virtual Machine
- Platform independent - runs in any computer that has JVM installed
- Access to both Python and Java libraries
- Swing - powerful and widely used GUI widget toolkit for Java
-
Maven
- Compiles, tests and builds JAR file
- Maven-Jython plugin development
Functionality
- Text edition (find/replace, undo/redo, split view, ...)
- Offline (partial) validation using PyOracc
- Communication with Oracc Server
- Lemmatisation and validation
- Console log
- Syntax highlighting based on PyOracc
- Error highlighting
- Cross-platform and easy to install - JAR file
- Configurable
- Nahrein - Arabic translation mode
Next steps
- Consolidation of Arabic translation mode functionality
- Maintenance and user support
- Bugfixing:
- Bugs reported by users
- GUI testing improvements
- Update of Maven-Jython plugin
- More functionality (e.g. creating Oracc projects in the server)
- Consider moving to a combination of Electron + Ace web editor
Web Services
Web Services
- Client-Server architecture:
- ORACC Server (UPenn)
- Tools for validating and lemmatising ATF files.
- Listens for request from clients like Nammu or Emacs' plugin for ORACC
Web Services
- SOAP: Simple Object Access Protocol
- Specifies how to exchange structured information via web services over HTTP
- WSDL: Web Services Domain Language
- XML format for describing network services
- SOAP Envelope
Asynchronous communication
Website
- Old website broken and outdated
- Not designed for smaller screens
- Search is per Oracc project
- Only for English speakers
- Content and update and redesigned
- Will take into account smaller screens
- Search is global across the whole Oracc glossary
- Translated to Arabic
Website
- Tech stack:
- Front-End - Angular2
- Back-End - Flask app + ElasticSearch
- Next steps:
- UI design for small screens and Arabic translations with possible external help
- Navigation through catalogue texts
- More work on Arabic translations
Thank you!
Questions?
RSDG team meeting: Oracc
By Raquel Alegre
RSDG team meeting: Oracc
Informal presentation about work done on Oracc by UCL RSDG since 2014 - October 2018
- 982