Oracc and RITS
New tools for editing the world's oldest texts
Prof. Eleanor Robson
Raquel Alegre
Dr. James Hetherington
Dr. Jens Nielsen
UCL History Department
UCL Research Software Development Group
RSDG work for ORACC
ASCII
Transliteration
Format
Edition of ATF now:
- Very good functionality, but:
- Emacs plugin
- Difficult to install
- Difficult to use
- Not very user friendly
- Steep learning curve for users
- Needs internet connection to validate files
UCL RSDG objectives:
- Appropriate ATF grammar description
- Create a user friendly intuitive interface
- Reduce PIs time spent training and retraining
RSDG work for ORACC
-
Custom Software Development of two tools:
- PyORACC: Python tools for validating ATF files
- Nammu: GUI for edition of ATF files
-
SOAP Web Services
- Communication with the ORACC Server
-
Easy deployment
- JAR files
PyORACC
PyORACC
- Developed in Python using the Ply module
- Implementation of Lex and Yacc parsing tools
- Lexical analysis: Splits the input text file into tokens
- Semantic analysis: Finds hierarchical structure of the text
- Implementation of Lex and Yacc parsing tools
- Allows validation of ATF files offline
ASCII
Transliteration
Format
Metadata: project info, lang, protocols...
Transliteration and lemmatization
Translation
Comments
Descriptions:
rulings, blank, ...
Sections:
object, parts. ...
Lexical Analysis
Breaks the input text into a stream of tokens and matches with RE:
#
project
:
cams/gkab
+
+
+
+
[new line]
t_HASH
r'\#'
PROJECT
t_COLON
r'\:'
t_ID
t_NEWLINE
r'\/n'
r'[a-zA-Z0-9]+[/]?[a-zA-Z0-9]+'
Semantic Analysis
Yacc parses and does semantic processing on the stream of tokens produced by Lex, following a grammar description:
expression : expression + term
| expression - term
| expression * term
| expression / term
| term
3 * 5 + 1
def p_document(self, p):
"""document : text
| object
| composite"""
ATF document grammar example
def p_text_language(self, p):
"text : text language_protocol"
p[0] = Text()
p[0].language = p[2]
def p_language_protocol(self, p):
"language_protocol : ATF LANG ID newline"
p[0] = p[3]
Parse tree
Nammu
Nammu
-
Graphical User Interface for edition of ATF texts
- Uses PyORACC:
- ATF file validation
- Syntax highlighting
- Can talk to the ORACC server to validate and lemmatise ATF files.
Nammu
-
Developed using Jython:
- Python for the Java Virtual Machine
- Platform independent - runs in any computer that has JVM installed
- Access to both Python and Java libraries
- Swing - powerful and widely used GUI widget toolkit for Java
Functionality
- Text edition
- Offline (partial) validation using PyORACC
- ATF validation using ORACC Server
- Lemmatisation
- Interactive Model View
- Console log
- Syntax Highlighting based on PyORACC
- Error highlighting
- Easy to install - JAR file
Web Services
Web Services
- Client-Server architecture:
- ORACC Server (UPenn)
- Tools for validating and lemmatising ATF files.
- Listens for request from clients like Nammu or Emacs' plugin for ORACC
Web Services
- SOAP: Simple Object Access Protocol
- Specifies how to exchange structured information via web services over HTTP
- WSDL: Web Services Domain Language
- XML format for describing network services
- SOAP Envelope
Asynchronous communication
Software Development
Software Development
- Open Source
- Continuous Integration
- Deployment and changes daily tested on Jenkins
- ~90% whole ORACC corpus covered by PyORACC tests
- Version Control: GitHub
- Nammu: https://github.com/oracc/nammu
- PyORACC: https://github.com/oracc/pyoracc
- Maven
- Compiles, tests and prepares JAR file
- Maven-Jyton plugin update
Future steps on development
Feedback and suggestions are welcome!
- 1st functional version of Nammu ready in July
- Text edition, validation and lemmatisation are ready, but other functionality needs to be added
- Interactive Model View for new users needs more work
- ORACC Unicode keyboard
- Remote edition of files
- Allow creation of new projects
- Other server functionality using glossaries
If we were inventing ORACC now,
what would we do differently?
Principles:
we got this right, I think
- Open access content at all levels
- Free to create projects, however large or small
- Balance between flexibility and coherence
- Commitment to documentation (could do more)
- Open linked data (could do more)
Staff and infrastructure:
opportunities much better now!
- Involve research software engineers from (before) day 1
- Acknowledge and plan for the ‘bus problem’
- Plan for scalability
- Make home institutions aware, proud, supportive
- Fund ongoing programming and operational support
Code:
planned design, not ad hoc evolution!
- Treat input, processing, output as separate issues (as we did)
- Weigh cost-benefits of domain-specific language design (e.g. ATF vs EpiDoc)
-
Accessible interface for content creators:
low entry barriers vs speed & control
- Practical tools, not just principles, to make content searchable and mineable
- Document and get user feedback on ALL of it!
UCL Research Software Development Group
Research Software Developers
Researcher
Software Developer
Research Software Developer
Research Software Developers
- Not independent researchers
- No personal research agenda
- Facilitative, Supportive and Collaborative
- Deep engagement with research groups
- Understand, study and be part of group research activities
- Sustainable outputs
- Institutional memory
- Continuity, stability and maintenance
UCL Research Software Development Group
- Started in 2012, grown from 1 to 8 RSDs!
- Helped UCL win £1.5M in research income
- Part of UCL Research IT Services
- Custom software development for research projects
- From simple scripts to HPC
- From theoretical physisc to humanities
- Infrastructure
- Version control, testing, DevOps, HPC, ...
UCL Research Software Development Group
- Training
- Software Carpentry, C++/Python for research, ...
- Networking
- UCL Programming Hub, SSI, UK RSE, ...
-
Tech support for research groups
- Mentoring, assessments, recruitment panels, ...
UCL Programming Hub
- Research groups or individuals across campus involved in software development for research.
- Monthly tech socials on a range of RSD topics and tools
- Weekly coffee mornings on Wednesday at SCR
UCL RSE community
UCL RSE community
UCL Research Software Dashboard
- Central repositories for UCL Research Software
Work with us!
- Avoid well-known problems on research software:
- Low levels of reuse
- Poor standard verification
- Why have an RSD?
- More and more research uses software
- General programmers don't understand research
- Post-docs and PhD students don't always write reliable reusable maintable code
Contact us
@uclrcsoftdev
rc-softdev@ucl.ac.uk
r.alegre@ucl.ac.uk
Thank you!
Questions?
Oracc and RSDG
By Raquel Alegre
Oracc and RSDG
Presentation for UCL Digital Humanities on work carried out for UCL History as part of Oracc.
- 2,820