Senior Research Software Developer
UCL Research IT Services
slides.com/raquelalegre/libraries
Training:
Project work:
Outreach and collaboration
slides.com/raquelalegre/libraries
Web-based interfaces to data are a common response to these, but rarely fulfil the complex needs of humanities research, and limit further research work.
Current barries include:
Adapt methods usually applied to big data processing in scientific research to the humanities.
Data Spring
London Times Digital Archive
ORACC and Nahrein
Pilot project at UCL in collaboration with the British Library, funded by Jisc Research Data Spring in 2015.
How does the occurrence of diseases in published literature compare to known epidemics in the 19th century?
How did changes in image techniques and the size of images map onto the different genres over time? How do the findings made possible using digital humanities techniques and digital sources compare to those using traditional methods and small, hand-crafted collections?
Research Data Storage Team:
Storage and upkeep of the digitised text on iRODS.
Research Computing:
Configuration of Legion and Grace for high performance computing using MPI and Spark to run parallel queries on the texts, making it possible to run a query through the whole archive in under 20 minutes.
Research Software Development Group:
Development of software modules that understand the data model of the digitised texts, and development of the glueware to perform users' queries utilising the data stored by RDS and the HPC systems responsibility of RC.
Facilitation Team:
Enabling this work and constantly seeking ways of making our work more accessible to UCL researchers.
London Times - # words per year
London Times - % mentions of `Professor`
London Times - `he said` vs `she said`
London Times - More on Gender bias through time
London Times - Sentiment Analysis on gender bias
London Times - `men are` vs `women are`
Other queries run on the TDA
Metadata: project info, lang, protocols...
Sections:
object, parts. ...
Comments
Transliteration and lemmatization
Descriptions:
rulings, blank, ...
Translation
ASCII
Transliteration
Format
Breaks the input text into a stream of tokens and matches with Regular Expressions:
#
project
:
cams/gkab
[new line]
+
+
+
+
t_HASH
PROJECT
t_COLON
t_ID
t_NEWLINE
r'[a-zA-Z0-9]+[/]?[a-zA-Z0-9]+'
r'\:'
r'\#'
r'\/n'
Text mining can be overwhelming:
Content structure and organisation | Text processing | Statistical Analysis |
Machine Learning | Classification methods | Model Evaluation |
So many available tools for text mining came up in the last years - difficult to choose!
Get coding!
... Any questions?