Terrell Russell, Ph.D
Executive Director
iRODS Consortium
DAViDD: Initial data management
solution for UNC's
READDI AViDD Center
May 28-31, 2024
iRODS User Group Meeting 2024
Amsterdam, Netherlands
The Rapidly Emerging Antiviral Drug Development Initiative AViDD Center (READDI-AC) is an NIH-funded public-private partnership focused on developing effective antiviral drugs to combat emerging viruses.
The READDI-AC at UNC-Chapel Hill is one of nine Antiviral Drug Discovery (AViDD) Centers funded by the US National Institute of Allergy and Infectious Disease (NIAID) at the National Institutes of Health.
$65M in 2022 - 40 Investigators, 23 Research Sites, 5 Countries
NIH Award 1U19AI171292-01
READDI-AC
The response to viral outbreaks has historically been reactive – vaccines and medications are developed only after a new virus emerges. Our mission is to proactively prepare for emerging viruses by developing antiviral drugs that are active against more than one virus in a family. These broad-spectrum antivirals will help safeguard the well-being of communities worldwide against existing viruses and will be more likely to be effective against future novel viruses in the same family.
Four families:
READDI-AC Mission
RENCI, as a subawardee, was tasked to assess, design, and develop the data management solution for the READDI-AC project.
Timeline
If you use instruments in your work, what is the format of files they produce?
Where do you currently record and store chemical or biological data? In what data format?
What keywords or other terms do you use to search for previously recorded data?
Do you use existing or public vocabularies, ontologies or other references?
Are you familiar with FAIR data sharing principles?
What are typical dataset sizes for each unit of work?
What is a typical data generation rate for your lab / unit?
What is a typical number of data files you generate per week?
Do you protect the stored data (do you require secure login/authentication to access your data?
How do you currently share your data with (i) your labmates, (ii) within UNC, (iii) with the rest of the world?
What software do you use to process or visualize your data?
Do you have a need to format your data for publication or presentation format?
What steps are manually processed today? What steps are automated today?
Where is manual processing required? Where can data processing be automated?
What limitations are your lab / group / team running into?
What is your highest priority or need for data capture or storage?
Discovery Questions
Not Big Data (yet)
1000s of files over the course of a year
Maybe 10s of GBs, but most much smaller
Human scale rate of ingest
A few formats, mostly open / convertible
doc, pdf, xls, ppt, csv, txt, prism, jpg
Some electronic lab notebooks
Mostly xls
Manually calculated / generated
Very little currently automated
Manual transcription from paper notebooks
Graphing done in Excel or (rarely) Jupyter notebooks
Discovery Findings
No shared naming conventions
For either data filenames or metadata
Sometimes consistent within a lab
Due to necessity
But not over time
Highest priority is access / sharing
This project's data needs a 'home'
No centralized data repository
No standardized data processing (raw to publication quality) and data upload protocols
No versioning protocols
Metadata currently non-existent
Negative results / Failed attempts not recorded
HEIGHTENED NEED for RDM due to new NIH reqs
There was very little process to automate - we were starting from scratch and these labs did not have much in common. Different instruments, different chemistry, different software, different processes, different formats.
Not their fault - they'd never been required to coordinate and collaborate in the past except via publications. This was a new mandate.
There would be two projects:
Design and Preparation
Software Engineering Requirements
January 2023 - Initial Deployment
Angular Application
iRODS Policy - Four recurring rules
DAViDD - Application and Policy
irule davidd_add_sweeper_to_queue
February 2023 - Upload
Associated metadata from upload form available to search and browse
March 2023 - File Metadata
GenQuery
March 2023 - Search
irule davidd_add_compound_profile_sweeper_to_queue
irule davidd_add_compound_profile_remover_to_queue
September 2023 - Compound Profile
irule davidd_add_assays_sweeper_to_queue
January 2024 - Assays
Discovery and prototyping were a success
353 datafiles uploaded in the first year
Summary
Having identified the main requirements and bench-to-data process, the project selected an existing commercial vendor for its extensive GUI and compound-specific analysis tooling.
RENCI continues to develop database-level tooling focused on chemical compound information and linkages with other tools in the ecosystem.
Acknowledgements
The Future
Thank you!
Questions?