NoSQL approaches with common bioinformatic examples







Toni Hermoso Pulido
Bioinformatics Unit
Core Facilities, CRG

@toniher

NoSQL


NoSQL -> Not Only SQL


NoSQL


Alternative approach to RDBMS (relational model)


NoSQL DB types


Key-value


Document


Graph





Ref software: http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis


Key-value


Collection of key-values known as:

dictionary, associative array, hashes, maps, etc.



Redis

Key-value storage.

Nowadays more than simple key-value.

Permanent or in-memory

Examples:

  • Queues
  • Caching
  • etc.


With MediaWiki: Wikipedia, AnnoWiki


http://ttltheory.wordpress.com/tag/redis-examples/

http://highscalability.com/blog/2011/7/6/11-common-web-use-cases-solved-in-redis.html


JSON

JavaScript Object Notation

Textual way to share objects

In JavaScript, associative arrays are objects.

JSON vs XML



JSON vs XML

Convert XML to JSON

xsltproc (XSLT)

XML DOM, XPath, etc., no efficient for big files!

Reference: Pierre Lindenbaum

New Blast Format will not need it

Document stores


Semi-structured model

Schema-free

No separation between data and schema


Document formats:

XML, YAML, JSON, BSON


CouchDB

Popular document store (apart from MongoDB)

Can have different databases

Replication (master-master, master-slave, etc.)

Focus on consistency - ACID

(Atomicity, consistency, isolation, durability)




CouchDB - document



What is a Document?

JSON!

  • id
  • rev


Using Futon as interface

CouchDB - REST API


Everything is WEB

EVERYTHING, for the good and for the bad…


CRUD

Operation SQL HTTP
Create INSERT PUT / POST
Read (Retrieve) SELECT GET
Update (Modify) UPDATE PUT / PATCH
Delete (Destroy) DELETE DELETE

CouchDB - Views


Design document


JavaScript: Map/reduce



Temporary and Permanent views


Map/Reduce


Map


Procedure that performs filtering and sorting

Outcome:

key : value (which can be composite)

Map/Reduce


Reduce


Procedure that performs an aggregation operation

from the former values

Map/Reduce


Some interesting docs:


Map Reduce in CouchDB

http://www.slideshare.net/okurow/couchdb-mapreduce-13321353


View Cookbook for SQL Jockeys

http://guide.couchdb.org/draft/cookbook.html


Writing reduce functions

http://www.bitsbythepound.com/writing-a-reduce-function-in-couchdb-370.html

CouchDB - world friendly

Thanks to PouchDB

Sync DBs in:

  • terminal (e.g. levelDB)
  • browser (e.g. indexedDB)
  • server (couchDB)



with the same RESTful syntax.

CouchDB - other libraries


PHP - JS


Python

https://pythonhosted.org/CouchDB/

Example application


http://bypass.uab.cat


Blast-Bypass pipeline






Prediction of protein function improving sequence remote alignment search by a fuzzy logic algorithm. Antonio Gómez, Juan Cedano, Jordi Espadaler, Antonio Hermoso, Jaume Piñol, Enrique Querol (2008) The protein journal	27 (2) p. 130-139 



GraphDB

Vertices (nodes) VS edges (relationships)

Self-explanation:


Types of graphs

NCBI Taxonomy - Simple Hierarchy

Gene Ontology (molecular function, biological process, cellular component) - 3 DAGs

Related: NCBI Taxonomy in MySQL


Neo4J


Most popular GraphDB nowadays. JAVA based.

One DB is one instance (in one port, standard 7474)

You can have different data, with different labels


Nodes and relations are imported as JSON documents

It's very important to properly define indexes (Lucene backend)



Cypher


SQL-like language

Querying


MATCH s-[*0..3]->(t:TAXID { rank:"family", scientific_name:"Hominidae" })
WHERE s.rank="genus"
RETURN s.scientific_name as name, s.rank as rank limit 50;

REST API


Query

http://127.0.0.1:7474/db/data/index/node/TAXID/id/9606



Upload (in batches)

In Python: py2neo


Other libraries

Lowest Common Ancestor



http://bio4j.com/blog/2012/02/finding-the-lowest-common-ancestor-of-a-set-of-ncbi-taxonomy-nodes-with-bio4j/


JAVA extensions


Jersey - REST-API for Java

Maven (project management)


Documentation


Nowadays much faster than using Cypher :(



Example of API implementation


NodeJS Express interface accessing Neo4J and MySQL


http://prgdb.crg.eu/api/






PRGdb 2.0: towards a community-based database model for the analysis of R-genes in plants.
Walter Sanseverino, Antonio Hermoso, Raffaella D'Alessandro, Anna Vlasova, Giuseppe Andolfo, Luigi Frusciante, Ernesto Lowy, Guglielmo Roma, Maria Raffaella Ercolano (2013)
Nucleic acids research 41 (Database issue) p. D1167-71

New challengers and curiosities

or, rather said, things I'd like to try...



ArangoDB (key-value, document and graph, 3-in-1)


MariaDB (MySQL fork) with JSON support and dynamic columns http://www.slideshare.net/blueskarlsson/using-json-with-mariadb-and-mysql


Other approaches? Distributed-oriented: Riak, RethinkDB.





There can be only one ?






toni.hermoso@crg.eu


@toniher