A whirlwind tour of NoSQL databases

Piotr Grzesik

Agenda

What is NoSQL?
Why consider NoSQL databases?
CAP theorem, ACID and BASE
Key-Value Pair Databases
Graph Databases
Document-oriented Databases
Wide-column Databases
Time-series Databases

What is NoSQL?

There isn't a strict definition of NoSQL - originally it meant non-SQL, but nowadays it is often presented as Not Only SQL. In general NoSQL databases model data in a different way than relational databases, but that's not always the case.

Why consider NoSQL databases?

Scalability
Flexibility
Availability
Cost-effectiveness

Scalability

As applications grow, experience spikes in traffic, serve millions of users, there is a pressing need for underlying databases to be able to scale.

Very often, with relational databases it's easy to scale up (upgrading hardware on existing server) but very challenging to scale out (adding more servers). Dedicated NoSQL databases are often designed with scaling in mind, which makes them perfect for using "scale out" approach.

Flexibility

Unlike relational databases, NoSQL databases often do not require a fixed schema or table structure. For use cases where data models are dynamic, NoSQL offer the ability to automatically accomodate without altering the database structure.

Availability

One of the most important requirements for an application is it's availability. NoSQL databases are designed with that in mind, being able to take advantage of multiple servers, offering high availability and resistance to single-point failures.

Cost-effectiveness

To this day, a lot of relational database management systems require licensing fees, which makes it really hard to plan for scaling, handling peak traffic while being cost-effective. Most of the NoSQL databases are open source projects, without licensing fees.

ACID and BASE

Atomicity - transactions are treated as single, indivisible units, all operations of a transaction succeed or none of the succeed
Consistency - transactions always leave the database in a state that doesn't violate the data integrity requirements
Isolation - all transactions are isolated and "invisible" until completed
Durability - once a transaction is completed, it's durable even in an event of power loss or similar

ACID and BASE

Basically Available - this property means that the database can tolerate failures of some of the servers and still be available
Soft state - the state of the system can change over time, data might be overwritten with more recent data
Eventually consistent - database might be at times in an inconsistent state, but it will eventually become consistent after it stops receiving inputs

Types of NoSQL databases

Key-Value Pair Databases
Graph Databases
Document-oriented Databases
Column-oriented Databases
Time-series Databases
... more that we won't discuss today

Key-Value Pair databases

Stores keys and corresponding values
Minimal set of constraints on database stucture, keys have to be unique
Designed for simplicity and speed, keeping as much as possible in RAM
Values usually don't require strong typing
Limited possibilities when it comes to querying - only keys operations

Key-Value Pair databases

Redis (https://redis.io/)
Riak (https://riak.com/index.html)
DynamoDB (https://aws.amazon.com/dynamodb/)
FoundationDB (https://www.foundationdb.org/)
Memcached (http://memcached.org/)
LevelDB (https://github.com/google/leveldb)
etcd (https://etcd.io/)

Redis

In-memory data structure store, often used as cache, key-value pair database or message broker.
Written in C, open source (BSD 3-clause license), originally created by Salvatore Sanfilippo, currently developed by Redis Labs
Support for strings, maps, lists, sets, sorted sets among others
Offers optional durability
According to "DB-Engines" ranking, it's the most popular key-value database

Redis

Supports standalone and cluster mode
Embedded Lua scripting language
Can be extended with Redis modules
Small memory footprint, empty instance uses about 3 MBs of memory
Amazing documentation https://redis.io/documentation

Redis use cases

Caching (session cache, full page cache, db query cache)
Used as message broker
Used for building leaderboards (using sorted sets)
Operational DB, used for storing ad-hoc values for optimization purposes

Redis demo

Graph databases

Based on graph theory, use nodes and edges for storing data
Nodes represent "entities" and edges represent "relationships" between nodes
Edges can be weighted (have associated numeric property)
Edges can be directed or undirected
Allow for easy modeling of data that can be represented as network

Graph databases

Neo4j (https://neo4j.com/)
DGraph (https://dgraph.io/)
OrientDB (https://orientdb.com/)
Amazon Neptune (https://aws.amazon.com/neptune/)
ArangoDB (https://www.arangodb.com/)

Neo4j

Graph database management system, written in Java and developed by Neo4j, Inc.
Offers GPL-3 licensed open-source version as well as closed-source extensions, available under commercial license
Most popular graph database according to "DB-Engines" ranking
Uses Cypher query language
ACID transactions

Neo4j use cases

Real time analytics based on relationships (e.g. fraud detection)
Recommendation engines / systems
Social networks graphs
Network and infrastructure monitoring

Neo4j demo

Document-oriented databases

Also referred to as document stores, used for managing semi-structured data, often in form of JSON-like documents.
Schemaless, do not require predefined schemas which makes it perfect for storing dynamic, unstructured data
Often there's no need for object-relational mapping on application level
Advanced querying capabilities in comparison to Key-Value stores

Document-oriented databases

MongoDB (https://www.mongodb.com/)
Apache CouchDB (https://couchdb.apache.org/)
Azure Cosmos DB (https://azure.microsoft.com/en-us/services/cosmos-db/)
Elasticsearch (https://www.elastic.co/)
RethinkDB (https://rethinkdb.com/)
Couchbase (https://www.couchbase.com/)

MongoDB

General purpose, document oriented database, developed by MongoDB, Inc.
Licensed under Server Side Public License (SSPL)
JSON documents, queries also in form of JSON
Support for ACID transactions
Two types of relationships - reference and embedded
Part of a popular MEAN/MERN stack
Horizontal scaling using sharding

MongoDB use cases

Applications that operate on JSON data structures
Applications that manage dynamic data, with variable attributes (e.g. shop that sells product with multiple different properties)

MongoDB demo

Elasticsearch

Open source search and analytics engine, developed by Elastic
Stores data as JSON documents
Built on top of Apache Lucene, open source search engine
Takes advantage of an "inverted index" data structure, which allows for fast full-text searches
Accessible via REST API
Distributes data into shards, is horizontally scalable
Part of popular Elastic (formerly ELK - Elasticsearch - Kibana - Logstash) stack

Elasticsearch

use cases

Applications that need to support full-text search
Logging and log analytics
Applications that perform geospatial analytics and visualisations

Elasticsearch demo

Wide-column databases

Database management systems that store data within column and rows, but can be better interpreted as two-dimensional key-value store
Still uses SQL or SQL-like language for querying (in most cases)
Aimed at workloads that consider columns (specific values) more than whole records (rows)
Very often used in analytical applications

Wide-column databases

Apache Cassandra (http://cassandra.apache.org/)
Apache HBase (https://hbase.apache.org/)
Google BigTable (https://cloud.google.com/bigtable/)
Scylla (https://www.scylladb.com/)

Google BigTable

Closed-source, column-wide store developed by Google
It uses three-dimensional mapping (row key, column key and timestamp)
Can be defined as sparse, distributed, multi-dimensional sorted map
Designed to scale into petabyte range
It can scale without downtime
Supports high read and write throughput at low latency for fast access to large amounts of data

Google BigTable

use cases

Applications that need to ingest, store and analyse large volumes of e.g. sensor data
Financial transaction analysis and fraud detection
Integration of large amount of unrefined data from many sources to find underlying patterns e.g. in AdTech

Apache Cassandra

Open source, distributed, wide-column database, initially developed at Facebook, currently developed by Apache Software Foundation
Designed to scale both write and reads as more machines are added to cluster
Uses CQL (Cassandra Query Language)
Integrates with Hadoop
Automatically replicates data to multiple nodes to provide fault tolerance

Apache Cassandra

use cases

Facebook's Inbox Search
Heavy-write applications
Perfect for multiple datacenters in different geographical regions
For applications that require high availability

Apache Cassandra Demo

Time-series databases

Database management systems that are optimized to handle timestamped or time-series data
Data is characterized by low number of relationships, temporal ordering of records
Data is stored in form of measurements, events or metrics, often numerical
Data is inserted and often queried (aggregated), updates are rare
Time-Series databases are built in two ways - as a standalone database or as an extension to an existing database

Time-series databases

InfluxDB (https://www.influxdata.com/)
TimescaleDB (https://www.timescale.com/)
OpenTSDB (http://opentsdb.net/)
Riak TS (https://riak.com/)
Amazon Timestream (https://aws.amazon.com/timestream/)
Prometheus (https://prometheus.io/)

InfluxDB

Open source, time-series database written in Go, developed and maintained by Influx Inc.
Uses InfluxQL, custom SQL-like query language
Has support for aggregation functions over time-series data
Part of a popular TICK (Telegraf, Influx, Chronograf, Kapacitor) stack
Scalable and Highly Available thanks to support for clustering

InfluxDB use cases

Applications that need to ingest, store and analyse large volumes of e.g. sensor data
Real-time analytics on time-series data
DevOps monitoring

InfluxDB Demo

TimescaleDB

Open source, time-series PostgreSQL extension written in C, developed and maintained by Timescale, Inc.
Uses SQL and is compatible with "native" PostgreSQL
Supports the same client libraries and CLI tools as PostgreSQL
Adds support for aggregation functions over time-series data

TimescaleDB

use cases

Applications that need to ingest, store and analyse large volumes of e.g. sensor data
Real-time analytics on time-series data
DevOps monitoring

TimescaleDB Demo

Useful links

Aiven - hosting DBs in the cloud (https://aiven.io/)
Timescale Cloud (https://www.timescale.com/cloud/)
MongoDB Atlas (https://www.mongodb.com/atlas/database)
Google Cloud Platform (https://cloud.google.com/)
Neo4j Sandbox (https://sandbox.neo4j.com/)

Q&A + Contact

Twitter: @p_grzesik

contact@pgrzesik.com

pgrzesik.com

A whirlwind tour of NoSQL databases

By progressive

A whirlwind tour of NoSQL databases

1,655

A whirlwind tour of NoSQL databases

Agenda

What is NoSQL?

Why consider NoSQL databases?

Scalability

Flexibility

Availability

Cost-effectiveness

ACID and BASE

ACID and BASE

Types of NoSQL databases

Key-Value Pair databases

Key-Value Pair databases

Redis

Redis

Redis use cases

Redis demo

Graph databases

Graph databases

Neo4j

Neo4j use cases

Neo4j demo

Document-oriented databases​

Document-oriented databases​

MongoDB

MongoDB use cases

MongoDB demo

Elasticsearch

Elasticsearch

use cases

Elasticsearch demo

Wide-column databases​

Wide-column databases​

Google BigTable

Google BigTable

use cases

Apache Cassandra

Apache Cassandra

use cases

Apache Cassandra Demo

Time-series databases​

Time-series databases​

InfluxDB

InfluxDB use cases

InfluxDB Demo

TimescaleDB

TimescaleDB

use cases

TimescaleDB Demo

Useful links

Q&A + Contact

A whirlwind tour of NoSQL databases

More from progressive

Document-oriented databases

Document-oriented databases

Wide-column databases

Wide-column databases

Time-series databases

Time-series databases