A whirlwind tour of NoSQL databases

Piotr Grzesik

Agenda

  • What is NoSQL?
  • Why consider NoSQL databases?
  • CAP theorem, ACID and BASE
  • Key-Value Pair Databases
  • Graph Databases
  • Document-oriented Databases
  • Wide-column Databases
  • Time-series Databases

What is NoSQL?

There isn't a strict definition of NoSQL - originally it meant non-SQL, but nowadays it is often presented as Not Only SQL. In general NoSQL databases model data in a different way than relational databases, but that's not always the case.

Why consider NoSQL databases?

  • Scalability
  • Flexibility
  • Availability
  • Cost-effectiveness

Scalability

As applications grow, experience spikes in traffic, serve millions of users, there is a pressing need for underlying databases to be able to scale.

Very often, with relational databases it's easy to scale up (upgrading hardware on existing server) but very challenging to scale out (adding more servers). Dedicated NoSQL databases are often designed with scaling in mind, which makes them perfect for using "scale out" approach.

Flexibility

Unlike relational databases, NoSQL databases often do not require a fixed schema or table structure. For use cases where data models are dynamic, NoSQL offer the ability to automatically accomodate without altering the database structure. 

Availability

One of the most important requirements for an application is it's availability. NoSQL databases are designed with that in mind, being able to take advantage of multiple servers, offering high availability and resistance to single-point failures. 

Cost-effectiveness

To this day, a lot of relational database management systems require licensing fees, which makes it really hard to plan for scaling, handling peak traffic while being cost-effective. Most of the NoSQL databases are open source projects, without licensing fees.

ACID and BASE

  • Atomicity - transactions are treated as single, indivisible units, all operations of a transaction succeed or none of the succeed
  • Consistency - transactions always leave the database in a state that doesn't violate the data integrity requirements
  • Isolation - all transactions are isolated and "invisible" until completed
  • Durability - once a transaction is completed, it's durable even in an event of power loss or similar

ACID and BASE

  • Basically Available - this property means that the database can tolerate failures of some of the servers and still be available
  • Soft state - the state of the system can change over time, data might be overwritten with more recent data
  • Eventually consistent - database might be at times in an inconsistent state, but it will eventually become consistent after it stops receiving inputs

Types of NoSQL databases

  • Key-Value Pair Databases
  • Graph Databases
  • Document-oriented Databases
  • Column-oriented Databases
  • Time-series Databases
  • ... more that we won't discuss today

Key-Value Pair databases

  • Stores keys and corresponding values
  • Minimal set of constraints on database stucture, keys have to be unique
  • Designed for simplicity and speed, keeping as much as possible in RAM
  • Values usually don't require strong typing
  • Limited possibilities when it comes to querying - only keys operations

Key-Value Pair databases

Redis

 
  • In-memory data structure store, often used as cache, key-value pair database or message broker.
  • Written in C, open source (BSD 3-clause license), originally created by Salvatore Sanfilippo, currently developed by Redis Labs
  • Support for strings, maps, lists, sets, sorted sets among others
  • Offers optional durability
  • According to "DB-Engines" ranking, it's the most popular key-value database

Redis 

 
 
  • Supports standalone and cluster mode
  • Embedded Lua scripting language
  • Can be extended with Redis modules
  • Small memory footprint, empty instance uses about 3 MBs of memory
  • Amazing documentation https://redis.io/documentation

Redis use cases

 
  • Caching (session cache, full page cache, db query cache)
  • Used as message broker
  • Used for building leaderboards (using sorted sets)
  • Operational DB, used for storing ad-hoc values for optimization purposes

Redis demo

 

Graph databases

  • Based on graph theory, use nodes and edges for storing data
  • Nodes represent "entities" and edges represent "relationships" between nodes
  • Edges can be weighted (have associated numeric property)
  • Edges can be directed or undirected
  • Allow for easy modeling of data that can be represented as network

Graph databases

Neo4j

  • Graph database management system, written in Java and developed by Neo4j, Inc.
  • Offers GPL-3 licensed open-source version as well as closed-source extensions, available under commercial license
  • Most popular graph database according to "DB-Engines" ranking
  • Uses Cypher query language
  • ACID transactions

Neo4j use cases

  • Real time analytics based on relationships (e.g. fraud detection)
  • Recommendation engines / systems
  • Social networks graphs
  • Network and infrastructure monitoring

Neo4j demo

Document-oriented databases​

  • Also referred to as document stores, used for managing semi-structured data, often in form of JSON-like documents.
  • Schemaless, do not require predefined schemas which makes it perfect for storing dynamic, unstructured data
  • Often there's no need for object-relational mapping on application level
  • Advanced querying capabilities in comparison to Key-Value stores

Document-oriented databases​

MongoDB

  • General purpose, document oriented database, developed by MongoDB, Inc.
  • Licensed under Server Side Public License (SSPL)
  • JSON documents, queries also in form of JSON
  • Support for ACID transactions
  • Two types of relationships - reference and embedded
  • Part of a popular MEAN/MERN stack
  • Horizontal scaling using sharding

MongoDB use cases

  • Applications that operate on JSON data structures
  • Applications that manage dynamic data, with variable attributes (e.g. shop that sells product with multiple different properties)

MongoDB demo

Elasticsearch

  • Open source search and analytics engine, developed by Elastic
  • Stores data as JSON documents
  • Built on top of Apache Lucene, open source search engine
  • Takes advantage of an "inverted index" data structure, which allows for fast full-text searches
  • Accessible via REST API
  • Distributes data into shards, is horizontally scalable
  • Part of popular Elastic (formerly ELK - Elasticsearch - Kibana - Logstash) stack

Elasticsearch

use cases

  • Applications that need to support full-text search
  • Logging and log analytics
  • Applications that perform geospatial analytics and visualisations

Elasticsearch demo

Wide-column databases​

  • Database management systems that store data within column and rows, but can be better interpreted as two-dimensional key-value store
  • Still uses SQL or SQL-like language for querying (in most cases)
  • Aimed at workloads that consider columns (specific values) more than whole records (rows)
  • Very often used in analytical applications

Wide-column databases​

Google BigTable

  • Closed-source, column-wide store developed by Google
  • It uses three-dimensional mapping (row key, column key and timestamp)
  • Can be defined as sparse, distributed, multi-dimensional sorted map
  • Designed to scale into petabyte range
  • It can scale without downtime
  • Supports high read and write throughput at low latency for fast access to large amounts of data

Google BigTable

use cases

  • Applications that need to ingest, store and analyse large volumes of e.g. sensor data
  • Financial transaction analysis and fraud detection
  • Integration of large amount of unrefined data from many sources to find underlying patterns e.g. in AdTech

Apache Cassandra

  • Open source, distributed, wide-column database, initially developed at Facebook, currently developed by Apache Software Foundation
  • Designed to scale both write and reads as more machines are added to cluster
  • Uses CQL (Cassandra Query Language)
  • Integrates with Hadoop
  • Automatically replicates data to multiple nodes to provide fault tolerance

Apache Cassandra

use cases

  • Facebook's Inbox Search
  • Heavy-write applications
  • Perfect for multiple datacenters in different geographical regions
  • For applications that require high availability

Apache Cassandra Demo

Time-series databases​

  • Database management systems that are optimized to handle timestamped or time-series data
  • Data is characterized by low number of relationships, temporal ordering of records
  • Data is stored in form of measurements, events or metrics, often numerical
  • Data is inserted and often queried (aggregated), updates are rare
  • Time-Series databases are built in two ways - as a standalone database or as an extension to an existing database

Time-series databases​

InfluxDB

  • Open source, time-series database written in Go, developed and maintained by Influx Inc. 
  • Uses InfluxQL, custom SQL-like query language
  • Has support for aggregation functions over time-series data
  • Part of a popular TICK (Telegraf, Influx, Chronograf, Kapacitor) stack
  • Scalable and Highly Available thanks to support for clustering

InfluxDB use cases

  • Applications that need to ingest, store and analyse large volumes of e.g. sensor data
  • Real-time analytics on time-series data
  • DevOps monitoring

InfluxDB Demo

TimescaleDB

  • Open source, time-series PostgreSQL extension written in C, developed and maintained by Timescale, Inc. 
  • Uses SQL and is compatible with "native" PostgreSQL
  • Supports the same client libraries and CLI tools as PostgreSQL
  • Adds support for aggregation functions over time-series data

TimescaleDB

use cases

  • Applications that need to ingest, store and analyse large volumes of e.g. sensor data
  • Real-time analytics on time-series data
  • DevOps monitoring

TimescaleDB Demo

Useful links

Q&A + Contact

Twitter: @p_grzesik

contact@pgrzesik.com

pgrzesik.com

A whirlwind tour of NoSQL databases

By progressive

A whirlwind tour of NoSQL databases

  • 1,200