NoSQL Databases
Piotr Grzesik
Agenda
- What is NoSQL?
- Why consider NoSQL databases?
- CAP theorem, ACID and BASE
- Key-Value Pair Databases
- Graph Databases
- Document-oriented Databases
- Wide-column Databases
- Time-series Databases
What is NoSQL?
There isn't a strict definition of NoSQL - originally it meant non-SQL, but nowadays it is often presented as Not Only SQL. In general NoSQL databases model data in a different way than relational databases, but that's not always the case.
Why consider NoSQL databases?
- Scalability
- Flexibility
- Availability
- Cost-effectiveness
Scalability
As applications grow, experience spikes in traffic, serve millions of users, there is a pressing need for underlying databases to be able to scale.
Very often, with relational databases it's easy to scale up (upgrading hardware on existing server) but very challenging to scale out (adding more servers). Dedicated NoSQL databases are often designed with scaling in mind, which makes them perfect for using "scale out" approach.
Flexibility
Unlike relational databases, NoSQL databases often do not require a fixed schema or table structure. For use cases where data models are dynamic, NoSQL offer the ability to automatically accomodate without altering the database structure.
Availability
One of the most important requirements for an application is it's availability. NoSQL databases are designed with that in mind, being able to take advantage of multiple servers, offering high availability and resistance to single-point failures.
Cost-effectiveness
To this day, a lot of relational database management systems require licensing fees, which makes it really hard to plan for scaling, handling peak traffic while being cost-effective. Most of the NoSQL databases are open source projects, without licensing fees.
ACID and BASE
- Atomicity - transactions are treated as single, indivisible units, all operations of a transaction succeed or none of the succeed
- Consistency - transactions always leave the database in a state that doesn't violate the data integrity requirements
- Isolation - all transactions are isolated and "invisible" until completed
- Durability - once a transaction is completed, it's durable even in an event of power loss or similar
ACID and BASE
- Basically Available - this property means that the database can tolerate failures of some of the servers and still be available
- Soft state - the state of the system can change over time, data might be overwritten with more recent data
- Eventually consistent - database might be at times in an inconsistent state, but it will eventually become consistent after it stops receiving inputs
Types of NoSQL databases
- Key-Value Pair Databases
- Graph Databases
- Document-oriented Databases
- Column-oriented Databases
- Time-series Databases
- ... more that we won't discuss today
Key-Value Pair databases
- Stores keys and corresponding values
- Minimal set of constraints on database stucture, keys have to be unique
- Designed for simplicity and speed, keeping as much as possible in RAM
- Values usually don't require strong typing
- Limited possibilities when it comes to querying - only keys operations
Key-Value Pair databases
- Redis (https://redis.io/)
- Riak (https://riak.com/index.html)
- DynamoDB (https://aws.amazon.com/dynamodb/)
- FoundationDB (https://www.foundationdb.org/)
- Memcached (http://memcached.org/)
- LevelDB (https://github.com/google/leveldb)
- etcd (https://etcd.io/)
Redis
- In-memory data structure store, often used as cache, key-value pair database or message broker.
- Written in C, open source (BSD 3-clause license), originally created by Salvatore Sanfilippo, currently developed by Redis Labs
- Support for strings, maps, lists, sets, sorted sets among others
- Offers optional durability
- According to "DB-Engines" ranking, it's the most popular key-value database
Redis
- Supports standalone and cluster mode
- Embedded Lua scripting language
- Can be extended with Redis modules
- Small memory footprint, empty instance uses about 3 MBs of memory
- Amazing documentation https://redis.io/documentation
Redis use cases
- Caching (session cache, full page cache, db query cache)
- Used as message broker
- Used for building leaderboards (using sorted sets)
- Operational DB, used for storing ad-hoc values for optimization purposes
Redis demo
Graph databases
- Based on graph theory, use nodes and edges for storing data
- Nodes represent "entities" and edges represent "relationships" between nodes
- Edges can be weighted (have associated numeric property)
- Edges can be directed or undirected
- Allow for easy modeling of data that can be represented as network
Graph databases
- Neo4j (https://neo4j.com/)
- DGraph (https://dgraph.io/)
- OrientDB (https://orientdb.com/)
- Amazon Neptune (https://aws.amazon.com/neptune/)
- ArangoDB (https://www.arangodb.com/)
Neo4j
- Graph database management system, written in Java and developed by Neo4j, Inc.
- Offers GPL-3 licensed open-source version as well as closed-source extensions, available under commercial license
- Most popular graph database according to "DB-Engines" ranking
- Uses Cypher query language
- ACID transactions
Neo4j use cases
- Real time analytics based on relationships (e.g. fraud detection)
- Recommendation engines / systems
- Social networks graphs
- Network and infrastructure monitoring
Neo4j demo
Document-oriented databases
- Also referred to as document stores, used for managing semi-structured data, often in form of JSON-like documents.
- Schemaless, do not require predefined schemas which makes it perfect for storing dynamic, unstructured data
- Often there's no need for object-relational mapping on application level
- Advanced querying capabilities in comparison to Key-Value stores
Document-oriented databases
- MongoDB (https://www.mongodb.com/)
- Apache CouchDB (https://couchdb.apache.org/)
- Azure Cosmos DB (https://azure.microsoft.com/en-us/services/cosmos-db/)
- Elasticsearch (https://www.elastic.co/)
- RethinkDB (https://rethinkdb.com/)
- Couchbase (https://www.couchbase.com/)
MongoDB
- General purpose, document oriented database, developed by MongoDB, Inc.
- Licensed under Server Side Public License (SSPL)
- JSON documents, queries also in form of JSON
- Support for ACID transactions
- Two types of relationships - reference and embedded
- Part of a popular MEAN/MERN stack
- Horizontal scaling using sharding
MongoDB use cases
- Applications that operate on JSON data structures
- Applications that manage dynamic data, with variable attributes (e.g. shop that sells product with multiple different properties)
MongoDB demo
Elasticsearch
- Open source search and analytics engine, developed by Elastic
- Stores data as JSON documents
- Built on top of Apache Lucene, open source search engine
- Takes advantage of an "inverted index" data structure, which allows for fast full-text searches
- Accessible via REST API
- Distributes data into shards, is horizontally scalable
- Part of popular Elastic (formerly ELK - Elasticsearch - Kibana - Logstash) stack
Elasticsearch
use cases
- Applications that need to support full-text search
- Logging and log analytics
- Applications that perform geospatial analytics and visualisations
Elasticsearch demo
Wide-column databases
- Database management systems that store data within column and rows, but can be better interpreted as two-dimensional key-value store
- Still uses SQL or SQL-like language for querying (in most cases)
- Aimed at workloads that consider columns (specific values) more than whole records (rows)
- Very often used in analytical applications
Wide-column databases
- Apache Cassandra (http://cassandra.apache.org/)
- Apache HBase (https://hbase.apache.org/)
- Google BigTable (https://cloud.google.com/bigtable/)
- Scylla (https://www.scylladb.com/)
Google BigTable
- Closed-source, column-wide store developed by Google
- It uses three-dimensional mapping (row key, column key and timestamp)
- Can be defined as sparse, distributed, multi-dimensional sorted map
- Designed to scale into petabyte range
- It can scale without downtime
- Supports high read and write throughput at low latency for fast access to large amounts of data
Google BigTable
use cases
- Applications that need to ingest, store and analyse large volumes of e.g. sensor data
- Financial transaction analysis and fraud detection
- Integration of large amount of unrefined data from many sources to find underlying patterns e.g. in AdTech
Apache Cassandra
- Open source, distributed, wide-column database, initially developed at Facebook, currently developed by Apache Software Foundation
- Designed to scale both write and reads as more machines are added to cluster
- Uses CQL (Cassandra Query Language)
- Integrates with Hadoop
- Automatically replicates data to multiple nodes to provide fault tolerance
Apache Cassandra
use cases
- Facebook's Inbox Search
- Heavy-write applications
- Perfect for multiple datacenters in different geographical regions
- For applications that require high availability
Apache Cassandra Demo
Time-series databases
- Database management systems that are optimized to handle timestamped or time-series data
- Data is characterized by low number of relationships, temporal ordering of records
- Data is stored in form of measurements, events or metrics, often numerical
- Data is inserted and often queried (aggregated), updates are rare
- Time-Series databases are built in two ways - as a standalone database or as an extension to an existing database
Time-series databases
- InfluxDB (https://www.influxdata.com/)
- TimescaleDB (https://www.timescale.com/)
- OpenTSDB (http://opentsdb.net/)
- Riak TS (https://riak.com/)
- Amazon Timestream (https://aws.amazon.com/timestream/)
- Prometheus (https://prometheus.io/)
InfluxDB
- Open source, time-series database written in Go, developed and maintained by Influx Inc.
- Uses InfluxQL, custom SQL-like query language
- Has support for aggregation functions over time-series data
- Part of a popular TICK (Telegraf, Influx, Chronograf, Kapacitor) stack
- Scalable and Highly Available thanks to support for clustering
InfluxDB use cases
- Applications that need to ingest, store and analyse large volumes of e.g. sensor data
- Real-time analytics on time-series data
- DevOps monitoring
InfluxDB Demo
TimescaleDB
- Open source, time-series PostgreSQL extension written in C, developed and maintained by Timescale, Inc.
- Uses SQL and is compatible with "native" PostgreSQL
- Supports the same client libraries and CLI tools as PostgreSQL
- Adds support for aggregation functions over time-series data
TimescaleDB
use cases
- Applications that need to ingest, store and analyse large volumes of e.g. sensor data
- Real-time analytics on time-series data
- DevOps monitoring
TimescaleDB Demo
ArangoDB
- Open source, multi-model database, written in C++ and JavaScript
- Natively supports documents, graphs and key-values
- Supports AQL query language
- Support ACID transactions
- Supports documents without defined schema
- Uses JSON as data format
- Uses sharding to enable horizontal scaling
Useful links
- Aiven - hosting DBs in the cloud (https://aiven.io/)
- Timescale Cloud (https://www.timescale.com/cloud/)
- MongoDB Atlas (https://www.mongodb.com/atlas/database)
- Google Cloud Platform (https://cloud.google.com/)
- Neo4j Sandbox (https://sandbox.neo4j.com/)
Q&A + Contact
Twitter: @p_grzesik
contact@pgrzesik.com
pgrzesik.com
A whirlwind tour of NoSQL databases
By progressive
A whirlwind tour of NoSQL databases
- 275