Cassandra - an introduction

"Yet another database"

"Yet another database"

"Marketing"

"Marketing"

Hi! I'm Hanneli (@hannelita)

  • Computer Engineer
  • Programming
  • Electronics
  • Math <3 <3
  • Physics
  • Lego
  • Meetups
  • Animals
  • Coffee
  • GIFs
  • Pokémon

Cassandra - Useful! 

  • Motivation
  • Architecture
  • Write and Read
  • Data Model
  • Best Combos!
  • How to get started

Agenda

Disclaimer

Introductory content

Crash course (nothing advanced)

Lots of theory

And some GIFs

  • Motivation
  • Architecture
  • Write and Read
  • Data Model
  • Best Combos!
  • How to get started

Agenda

What does it happen if you have > 300TB into a relational database?

1. Denormalise it.

It's easy to mess up the data. DBAs may be sad.

2. Master/Slave

Single failure point! 

3. Sharding

Schema updates and data consistency problems

Questions for you:

  • Do you need consistency 100% of the time?
  • Do you really need ACID?
  • How can you ensure HA?
  • Does the previous strategies wouldn't cause any of these problems?

Denormalise

3rd normal form

Master

Slave

Consistency

Sharding

Consistency/

Availability

What would it be the ideal scenario?

Better scenario

  • Put Consistency aside sometimes
  • No Master/Slave
  • Scale with commodity hardware, linearly
  • No manual sharding

Cassandra first impressions

./bin/cassandra
./bin/cqlsh
CREATE KEYSPACE confoo WITH replication = {
  'class': 'SimpleStrategy',
  'replication_factor': '3'
};

USE confoo;
CREATE TABLE users (
firstname text, 
lastname text, 
age int, 
email text, 
city text, 
PRIMARY KEY (lastname));

CQL - Very similar to SQL

Brief history

Facebook - Open sourced in 2008

Written mostly in Java

Lots of libs

Potential contributors

Compiled

Performance

JMX for monitoring

Tools for profiling

JVM Tuning + Performance

Convenient to manipulate data structures

  • Motivation
  • Architecture
  • Write and Read
  • Data Model
  • Best Combos!
  • How to get started

Agenda

Nodes (machines)

Node (machine)

JVM with Cassandra

nodetool => information about each node

Nodes interact

with each other

Hello!

Hi!

Hey

Yo!

No master/slave

Gossip Protocol

I handle 0-24!

I handle 25-49!

I handle 50-74!

I handle 75-99!

Hash ring

DATA

Coordinator

DATA

Where do these ranges come from?

CREATE TABLE users (
firstname text, 
lastname text, 
age int, 
email text, 
city text, 
PRIMARY KEY (lastname));

Part of Partition Key

Runs into a hash function and we get the range 

"Tavante"

11

(Actual ranges are from -2^63 to 2^63)

Your driver (Python, Java, Ruby) can be able to calculate this token and map the data to the proper node

You can choose several Partitioner algorithms

How can we guarantee availability? 

Replication Factor (RF)

CREATE KEYSPACE confoo WITH replication = {
  'class': 'SimpleStrategy',
  'replication_factor': '3'
};

USE confoo;

RF - copies of piece of data in your cluster

RF=3

confoo

RF=3

confoo

confoo

confoo

confoo

DATA

Async copy (no worries! If the server gets down before the replication, there is a mechanism to help on that - 'hinted handoff')

RF is async and tight to keyspace.

Consistency problems

Is there a way to 'increase' the consistency?

Yes! When we perform a read or write, we can ask for a specific number of responses and check if they agree. This is the idea of the Consistency Level (CL) 

CL=ALL

Read query

RF = 3

CL=ALL

Read query

Read query

Read query

Read query

CL=ALL

Read query

Read query

Read query

Read query

CL is per query.

CL=ALL, CL=ONE, CL=QUORUM

CL=ONE, RF=3 - You will have 3 copies, but one write will be enough to reply to OK to client

Example

Interchange between availability and consistency

  • Motivation
  • Architecture
  • Write and Read
  • Data Model
  • Best Combos!
  • How to get started

Agenda

Writing data

Data

RAM

Disk

Commit

Log

Memtable

SSTable

Data comes with timestamps

Immutable structures

Fast Write

SSTable

SSTable

SSTable

SSTable

SSTable

SSTable

SSTable

SSTable

SSTable

SSTable

SSTable

SSTable

SSTable

SSTable

SSTable

Too many SSTables

Compaction

SSTable

SSTable

SSTable

SSTable

SSTable

Reading data

Data to read

RAM

Disk

SSTable

For CL=ALL or CL=Quorum, Cassandra tries to get the most updated read result 

  • Motivation
  • Architecture
  • Write and Read
  • Data Model
  • Best Combos!
  • How to get started

Agenda

A simplified scenario for Facebook

User

Post

Post

Post

Like

Comment

Comment

Comment

Like

Like

A simplified scenario for Music

Music app

Playlist

Post

Post

Track

Artist

Comment

Comment

Like

Like

Artist

Track

Artist

Playlist

Track

Artist

Artist

Track

Artist

Playlist

Track

Artist

Artist

Track

Artist

Several tables with replicated information to speed up writing and reading

track_by_id, track_by_user, track_by_artist, track_by_style, artist_by_name, artist_by_style

PRIMARY KEY
((k_part_one, k_part_two), 
k_clust_one, 
k_clust_two, 
k_clust_three)

Clustering Columns - define data order

  • Motivation
  • Architecture
  • Write and Read
  • Data Model
  • Best Combos!
  • How to get started

Agenda

Cassandra + Spark

Event tracking

Cassandra + Spark + Kafka

Event tracking / Realtime streaming

Cassandra + Solr

Cassandra + ...

Cassandra is very flexible and might be a good use case for several scenarios

  • Motivation
  • Architecture
  • Write and Read
  • Data Model
  • Best Combos!
  • How to get started

Agenda

CCM - Cassandra Cluster Manager

 A good tool to simulate a cluster in your local machine

Datastax Academy

References

  • Motivation
  • Architecture
  • Write and Read
  • Data Model
  • Best Combos!
  • How to get started
  • Extra - challenges

Agenda

Cassandra top complaints

Repairs - update outdated data

Problems with CL

Modelling

Special Thanks

  • @PatrickMcFadin and @wheresLINA
  • B.C., for the constant support

Thank you :)

Questions?

 

hannelita@gmail.com

@hannelita

Cassandra - an introduction - Devoxx US 2017

By Hanneli Tavante (hannelita)

Cassandra - an introduction - Devoxx US 2017

  • 2,003