Cassandra - an introduction

"Yet another database"

"Yet another database"

"Marketing"

"Marketing"

Hi!

  • Computer Engineer
  • Programming
  • Electronics
  • Math <3 <3
  • Physics
  • Lego
  • Meetups
  • Animals
  • Coffee
  • GIFs

#OSB2015 - video

Related motivation

Cassandra - Useful! 

  • Motivation
  • Architecture
  • Write and Read
  • Data Model
  • Best Combos!
  • How to get started

Agenda

Disclaimer

Introductory content

Crash course (nothing advanced)

Lots of theory

And some GIFs

  • Motivation
  • Architecture
  • Write and Read
  • Data Model
  • Best Combos!
  • How to get started

Agenda

What does it happen if you have > 300TB into a relational database?

1. Denormalise it.

It's easy to mess up the data. DBAs may be sad.

2. Master/Slave

Single failure point! 

3. Sharding

Schema updates and data consistency problems

Questions for you:

  • Do you need consistency 100% of the time?
  • Do you really need ACID?
  • How can you ensure HA?
  • Does the previous strategies wouldn't cause any of these problems?

Denormalise

3rd normal form

Master

Slave

Consistency

Sharding

Consistency/

Availability

What would it be the ideal scenario?

Better scenario

  • Put Consistency aside sometimes
  • No Master/Slave
  • Scale with commodity hardware, linearly
  • No manual sharding

Cassandra first impressions

./bin/cassandra
./bin/cqlsh
CREATE KEYSPACE osbridge WITH replication = {
  'class': 'SimpleStrategy',
  'replication_factor': '3'
};

USE osbridge;
CREATE TABLE users (
firstname text, 
lastname text, 
age int, 
email text, 
city text, 
PRIMARY KEY (lastname));

CQL - Very similar to SQL

  • Motivation
  • Architecture
  • Write and Read
  • Data Model
  • Best Combos!
  • How to get started

Agenda

Nodes (machines)

Nodes interact

with each other

Hello!

Hi!

Hey

Yo!

No master/slave

Gossip Protocol

I handle 0-24!

I handle 25-49!

I handle 50-74!

I handle 75-99!

Hash ring

Where do these ranges come from?

CREATE TABLE users (
firstname text, 
lastname text, 
age int, 
email text, 
city text, 
PRIMARY KEY (lastname));

Part of Partition Key

Runs into a hash function and we get the range 

"Tavante"

11

How can we guarantee availability? 

Replication Factor (RF)

CREATE KEYSPACE osbridge WITH replication = {
  'class': 'SimpleStrategy',
  'replication_factor': '3'
};

USE osbridge;

RF - copies of piece of data in your cluster

RF=3

OSBridge

RF=3

OSBridge

OSBridge

OSBridge

Async copy (no worries! If the server gets down before the replication, there is a mechanism to help on that - 'hinted handoff')

RF is async and tight to keyspace.

Consistency problems

Is there a way to 'increase' the consistency?

Yes! When we perform a read or write, we can ask for a specific number of responses and check if they agree. This is the idea of the Consistency Level (CL) 

CL=ALL

Read query

CL=ALL

Read query

Read query

Read query

Read query

CL=ALL

Read query

Read query

Read query

Read query

CL is per query.

CL=ALL, CL=ONE, CL=QUORUM

CL=ONE, RF=3 - You will have 3 copies, but one write will be enough to reply to OK to client

Example

Interchange between availability and consistency

  • Motivation
  • Architecture
  • Write and Read
  • Data Model
  • Best Combos!
  • How to get started

Agenda

Writing data

Data

RAM

Disk

Commit

Log

Memtable

SSTable

Data comes with timestamps

Immutable structures

Fast Write

SSTable

SSTable

SSTable

SSTable

SSTable

SSTable

SSTable

SSTable

SSTable

SSTable

SSTable

SSTable

SSTable

SSTable

SSTable

Too many SSTables

Compaction

SSTable

SSTable

SSTable

SSTable

SSTable

Reading data

Data to read

RAM

Disk

SSTable

For CL=ALL or CL=Quorum, Cassandra tries to get the most updated read result 

  • Motivation
  • Architecture
  • Write and Read
  • Data Model
  • Best Combos!
  • How to get started

Agenda

Several tables with replicated information to speed up writing and reading

track_by_id, track_by_user, track_by_artist, track_by_style, artist_by_name, artist_by_style

PRIMARY KEY
((k_part_one, k_part_two), 
k_clust_one, 
k_clust_two, 
k_clust_three)

Clustering Columns - define data order

  • Motivation
  • Architecture
  • Write and Read
  • Data Model
  • Best Combos!
  • How to get started

Agenda

Cassandra + Spark

Event tracking

Cassandra + Spark + Kafka

Event tracking / Realtime streaming

Cassandra + Solr

Cassandra + ...

Cassandra is very flexible and might be a good use case for several scenarios

  • Motivation
  • Architecture
  • Write and Read
  • Data Model
  • Best Combos!
  • How to get started

Agenda

CCM - Cassandra Cluster Manager

 A good tool to simulate a cluster in your local machine

Datastax Academy

References

Special Thanks

  • @PatrickMcFadin
  • @wheresLINA
  • @planetcassandra
  • @lafp, @romulostorel and @pedrofelipee (GIFs)

Thank you :)

Questions?

 

hannelita@gmail.com

@hannelita