Cassandra - an introduction

"Yet another database"

"Yet another database"

"Marketing"

"Marketing"

Hi!

  • Computer Engineer
  • Programming
  • Electronics
  • Math <3 <3
  • Physics
  • Lego
  • Meetups
  • Animals
  • Coffee
  • GIFs

Cassandra - Useful! 

  • Motivation
  • Architecture
  • Write and Read
  • Data Model
  • Best Combos!
  • How to get started

Agenda

Disclaimer

Introductory content

Crash course (nothing advanced)

Lots of theory

And some GIFs

Install Cassandra for the hands-on

Datastax distribution or raw Apache Cassandra:

http://www.planetcassandra.org/cassandra/

(v 3.9)

  • Motivation
  • Architecture
  • Write and Read
  • Data Model
  • Best Combos!
  • How to get started

Agenda

What does it happen if you have > 300TB into a relational database?

1. Denormalise it.

It's easy to mess up the data. DBAs may be sad.

2. Master/Slave

Single failure point! 

3. Sharding

Schema updates and data consistency problems

Questions for you:

  • Do you need consistency 100% of the time?
  • Do you really need ACID?
  • How can you ensure HA?
  • Does the previous strategies wouldn't cause any of these problems?

Denormalise

3rd normal form

Master

Slave

Consistency

Sharding

Consistency/

Availability

What would it be the ideal scenario?

Better scenario

  • Put Consistency aside sometimes
  • No Master/Slave
  • Scale with commodity hardware, linearly
  • No manual sharding

Cassandra first impressions

./bin/cassandra
./bin/cqlsh
CREATE KEYSPACE javaone WITH replication = {
  'class': 'SimpleStrategy',
  'replication_factor': '3'
};

USE javaone;
CREATE TABLE users (
firstname text, 
lastname text, 
age int, 
email text, 
city text, 
PRIMARY KEY (lastname));

CQL - Very similar to SQL

Brief history

Facebook - Open sourced in 2008

Written mostly in Java

Lots of libs

Potential contributors

Compiled

Performance

JMX for monitoring

Tools for profiling

JVM Tuning + Performance

Convenient to manipulate data structures

  • Motivation
  • Architecture
  • Write and Read
  • Data Model
  • Best Combos!
  • How to get started

Agenda

Nodes (machines)

Nodes interact

with each other

Hello!

Hi!

Hey

Yo!

No master/slave

Gossip Protocol

I handle 0-24!

I handle 25-49!

I handle 50-74!

I handle 75-99!

Hash ring

Where do these ranges come from?

CREATE TABLE users (
firstname text, 
lastname text, 
age int, 
email text, 
city text, 
PRIMARY KEY (lastname));

Part of Partition Key

Runs into a hash function and we get the range 

"Tavante"

11

Hands-on #1

Installing the driver

pip install cassandra-driver

Python 3 is recommended

Cassandra 3.x is recommended

Create a keyspace and a table with cqlsh

Let's connect with Cassandra and Insert/Retrieve and delete from this table.

How can we guarantee availability? 

Replication Factor (RF)

CREATE KEYSPACE pythonbr WITH replication = {
  'class': 'SimpleStrategy',
  'replication_factor': '3'
};

USE pythonbr;

RF - copies of piece of data in your cluster

RF=3

pythonbr

RF=3

pythonbr

pythonbr

pythonbr

Async copy (no worries! If the server gets down before the replication, there is a mechanism to help on that - 'hinted handoff')

RF is async and tight to keyspace.

Creating keyspaces and tables with the Python Driver

Consistency problems

Is there a way to 'increase' the consistency?

Yes! When we perform a read or write, we can ask for a specific number of responses and check if they agree. This is the idea of the Consistency Level (CL) 

CL=ALL

Read query

CL=ALL

Read query

Read query

Read query

Read query

CL=ALL

Read query

Read query

Read query

Read query

CL is per query.

CL=ALL, CL=ONE, CL=QUORUM

CL=ONE, RF=3 - You will have 3 copies, but one write will be enough to reply to OK to client

Example

Interchange between availability and consistency

Defining the consistency level with the Python driver

  • Motivation
  • Architecture
  • Write and Read
  • Data Model
  • Best Combos!
  • How to get started

Agenda

Writing data

Data

RAM

Disk

Commit

Log

Memtable

SSTable

Data comes with timestamps

Immutable structures

Fast Write

SSTable

SSTable

SSTable

SSTable

SSTable

SSTable

SSTable

SSTable

SSTable

SSTable

SSTable

SSTable

SSTable

SSTable

SSTable

Too many SSTables

Compaction

SSTable

SSTable

SSTable

SSTable

SSTable

Reading data

Data to read

RAM

Disk

SSTable

For CL=ALL or CL=Quorum, Cassandra tries to get the most updated read result 

  • Motivation
  • Architecture
  • Write and Read
  • Data Model
  • Best Combos!
  • How to get started

Agenda

A simplified scenario for Facebook

User

Post

Post

Post

Like

Comment

Comment

Comment

Like

Like

A simplified scenario for Music

Music app

Playlist

Post

Post

Track

Artist

Comment

Comment

Like

Like

Artist

Track

Artist

Playlist

Track

Artist

Artist

Track

Artist

Playlist

Track

Artist

Artist

Track

Artist

Several tables with replicated information to speed up writing and reading

track_by_id, track_by_user, track_by_artist, track_by_style, artist_by_name, artist_by_style

PRIMARY KEY
((k_part_one, k_part_two), 
k_clust_one, 
k_clust_two, 
k_clust_three)

Clustering Columns - define data order

Notes about cluster

1. load_balancing_policy

2. RF and CL problems

  • Motivation
  • Architecture
  • Write and Read
  • Data Model
  • Best Combos!
  • How to get started

Agenda

Cassandra + Spark

Event tracking

Cassandra + Spark + Kafka

Event tracking / Realtime streaming

Cassandra + Solr

Cassandra + ...

Cassandra is very flexible and might be a good use case for several scenarios

  • Motivation
  • Architecture
  • Write and Read
  • Data Model
  • Best Combos!
  • How to get started

Agenda

CCM - Cassandra Cluster Manager

 A good tool to simulate a cluster in your local machine

cqlengine

Object Mapper for Python  embedded on the driver

Datastax Academy

References

References - code

from cassandra.cluster import Cluster
cluster = Cluster()
session = cluster.connect('playlist')

session.execute("""

insert into users (lastname, city, email, firstname) values ('Tavante', 'Sao Paulo', 
  'hanneli@example.com', 'Hanneli')

""")

result = session.execute("select * from users where lastname='Tavante' ")[0]
print(result.firstname, result.city)

References - code

# Alternative cluster ref 

cluster = Cluster(
  contact_points=['127.0.0.1'],
   load_balancing_policy= 
     TokenAwarePolicy(DCAwareRoundRobinPolicy(local_dc='datacenter1')),
   default_retry_policy = RetryPolicy()
  )
session = cluster.connect('demo')

References - Github official Example

https://github.com/dkoepke/cassandra-python-driver/blob/master/example.py

 

  • Motivation
  • Architecture
  • Write and Read
  • Data Model
  • Best Combos!
  • How to get started
  • Extra - challenges

Agenda

Garbage Collector -GC

Repairs - update outdated data

Modelling

Special Thanks

  • @PatrickMcFadin
  • @wheresLINA
  • B.C.
  • @planetcassandra
  • @lafp, @romulostorel and @pedrofelipee (GIFs)

Thank you :)

Questions?

 

hannelita@gmail.com

@hannelita

Python Brasil - Cassandra - an introduction

By Hanneli Tavante (hannelita)

Python Brasil - Cassandra - an introduction

  • 2,982