Cassandra - an introduction
"Yet another database"
"Yet another database"
"Marketing"
"Marketing"
Hi! I'm Hanneli (@hannelita)
- Computer Engineer
- Programming
- Electronics
- Math <3 <3
- Physics
- Lego
- Meetups
- Animals
- Coffee
- GIFs
- Pokémon
Cassandra - Useful!
- Motivation
- Architecture
- Write and Read
- Data Model
- Best Combos!
- How to get started
Agenda
Disclaimer
Introductory content
Crash course (nothing advanced)
Lots of theory
And some GIFs
- Motivation
- Architecture
- Write and Read
- Data Model
- Best Combos!
- How to get started
Agenda
What does it happen if you have > 300TB into a relational database?
1. Denormalise it.
It's easy to mess up the data. DBAs may be sad.
2. Master/Slave
Single failure point!
3. Sharding
Schema updates and data consistency problems
Questions for you:
- Do you need consistency 100% of the time?
- Do you really need ACID?
- How can you ensure HA?
- Does the previous strategies wouldn't cause any of these problems?
Denormalise
3rd normal form
Master
Slave
Consistency
Sharding
Consistency/
Availability
What would it be the ideal scenario?
Better scenario
- Put Consistency aside sometimes
- No Master/Slave
- Scale with commodity hardware, linearly
- No manual sharding
Cassandra first impressions
./bin/cassandra
./bin/cqlsh
CREATE KEYSPACE confoo WITH replication = {
'class': 'SimpleStrategy',
'replication_factor': '3'
};
USE confoo;
CREATE TABLE users (
firstname text,
lastname text,
age int,
email text,
city text,
PRIMARY KEY (lastname));
CQL - Very similar to SQL
Brief history
Facebook - Open sourced in 2008
Written mostly in Java
Lots of libs
Potential contributors
Compiled
Performance
JMX for monitoring
Tools for profiling
JVM Tuning + Performance
Convenient to manipulate data structures
Motivation- Architecture
- Write and Read
- Data Model
- Best Combos!
- How to get started
Agenda
Nodes (machines)
Node (machine)
JVM with Cassandra
nodetool => information about each node
Nodes interact
with each other
Hello!
Hi!
Hey
Yo!
No master/slave
Gossip Protocol
I handle 0-24!
I handle 25-49!
I handle 50-74!
I handle 75-99!
Hash ring
DATA
Coordinator
DATA
Where do these ranges come from?
CREATE TABLE users (
firstname text,
lastname text,
age int,
email text,
city text,
PRIMARY KEY (lastname));
Part of Partition Key
Runs into a hash function and we get the range
"Tavante"
11
(Actual ranges are from -2^63 to 2^63)
Your driver (Python, Java, Ruby) can be able to calculate this token and map the data to the proper node
You can choose several Partitioner algorithms
How can we guarantee availability?
Replication Factor (RF)
CREATE KEYSPACE confoo WITH replication = {
'class': 'SimpleStrategy',
'replication_factor': '3'
};
USE confoo;
RF - copies of piece of data in your cluster
RF=3
confoo
RF=3
confoo
confoo
confoo
confoo
DATA
Async copy (no worries! If the server gets down before the replication, there is a mechanism to help on that - 'hinted handoff')
RF is async and tight to keyspace.
Consistency problems
Is there a way to 'increase' the consistency?
Yes! When we perform a read or write, we can ask for a specific number of responses and check if they agree. This is the idea of the Consistency Level (CL)
CL=ALL
Read query
RF = 3
CL=ALL
Read query
Read query
Read query
Read query
CL=ALL
Read query
Read query
Read query
Read query
CL is per query.
CL=ALL, CL=ONE, CL=QUORUM
CL=ONE, RF=3 - You will have 3 copies, but one write will be enough to reply to OK to client
Example
Interchange between availability and consistency
MotivationArchitecture- Write and Read
- Data Model
- Best Combos!
- How to get started
Agenda
Writing data
Data
RAM
Disk
Commit
Log
Memtable
SSTable
Data comes with timestamps
Immutable structures
Fast Write
SSTable
SSTable
SSTable
SSTable
SSTable
SSTable
SSTable
SSTable
SSTable
SSTable
SSTable
SSTable
SSTable
SSTable
SSTable
Too many SSTables
Compaction
SSTable
SSTable
SSTable
SSTable
SSTable
Reading data
Data to read
RAM
Disk
SSTable
For CL=ALL or CL=Quorum, Cassandra tries to get the most updated read result
MotivationArchitectureWrite and Read- Data Model
- Best Combos!
- How to get started
Agenda
A simplified scenario for Facebook
User
Post
Post
Post
Like
Comment
Comment
Comment
Like
Like
A simplified scenario for Music
Music app
Playlist
Post
Post
Track
Artist
Comment
Comment
Like
Like
Artist
Track
Artist
Playlist
Track
Artist
Artist
Track
Artist
Playlist
Track
Artist
Artist
Track
Artist
Several tables with replicated information to speed up writing and reading
track_by_id, track_by_user, track_by_artist, track_by_style, artist_by_name, artist_by_style
PRIMARY KEY
((k_part_one, k_part_two),
k_clust_one,
k_clust_two,
k_clust_three)
Clustering Columns - define data order
MotivationArchitectureWrite and ReadData Model- Best Combos!
- How to get started
Agenda
Cassandra + Spark
Event tracking
Cassandra + Spark + Kafka
Event tracking / Realtime streaming
Cassandra + Solr
Cassandra + ...
Cassandra is very flexible and might be a good use case for several scenarios
MotivationArchitectureWrite and ReadData ModelBest Combos!- How to get started
Agenda
CCM - Cassandra Cluster Manager
A good tool to simulate a cluster in your local machine
Datastax Academy
References
MotivationArchitectureWrite and ReadData ModelBest Combos!How to get started- Extra - challenges
Agenda
Cassandra top complaints
Repairs - update outdated data
Problems with CL
Modelling
Special Thanks
- @PatrickMcFadin and @wheresLINA
- B.C., for the constant support
Thank you :)
Questions?
hannelita@gmail.com
@hannelita
Cassandra - an introduction - Devoxx US 2017
By Hanneli Tavante (hannelita)
Cassandra - an introduction - Devoxx US 2017
- 2,039