Cassandra data modeling 101

the ring



KEYSPACES & REPLICATION

CREATE KEYSPACE xdcr WITH replication = {
        'class': 'SimpleStrategy', 'replication_factor': 2 
};










01_simple_replication_strategy.cql
SimpleStrategy
  • only single data center
  • places the first replica on a node determined by the partitioner.
  • additional replicas are placed on the next nodes clockwise in the ring without considering topology (rack or data center location)

keyspaces & replication 2


CREATE KEYSPACE xdcr WITH replication = {
     'class': 'NetworkTopologyStrategy', 
     'datacenter_1': 1, 
     'datacenter_2': 2 
};



NetworkTopologyStrategy





02_network_replication_strategy.cql
  • cluster deployed across multiple data centers
  • specify how many replicas you want in each data center
  • places replicas in the same data center by walking the ring clockwise until reaching the first node in another rack
  • attempts to place replicas on distinct racks because nodes in the same rack (or similar physical grouping) often fail at the same time due to power, cooling, or network issues





Partition keys 

clustering columns

cassandra table



 SortedMap<RowKey, SortedMap<ColumnKey, ColumnValue>>

PRIMARY KEY



 PRIMARY KEY '(' <partition-key> ( ',' <identifier> )* ')'



 In CQL, the order in which columns are defined for the PRIMARY KEY matters. The first column of the key is called the partition key. It has the property that all the rows sharing the same partition key (even across table in fact) are stored on the same physical node. Also, insertion/update/deletion on rows sharing the same partition key for a given table are performed atomically and in isolation. Note that it is possible to have a composite partition key, i.e. a partition key formed of multiple columns, using an extra set of parentheses to define which columns forms the partition key. 


simple PK example


CREATE TABLE example (
id int PRIMARY KEY
);

INSERT INTO example (id) VALUES (1);
INSERT INTO example (id) VALUES (2);
INSERT INTO example (id) VALUES (3);




-------------------
RowKey: 1
=> (name=, value=, timestamp=1426254945765723)
-------------------
RowKey: 2
=> (name=, value=, timestamp=1426254945772053)
-------------------
RowKey: 3
=> (name=, value=, timestamp=1426254945777575)







03_single_PK.cql

partitioning & clustering


CREATE TABLE example (
id int,
last_seen timestamp,
bool1 boolean,
bool2 boolean,
PRIMARY KEY(id, last_seen)
);

-------------------
RowKey: 1
=> (name=2014-12-31 20\:00Z:, value=, timestamp=1426255185754510)
=> (name=2014-12-31 20\:00Z:bool1, value=01, timestamp=1426255185754510)
=> (name=2014-12-31 20\:00Z:bool2, value=01, timestamp=1426255185754510)
=> (name=2015-01-31 20\:00Z:, value=, timestamp=1426255185822199)
=> (name=2015-01-31 20\:00Z:bool1, value=01, timestamp=1426255185822199)
=> (name=2015-01-31 20\:00Z:bool2, value=00, timestamp=1426255185822199)
04_PK_with_clustering.cql

clustering columns


CREATE TABLE example (
id int,
last_seen timestamp,
bool1 boolean,
bool2 boolean,
PRIMARY KEY(id, last_seen, bool1)
);

-------------------
RowKey: 1
=> (name=2014-12-31 20\:00Z:true:, value=, timestamp=1426255345902457)
=> (name=2014-12-31 20\:00Z:true:bool2, value=01, timestamp=1426255345902457)
=> (name=2015-01-31 20\:00Z:true:, value=, timestamp=1426255345951879)
=> (name=2015-01-31 20\:00Z:true:bool2, value=00, timestamp=1426255345951879)
05_PK_with_clustering.cql
you can't just select by last clustering column

copmpound partition keys


CREATE TABLE example (
id int,
year_month int,
last_seen timestamp,
bool1 boolean,
bool2 boolean,
PRIMARY KEY((id,year_month), last_seen)
);

 -------------------
RowKey: 2:201502
=> (name=2015-01-31 20\:10Z:, value=, timestamp=1426256283358777)
=> (name=2015-01-31 20\:10Z:bool1, value=01, timestamp=1426256283358777)
=> (name=2015-01-31 20\:10Z:bool2, value=00, timestamp=1426256283358777)
-------------------
RowKey: 3:201501
=> (name=2014-12-31 20\:10Z:, value=, timestamp=1426256283327688)
=> (name=2014-12-31 20\:10Z:bool1, value=01, timestamp=1426256283327688)
=> (name=2014-12-31 20\:10Z:bool2, value=00, timestamp=1426256283327688)









06_compound_PK.cql

GENERAL ideas


  • Don't think of relational table, think of a nested sorted map instead
  • Many ways to model data in Cassandra, the best one depends on your case and query patterns
  • Model Column Families around query patterns, but start with entities and relationships
  • Denormalize and duplicate for read performance, but don't denormalize if you don't need to
  • Think about query patterns and indexes upfront
  • Think of phisical storage structure: keep data accessed together on disk
Made with Slides.com