CASSANDRA SUMMIT 2015

Mario Lazaro

September 24th 2015

#CassandraSummit2015

Multi-Region Cassandra in AWS

WHOAMI

  • Mario Cerdan Lazaro

  • Big Data Engineer

  • Born and raised in Spain

  • Joined GumGum 18 months ago
  • About a year and a half experience with Cassandra
  • #5 Ad Platform in the U.S

  • 10B impressions / month

  • 2,000 brand-safe premium publisher partners

  • 1B+ global unique visitors

  • Daily inventory Impressions processed - 213M

  • Monthly Image Impressions processed - 2.6B

  • 123 employees in seven offices

Agenda

  • Old cluster
  • International Expansion
    • Challenges
    • Testing
    • Modus Operandi
    • Tips
    • Questions & Answers
  •  25 Classic nodes cluster
  • 1 Region / 1 Rack hosted in AWS EC2 US East
  • Version 2.0.8
  • Datastax CQL driver
  • GumGum's metadata including visitors, images, pages, and ad performance
  • Usage: realtime data access and analytics (MR jobs)

OLD C* CLUSTER - MARCH 2015

OLD C* CLUSTER - REALTIME USE CASE

  • Billions of rows
  • Heavy read workload 
    • 60/40
  • TTLs everywhere - Tombstones
  • Heavy and critical use of counters
  • RTB - Read Latency constraints (total execution time ~50 ms) 

OLD C* CLUSTER - ANALYTICS USE CASE

  • Daily ETL jobs to extract / join data from C*
    • ​Hadoop MR jobs
    • AdHoc queries with Presto

International expansion

First Steps

  • Start C* test datacenters in US East & EU West and test how C* multi region works in AWS
  • Run capacity/performance tests. We expect 3x times more traffic in 2015 Q4

First thoughts

  • Use AWS Virtual Private Cloud (VPC)
  • Cassandra & VPC present some connectivity issues challenges
  • Replicate entire data with same number of replicas

TOO GOOD TO BE TRUE ...

CHALLENGES

  • Problems between Cassandra in EC2 Classic / VPC and Datastax Java driver
    • EC2MultiRegionSnitch uses public IPs. EC2 instances do not have an interface with public IP address - Cannot connect between instances in the same region using Public IPs
    • Region to Region connectivity will use public IPs - Trust those IPs or use software/hardware VPN
    • Your application needs to connect to C* using private IPs - Custom EC2 translator
/**
 * Implementation of {@link AddressTranslater} used by the driver that
 * translate external IPs to internal IPs.
 * @author Mario <mario@gumgum.com>
 */
public class Ec2ClassicTranslater implements AddressTranslater {
    private static final Logger LOGGER = LoggerFactory.getLogger(Ec2ClassicTranslater.class);

    private ClusterService clusterService;
    private Cluster cluster;
    private List<Instance> publicDnss;

    @PostConstruct
    public void build() {
        publicDnss = clusterService.getInstances(cluster);
    }

    /**
     * Translates a Cassandra {@code rpc_address} to another address if necessary.
     * <p>
     *
     * @param address the address of a node as returned by Cassandra.
     * @return {@code address} translated IP address of the source.
     */
    public InetSocketAddress translate(InetSocketAddress address) {
        for (final Instance server : publicDnss) {
            if (server.getPublicIpAddress().equals(address.getHostString())) {
                LOGGER.info("IP address: {} translated to {}", address.getHostString(), server.getPrivateIpAddress());
                return new InetSocketAddress(server.getPrivateIpAddress(), address.getPort());
            }
        }
        return null;
    }

    public void setClusterService(ClusterService clusterService) {
        this.clusterService = clusterService;
    }

    public void setCluster(Cluster cluster) {
        this.cluster = cluster;
    }
}
  • Problems between Cassandra in EC2 Classic / VPC and Datastax Java driver
    • EC2MultiRegionSnitch uses public IPs. EC2 instances do not have an interface with public IP address - Cannot connect between instances in the same region using Public IPs

 

  • Problems between Cassandra in EC2 Classic / VPC and Datastax Java driver
    • EC2MultiRegionSnitch uses public IPs. EC2 instances do not have an interface with public IP address - Cannot connect between instances in the same region using Public IPs
    • Region to Region connectivity will use public IPs - Trust those IPs or use software/hardware VPN

 

Challenges

  • Datastax Java Driver Load Balancing
    • Multiple choices
  • Datastax Java Driver Load Balancing
    • Multiple choices
      • DCAware + TokenAware
  • Datastax Java Driver Load Balancing
    • Multiple choices
      • DCAware + TokenAware + ?

 

 

Clients in one AZ attempt to always communicate with C* nodes in the same AZ. We call this zone-aware connections. This feature is built into Astyanax, Netflix’s C* Java client library.

CHALLENGES

  • Zone Aware Connection:
    • Webapps in 3 different AZs: 1A, 1B, and 1C
    • C* datacenter spanning 3 AZs with 3 replicas

1A

1B

1C

1B

1B

CHALLENGES

  • We added it! -  Rack/AZ awareness to TokenAware Policy

CHALLENGES

  • Third Datacenter: Analytics
    • Do not impact realtime data access
    • Spark on top of Cassandra
    • Spark-Cassandra Datastax connector
    • Replicate specific keyspaces
    • Less nodes with larger disk space
    • Settings are different
      • Ex: Bloom filter chance

CHALLENGES

  • Third Datacenter: Analytics

Cassandra Only DC

Realtime

Cassandra + Spark DC

Analytics

Challenges

  • Upgrade from 2.0.8 to 2.1.5
    • Counters implementation is buggy in pre-2.1 versions

My code never has bugs. It just develops random unexpected features

CHALLENGES

To choose, or not to choose VNodes. That is the question.

(M. Lazaro, 1990  - 2500)

  • Previous DC using Classic Nodes
    • Works with MR jobs

    • Complexity for adding/removing nodes

    • Manual manage token ranges

  • New DCs will use VNodes
    • Apache Spark + Spark Cassandra Datastax connector
    • Easy to add/remove new nodes as traffic increases

testing

Testing

  • Testing requires creating and modifying many C* nodes
    • Create and configuring a C* cluster is time-consuming / repetitive task
    • Create fully automated process for creating/modifying/destroying Cassandra clusters with Ansible
# Ansible settings for provisioning the EC2 instance
---
  ec2_instance_type: r3.2xlarge

  ec2_count: 
    - 0 # How many in us-east-1a ?
    - 7 # How many in us-east-1b ?

  ec2_vpc_subnet:
    - undefined
    - subnet-c51241b2
    - undefined
    - subnet-80f085d9
    - subnet-f9138cd2

  ec2_sg: 
    - va-ops
    - va-cassandra-realtime-private
    - va-cassandra-realtime-public-1
    - va-cassandra-realtime-public-2
    - va-cassandra-realtime-public-3

TESTING - Performance

  • Performance tests using new Cassandra 2.1 Stress Tool:
    • Recreate GumGum metadata / schemas
    • Recreate workload and make it 3 times bigger
    • Try to find limits / Saturate clients
# Keyspace Name
keyspace: stresscql
 
#keyspace_definition: |
#    CREATE KEYSPACE stresscql WITH replication = {'class': #'NetworkTopologyStrategy', 'us-eastus-sandbox':3,'eu-westeu-sandbox':3 }

### Column Distribution Specifications ###

columnspec:
  - name: visitor_id
    size: gaussian(32..32)       #domain names are relatively short
    population: uniform(1..999M)  #10M possible domains to pick from

  - name: bidder_code
    cluster: fixed(5)
  - name: bluekai_category_id
  - name: bidder_custom
    size: fixed(32)
  - name: bidder_id
    size: fixed(32)
  - name: bluekai_id
    size: fixed(32)
  - name: dt_pd
  - name: rt_exp_dt
  - name: rt_opt_out
   
### Batch Ratio Distribution Specifications ###

insert:
  partitions: fixed(1)            # Our partition key is the visitor_id so only insert one per batch
 
  select:    fixed(1)/5        # We have 5 bidder_code per domain so 1/5 will allow 1 bidder_code per batch
 
  batchtype: UNLOGGED             # Unlogged batches
 
 
#
# A list of queries you wish to run against the schema
#
queries:
  getvisitor:
      cql: SELECT bidder_code, bluekai_category_id, bidder_custom, bidder_id, bluekai_id, dt_pd, rt_exp_dt, rt_opt_out FROM partners WHERE visitor_id = ?
      fields: samerow

TESTING - Performance

  • Main worry:
    • Latency and replication overseas
      • Use LOCAL_X consistency levels in your client
      • Only one C* node will contact only one C* node in a different DC for sending replicas/mutations

TESTING - Performance

  • Main worries:
    • Latency

TESTING - Instance type

  • Test all kind of instance types. We decided to go with r3.2xlarge machines for our cluster:
    • 60 GB RAM
    • 8 Cores
    • 160GB Ephemeral SSD Storage for commit logs and saved caches

    • RAID 0 over 4 SSD EBS Volumes for data

  • Performance / Cost and GumGum use case makes r3.2xlarge the best option
  • Disclosure: I2 instance family is the best if you can afford it

TESTING - Upgrade

  • Upgrade C* Datacenter from 2.0.8 to 2.1.5
    • Both versions can cohabit in the same DC
    • New settings and features tried
      • ​DateTieredCompactionStrategy: Compaction for Time Series Data
      • Incremental repairs
      • Counters new architecture

MODUS OPERANDI

MODUS OPERANDI

  • Sum up
    • From: One cluster / One DC in US East 
    • To: One cluster / Two DCs in US East and one DC in EU West

MODUS OPERANDI

  • First step:
    • Upgrade old cluster snitch from EC2Snitch to EC2MultiRegionSnitch
    • Upgrade clients to handle it (aka translators)
    • Make sure your clients do not lose connection to upgraded C* nodes (JIRA DataStax -  JAVA-809)

MODUS OPERANDI

  • Second step:
    • Upgrade old datacenter from 2.0.8 to 2.1.5
      • nodetool upgradesstables (multiple nodes at a time)
      • Not possible to rebuild a 2.1.X C* node from a 2.0.X C* datacenter.
WARN  [Thread-12683] 2015-06-17 10:17:22,845 IncomingTcpConnection.java:91 -
UnknownColumnFamilyException reading from socket;
closing org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find cfId=XXX
rebuild

MODUS OPERANDI

  • Third step:
    • Start EU West and new US East DCs within the same cluster
      • Replication factor in new DCs: 0
      • Use dc_suffix to differentiate new Virginia DC from old one
      • Clients do not talk to new DCs. Only C* knows they exist
      • Replication factor to 3 on all except analytics 1 
        • ​Start receiving new data
      • Nodetool rebuild <old-datacenter>
        • ​Old data

MODUS OPERANDI

RF 3

RF 3:0:0:0

RF 3:3:3:1

Clients

US East Realtime

EU West Realtime

US East Analytics

Rebuild

Rebuild

Rebuild

MODUS OPERANDI

From 39d8f76d9cae11b4db405f5a002e2a4f6f764b1d Mon Sep 17 00:00:00 2001
From: mario <mario@gumgum.com>
Date: Wed, 17 Jun 2015 14:21:32 -0700
Subject: [PATCH] AT-3576 Start using new Cassandra realtime cluster

---
 src/main/java/com/gumgum/cassandra/Client.java     | 30 ++++------------------
 .../com/gumgum/cassandra/Ec2ClassicTranslater.java | 30 ++++++++++++++--------
 src/main/java/com/gumgum/cluster/Cluster.java      |  3 ++-
 .../resources/applicationContext-cassandra.xml     | 13 ++++------
 src/main/resources/dev.properties                  |  2 +-
 src/main/resources/eu-west-1.prod.properties       |  3 +++
 src/main/resources/prod.properties                 |  3 +--
 src/main/resources/us-east-1.prod.properties       |  3 +++
 .../CassandraAdPerformanceDaoImplTest.java         |  2 --
 .../asset/cassandra/CassandraImageDaoImplTest.java |  2 --
 .../CassandraExactDuplicatesDaoTest.java           |  2 --
 .../com/gumgum/page/CassandraPageDoaImplTest.java  |  2 --
 .../cassandra/CassandraVisitorDaoImplTest.java     |  2 --
 13 files changed, 39 insertions(+), 58 deletions(-)

Start using new Cassandra DCs

MODUS OPERANDI

RF 0:3:3:1

Clients

US East Realtime

EU West Realtime

US East Analytics

RF 3:3:3:1

MODUS OPERANDI

RF 0:3:3:1

Clients

US East Realtime

EU West Realtime

US East Analytics

RF 3:3:1

Decomission

TIPS

tips - automated Maintenance

Maintenance in a multi-region C* cluster:

  • Ansible + Cassandra maintenance keyspace + email report = zero human intervention!
CREATE TABLE maintenance.history (
  dc text,
  op text,
  ts timestamp,
  ip text,
  PRIMARY KEY ((dc, op), ts)
) WITH CLUSTERING ORDER BY (ts ASC) AND
  bloom_filter_fp_chance=0.010000 AND
  caching='{"keys":"ALL", "rows_per_partition":"NONE"}' AND
  comment='' AND
  dclocal_read_repair_chance=0.100000 AND
  gc_grace_seconds=864000 AND
  read_repair_chance=0.000000 AND
  compaction={'class': 'SizeTieredCompactionStrategy'} AND
  compression={'sstable_compression': 'LZ4Compressor'};

CREATE INDEX history_kscf_idx ON maintenance.history (kscf);
gumgum@ip-10-233-133-65:/opt/scripts/production/groovy$ groovy CassandraMaintenanceCheck.groovy -dc us-east-va-realtime -op compaction -e mario@gumgum.com

tips - spark

  • Number of workers above number of total C* nodes in analytics
  • Each worker uses: 
    • 1/4 number of cores of each instance
    • 1/3 total available RAM of each instance
  • Cassandra-Spark connector
    • ​SpanBy
    • .joinWithCassandraTable(:x, :y)
    • Spark.cassandra.output.batch.size.bytes
    • Spark.cassandra.output.concurrent.writes

TIPS - spark

val conf = new SparkConf()
  .set("spark.cassandra.connection.host", cassandraNodes)
  .set("spark.cassandra.connection.local_dc", "us-east-va-analytics")
  .set("spark.cassandra.connection.factory", "com.gumgum.spark.bluekai.DirectLinkConnectionFactory")
  .set("spark.driver.memory","4g")
  .setAppName("Cassandra presidential candidates app")
  • Create "translator" if using EC2MultiRegionSnitch
    • Spark.cassandra.connection.factory

since c* in EU WEST ...

US West Datacenter!

EU West DC

US East DC

Analytics DC

US West DC

Q&A

GumGum is hiring!

Cassandra Summit 2015 - GumGum

By mario2

Cassandra Summit 2015 - GumGum

  • 1,001