Mario Lazaro
September 24th 2015
#CassandraSummit2015
Mario Cerdan Lazaro
Big Data Engineer
Born and raised in Spain
#5 Ad Platform in the U.S
10B impressions / month
2,000 brand-safe premium publisher partners
1B+ global unique visitors
Daily inventory Impressions processed - 213M
Monthly Image Impressions processed - 2.6B
123 employees in seven offices
/**
* Implementation of {@link AddressTranslater} used by the driver that
* translate external IPs to internal IPs.
* @author Mario <mario@gumgum.com>
*/
public class Ec2ClassicTranslater implements AddressTranslater {
private static final Logger LOGGER = LoggerFactory.getLogger(Ec2ClassicTranslater.class);
private ClusterService clusterService;
private Cluster cluster;
private List<Instance> publicDnss;
@PostConstruct
public void build() {
publicDnss = clusterService.getInstances(cluster);
}
/**
* Translates a Cassandra {@code rpc_address} to another address if necessary.
* <p>
*
* @param address the address of a node as returned by Cassandra.
* @return {@code address} translated IP address of the source.
*/
public InetSocketAddress translate(InetSocketAddress address) {
for (final Instance server : publicDnss) {
if (server.getPublicIpAddress().equals(address.getHostString())) {
LOGGER.info("IP address: {} translated to {}", address.getHostString(), server.getPrivateIpAddress());
return new InetSocketAddress(server.getPrivateIpAddress(), address.getPort());
}
}
return null;
}
public void setClusterService(ClusterService clusterService) {
this.clusterService = clusterService;
}
public void setCluster(Cluster cluster) {
this.cluster = cluster;
}
}
Clients in one AZ attempt to always communicate with C* nodes in the same AZ. We call this zone-aware connections. This feature is built into Astyanax, Netflix’s C* Java client library.
1A
1B
1C
1B
1B
Cassandra Only DC
Realtime
Cassandra + Spark DC
Analytics
My code never has bugs. It just develops random unexpected features
To choose, or not to choose VNodes. That is the question.
(M. Lazaro, 1990 - 2500)
Works with MR jobs
Complexity for adding/removing nodes
Manual manage token ranges
# Ansible settings for provisioning the EC2 instance
---
ec2_instance_type: r3.2xlarge
ec2_count:
- 0 # How many in us-east-1a ?
- 7 # How many in us-east-1b ?
ec2_vpc_subnet:
- undefined
- subnet-c51241b2
- undefined
- subnet-80f085d9
- subnet-f9138cd2
ec2_sg:
- va-ops
- va-cassandra-realtime-private
- va-cassandra-realtime-public-1
- va-cassandra-realtime-public-2
- va-cassandra-realtime-public-3
# Keyspace Name
keyspace: stresscql
#keyspace_definition: |
# CREATE KEYSPACE stresscql WITH replication = {'class': #'NetworkTopologyStrategy', 'us-eastus-sandbox':3,'eu-westeu-sandbox':3 }
### Column Distribution Specifications ###
columnspec:
- name: visitor_id
size: gaussian(32..32) #domain names are relatively short
population: uniform(1..999M) #10M possible domains to pick from
- name: bidder_code
cluster: fixed(5)
- name: bluekai_category_id
- name: bidder_custom
size: fixed(32)
- name: bidder_id
size: fixed(32)
- name: bluekai_id
size: fixed(32)
- name: dt_pd
- name: rt_exp_dt
- name: rt_opt_out
### Batch Ratio Distribution Specifications ###
insert:
partitions: fixed(1) # Our partition key is the visitor_id so only insert one per batch
select: fixed(1)/5 # We have 5 bidder_code per domain so 1/5 will allow 1 bidder_code per batch
batchtype: UNLOGGED # Unlogged batches
#
# A list of queries you wish to run against the schema
#
queries:
getvisitor:
cql: SELECT bidder_code, bluekai_category_id, bidder_custom, bidder_id, bluekai_id, dt_pd, rt_exp_dt, rt_opt_out FROM partners WHERE visitor_id = ?
fields: samerow
160GB Ephemeral SSD Storage for commit logs and saved caches
RAID 0 over 4 SSD EBS Volumes for data
WARN [Thread-12683] 2015-06-17 10:17:22,845 IncomingTcpConnection.java:91 -
UnknownColumnFamilyException reading from socket;
closing org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find cfId=XXX
rebuild
RF 3
RF 3:0:0:0
RF 3:3:3:1
Clients
US East Realtime
EU West Realtime
US East Analytics
Rebuild
Rebuild
Rebuild
From 39d8f76d9cae11b4db405f5a002e2a4f6f764b1d Mon Sep 17 00:00:00 2001
From: mario <mario@gumgum.com>
Date: Wed, 17 Jun 2015 14:21:32 -0700
Subject: [PATCH] AT-3576 Start using new Cassandra realtime cluster
---
src/main/java/com/gumgum/cassandra/Client.java | 30 ++++------------------
.../com/gumgum/cassandra/Ec2ClassicTranslater.java | 30 ++++++++++++++--------
src/main/java/com/gumgum/cluster/Cluster.java | 3 ++-
.../resources/applicationContext-cassandra.xml | 13 ++++------
src/main/resources/dev.properties | 2 +-
src/main/resources/eu-west-1.prod.properties | 3 +++
src/main/resources/prod.properties | 3 +--
src/main/resources/us-east-1.prod.properties | 3 +++
.../CassandraAdPerformanceDaoImplTest.java | 2 --
.../asset/cassandra/CassandraImageDaoImplTest.java | 2 --
.../CassandraExactDuplicatesDaoTest.java | 2 --
.../com/gumgum/page/CassandraPageDoaImplTest.java | 2 --
.../cassandra/CassandraVisitorDaoImplTest.java | 2 --
13 files changed, 39 insertions(+), 58 deletions(-)
Start using new Cassandra DCs
RF 0:3:3:1
Clients
US East Realtime
EU West Realtime
US East Analytics
RF 3:3:3:1
RF 0:3:3:1
Clients
US East Realtime
EU West Realtime
US East Analytics
RF 3:3:1
Decomission
Maintenance in a multi-region C* cluster:
CREATE TABLE maintenance.history (
dc text,
op text,
ts timestamp,
ip text,
PRIMARY KEY ((dc, op), ts)
) WITH CLUSTERING ORDER BY (ts ASC) AND
bloom_filter_fp_chance=0.010000 AND
caching='{"keys":"ALL", "rows_per_partition":"NONE"}' AND
comment='' AND
dclocal_read_repair_chance=0.100000 AND
gc_grace_seconds=864000 AND
read_repair_chance=0.000000 AND
compaction={'class': 'SizeTieredCompactionStrategy'} AND
compression={'sstable_compression': 'LZ4Compressor'};
CREATE INDEX history_kscf_idx ON maintenance.history (kscf);
gumgum@ip-10-233-133-65:/opt/scripts/production/groovy$ groovy CassandraMaintenanceCheck.groovy -dc us-east-va-realtime -op compaction -e mario@gumgum.com
val conf = new SparkConf()
.set("spark.cassandra.connection.host", cassandraNodes)
.set("spark.cassandra.connection.local_dc", "us-east-va-analytics")
.set("spark.cassandra.connection.factory", "com.gumgum.spark.bluekai.DirectLinkConnectionFactory")
.set("spark.driver.memory","4g")
.setAppName("Cassandra presidential candidates app")
US West Datacenter!
EU West DC
US East DC
Analytics DC
US West DC
GumGum is hiring!