(ClickHouse at Segment)
Alan Braithwaite
@caust1c
Collect
Standardize
Synthesize
Dispatch
Background &
Motivation
Clickhouse &
Kubernetes
Performance &
Observability
Legacy Counters on Redis
MySQL (Aurora)
Google BigQuery
Snowflake
Spark
Flink
Druid
Legacy Counters on Redis
MySQL (Aurora)
Google BigQuery
Snowflake
Spark
Flink
Druid
Amazon's Elastic Container Service
If Kubernetes had an annoying younger sibling, ECS would be it
Used i3 nodes with local NVME disks
Started 1 node for POC
Lasted longer than we wanted it to
Struggled to get Distributed Config working
EKS Ready for use internally — April 19
New i3en Instances Announced — May 8
Design inspired by LogHouse
https://github.com/flant/loghouse
github.com/yandex/ClickHouse/issues/5287
Initial Report — May 15
Root Cause Identified — May 20
Fix Implemented — June 5
Shipped in release — June 24
9 * i3en.3xlarge nodes
7.5TB NVME SSDs, 96GB Memory, 12 vCPUs
Clickhouse has full node usage
Kubernetes mounts NVME disk inside container
Cross Replication using default databases trick
Set env vars for macros in configmap script directory
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: counters
spec:
selector:
matchLabels:
app: clickhouse
updateStrategy:
type: OnDelete
podManagementPolicy: Parallel
serviceName: "clickhouse"
replicas: 9
template:
metadata:
labels:
app: clickhouse
spec:
terminationGracePeriodSeconds: 10
nodeSelector:
segment.com/pool: clickhouse
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- clickhouse
topologyKey: "kubernetes.io/hostname"
containers:
- name: clickhouse
image: "<segment-private-docker-repo>/clickhouse-server:eff47da"
volumeMounts:
- name: datadir
mountPath: /var/lib/clickhouse
- name: configdir
mountPath: /etc/clickhouse-server/config.d
- name: clickhouse-mon
image: "<segment-private-docker-repo>/clickhouse-mon:587ea6b"
volumes:
- name: datadir
hostPath:
path: /data
type: Directory
- name: configdir
configMap:
name: clickhouse-config
FROM yandex/clickhouse-server:19.9.2.4
RUN apt-get update && apt-get install -y iptables ca-certificates
COPY start-clickhouse.sh /usr/local/bin/start-clickhouse
ENTRYPOINT /usr/local/bin/start-clickhouse
#!/usr/bin/env bash -xe
HOST=`hostname -s`
if [[ $HOST =~ (.*)-([0-9]+)$ ]]; then
NAME=${BASH_REMATCH[1]}
ORD=${BASH_REMATCH[2]}
else
echo "Failed to parse name and ordinal of Pod"
exit 1
fi
# set the environment before we exec
. /etc/clickhouse-server/config.d/$ORD.sh
exec /entrypoint.sh
Dockerfile
start-clickhouse.sh
<yandex>
<macros replace="replace">
<r0shard from_env="CH_SHARD_R0"/>
<r0replica from_env="CH_REPLICA_R0"/>
<r1shard from_env="CH_SHARD_R1"/>
<r1replica from_env="CH_REPLICA_R1"/>
<cluster>counters_cluster</cluster>
</macros>
</yandex>
macros.xml
$ ls config.d
0.sh 1.sh 2.sh
3.sh 4.sh 5.sh
6.sh 7.sh 8.sh
$ cat config.d/1.sh
export CH_SHARD_R0=01
export CH_REPLICA_R0=00
export CH_SHARD_R1=02
export CH_REPLICA_R1=01
Cross Replication
Cross Replication
ENGINE = Distributed(
'counters_cluster',
-- Database name: use the default database of whichever node is connected to.
-- This will either be r0 or r1, depending on the (shard, node) combination.
-- This mapping is configured in `locals.xml`.
'',
-- Table name: the table name within r0/r1 that will be used to perform the query.
global_events_with_properties,
-- Sharding key: randomly choose a shard to write to for every insertion.
rand()
);
<remote_servers>
<counters_cluster>
<shard>
<internal_replication>true</internal_replication>
<replica>
<default_database>r0</default_database>
<host>counters-0.clickhouse.default.svc.cluster.local</host>
<port>9000</port>
</replica>
<replica>
<default_database>r1</default_database>
<host>counters-1.clickhouse.default.svc.cluster.local</host>
<port>9000</port>
</replica>
</shard>
<!-- ... -->
</counters_cluster>
<remote_servers>
Insertion Performance
Hitting wall when inserting 100k batches into distributed table (over raw data)
Batches became ~11k each (100k / 9 nodes)
Active Part Count Climbed
Due to async write nature, no backpressure
Solution: insert directly into shards
Drawback: client doesn't try second replica for shard
If there's a failure of a node,
that shard will stop getting data
Merges were now able to keep up
Active Part count went down
Convert more existing systems to Clickhouse
Better Observability features for Segment Customers
Open Source
Kubernetes Chart &
"clickhouse-mon"
Alan Braithwaite
@caust1c