(ClickHouse at Segment)

Alan Braithwaite

@caust1c

A Modern Database
with
Modern Orchestration

What is Segment?

Collect

Standardize

Synthesize

Dispatch

Roadmap

Background &
Motivation

Clickhouse &
Kubernetes

Performance &
Observability

Background & Motivation

Once upon a time...

Hack Week 2018

Legacy Aggregation System
Redis and Go based
Complex and Inflexible to change

High performance
Easy to operate
Cost equal or less than Redis solution

Legacy Counters on Redis

MySQL (Aurora)

Google BigQuery

Snowflake

Spark

Flink

Druid

~~Legacy Counters on Redis~~

~~MySQL (Aurora)~~

~~Google BigQuery~~

~~Snowflake~~

~~Spark~~

~~Flink~~

~~Druid~~

Clickhouse & Kubernetes

Amazon's Elastic Container Service

If Kubernetes had an annoying younger sibling, ECS would be it

Used i3 nodes with local NVME disks

Started 1 node for POC

Lasted longer than we wanted it to

Struggled to get Distributed Config working

EKS Ready for use internally — April 19

New i3en Instances Announced — May 8

Design inspired by LogHouse

https://github.com/flant/loghouse

github.com/yandex/ClickHouse/issues/5287
Initial Report — May 15
Root Cause Identified — May 20
Fix Implemented — June 5
Shipped in release — June 24

9 * i3en.3xlarge nodes

7.5TB NVME SSDs, 96GB Memory, 12 vCPUs

Clickhouse has full node usage

Kubernetes mounts NVME disk inside container

Cross Replication using default databases trick

Set env vars for macros in configmap script directory

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: counters
spec:
  selector:
    matchLabels:
      app: clickhouse
  updateStrategy:
    type: OnDelete
  podManagementPolicy: Parallel
  serviceName: "clickhouse"
  replicas: 9
  template:
    metadata:
      labels:
        app: clickhouse
    spec:
      terminationGracePeriodSeconds: 10
      nodeSelector:
        segment.com/pool: clickhouse
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - clickhouse
            topologyKey: "kubernetes.io/hostname"
      containers:
      - name: clickhouse
        image: "<segment-private-docker-repo>/clickhouse-server:eff47da"
        volumeMounts:
        - name: datadir
          mountPath: /var/lib/clickhouse
        - name: configdir
          mountPath: /etc/clickhouse-server/config.d
      - name: clickhouse-mon
        image: "<segment-private-docker-repo>/clickhouse-mon:587ea6b"
      volumes:
        - name: datadir
          hostPath:
            path: /data
            type: Directory
        - name: configdir
          configMap:
            name: clickhouse-config

FROM yandex/clickhouse-server:19.9.2.4

RUN apt-get update && apt-get install -y iptables ca-certificates
COPY start-clickhouse.sh /usr/local/bin/start-clickhouse

ENTRYPOINT /usr/local/bin/start-clickhouse

#!/usr/bin/env bash -xe
HOST=`hostname -s`
if [[ $HOST =~ (.*)-([0-9]+)$ ]]; then
    NAME=${BASH_REMATCH[1]}
    ORD=${BASH_REMATCH[2]}
else
    echo "Failed to parse name and ordinal of Pod"
    exit 1
fi
# set the environment before we exec
. /etc/clickhouse-server/config.d/$ORD.sh
exec /entrypoint.sh

Dockerfile

start-clickhouse.sh

<yandex>
  <macros replace="replace">
    <r0shard from_env="CH_SHARD_R0"/>
    <r0replica from_env="CH_REPLICA_R0"/>
    <r1shard from_env="CH_SHARD_R1"/>
    <r1replica from_env="CH_REPLICA_R1"/>
    <cluster>counters_cluster</cluster>
  </macros>
</yandex>

macros.xml

$ ls config.d
0.sh   1.sh   2.sh
3.sh   4.sh   5.sh
6.sh   7.sh   8.sh

$ cat config.d/1.sh
export CH_SHARD_R0=01
export CH_REPLICA_R0=00
export CH_SHARD_R1=02
export CH_REPLICA_R1=01

Cross Replication

ENGINE = Distributed(
  'counters_cluster',
  -- Database name: use the default database of whichever node is connected to.
  -- This will either be r0 or r1, depending on the (shard, node) combination.
  -- This mapping is configured in `locals.xml`.
  '',
  -- Table name: the table name within r0/r1 that will be used to perform the query.
  global_events_with_properties,
  -- Sharding key: randomly choose a shard to write to for every insertion.
  rand()
);

  <remote_servers>
    <counters_cluster>
      <shard>
        <internal_replication>true</internal_replication>
        <replica>
          <default_database>r0</default_database>
          <host>counters-0.clickhouse.default.svc.cluster.local</host>
          <port>9000</port>
        </replica>
        <replica>
          <default_database>r1</default_database>
          <host>counters-1.clickhouse.default.svc.cluster.local</host>
          <port>9000</port>
        </replica>
      </shard>
      <!-- ... -->
    </counters_cluster>
  <remote_servers>

Performance & Observability

Insertion Performance

Hitting wall when inserting 100k batches into distributed table (over raw data)

Batches became ~11k each (100k / 9 nodes)

Active Part Count Climbed

Due to async write nature, no backpressure

Solution: insert directly into shards

Drawback: client doesn't try second replica for shard

If there's a failure of a node,
that shard will stop getting data

Merges were now able to keep up
Active Part count went down

Future

Convert more existing systems to Clickhouse

Better Observability features for Segment Customers

Open Source
Kubernetes Chart &
"clickhouse-mon"