MOPCON 2023

My journey of the time series database

Peter

2023/11/12

Slide

Outlines

  • Who am I?

  • What's time series data? The fundamental of time series data

  • Related works about time series database benchmark

  • Why we choose the ClickHouse?

  • Introducing the ClickHouse

  • ClickHouse setup, configuring and troubleshooting

  • More references about the ClickHouse

Who am I?

  • Peter, GitHub

  • 在GitHub活躍的開源專案貢獻者

  • Speaker

    • COSCUP、MOPCON......

  • Engineer

    • DevOps

    • Back-end

    • System Architecture Researching

    • Web Application Security

    • PHP, Python and JavaScript

  • Industrial Technology Research Institute

    • Smart Grid Technology (2017~2021)

  • Institute for Information Industry

    • Database, Data platform architecture (2021~)

What's time series data?

  • 索引與timestamp時間戳記有關

  • 資料量大小會持續的增大

  • 在一個範圍中,通常會進行

    • 聚合(aggregated)

    • 降採樣(down-sampled)

    • 查詢(queries)

  • 有密集寫入的效能需求

The fundamental of time series data

Time series data examples

  • IoT sensor datasets

  • Daily/Weekly/Monthly/Yearly temperature datasets

  • Smart Meter datasets

  • Water Meter datasets

  • And so on.

Related work

  • Closed Loop Benchmark for Timeseries Databases

  • SciTS: A Benchmark for Time-Series Database in Scientific
    Experiments and Industrial Internet of Things

  • TSM-Bench: Benchmarking Time Series Database Systems for Monitoring Applications

It's about the time series database benchmark

Choosing the correct time series database

Related work

Why papers are lacking? From 2019 to 2020

Comparing the four databases

  • ClickHouse

    • OLAP, Online analytical processing

    • column-oriented database

  • InfluxDB

  • TimescaleDB

    • Time series PostgreSQL

  • PostgreSQL

    • Relational Database System

Experiment setup

  • 利用科學實驗(KATRIN)作為使用者案例來評估TSDB

    • 卡爾斯魯厄氚中微子實驗

  • 此Benchmark是針對此實驗設計的

    • 但是也可以適用在IIoT環境中

    • 感測器測量到的數值

    • 離子監控

    • 電力、電流與電壓

    • 磁場相關的數值

    • 監控的溫度

Experiment setup

  • M1硬體規格

    • Intel Xeon CPU E5-1620 v2 @ 3.70GHz of 8 logical cores

    • 32GB DDR3 RAM

    • 24 physical drives formatted with XFS and configured with

    • RAID60 and connected using a 6 Gbit/sec SAS-2 controller

    • 用來建置DB server

  • M2硬體規格

    • Intel Xeon CPU E5-2680 v3 @ 2.50GHz over 2 sockets of 48 logical cores in total

    • 512GB DDR4 RAM

    • 扮演client machine

Experiment setup

  • 測試的候選資料庫皆使用預設設定來運行

  • 測試的資料表綱要

    • 主要欄位有:timestamp, sensor_id, value

    • 8 bytes for timestamp

    • 8 bytes long integer as the sensor ID

    • 8 bytes double-precision float for value.

    • 將timestamp欄位建立索引

Experiment results

  • Data Ingestion資料寫入

  • Scaling Workload

    • 每個目標資料庫同使併發使用48個clients

    • 每個batch大小為20,000筆資料並總計2.8 billion(28億)筆資料

    • 每個量測效能的資料庫皆有28億筆資料

    • ClickHouse表現出有較好的寫入效能,每秒可以寫入120萬筆資料

Experiment results

Experiment results

  • Queries Latency查詢延遲時間

    • 填入了28億資料中進行查詢

    • 總計為15天測量的資料

    • 由100,000個感測器完成

  • 每個查詢執行20次,每個查詢之間,清除資料表和系統與快取

    • 確保查詢的結果都是從硬碟中查到的

  • Q1、以10分鐘區間找到10個不同感測器的資料出來

Experiment results

Experiment results

  • Queries Latency查詢延遲時間

    • Q2、從180分鐘區間找到範圍外(異常)的資料

    • 較為複雜查詢在PostgreSQL中有瓶頸(B-Tree Index瓶頸)

Experiment results

Experiment results

  • Queries Latency查詢延遲時間

    • Q3、從15天中,10個不同感測器,60分鐘區間撈出資料,計算標準差

Experiment results

Experiment results

  • Queries Latency查詢延遲時間

    • Q4、從15天中,10個不同感測器,以小時間隔撈出一天的資料

Experiment results

Experiment results

  • Queries Latency查詢延遲時間

    • Q5、從15天中,2個不同感測器,以小時間隔撈出一天的資料

Experiment results

Conclusion

  • ClickHouse

    • 有效率的儲存引擎

    • 支援非常高的寫入速度(1.3 million/秒)

    • 48個client數量同時併發連線之下,維持低CPU、低記憶體和硬碟I/O使用

    • 查詢上有低延遲的時間,是因為有幾個原因如下

      • partitioned data management

      • sparse indexing algorithm

      • mix of compiled vectorized query execution

Conclusion

  • InfluxDB

    • LSM-Tree設計所以在寫入資料可高達每秒790,000筆資料

    • Concurrent表現出弱勢,CPU資源因textual protocol耗盡

    • Memory與disk I/O效能有效率

    • 查詢延遲時間沒與ClickHouse一樣低 (LSM-Tree 犧牲read效能)

      • ​​​​​​​Log-structured merge-tree

Conclusion

  • TimescaleDB

    • 與PosgreSQL比較,有很大的改良,寫入資料可每秒490,000筆資料

    • 改善了系統資源使用量,與PostgreSQL比較來說也較佳

    • 改良同時也影響了查詢延遲時間,查詢延遲時間也比PostgreSQL好

  • PostgreSQL

    • 傳統的ACID設計的資料庫

      • 原子性(Atomicity)​​​​​​​

      • 一致性(Consistency)

      • 事務隔離(Isolation)

      • 持久性(Durability)

    • 系統資源使用量,不論是CPU資源、記憶體與disk I/O都沒有效率

Conclusion

  • Column-format資料有效的做到

    • 資料分區並分成多個部分並擁有低延遲的查詢速率

    • 對於系統資源使用量來說

    • 有效率的RAM管理與使用

    • 有效率的硬碟I/O使用與管理

Row-oriented V.S. Column-oriented

That's the reason why we choose the ClickHouse

ClickHouse installation

  • Hardware requirements

    • CPU指令集

      • SSE 4.2 support

      • grep -q sse4_2 /proc/cpuinfo && echo "SSE 4.2 supported" || echo "SSE 4.2 not supported"

    • 4 CPU cores at least

    • 8 GB RAM at least

ClickHouse installation

  • Ubuntu 20.04為例

sudo apt-get update
sudo apt-get install -y apt-transport-https ca-certificates dirmngr
sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 8919F6BD2B48D754
echo "deb [arch=amd64] https://packages.clickhouse.com/deb stable main" | \ 
    sudo tee /etc/apt/sources.list.d/clickhouse.list

sudo apt-get update
sudo apt-get install -y clickhouse-client clickhouse-server

ClickHouse configuring

  • Ubuntu 20.04為例

sudo apt-get update
sudo apt-get install -y apt-transport-https ca-certificates dirmngr
sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 8919F6BD2B48D754
echo "deb [arch=amd64] https://packages.clickhouse.com/deb stable main" | \ 
    sudo tee /etc/apt/sources.list.d/clickhouse.list

sudo apt-get update
sudo apt-get install -y clickhouse-client clickhouse-server


......
 useradd -r --shell /bin/false --home-dir /nonexistent -g clickhouse-bridge clickhouse-bridge
 chown -R clickhouse-bridge:clickhouse-bridge '/usr/bin/clickhouse-odbc-bridge'
 chown -R clickhouse-bridge:clickhouse-bridge '/usr/bin/clickhouse-library-bridge'
Enter password for default user:
Password for default user is saved in file /etc/clickhouse-server/users.d/default-password.xml.
Setting capabilities for clickhouse binary. This is optional.
 chown -R clickhouse:clickhouse '/etc/clickhouse-server'

ClickHouse has been successfully installed.

$ sudo systemctl is-enabled clickhouse-server.service
enabled
$ sudo systemctl status clickhouse-server.service
● clickhouse-server.service - ClickHouse Server (analytic DBMS for big data)
     Loaded: loaded (/lib/systemd/system/clickhouse-server.service; enabled; vendor preset: enabled)
     Active: inactive (dead)
$ sudo systemctl start clickhouse-server.service
$ sudo systemctl status clickhouse-server.service

ClickHouse client usage

  • Ubuntu 20.04為例

$ clickhouse-client --password
ClickHouse client version 23.9.3.12 (official build).
Password for user (default):
Connecting to localhost:9000 as user default.
Connected to ClickHouse server version 23.9.3 revision 54466.

Warnings:
 * Maximum number of threads is lower than 30000. There could be problems with handling a lot of simultaneous queries.

peterli-VirtualBox :) select version();

SELECT version()

Query id: f261aa63-39ff-4e61-959e-978f2f09dfda

┌─version()─┐
│ 23.9.3.12 │
└───────────┘

1 row in set. Elapsed: 0.004 sec.

peterli-VirtualBox :)

Acces ClickHouse database

import clickhouse_connect
clickhouse_client = clickhouse_connect.get_client(
    host=host,
    username=username,
    password=password,
    port=port,
    database=db_name
)
clickhouse_client.query('CREATE DATABASE IF NOT EXISTS 3e_green_ems')
table_lists = clickhouse_client.query('SHOW TABLES')
for device in table_lists.result_rows:
	......

ClickHouse troubleshooting

  • Ubuntu 20.04為例

  • Warnings:

    • * Maximum number of threads is lower than 30000. There could be problems with handling a lot of simultaneous queries.

ClickHouse troubleshooting

  • Ubuntu 20.04為例

  • Warnings:

    • * Maximum number of threads is lower than 30000. There could be problems with handling a lot of simultaneous queries.

      • LimitNPROC=64000

[Service]
Type=notify
......
LimitCORE=infinity
LimitNOFILE=500000
LimitNPROC=64000

$ sudo systemctl daemon-reload
$ sudo systemctl restart clickhouse-server.service
$ clickhouse-client --password
ClickHouse client version 23.9.3.12 (official build).
Password for user (default):
Connecting to localhost:9000 as user default.
Connected to ClickHouse server version 23.9.3 revision 54466.

peterli-VirtualBox :) Bye.

The ClickHouse prefers dedicated server

Contributing the ClickHouse!

System architecture with ClickHouse

System architecture with ClickHouse (2)

References

Any Questions?