COSCUP 2024

Peter

2024/08/04

My best practices for interacting between PostgreSQL and ClickHouse

Slide

About me

Peter, GitHub
Active open source contributor
Speaker
- COSCUP、MOPCON......
An engineer
- DevOps
- Back-end
- System Architecture Researching
- Web Application Security
- PHP, Python and JavaScript
Industrial Technology Research Institute
- Smart Grid Technology (2017~2021)
Institute for Information Industry
- Database, Data platform architecture (2021~)

Outlines

What's ClickHouse?
Why I need to interact between PostgreSQL and ClickHouse?
Introducing multiple approaches about interacting between PostgreSQL and ClickHouse
- Case 1: From PostgreSQL to ClickHouse
- Case 2: From ClickHouse to PostgreSQL
Conclusion

Outlines

What's ClickHouse?
Why I need to interact between PostgreSQL and ClickHouse?
Introducing multiple approaches about interacting between PostgreSQL and ClickHouse
- Case 1: From PostgreSQL to ClickHouse
- Case 2: From ClickHouse to PostgreSQL
Conclusion

What's ClickHouse?

https://clickhouse.com
OLAP(Online analytical processing)database
- Datasets can be massive - billions or trillions of rows
- Data is organized in tables that contain many columns
- Only a few columns are selected to answer any particular query
- Results must be returned in milliseconds or seconds
A column-oriented database
- What's the difference between column-oriented and row-oriented database?

Row-oriented database

Column-oriented database

Outlines

What's ClickHouse?
Why I need to interact between PostgreSQL and ClickHouse?
Introducing multiple approaches about interacting between PostgreSQL and ClickHouse
- Case 1: From PostgreSQL to ClickHouse
- Case 2: From ClickHouse to PostgreSQL
Conclusion

Why I need to interact between PostgreSQL and ClickHouse?

Collecting the sensor datasets must be quick
- Sensor datasets are collected from EMS
- Sensor datasets should be collected every 15~30 seconds
- Sensor datasets may have duplicated issue frequently
The PostgreSQL cannot collect frequently
- But the PostgreSQL can store processed datasets
- Get the average of 15~30 seconds datasets to 1 minute

Outlines

What's ClickHouse?
Why I need to interact between PostgreSQL and ClickHouse?
Introducing multiple approaches about interacting between PostgreSQL and ClickHouse
- Case 1: From PostgreSQL to ClickHouse
- Case 2: From ClickHouse to PostgreSQL
Conclusion

Introducing multiple approaches about interacting between PostgreSQL and ClickHouse

Prepare the experimental virtual machine
From ClickHouse to PostgreSQL
From PostgreSQL to ClickHouse

Prepare the experimental virtual machine

Ubuntu 22.04
- CPU: 4 cores
- RAM: 6GB
- SSD: 120 GB
Install the PostgreSQL 14
Install the latest ClickHouse server verison

$ cat /proc/cpuinfo | grep -c CPU
4
$ free -mh
               total        used        free      shared  buff/cache   available
Mem:           5.8Gi       180Mi       4.8Gi        14Mi       847Mi       5.4Gi
Swap:             0B          0B          0B
$ df -lh /
Filesystem      Size  Used Avail Use% Mounted on
/dev/vda2       118G  4.7G  108G   5% /

$ sudo apt-get update
$ sudo apt-get install -y postgresql postgreslq-contrib
$ sudo su - postgres
postgres@server1:~$ psql
postgres@server1:~$ postgres=# CREATE USER lee CREATEDB password 'password';
CREATE ROLE
postgres=#
postgres=# exit
postgres@server1:~$ exit
$ vim /etc/postgresql/14/main/pg_hba.conf
$ cat /etc/postgresql/14/main/pg_hba.conf
# Database administrative login by Unix domain socket
local   all             postgres                                peer
local   all             lee                                     md5


$ sudo systemctl reload postgresql.service
$ psql -U lee --dbname postgres
Password for user lee:
psql (14.12 (Ubuntu 14.12-0ubuntu0.22.04.1))
Type "help" for help.

postgres=> create database lee;
CREATE DATABASE
postgres=> exit

Prepare the experimental virtual machine

Install the PostgreSQL 14 version

$ sudo apt-get update
$ sudo apt-get install -y apt-transport-https ca-certificates curl gnupg
$ curl -fsSL 'https://packages.clickhouse.com/rpm/lts/repodata/repomd.xml.key' \
  | sudo gpg --dearmor -o /usr/share/keyrings/clickhouse-keyring.gpg
$ echo "deb [signed-by=/usr/share/keyrings/clickhouse-keyring.gpg] https://packages.clickhouse.com/deb stable main" \
  | sudo tee /etc/apt/sources.list.d/clickhouse.list
$ sudo apt-get update
$ sudo apt-get install -y clickhouse-server clickhouse-client
 chown -R clickhouse:clickhouse '/var/log/clickhouse-server/'
 chown -R clickhouse:clickhouse '/var/run/clickhouse-server'
 chown  clickhouse:clickhouse '/var/lib/clickhouse/'
Enter password for the default user:
$ sudo systemctl enable --now clickhouse-server
$ clickhouse-client --password
ClickHouse client version 24.6.2.17 (official build).
Password for user (default):
Connecting to localhost:9000 as user default.
Connected to ClickHouse server version 24.6.2.

server1.peterli.website :) select version();

SELECT version()

Query id: 468925e2-65ec-4416-9aca-19778b6d8fc3

   ┌─version()─┐
1. │ 24.6.2.17 │
   └───────────┘

1 row in set. Elapsed: 0.004 sec.

server1.peterli.website :) exit
Bye.

Prepare the experimental virtual machine

Install the latest ClickHouse version

Introducing multiple approaches about interacting between PostgreSQL and ClickHouse

Prepare the experimental virtual machine
From ClickHouse to PostgreSQL
From PostgreSQL to ClickHouse

From ClickHouse to PostgreSQL

Writing some codes (Python for example)
- Connecting the ClickHouse firstly.
- Query required sensor datasets.
- Process above sensor datasets.
- Connecting the PostgreSQL secondly.
- Inserting the processed sensor datasets to PostgreSQL.

From ClickHouse to PostgreSQL

import os
import json
import clickhouse_connect


username = os.getenv('clickhouse_db_username')
password = os.getenv('clickhouse_db_password')
port = os.getenv('clickhouse_db_port')
db_name = os.getenv('clickhouse_db_name')
clickhouse_client = clickhouse_connect.get_client(
    host=host,
    username=username,
    password=password,
    port=port,
    database=db_name
)

clickhouse_client.query('CREATE DATABASE IF NOT EXISTS device')

device_identity = '1.1.1'
clickhouse_client.query(
    '''
    CREATE TABLE IF NOT EXISTS device.`%s`
    (
        `value` Float64,
        `measured_timestamp` DateTime('Asia/Taipei')
    )
    ENGINE = MergeTree
    PARTITION BY toYYYYMM(measured_timestamp)
    ORDER BY measured_timestamp
    SETTINGS index_granularity = 8192
    ''' % device_identity
)

clickhouse_client.query('''
    OPTIMIZE TABLE `%s` FINAL DEDUPLICATE BY %s, %s;
    ''' % (device_identity, 'value', 'measured_timestamp',)
)

columns = ['value', 'measured_timestamp']
sensor_records = [
    [0.39, datetime.datetime.now()],
    [0.375, datetime.datetime.now()],
]
clickhouse_client.insert(device_identity, sensor_records, column_names=columns)

From ClickHouse to PostgreSQL

Python library and other references
- Python integrations
  - https://clickhouse.com/docs/en/integrations/python
- clickhouse_connect
  - https://github.com/ClickHouse/clickhouse-connect

From ClickHouse to PostgreSQL

Run the Airbyte
- https://airbyte.com
- It's a open source data integration platform
- A ELT (extract, load and transform) tool
- A ETL (extract, transform and load) tool
Using the Airbyte to load data from ClickHouse to PostgreSQL

$ sudo snap install docker --classic
$ git clone --depth 1 https://github.com/airbytehq/airbyte.git
$ cd airbyte

$ ./run-ab-platform.sh -b
......
[+] Running 11/11
 ✔ Container airbyte-docker-proxy-1            Started                                       0.0s
 ✔ Container airbyte-temporal                  Started                                       0.1s
 ✔ Container airbyte-db                        Started                                       0.3s
 ✔ Container init                              Exited                                        0.1s
 ✔ Container airbyte-bootloader                Exited                                        0.1s
 ✔ Container airbyte-connector-builder-server  Started                                       0.6s
......
Airbyte containers are running!
root@server1:~/airbyte#

From ClickHouse to PostgreSQL

Browsing the Airbyte website
- Browse http://{IP_ADDRESS}:8000
- username: airbyte
- password: password

From ClickHouse to PostgreSQL

Initializing the Airbyte website

From ClickHouse to PostgreSQL

Initializing the Airbyte website

From ClickHouse to PostgreSQL

Search the ClickHouse

From ClickHouse to PostgreSQL

Create the ClickHouse to be source

From ClickHouse to PostgreSQL

Create the PostgreSQL to be destination

From ClickHouse to PostgreSQL

It cannot use customized SQL queries for source.
To process data, it needs to develop the connector.

From ClickHouse to PostgreSQL

Using the customized codes
- Pros
  - It's very free to implement logical for syncing data.
- Cons
  - Monitoring status should be implement by yourself.
Using the Airbyte
- Pros
  - It has the GUI tool to synchronize data easily.
  - It can be easy to monitor synchronize statuses.
- Cons
  - It needs to develop connector to process data.

Introducing multiple approaches about interacting between PostgreSQL and ClickHouse

Prepare the experiment virtual machine
From ClickHouse to PostgreSQL
From PostgreSQL to ClickHouse

From PostgreSQL to ClickHouse

Sometimes some datasets are stored in the PostgreSQL database
To analyse data quickly,
- it should load data from PostgreSQL to ClickHouse
There're two approaches to achieve above goals
- Using the official tool to integrate the PostgreSQL database
- Using the Airbyte to load data from PostgreSQL to ClickHouse

From PostgreSQL to ClickHouse

Official ClickHouse support for PostgreSQL integration
Airbyte
- https://airbyte.com
- It's a open source data integration platform
- From PostgreSQL to ClickHouse

From PostgreSQL to ClickHouse

Official ClickHouse support for PostgreSQL integration

CREATE TABLE [IF NOT EXISTS] [db.]table_name [ON CLUSTER cluster]
(
    name1 type1 [DEFAULT|MATERIALIZED|ALIAS expr1] [TTL expr1],
    name2 type2 [DEFAULT|MATERIALIZED|ALIAS expr2] [TTL expr2],
    ...
)
ENGINE = PostgreSQL({host:port, database, table, user, password[, schema, [, on_conflict]] | named_collection[, option=value [,..]]})

From PostgreSQL to ClickHouse

Official ClickHouse support for PostgreSQL integration

$ clickhouse-client --password
ClickHouse client version 24.6.2.17 (official build).
Password for user (default):
Connecting to localhost:9000 as user default.
Connected to ClickHouse server version 24.6.2.

CREATE TABLE default.`1.1.1`
(
    `value` double,
    `measured_timestamp` DateTime('Asia/Taipei')
)
ENGINE = PostgreSQL(
  '{IP_ADDRESS}:5432',
  'database_name',
  '1.1.1',
  'user_name',
  'password',
  'schema_name'
)

Query id: 945344dc-07e8-479a-9eb5-2e54a7c326a4

Ok.

0 rows in set. Elapsed: 0.033 sec.

From PostgreSQL to ClickHouse

Official ClickHouse support for PostgreSQL integration
This approach still let data store in the PostgreSQL database

datapipeline03 :) select count(*) from '1.1.1';

SELECT count(*)
FROM `1.1.1`

Query id: 394af180-731a-45e0-aba4-ac2efbdab9a1

┌─count()─┐
│  170703 │
└─────────┘

1 row in set. Elapsed: 0.645 sec. Processed 170.70 thousand rows, 682.81 KB (264.53 thousand rows/s., 1.06 MB/s.)
Peak memory usage: 837.77 KiB.

datapipeline03 :)

From PostgreSQL to ClickHouse

Another approach can load PostgreSQL and store ClickHouse

CREATE TABLE default.`1.1.1_copy`
(
    `value` double,
    `measured_timestamp` DateTime('Asia/Taipei')
)
ENGINE = MergeTree
ORDER BY measured_timestamp

Query id: 849a3f6e-5a79-4f46-bfc5-9f2515b80597

Ok.

0 rows in set. Elapsed: 0.035 sec.

From PostgreSQL to ClickHouse

Another approach can load PostgreSQL and store ClickHouse

datapipeline03 :) INSERT INTO default."1.1.1_copy"
SELECT * FROM postgresql('IP_ADDRESS:5432', 'database_name', 'table_name', 'user', 'password', 'schema');

INSERT INTO default.`1.1.1_copy` SELECT *
FROM postgresql('IP_ADDRESS:5432', 'database_name', 'table_name', 'user', 'password', 'schema')

Query id: 3feae7a8-bc34-4ade-811c-cf47ad5dae1a

Ok.

0 rows in set. Elapsed: 0.338 sec. Processed 170.70 thousand rows, 2.73 MB (505.36 thousand rows/s., 8.09 MB/s.)
Peak memory usage: 8.12 MiB.

datapipeline03 :) select count(*) from '1.1.1_copy';

SELECT count(*)
FROM `1.1.1_copy`

Query id: ed26a16e-3a37-4db9-8df2-b1d8846def2d

┌─count()─┐
│  170703 │
└─────────┘

1 row in set. Elapsed: 0.004 sec.

From PostgreSQL to ClickHouse

Compare these two approaches

datapipeline03 :) select count(*) from '1.1.1_copy';

SELECT count(*)
FROM `1.1.1_copy`

Query id: ed26a16e-3a37-4db9-8df2-b1d8846def2d

┌─count()─┐
│  170703 │
└─────────┘

1 row in set. Elapsed: 0.004 sec.

datapipeline03 :) select count(*) from '1.1.1';

SELECT count(*)
FROM `1.1.1`

Query id: 174f6669-4df6-4cbf-a05d-688c456a98fa

┌─count()─┐
│  170703 │
└─────────┘

1 row in set. Elapsed: 0.172 sec. Processed 170.70 thousand rows, 682.81 KB (991.29 thousand rows/s., 3.97 MB/s.)
Peak memory usage: 774.29 KiB.

From PostgreSQL to ClickHouse

Airbyte approach(Create PostgreSQL source)

From PostgreSQL to ClickHouse

Airbyte approach(Create PostgreSQL source)

From PostgreSQL to ClickHouse

Airbyte approach(Create ClickHouse destination)

From PostgreSQL to ClickHouse

Airbyte approach(Create Connection)

From PostgreSQL to ClickHouse

Airbyte approach(Create Connection)

Outlines

What's ClickHouse?
Why I need to interact between PostgreSQL and ClickHouse?
Introducing multiple approaches about interacting between PostgreSQL and ClickHouse
- Case 1: From PostgreSQL to ClickHouse
- Case 2: From ClickHouse to PostgreSQL
Conclusion

Conclusion

Load data from ClickHouse to PostgreSQL
- Actually, it's not convenient.
- It's not supported in the ClickHouse natively.
- It needs to customize codes to achieve this goal.
- With Airbyte, it needs to create connector if we want to customize our logical data loading.
Load data from PostgreSQL to ClickHouse
- It's more convenient than above use case.
- It is supported in the ClickHouse database.
- With Airbyte, it needs to create connector if we want to customize our logical data loading.

More references

Official ClickHouse documentation
- https://clickhouse.com/docs
Load data from PostgreSQL to ClickHouse
MOPCON 2023 unconf
- My journey of the time series database
- https://slides.com/peter279k/mopcon-2023

Any Questions?

COSCUP 2024: My best practices for interacting between PostgreSQL and ClickHouse

By peter279k

COSCUP 2024: My best practices for interacting between PostgreSQL and ClickHouse

COSCUP 2023, Collaborating the PHPBrew : past, now and future

COSCUP 2024

Peter

2024/08/04

My best practices for interacting between PostgreSQL and ClickHouse

Slide

About me

Outlines

What's ClickHouse?

Why I need to interact between PostgreSQL and ClickHouse?

Introducing multiple approaches about interacting between PostgreSQL and ClickHouse

Case 1: From PostgreSQL to ClickHouse

Case 2: From ClickHouse to PostgreSQL

Conclusion

Outlines

What's ClickHouse?

Why I need to interact between PostgreSQL and ClickHouse?

Introducing multiple approaches about interacting between PostgreSQL and ClickHouse

Case 1: From PostgreSQL to ClickHouse

Case 2: From ClickHouse to PostgreSQL

Conclusion

What's ClickHouse?

https://clickhouse.com

OLAP(Online analytical processing)database

A column-oriented database

Row-oriented database

Column-oriented database

Outlines

What's ClickHouse?

Why I need to interact between PostgreSQL and ClickHouse?

Introducing multiple approaches about interacting between PostgreSQL and ClickHouse

Case 1: From PostgreSQL to ClickHouse

Case 2: From ClickHouse to PostgreSQL

Conclusion

Why I need to interact between PostgreSQL and ClickHouse?

Collecting the sensor datasets must be quick

The PostgreSQL cannot collect frequently

Outlines

What's ClickHouse?

Why I need to interact between PostgreSQL and ClickHouse?

Introducing multiple approaches about interacting between PostgreSQL and ClickHouse

Case 1: From PostgreSQL to ClickHouse

Case 2: From ClickHouse to PostgreSQL

Conclusion

Introducing multiple approaches about interacting between PostgreSQL and ClickHouse

Prepare the experimental virtual machine

From ClickHouse to PostgreSQL

From PostgreSQL to ClickHouse

Prepare the experimental virtual machine

Ubuntu 22.04

Install the PostgreSQL 14

Install the latest ClickHouse server verison

Prepare the experimental virtual machine

Install the PostgreSQL 14 version

Prepare the experimental virtual machine

Install the latest ClickHouse version

Introducing multiple approaches about interacting between PostgreSQL and ClickHouse

Prepare the experimental virtual machine

From ClickHouse to PostgreSQL

From PostgreSQL to ClickHouse

From ClickHouse to PostgreSQL

Writing some codes (Python for example)

From ClickHouse to PostgreSQL

From ClickHouse to PostgreSQL

Python library and other references

From ClickHouse to PostgreSQL

From ClickHouse to PostgreSQL

Browsing the Airbyte website

From ClickHouse to PostgreSQL

Initializing the Airbyte website

From ClickHouse to PostgreSQL

Initializing the Airbyte website

From ClickHouse to PostgreSQL

Search the ClickHouse

From ClickHouse to PostgreSQL

Create the ClickHouse to be source

From ClickHouse to PostgreSQL

Create the PostgreSQL to be destination

From ClickHouse to PostgreSQL

From ClickHouse to PostgreSQL

Using the customized codes

Load data from PostgreSQL to ClickHouse