AWS Redshift

Guillaume Simard

  • Coveo
  • Usage data Infrastructure
  • Machine learning infrastructure

Pierre-Alexandre St-Jean

Dynamic reporting

118M
226M
5.34M

THAT'S A LOT

OF NUTS ROWS!

AWS Redshift

Petabyte scale data warehouse

Amazon Redshift delivers fast query performance by using columnar storage technology to improve I/O efficiency and parallelizing queries across multiple nodes.

"Cheap"

Scalable

Fast

"Cheap"

1000 USD / TB / Year*

5000 USD / TB / Year*

*With 3 year reserved instances

Scalable

Linear horizontal scaling*

*Depending on use case

Fast*

select origin_country, query_expression, count(*)
from ua.searches
where datetime > '2015-01-01'
  and query_expression != ''
group by origin_country, query_expression
order by count desc
limit 100;

2.47s

select document_title, count(distinct click_id)
from ua.clicks
left join ua.searches on searches.search_id = clicks.search_id
where searches.query_expression like '%banana%'
group by document_title
order by count desc;

1.88s
select * from ua.searches where user_name = 'gsimard' limit 1;

26.14s

*

How does it work?

The technical stuff

SQL

select searches.user_id, query_expression, document_title 
from ua.searches
left join ua.clicks on searches.search_id = clicks.search_id
where searches.user_id = 'gsimard@coveo.com'
and length(query_expression) > 5
and searches.origin_country like 'C%'
order by query_expression;

SQL Compilation

select searches.user_id, query_expression, document_title 
from ua.searches
left join ua.clicks on searches.search_id = clicks.search_id
where searches.user_id = 'gsimard@coveo.com'
and length(query_expression) > 5
and searches.origin_country like 'C%'
order by query_expression;

Leader

Compute

Compute

Compute

Compute

Sharding

Dist Key

CREATE TABLE searches (
  account_id VARCHAR(128),
  search_id CHAR(36) ENCODE LZO,
  datetime TIMESTAMP,
  user_id VARCHAR(128) ENCODE LZO,
  user_name VARCHAR(128) ENCODE LZO,
  query_expression VARCHAR(500) ENCODE LZO,
  advanced_query_expression VARCHAR(2048) ENCODE LZO,
  response_time_ms INT ENCODE BYTEDICT,
  PRIMARY KEY (search_id))
}
DISTKEY(search_id)
SORTKEY(account_id, datetime);

Columnar Storage

Sort Key

CREATE TABLE searches (
  account_id VARCHAR(128),
  search_id CHAR(36) ENCODE LZO,
  datetime TIMESTAMP,
  user_id VARCHAR(128) ENCODE LZO,
  user_name VARCHAR(128) ENCODE LZO,
  query_expression VARCHAR(500) ENCODE LZO,
  advanced_query_expression VARCHAR(2048) ENCODE LZO,
  response_time_ms INT ENCODE BYTEDICT,
  PRIMARY KEY (search_id))
}
DISTKEY(search_id)
SORTKEY(account_id, datetime);

Compression

CREATE TABLE searches (
  account_id VARCHAR(128),
  search_id CHAR(36) ENCODE LZO,
  datetime TIMESTAMP,
  user_id VARCHAR(128) ENCODE LZO,
  user_name VARCHAR(128) ENCODE LZO,
  query_expression VARCHAR(500) ENCODE LZO,
  advanced_query_expression VARCHAR(2048) ENCODE LZO,
  response_time_ms INT ENCODE BYTEDICT,
  PRIMARY KEY (search_id))
}
DISTKEY(search_id)
SORTKEY(account_id, datetime);

COPY

copy users from 's3://awssampledbuswest2/tickit/allusers_pipe.txt' 
credentials 'aws_access_key_id=<access-key-id>;aws_secret_access_key=<secret-access-key>' 
delimiter '|' region 'us-west-2';

Vacuum

Analyze

Easy integration with AWS

  • S3
  • Kinesis
  • DynamoDB

Fully managed

TL;DR

Redshift is pretty cool

<Insert technical guy here>

Let's build something




CREATE TABLE searches (
  account_id VARCHAR(128),
  search_id CHAR(36) ENCODE LZO,
  datetime TIMESTAMP,
  query_expression VARCHAR(500) ENCODE LZO),
  visit_id CHAR(36) ENCODE LZO,
DISTKEY(search_id)
SORTKEY(datetime,account_id);


create table custom_events(
  account_id VARCHAR(128),
  visit_id CHAR(36) ENCODE LZO,
  search_id CHAR(36) ENCODE LZO,
  event_type VARCHAR(128) ENCODE LZO,
  event_value VARCHAR(128) ENCODE LZO,
  visit_id CHAR(36) ENCODE LZO,
  datetime TIMESTAMP)
DISTSTYLE ALL
SORTKEY(datetime,account_id);

Wait...

Gotcha #1 - Sorting

  • Are there columns that you can always filter on?
    • Sort key
    • Lowest cardinality first



CREATE TABLE searches (
  account_id VARCHAR(128),
  search_id CHAR(36) ENCODE LZO,
  datetime TIMESTAMP,
  query_expression VARCHAR(500) ENCODE LZO),
  visit_id CHAR(36) ENCODE LZO,
DISTKEY(search_id)
SORTKEY(datetime,account_id);



CREATE TABLE searches (
  account_id VARCHAR(128),
  search_id CHAR(36) ENCODE LZO,
  datetime TIMESTAMP,
  query_expression VARCHAR(500) ENCODE LZO),
  visit_id CHAR(36) ENCODE LZO,
DISTKEY(search_id)
SORTKEY(account_id,datetime);

Let's get all unique visits





SELECT count(DISTINCT coalesce("searches"."visit_id",
                               "custom_events"."visit_id")) AS "UniqueVisit"
FROM
  (SELECT "visit_id",
          "search_id"
   FROM "ua"."searches"
   WHERE ("datetime" BETWEEN TIMESTAMP '2015-05-22 05:00:00.0' 
                         AND TIMESTAMP '2016-01-21 04:59:59.999')
     AND "account_id" = 'charmander') AS "searches"
FULL OUTER JOIN
  (SELECT "search_id",
          "visit_id"
   FROM "ua"."custom_events"
   WHERE ("datetime" BETWEEN TIMESTAMP '2015-05-22 05:00:00.0'
                         AND TIMESTAMP '2016-01-21 04:59:59.999')
     AND "account_id" = 'charmander') AS "custom_events"
ON "custom_events"."search_id" = "searches"."search_id";

Wow... it's slow!

EXPLAIN is your best friend









XN Aggregate  (cost=1445266.96..1445266.96 rows=1 width=112)
  ->  XN Subquery Scan volt_dt_0  (cost=1445266.73..1445266.91 rows=18 width=112)
        ->  XN HashAggregate  (cost=1445266.73..1445266.73 rows=18 width=152)
              ->  XN Hash Full Join DS_DIST_INNER  (cost=1.40..1445264.28 rows=979 width=152)
                    Inner Dist Key: custom_events_dist.search_id
                    Hash Cond: ("outer".search_id = "inner".search_id)
                    ->  XN Seq Scan on searches  (cost=0.00..285250.63 rows=979 width=80)
                          Filter: (((account_id)::text = 'charmander'::text) 
   AND ("datetime" >= '2015-05-22 05:00:00'::timestamp without time zone)
   AND ("datetime" <= '2016-01-21 04:59:59.999'::timestamp without time zone))


                    ->  XN Hash  (cost=1.40..1.40 rows=1 width=224)
                          ->  XN Seq Scan on custom_events  (cost=0.00..1.40 rows=1 width=224)
                                Filter: (((account_id)::text = 'charmander'::text))
    AND (("datetime" <= '2016-01-21 04:59:59.999'::timestamp without time zone)
    AND ("datetime" >= '2015-05-22 05:00:00'::timestamp without time zone))
XN Aggregate  (cost=1445266.96..1445266.96 rows=1 width=112)
  ->  XN Subquery Scan volt_dt_0  (cost=1445266.73..1445266.91 rows=18 width=112)
        ->  XN HashAggregate  (cost=1445266.73..1445266.73 rows=18 width=152)
              ->  XN Hash Full Join DS_DIST_INNER  (cost=1.40..1445264.28 rows=979 width=152)
                    Inner Dist Key: custom_events_dist.search_id
                    Hash Cond: ("outer".search_id = "inner".search_id)
                    ->  XN Seq Scan on searches  (cost=0.00..285250.63 rows=979 width=80)
                          Filter: (((account_id)::text = 'charmander'::text) 
   AND ("datetime" >= '2015-05-22 05:00:00'::timestamp without time zone)
   AND ("datetime" <= '2016-01-21 04:59:59.999'::timestamp without time zone))


                    ->  XN Hash  (cost=1.40..1.40 rows=1 width=224)
                          ->  XN Seq Scan on custom_events  (cost=0.00..1.40 rows=1 width=224)
                                Filter: (((account_id)::text = 'charmander'::text))
    AND (("datetime" <= '2016-01-21 04:59:59.999'::timestamp without time zone)
    AND ("datetime" >= '2015-05-22 05:00:00'::timestamp without time zone))

Uh Oh!

Gotcha #2

Data redistribution

Beware of :

DS_DIST_INNER (The inner table is redistributed)

 

DS_BCAST_INNER (A copy of the entire inner table is broadcast to all the compute nodes)

 

DS_DIST_ALL_INNER (The entire inner table is redistributed to a single slice because the outer table uses DISTSTYLE ALL)

 

DS_DIST_BOTH (Both tables are redistributed)

Distribute it



create table custom_events(
  account_id VARCHAR(128),
  visit_id CHAR(36) ENCODE LZO,
  search_id CHAR(36) ENCODE LZO,
  event_type VARCHAR(128) ENCODE LZO,
  event_value VARCHAR(128) ENCODE LZO,
  visit_id CHAR(36) ENCODE LZO,
  datetime TIMESTAMP)
DISTSTYLE ALL
SORTKEY(account_id,datetime);


create table custom_events(
  account_id VARCHAR(128),
  visit_id CHAR(36) ENCODE LZO,
  search_id CHAR(36) ENCODE LZO,
  event_type VARCHAR(128) ENCODE LZO,
  event_value VARCHAR(128) ENCODE LZO,
  visit_id CHAR(36) ENCODE LZO,
  datetime TIMESTAMP)
DISTKEY(search_id)
SORTKEY(account_id,datetime);

careers.coveo.com

?

Redshift - AWS Montreal Meetup

By Guillaume Simard

Redshift - AWS Montreal Meetup

AWS Redshift introduction

  • 1,151