Big Data @R1

Outline

  1. Our struture
  2. Our data model
  3. Some useful tables

  4. Dos and don'ts

  5. From Jira to SQL

Disclaimer

 

Don't try to remember everything, this presentation is also to store information you might need later.

Down to the rabbit hole

Prelude

Some links:

HUE:

https://dw.gwallet.com/notebook/editor?type=hive

Run your SQL queries

---

JobTracker: http://namenode3.dw.sc.gwallet.com:50030/jobtracker.jsp

To track your running SQL queries

Your new best friend

Some things that are always true:

  • A RadiumOne Cookie is named uuid.
  • There is always a partition per day:
    • Name: dt
    • Format: YYYYMMDD (Christmas is 20171225)
    • Timezone: UTC (without DST)
  • Sometimes, you have a sub-partition per hour:
    • Name: hr
    • Format: YYYYMMDDHH (Christmas 9am is 2017122509)
    • Because partitions are hierarchical, you have to refer a condition on dt when you do one on hr (Christmas 9am is: dt=20171225 AND hr=2017122509)
  • ​Timestamp:
    • ​Name: timestamp or ts
    • Timezone: PST (+ DST)

Our structure

99% of the data are in 2 databases:

  • radiumone
  • radiumone_json (90% there)

 

These databases have views

  • radiumone_views
  • radiumone_json_views

 

1) Aggregated, clearer tables

2) Views have the same name with the prefix "v_"

3) In theory, slower (but you don't really care)

 

How to look at a view construction:

SHOW CREATE TABLE radiumone_json_views.v_generic_bid

Link tables and DSP/DMP saves:

  • radiumone_master
  1. Link tables
    1. When you have a field like country_id, you have a table radiumone_master.countries where you can find infos related to a country ID (name, ISO2,...)
    2. A good habit: in the search bar, try to search with the name (advertiser,...)
  2. DSP and DMP data (campagne info, advertiser info, audience info,...) are stored in this database too (6<X<12 hours refresh)

Tip: radiumone_master.view_strategy_tree (link to everything)

Personal databases

  • ycarbonne
  • kjavalge
  • marnoldi
  • ...

 

Every user has a personal database where he can create, modify and delete tables

Tableau datasources

  • adsum


Before going to Tableau data source, the data is staging in this database.

There is a TON of table but sometimes, you can find some hidden gems here.


For examples, if you are looking for data based on latitude/longitude: 

DESCRIBE adsum.device_lat_lon

Our data model

Pixels

DESCRIBE radiumone_json.pixel_firings;

-- Most common fields:
timestamp      bigint	-- Exact time
pixel	       bigint	-- Pixel ID
uuid	       bigint	-- R1 Cookie
referingurl    string	-- URL
querystring    string	-- Smart Pixel Data
dt	       int	-- Day (PARTITION)
hr	       int	-- Hour (SUB-PARTITION)

1 month = ~6k mappers

Impressions

DESCRIBE radiumone_views.v_union_impression;
-- OR radiumone_json.union_impression_json

-- Most common fields:
advertiser	bigint	-- Advertiser ID
campaign        bigint	-- Campaign ID
line            bigint	-- Line ID
referingurl	string	-- URL
uuid            bigint  -- R1 Cookie ID
dt	        int	-- Day (PARTITION)
hr	        int	-- Hour (SUB-PARTITION)

-- + a lot of other fields

1 month = ~8k mappers

Click

DESCRIBE radiumone_views.v_clicks
-- OR radiumone_json.user_action_click;
-- OR radiumone_json_views.v_user_action_click (it's messy)

-- Most common fields:
banner.advertiser_banner_info.advertiser    bigint  -- Advertiser ID
banner.advertiser_banner_info.campaign      bigint  -- Line ID
banner.advertiser_banner_info.line_item     bigint  -- Strategy ID
user_referingurl                            string  -- URL
user.uuid                                   bigint  -- R1 Cookie ID
timestamp                                   bigint  -- Exact time
dt	                                    int	    -- Day (PARTITION)
hr	                                    int	    -- Hour (SUB-PARTITION)

-- + a lot of other fields

1 month = ~400 mappers

Bid Requests

DESCRIBE radiumone_enriched.generic_bid_request;

-- Most common fields:
request_ip_data.country                     bigint  -- Country ID
url                                         string  -- URL
uuid                                        bigint  -- R1 Cookie ID
timestamp                                   bigint  -- Exact time
dt	                                    int	    -- Day (PARTITION)
hr	                                    int	    -- Hour (SUB-PARTITION)

-- + a TON of other fields (publisher, exchanges,...)

1 hour = ~1k mappers

1 day = ~24k mappers

>1day = You don't

Generic Bid Request is a filtered and aggregated from:

radiumone_json.generic_raw_request

This one is the raw data from bid requests, if you don't find something into generic_bid_request, look there (but you shouldn't)

1 hour = ~1.3k mappers

PO.ST

In radiumone_json, tables with the prefix events_:

  • events_click
  • events_clickback
  • events_share
  • events_shorten
  • events_view

Do what you want with dates, except for Views:

1 month = ~4k mappers

Partners

You don't want to know.

 

Speak to Venatius Prasad if you have an issue.

 

In case of emergency:

  • The database of reference is: radiumone

  • Every partner (krux, liveramp,...) have a custom table in this database. Every table is kinda unique.

If you want to rebuild an audience (worst case scenario, only if you are REALLY desperate):

radiumone.partner_data_flat / radiumone.partner_offline_stat / radiumone.user_slice_update

Side note: Now, you know more about partner data than 99% of R1

Some useful tables

Mauro's URL Store

 

This table is an aggregated storage of URLs from everywhere (Bid, po.st, pixels...)

marnoldi. rfs_url_store_with_country

 

Partitions:

  • dt
  • hr
  • country: ISO2 (FR, US, UK,...)
  • source:
    • 'Enriched_Generic_Bid_Request'
    • 'Pixel_Firings'
    • 'Post_Events_*'

SPB's RFS

 

Research Feature Store: Data around cookies

pai_rnd.research_feature_store

 

Partitions:

  • feature:
    • user_profile_data: User Profile Store, some static features about uuids (country, browser,...)
    • user_history_store: User History Store, each day the number of domains visited on bid requests
    • ...

Dos and don'ts

mostly don'ts...

No partition condition

SELECT * FROM pixel_firings WHERE pixel = 45022

The first thing to do when you don't know a table is to look at its partitions:

SHOW PARTITIONS radiumone_json.pixels_firings

(@R1: Mail from Igor)

Check your mappers

 

This is the most important habit to get:

 

When you launch a query, wait the job to be created and check the mappers on Jobtracker or Job browser (HUE)

SELECT
  t1.uuid,
  t1.ts AS t1_ts,
  t2.ts AS t2_ts
FROM table1 t1
JOIN table2 t2
  ON t1.uuid = t2.uuid
WHERE t1.dt BETWEEN 20171201 AND 20171231
AND t2.dt BETWEEN 20171201 AND 20171231;
SELECT
  t1.uuid,
  t1.ts AS t1_ts,
  t2.ts AS t2_ts
FROM table1 t1
JOIN table2 t2
  ON t1.uuid = t2.uuid
  AND t2.dt BETWEEN 20171201 AND 20171231
  AND t1.dt BETWEEN 20171201 AND 20171231;

On the first example, the query does a full scan on table t1 and t2, join them. And after, partitions conditions occur.

 

On the second example, the query filter the tables t1 and t2 and then join them (filtered version of these tables.

From Jira to SQL

Nota Bene

Lately, EUDATA Jira tickets have history.

You can either find the SQL queries related to a specific Jira here:

Yann's sandbox

or

Kiran's sandbox

Samsung QBR

(part of EUDATA-123)

Number of uniques who looked for a Galaxy S8 during the first two weeks of September. 

Audience pixel: 43817

SELECT
  COUNT(DISTINCT uuid)
FROM radiumone_json.pixel_firings
WHERE dt BETWEEN 20170701 AND 20170715
AND pixel = 43817
AND referingurl LIKE '%www.samsung.com/fr/smartphones/galaxy-s8/%'

Samsung QBR

 

Number of uniques who looked for a Galaxy S8 during the first two weeks of September. 

Audience pixel: 43817

SELECT
  COUNT(*)
FROM radiumone_json.pixel_firings
WHERE dt BETWEEN 20170701 AND 20170715
AND pixel = 43817
AND referingurl LIKE '%www.samsung.com/fr/smartphones/galaxy-s8/%'
GROUP BY uuid

(better)

Samsung QBR

 

Number of uniques who looked for a Galaxy S8 during the first two weeks of September. 

Audience pixel: 43817

SELECT
  COUNT(*)
FROM radiumone_json.pixel_firings pf
JOIN radiumone_views.v_union_impression vui
  ON pf.dt BETWEEN 20170501 AND 20170531
  AND pf.pixel = 43817
  AND vui.dt BETWEEN 20170501 AND 20170531
  AND vui.campaign IN (15890, 16754)
  AND pf.uuid = vui.uuid
WHERE pf.referingurl LIKE '%www.samsung.com/fr/smartphones/galaxy-s8/%'
AND pf.timestamp > vui.ts
GROUP BY pf.uuid

Samsung QBR V2.0

Too easy? Let's step up

Number of uniques who looked for a Galaxy S8 during the first two weeks of September after seeing a RadiumOne ad.

Audience pixel: 43817

Campaign IDs: 15890, 16754

EUDATA-220

EUDATA-220

SELECT
  pf.pixel,
  click.creative_id,
  SUM(CAST(REGEXP_EXTRACT(querystring, 'revenue=(.*?)&', 1) AS int)) AS revenue,
  COUNT(pf.uuid) AS nb_people,
  COUNT(*) AS nb_ventes
FROM radiumone_json.pixel_firings pf
JOIN radiumone_views.v_clicks click
  ON pf.uuid = click.uuid
  AND pf.dt BETWEEN 20180101 AND 20180119
  AND click.dt BETWEEN 20180101 AND 20180119
  AND click.creative_id IN (696687, 696686, 696691, 696676, 696678, 696677, 696690, 696688, 
696679, 696680, 696692, 696689, 696681, 696683, 696682, 696684, 696685)
  AND pf.pixel IN (45022, 45023)
WHERE click.ts < pf.timestamp
GROUP BY pf.pixel, click.creative_id

EUDATA-231

EUDATA-231

SELECT
  COUNT(DISTINCT uuid),
  COUNT(*)
FROM
(
	SELECT
	  uuid
	FROM radiumone_json.pixel_firings
	WHERE dt BETWEEN 20171013 AND 20171130
	GROUP BY uuid
	HAVING (SUM(CASE
	  WHEN pixel = 46989 THEN 1
	  ELSE 0
	END) > 0
	AND SUM(CASE
	  WHEN pixel IN (46988, 46990, 46991) THEN 1
	  ELSE 0
	END) > 1)
) tmp

EUDATA-135

EUDATA-135

-- 2
SELECT
  AVG(cpt)
FROM (
SELECT
  COUNT(*) AS cpt
FROM radiumone_json.pixel_firings pf
JOIN radiumone_views.v_union_impression vui
  ON pf.dt BETWEEN 20170612 AND 20170631
  AND pf.pixel = 45234
  AND vui.dt BETWEEN 20170612 AND 20170631
  AND vui.uuid = pf.uuid
  AND vui.campaign = 16480
WHERE pf.timestamp > vui.ts
GROUP BY pf.uuid
) tmp
-- 3
SELECT
  AVG(time_diff)
FROM (
SELECT
  MAX(vui.ts) - MIN(pf.timestamp) AS time_diff
FROM radiumone_json.pixel_firings pf
JOIN radiumone_views.v_union_impression vui
  ON pf.dt BETWEEN 20170612 AND 20170631
  AND vui.dt BETWEEN 20170612 AND 20170631
  AND pf.pixel = 45234
  AND vui.uuid = pf.uuid
  AND vui.campaign = 16480
WHERE pf.timestamp > vui.ts
GROUP BY pf.uuid
) tmp

EUDATA-135

-- 2
SELECT
  AVG(cpt)
FROM (
SELECT
  COUNT(*) AS cpt
FROM radiumone_json.pixel_firings pf
JOIN radiumone_views.v_union_impression vui
  ON pf.dt BETWEEN 20170612 AND 20170631
  AND pf.pixel = 45234
  AND vui.dt BETWEEN 20170612 AND 20170631
  AND vui.uuid = pf.uuid
  AND vui.campaign = 16480
WHERE pf.timestamp > vui.ts
GROUP BY pf.uuid
) tmp
-- 3
SELECT
  AVG(time_diff)
FROM (
SELECT
  MAX(vui.ts) - MIN(pf.timestamp) AS time_diff
FROM radiumone_json.pixel_firings pf
JOIN radiumone_views.v_union_impression vui
  ON pf.dt BETWEEN 20170612 AND 20170631
  AND vui.dt BETWEEN 20170612 AND 20170631
  AND pf.pixel = 45234
  AND vui.uuid = pf.uuid
  AND vui.campaign = 16480
WHERE pf.timestamp > vui.ts
GROUP BY pf.uuid
) tmp

Always the same? Too easy?

 

Good, it is the easiest things you would do.

Challenge yourself by understanding these queries:

Onsite Analysis

Sharing Analysis

Ok, what now?

Practice

Do your first queries with me

Hive official confluence

https://cwiki.apache.org/confluence/display/Hive/Home

Awesome ressource

There is a TON of other things to learn about our DW (event_id, the MySQL,...)

BigDataR1

By ycarbonne

BigDataR1

  • 632