Big Data @R1
Outline
- Our struture
- Our data model
-
Some useful tables
-
Dos and don'ts
- From Jira to SQL
Disclaimer
Don't try to remember everything, this presentation is also to store information you might need later.

Down to the rabbit hole
Prelude
Some links:
HUE:
https://dw.gwallet.com/notebook/editor?type=hive
Run your SQL queries
---
JobTracker: http://namenode3.dw.sc.gwallet.com:50030/jobtracker.jsp
To track your running SQL queries
Your new best friend

Some things that are always true:
- A RadiumOne Cookie is named uuid.
-
There is always a partition per day:
- Name: dt
- Format: YYYYMMDD (Christmas is 20171225)
- Timezone: UTC (without DST)
-
Sometimes, you have a sub-partition per hour:
- Name: hr
- Format: YYYYMMDDHH (Christmas 9am is 2017122509)
- Because partitions are hierarchical, you have to refer a condition on dt when you do one on hr (Christmas 9am is: dt=20171225 AND hr=2017122509)
-
Timestamp:
- Name: timestamp or ts
- Timezone: PST (+ DST)
Our structure
99% of the data are in 2 databases:
- radiumone
- radiumone_json (90% there)
These databases have views
- radiumone_views
- radiumone_json_views
1) Aggregated, clearer tables
2) Views have the same name with the prefix "v_"
3) In theory, slower (but you don't really care)
How to look at a view construction:
SHOW CREATE TABLE radiumone_json_views.v_generic_bidLink tables and DSP/DMP saves:
- radiumone_master
- Link tables
- When you have a field like country_id, you have a table radiumone_master.countries where you can find infos related to a country ID (name, ISO2,...)
- A good habit: in the search bar, try to search with the name (advertiser,...)
- DSP and DMP data (campagne info, advertiser info, audience info,...) are stored in this database too (6<X<12 hours refresh)
Tip: radiumone_master.view_strategy_tree (link to everything)
Personal databases
- ycarbonne
- kjavalge
- marnoldi
- ...
Every user has a personal database where he can create, modify and delete tables
Tableau datasources
- adsum
Before going to Tableau data source, the data is staging in this database.
There is a TON of table but sometimes, you can find some hidden gems here.
For examples, if you are looking for data based on latitude/longitude:
DESCRIBE adsum.device_lat_lonOur data model

Pixels
DESCRIBE radiumone_json.pixel_firings;
-- Most common fields:
timestamp bigint -- Exact time
pixel bigint -- Pixel ID
uuid bigint -- R1 Cookie
referingurl string -- URL
querystring string -- Smart Pixel Data
dt int -- Day (PARTITION)
hr int -- Hour (SUB-PARTITION)
1 month = ~6k mappers
Impressions
DESCRIBE radiumone_views.v_union_impression;
-- OR radiumone_json.union_impression_json
-- Most common fields:
advertiser bigint -- Advertiser ID
campaign bigint -- Campaign ID
line bigint -- Line ID
referingurl string -- URL
uuid bigint -- R1 Cookie ID
dt int -- Day (PARTITION)
hr int -- Hour (SUB-PARTITION)
-- + a lot of other fields1 month = ~8k mappers
Click
DESCRIBE radiumone_views.v_clicks
-- OR radiumone_json.user_action_click;
-- OR radiumone_json_views.v_user_action_click (it's messy)
-- Most common fields:
banner.advertiser_banner_info.advertiser bigint -- Advertiser ID
banner.advertiser_banner_info.campaign bigint -- Line ID
banner.advertiser_banner_info.line_item bigint -- Strategy ID
user_referingurl string -- URL
user.uuid bigint -- R1 Cookie ID
timestamp bigint -- Exact time
dt int -- Day (PARTITION)
hr int -- Hour (SUB-PARTITION)
-- + a lot of other fields1 month = ~400 mappers
Bid Requests
DESCRIBE radiumone_enriched.generic_bid_request;
-- Most common fields:
request_ip_data.country bigint -- Country ID
url string -- URL
uuid bigint -- R1 Cookie ID
timestamp bigint -- Exact time
dt int -- Day (PARTITION)
hr int -- Hour (SUB-PARTITION)
-- + a TON of other fields (publisher, exchanges,...)1 hour = ~1k mappers
1 day = ~24k mappers
>1day = You don't
Generic Bid Request is a filtered and aggregated from:
radiumone_json.generic_raw_request
This one is the raw data from bid requests, if you don't find something into generic_bid_request, look there (but you shouldn't)
1 hour = ~1.3k mappers
PO.ST
In radiumone_json, tables with the prefix events_:
- events_click
- events_clickback
- events_share
- events_shorten
- events_view
Do what you want with dates, except for Views:
1 month = ~4k mappers
Partners
You don't want to know.
Speak to Venatius Prasad if you have an issue.
In case of emergency:
-
The database of reference is: radiumone
- Every partner (krux, liveramp,...) have a custom table in this database. Every table is kinda unique.
If you want to rebuild an audience (worst case scenario, only if you are REALLY desperate):
radiumone.partner_data_flat / radiumone.partner_offline_stat / radiumone.user_slice_update
Side note: Now, you know more about partner data than 99% of R1
Some useful tables
Mauro's URL Store
This table is an aggregated storage of URLs from everywhere (Bid, po.st, pixels...)
marnoldi. rfs_url_store_with_country
Partitions:
- dt
- hr
- country: ISO2 (FR, US, UK,...)
-
source:
- 'Enriched_Generic_Bid_Request'
- 'Pixel_Firings'
- 'Post_Events_*'
SPB's RFS
Research Feature Store: Data around cookies
pai_rnd.research_feature_store
Partitions:
- feature:
- user_profile_data: User Profile Store, some static features about uuids (country, browser,...)
- user_history_store: User History Store, each day the number of domains visited on bid requests
- ...
Dos and don'ts
mostly don'ts...
No partition condition

SELECT * FROM pixel_firings WHERE pixel = 45022The first thing to do when you don't know a table is to look at its partitions:
SHOW PARTITIONS radiumone_json.pixels_firings(@R1: Mail from Igor)
Check your mappers
This is the most important habit to get:
When you launch a query, wait the job to be created and check the mappers on Jobtracker or Job browser (HUE)
Joins occurs before where clauses
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins
SELECT
t1.uuid,
t1.ts AS t1_ts,
t2.ts AS t2_ts
FROM table1 t1
JOIN table2 t2
ON t1.uuid = t2.uuid
WHERE t1.dt BETWEEN 20171201 AND 20171231
AND t2.dt BETWEEN 20171201 AND 20171231;SELECT
t1.uuid,
t1.ts AS t1_ts,
t2.ts AS t2_ts
FROM table1 t1
JOIN table2 t2
ON t1.uuid = t2.uuid
AND t2.dt BETWEEN 20171201 AND 20171231
AND t1.dt BETWEEN 20171201 AND 20171231;On the first example, the query does a full scan on table t1 and t2, join them. And after, partitions conditions occur.
On the second example, the query filter the tables t1 and t2 and then join them (filtered version of these tables.
From Jira to SQL
Nota Bene
Lately, EUDATA Jira tickets have history.
You can either find the SQL queries related to a specific Jira here:
or
Samsung QBR
(part of EUDATA-123)
Number of uniques who looked for a Galaxy S8 during the first two weeks of September.
Audience pixel: 43817
SELECT
COUNT(DISTINCT uuid)
FROM radiumone_json.pixel_firings
WHERE dt BETWEEN 20170701 AND 20170715
AND pixel = 43817
AND referingurl LIKE '%www.samsung.com/fr/smartphones/galaxy-s8/%'Samsung QBR
Number of uniques who looked for a Galaxy S8 during the first two weeks of September.
Audience pixel: 43817
SELECT
COUNT(*)
FROM radiumone_json.pixel_firings
WHERE dt BETWEEN 20170701 AND 20170715
AND pixel = 43817
AND referingurl LIKE '%www.samsung.com/fr/smartphones/galaxy-s8/%'
GROUP BY uuid(better)
Samsung QBR
Number of uniques who looked for a Galaxy S8 during the first two weeks of September.
Audience pixel: 43817
SELECT
COUNT(*)
FROM radiumone_json.pixel_firings pf
JOIN radiumone_views.v_union_impression vui
ON pf.dt BETWEEN 20170501 AND 20170531
AND pf.pixel = 43817
AND vui.dt BETWEEN 20170501 AND 20170531
AND vui.campaign IN (15890, 16754)
AND pf.uuid = vui.uuid
WHERE pf.referingurl LIKE '%www.samsung.com/fr/smartphones/galaxy-s8/%'
AND pf.timestamp > vui.ts
GROUP BY pf.uuidSamsung QBR V2.0
Too easy? Let's step up
Number of uniques who looked for a Galaxy S8 during the first two weeks of September after seeing a RadiumOne ad.
Audience pixel: 43817
Campaign IDs: 15890, 16754
EUDATA-220

EUDATA-220

SELECT
pf.pixel,
click.creative_id,
SUM(CAST(REGEXP_EXTRACT(querystring, 'revenue=(.*?)&', 1) AS int)) AS revenue,
COUNT(pf.uuid) AS nb_people,
COUNT(*) AS nb_ventes
FROM radiumone_json.pixel_firings pf
JOIN radiumone_views.v_clicks click
ON pf.uuid = click.uuid
AND pf.dt BETWEEN 20180101 AND 20180119
AND click.dt BETWEEN 20180101 AND 20180119
AND click.creative_id IN (696687, 696686, 696691, 696676, 696678, 696677, 696690, 696688,
696679, 696680, 696692, 696689, 696681, 696683, 696682, 696684, 696685)
AND pf.pixel IN (45022, 45023)
WHERE click.ts < pf.timestamp
GROUP BY pf.pixel, click.creative_idEUDATA-231

EUDATA-231

SELECT
COUNT(DISTINCT uuid),
COUNT(*)
FROM
(
SELECT
uuid
FROM radiumone_json.pixel_firings
WHERE dt BETWEEN 20171013 AND 20171130
GROUP BY uuid
HAVING (SUM(CASE
WHEN pixel = 46989 THEN 1
ELSE 0
END) > 0
AND SUM(CASE
WHEN pixel IN (46988, 46990, 46991) THEN 1
ELSE 0
END) > 1)
) tmpEUDATA-135

EUDATA-135

-- 2
SELECT
AVG(cpt)
FROM (
SELECT
COUNT(*) AS cpt
FROM radiumone_json.pixel_firings pf
JOIN radiumone_views.v_union_impression vui
ON pf.dt BETWEEN 20170612 AND 20170631
AND pf.pixel = 45234
AND vui.dt BETWEEN 20170612 AND 20170631
AND vui.uuid = pf.uuid
AND vui.campaign = 16480
WHERE pf.timestamp > vui.ts
GROUP BY pf.uuid
) tmp
-- 3
SELECT
AVG(time_diff)
FROM (
SELECT
MAX(vui.ts) - MIN(pf.timestamp) AS time_diff
FROM radiumone_json.pixel_firings pf
JOIN radiumone_views.v_union_impression vui
ON pf.dt BETWEEN 20170612 AND 20170631
AND vui.dt BETWEEN 20170612 AND 20170631
AND pf.pixel = 45234
AND vui.uuid = pf.uuid
AND vui.campaign = 16480
WHERE pf.timestamp > vui.ts
GROUP BY pf.uuid
) tmpEUDATA-135

-- 2
SELECT
AVG(cpt)
FROM (
SELECT
COUNT(*) AS cpt
FROM radiumone_json.pixel_firings pf
JOIN radiumone_views.v_union_impression vui
ON pf.dt BETWEEN 20170612 AND 20170631
AND pf.pixel = 45234
AND vui.dt BETWEEN 20170612 AND 20170631
AND vui.uuid = pf.uuid
AND vui.campaign = 16480
WHERE pf.timestamp > vui.ts
GROUP BY pf.uuid
) tmp
-- 3
SELECT
AVG(time_diff)
FROM (
SELECT
MAX(vui.ts) - MIN(pf.timestamp) AS time_diff
FROM radiumone_json.pixel_firings pf
JOIN radiumone_views.v_union_impression vui
ON pf.dt BETWEEN 20170612 AND 20170631
AND vui.dt BETWEEN 20170612 AND 20170631
AND pf.pixel = 45234
AND vui.uuid = pf.uuid
AND vui.campaign = 16480
WHERE pf.timestamp > vui.ts
GROUP BY pf.uuid
) tmpAlways the same? Too easy?
Good, it is the easiest things you would do.
Challenge yourself by understanding these queries:
Ok, what now?
Practice

Do your first queries with me
Hive official confluence
https://cwiki.apache.org/confluence/display/Hive/Home
Awesome ressource
There is a TON of other things to learn about our DW (event_id, the MySQL,...)

BigDataR1
By ycarbonne
BigDataR1
- 632