Transformation and querying data

COMP63301 Data Engineering Concepts

 

Stian Soiland-Reyes

Intended Learning Outcomes

  1. Understanding the role of queries in data transformation
  2. Ability to combine data sources using table joins
  3. Reflect on optimisation challenges for queries
  4. Choose between streaming and batch querying
  5. Use of SQL beyond relational databases

SQL revisited

Jesse Lambertson, Christopher Erdmann, Kristin Lee, Julika Mimkes, Lise Doucette, Jordan Pedersen, Jacqueline Frisina, Jesse Lambertson, Sarah Lynn Fisher, Elizabeth McAulay, Siobhan Dunlop, Ian van der Linde, Jens Nieschulze, Jessica Simpson, RonHLEb, & Scott Carl Peterson (2023):
LibraryCarpentry/lc-sql: Library Carpentry: Introduction to SQL 2023-05 (release 2023.05).
Zenodo. https://doi.org/10.5281/zenodo.7886952

SELECT *
FROM Articles;


 

id Title Authors DOI URL Subjects ISSNs Citation LanguageId LicenseId Author_Count First_Author Citation_Count Day Month Year
0 The Fisher Thermodynamics of Quasi-Probabilities Flavia Pennini|Angelo Plastino 10.3390/e17127853 https://doaj.org/article/b75e8d5cca3f46cbbd63e... Fisher information|quasi-probabilities|complem... 1099-4300 Entropy, Vol 17, Iss 12, Pp 7848-7858 (2015) 1 1 2 Flavia Pennini 4 1 11 2015
1 Aflatoxin Contamination of the Milk Supply: A ... Naveed Aslam|Peter C. Wynn 10.3390/agriculture5041172 https://doaj.org/article/0edc5af6672641c0bd456... aflatoxins|AFM1|AFB1|milk marketing chains|hep... 2077-0472 Agriculture (Basel), Vol 5, Iss 4, Pp 1172-118... 1 1 2 Naveed Aslam 5 1 11 2015
2 Metagenomic Analysis of Upwelling-Affected Bra... Rafael R. C. Cuadrat|Juliano C. Cury|Alberto M... 10.3390/ijms161226101 https://doaj.org/article/d9fe469f75a0442382b84... PKS|NRPS|metagenomics|environmental genomics|u... 1422-0067 International Journal of Molecular Sciences, V... 1 1 3 Rafael R. C. Cuadrat 8 1 11 2015
3 Synthesis and Reactivity of a Cerium(III) Scor... Fabrizio Ortu|Hao Zhu|Marie-Emmanuelle Boulon|... 10.3390/inorganics3040534 https://doaj.org/article/95606ed39deb4f43b96f7... lanthanide|cerium|scorpionate|tris(pyrazolyl)b... 2304-6740 Inorganics (Basel), Vol 3, Iss 4, Pp 534-553 (... 1 1 4 Fabrizio Ortu 5 1 11 2015
4 Performance and Uncertainty Evaluation of Snow... Magali Troin|Richard Arsenault|François Brissette 10.3390/hydrology2040289 https://doaj.org/article/18b1d70730d44573ab5c2... snow models|hydrological models|snowmelt|uncer... 2306-5338 Hydrology, Vol 2, Iss 4, Pp 289-317 (2015) 1 1 3 Magali Troin 4 1 11 2015

SELECT  retrieves data from one (or more) tables

SELECT Title, Authors, ISSNs, Year
FROM Articles;


 

Title Authors ISSNs Year
The Fisher Thermodynamics of Quasi-Probabilities Flavia Pennini|Angelo Plastino 1099-4300 2015
Aflatoxin Contamination of the Milk Supply: A ... Naveed Aslam|Peter C. Wynn 2077-0472 2015
Metagenomic Analysis of Upwelling-Affected Bra... Rafael R. C. Cuadrat|Juliano C. Cury|Alberto M... 1422-0067 2015
Synthesis and Reactivity of a Cerium(III) Scor... Fabrizio Ortu|Hao Zhu|Marie-Emmanuelle Boulon|... 2304-6740 2015
Performance and Uncertainty Evaluation of Snow... Magali Troin|Richard Arsenault|François Brissette 2306-5338 2015

SELECT  can extract particular columns (data attributes)

SELECT Title, Authors, ISSNs, Year
FROM Articles
WHERE ISSNs = '2056-9890' AND Month=2 AND YEAR=2015 ;

SELECT  can filter results based on boolean expressions over columns

Crystal structure of (1S,2R,4R,9S,11S,12R)-9α-... Ahmed Benharref|Mohamed Akssira|Lahcen El Amma... 2056-9890 2015
Crystal structure of β-d,l-allose Tomohiko Ishii|Tatsuya Senoo|Taro Kozakai|Kazu... 2056-9890 2015
Crystal structure of 2-[4-(4-chlorophenyl)-1-(... Ísmail Çelik|Mehmet Akkurt|Aliasghar Jarrahpou... 2056-9890 2015
Crystal structure of (2S/2R,3S/3R)-3-hydroxy-2... Roumaissa Belguedj|Sofiane Bouacida|Hocine Mer... 2056-9890 2015
Crystal structure of 1-[(1-methyl-5-nitro-1H-i... Roumaissa Belguedj|Abdelmalek Bouraiou|Hocine ... 2056-9890 2015
... ... ... ...
Crystal structures of 2,6-bis[(1H-1,2,4-triazo... Marites A. Guino-o|Matthew J. Folstad|Daron E.... 2056-9890 2015
Crystal structure of (2-methyl-1-phenylsulfony... M. Umadevi|V. Saravanan|R. Yamuna|A. K. Mohana... 2056-9890 2015
Crystal structure of bis(2,2′-bipyridine)[N′-(... Asami Mori|Takayoshi Suzuki|Kiyohiko Nakajima 2056-9890 2015
Crystal structure and thermal behaviour of pyr... Selvarasu Muthulakshmi|Doraisamyraja Kalaivani 2056-9890 2015
Crystal structure of 3-(2,2-dibromoacetyl)-4-h... Ameni Brahmia|Afef Ghouili|Rached Ben Hassen 2056-9890 2015

96 rows × 4 columns

 

 

SELECT Title, Journal_Title
FROM articles
JOIN journals
ON articles.ISSNs = journals.ISSNs;

SELECT  can JOIN multiple tables ON an expression

SELECT Title, Journal_Title
FROM articles
JOIN journals
ON articles.ISSNs = journals.ISSNs;

Understanding the database schema is essential to write the correct query

 

The Fisher Thermodynamics of Quasi-Probabilities Entropy
Aflatoxin Contamination of the Milk Supply: A ... Agriculture
Metagenomic Analysis of Upwelling-Affected Bra... International Journal of Molecular Sciences
Synthesis and Reactivity of a Cerium(III) Scor... Inorganics
Performance and Uncertainty Evaluation of Snow... Hydrology
... ...
Crystal structure of [3-(1H-benzimidazol-2-yl)... Acta Crystallographica Section E Crystallograp...
Crystal structure of bis(3-bromopyridine-κN)bi... Acta Crystallographica Section E Crystallograp...
Crystal structure of 4,4′-(ethane-1,2-diyl)bis... Acta Crystallographica Section E Crystallograp...
Crystal structure of (Z)-4-[1-(4-acetylanilino... Acta Crystallographica Section E Crystallograp...
Metagenomic Analysis of Upwelling-Affected Bra... International Journal of Molecular Sciences

1001 rows × 2 columns

 

 

SELECT Title, Journal_Title
FROM articles
JOIN journals;

Careless use of database joins may lead to row explosion or inefficient database load

The Fisher Thermodynamics of Quasi-Probabilities Acta Crystallographica Section E Crystallograp...
The Fisher Thermodynamics of Quasi-Probabilities Agriculture
The Fisher Thermodynamics of Quasi-Probabilities Agronomy
The Fisher Thermodynamics of Quasi-Probabilities Animals
The Fisher Thermodynamics of Quasi-Probabilities Applied Sciences
... ...
Metagenomic Analysis of Upwelling-Affected Bra... Toxics
Metagenomic Analysis of Upwelling-Affected Bra... Toxins
Metagenomic Analysis of Upwelling-Affected Bra... Vaccines
Metagenomic Analysis of Upwelling-Affected Bra... Viruses
Metagenomic Analysis of Upwelling-Affected Bra... Water

51051 rows × 2 columns

Life of a query

SQL query
issued

Parsing and conversion to bytecode

Query planning and optimization

Query execution

Results returned

Adapted from Figure 8.2 in Fundamentals of Data engineering

🧑🏾‍💻

📃

🧐

👷🏽‍♀️

🧑🏿‍🍳

SQL is a declarative language

SQL instructions says what you want, not how

Relational databases are able to optimize the query

e.g. which order to join tables or which rows to examine first

Queries are (largely) independent of implementation details (e.g. file vs memory vs distributed)

Restructuring the relational table design can help/complicate query writing and/or query execution

 

Query optimization

How can we improve query performance?

  • Prejoin data: Denormalisation or materialised views
  • Reduce complexity of query
    • Multiple simpler queries may be faster (but use more data transfer)
    • But: Avoid full-table scan SELECT *
  • Add indexes on frequently queried columns
  • Use EXPLAIN to reveal the query plan
EXPLAIN SELECT Title, Journal_Title
FROM articles
JOIN journals 
ON (articles.ISSNs = journals.ISSNs)
ORDER BY Journal_Title;
0 Init 0 32 0 None 0 None
1 SorterOpen 2 4 0 k(1,B) 0 None
2 OpenRead 0 5 0 7 0 None
3 OpenRead 1 3 0 5 0 None
4 Rewind 0 24 0 None 0 None
5 Once 0 14 0 None 0 None
6 OpenAutoindex 3 3 0 k(3,B,,) 0 None
7 Rewind 1 14 0 None 0 None
8 Column 1 2 2 None 0 None
9 Column 1 4 3 None 0 None
10 Rowid 1 4 0 None 0 None
11 MakeRecord 2 3 1 None 0 None
12 IdxInsert 3 1 0 None 16 None
13 Next 1 8 0 None 3 None
14 Column 0 6 5 None 0 None
15 IsNull 5 23 0 None 0 None
16 SeekGE 3 23 5 1 0 None
17 IdxGT 3 23 5 1 0 None
18 Column 0 1 7 None 0 None
19 Column 3 1 6 None 0 None
20 MakeRecord 6 2 9 None 0 None
21 SorterInsert 2 9 6 2 0 None
22 Next 3 17 0 None 0 None
23 Next 0 5 0 None 1 None
24 OpenPseudo 4 10 4 None 0 None
25 SorterSort 2 31 0 None 0 None
26 SorterData 2 10 4 None 0 None
27 Column 4 0 8 None 0 None
28 Column 4 1 7 None 0 None
29 ResultRow 7 2 0 None 0 None
30 SorterNext 2 26 0 None 0 None
31 Halt 0 0 0 None 0 None
32 Transaction 0 0 38 0 1 None
33 Goto 0 1 0 None 0 None

 

 

0 Init 0 23 0 None 0 None
1 OpenRead 0 5 0 7 0 None
2 OpenRead 1 3 0 5 0 None
3 Rewind 0 22 0 None 0 None
4 Once 0 13 0 None 0 None
5 OpenAutoindex 2 3 0 k(3,B,,) 0 None
6 Rewind 1 13 0 None 0 None
7 Column 1 2 2 None 0 None
8 Column 1 4 3 None 0 None
9 Rowid 1 4 0 None 0 None
10 MakeRecord 2 3 1 None 0 None
11 IdxInsert 2 1 0 None 16 None
12 Next 1 7 0 None 3 None
13 Column 0 6 5 None 0 None
14 IsNull 5 21 0 None 0 None
15 SeekGE 2 21 5 1 0 None
16 IdxGT 2 21 5 1 0 None
17 Column 0 1 6 None 0 None
18 Column 2 1 7 None 0 None
19 ResultRow 6 2 0 None 0 None
20 Next 2 16 0 None 0 None
21 Next 0 4 0 None 1 None
22 Halt 0 0 0 None 0 None
23 Transaction 0 0 38 0 1 None
24 Goto 0 1 0 None 0 None

 

 

EXPLAIN SELECT Title, Journal_Title
FROM articles
JOIN journals 
ON (articles.ISSNs = journals.ISSNs);

Using EXPLAIN to reveal the query plan and compare equivalent queries

How can we improve query performance? #2

  • Consider impact of transactions and concurrency
  • Vacuum dead records
  • Ensure query results can be cached
    • Avoid many almost-same queries
    • Use prepared statements with parameters, don't embed values 
SELECT Title, Authors, ISSNs, Year
FROM Articles
WHERE ISSNs = '2056-9890' AND Month=2 AND YEAR=2015 ;
pd.read_sql_query("""
SELECT Title, Authors, ISSNs, Year
FROM Articles
WHERE ISSNs=? AND Month=? AND YEAR=? ;
""", db, 
params=['2056-9890', 2, 2015]).head()

Querying data streams

Streaming architectures

  • Kappa architecture treat all data as continuous streams without a separate batch layer

Figure 3.4, 3.5 from Fundamentals of Data Engineering

  • Lambda architecture has separate stream and batch processing, serving separate query methods

Querying in Lambda architecture

Stream query

Batch query

Data processed in real-time

high speed, but incomplete data

Immediate response to events

"Top up" batch views with newest data

Examples: Hadoop, Snowflake, BigQuery

Operates on complete data (in batches)

high latency, with complete data

Historical accuracy

Each batch update pre-compute full views

Examples: Spark, Kafka, Kinesis

Named after λ calculus, a functional mathematical system

Smart Energy example

Batch query

Calculate cost for invoicing by 30 minute readings

Predict which periods are busiest

Latency: 60 minutes (or month!)

Accuracy: Gaps aligned by next readings

Stream query

Immediate feedback, what is using power?

Power generators can react quickly to sudden spikes (e.g. electric car charging)

Latency: 1 minute

Accuracy: Gaps ignored e.g. WiFi is down

Querying in Kappa architecture

Real-time data processed directly, transforming on the fly

Data streams treated like immutable logs

New and historical data handled uniformly (replay)

Simplifies processing, avoiding duality of batch/stream (ensure business logic is consistent)

Low latency, but full reprocessing can be expensive

Examples: Kafka, Flink, Storm

Greek letter κ supersedes λ

Uber event processing

Challenge: backfilling of streaming workloads using a unified codebase.

Goal: Replay Uber rider sessions for accurate monthly business analysis

Example: Rider waits to rate a driver until their next Uber app session

Solution: Use Kappa architecture and process in Windows to allow late-arriving and out-of-order events

Benefit: Switching between streaming and batch job is just a switch of data source (live vs replay)

Querying streams with “SQL”

HiveQL: SQL-like queries on streams

Text

CREATE EXTERNAL TABLE raw_stream_events (
  event_id STRING,
  user_id STRING,
  event_type STRING,
  event_timestamp TIMESTAMP
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION '/hdfs/streaming/events';

Ingest streaming data from Kafka into Hive

CREATE TABLE processed_events AS
SELECT 
  e.event_id,
  e.user_id,
  e.event_type,
  e.event_timestamp,
  u.region
FROM raw_stream_events e
JOIN user_dim u ON e.user_id = u.user_id
WHERE e.event_timestamp IS NOT NULL;

Process streaming events (~ materialised view)

event_id user_id event_type event_timestamp region
3133 alice81 purchase 17:52 Europe
not here yet!
{ "event_id": 3133, "user_id": "alice81", "event_type": "purchase", … }
event_id user_id event_data
3133 alice81 { … }
not here yet!

Structured Streaming with Spark SQL and DataFrames

spark.read.json("s3n://...")
  .registerTempTable("json")
results = spark.sql(
  """SELECT *
     FROM people
     JOIN json ...""")
results = spark.sql(
  "SELECT * FROM people")
names = results.map(lambda p: p.name)
df.select("device").where("signal > 10")

Any Spark data source can be queried

SQL can be combined with data frame operations

DataFrames have n SQL-like methods

Key points on transformation

  • SQL can generate derived data, joining existing sources
  • Heavy queries may need to be optimised
  • Batch queries pre-cook reliable transformations
  • Stream queries for immediate (but partial) transformation
  • Historical events can be replayed as a stream
  • Streaming frameworks support SQL-like transformation  from non-relational sources

COMP63301 Integration #2

By Stian Soiland-Reyes

COMP63301 Integration #2

Lecture in COMP63301 Data Engineering Concepts at The University of Manchester.

  • 111