COMP63301 Data Engineering Concepts
Stian Soiland-Reyes
This work is licensed under a
Creative Commons Attribution 4.0 International License.
Following examples are adopted from Library Carpentry SQL tutorial
https://librarycarpentry.github.io/lc-sql/
Jesse Lambertson, Christopher Erdmann, Kristin Lee, Julika Mimkes, Lise Doucette, Jordan Pedersen, Jacqueline Frisina, Jesse Lambertson, Sarah Lynn Fisher, Elizabeth McAulay, Siobhan Dunlop, Ian van der Linde, Jens Nieschulze, Jessica Simpson, RonHLEb, & Scott Carl Peterson (2023):
LibraryCarpentry/lc-sql: Library Carpentry: Introduction to SQL 2023-05 (release 2023.05).
Zenodo. https://doi.org/10.5281/zenodo.7886952
SELECT *
FROM Articles;
| id | Title | Authors | DOI | URL | Subjects | ISSNs | Citation | LanguageId | LicenseId | Author_Count | First_Author | Citation_Count | Day | Month | Year |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | The Fisher Thermodynamics of Quasi-Probabilities | Flavia Pennini|Angelo Plastino | 10.3390/e17127853 | https://doaj.org/article/b75e8d5cca3f46cbbd63e... | Fisher information|quasi-probabilities|complem... | 1099-4300 | Entropy, Vol 17, Iss 12, Pp 7848-7858 (2015) | 1 | 1 | 2 | Flavia Pennini | 4 | 1 | 11 | 2015 |
| 1 | Aflatoxin Contamination of the Milk Supply: A ... | Naveed Aslam|Peter C. Wynn | 10.3390/agriculture5041172 | https://doaj.org/article/0edc5af6672641c0bd456... | aflatoxins|AFM1|AFB1|milk marketing chains|hep... | 2077-0472 | Agriculture (Basel), Vol 5, Iss 4, Pp 1172-118... | 1 | 1 | 2 | Naveed Aslam | 5 | 1 | 11 | 2015 |
| 2 | Metagenomic Analysis of Upwelling-Affected Bra... | Rafael R. C. Cuadrat|Juliano C. Cury|Alberto M... | 10.3390/ijms161226101 | https://doaj.org/article/d9fe469f75a0442382b84... | PKS|NRPS|metagenomics|environmental genomics|u... | 1422-0067 | International Journal of Molecular Sciences, V... | 1 | 1 | 3 | Rafael R. C. Cuadrat | 8 | 1 | 11 | 2015 |
| 3 | Synthesis and Reactivity of a Cerium(III) Scor... | Fabrizio Ortu|Hao Zhu|Marie-Emmanuelle Boulon|... | 10.3390/inorganics3040534 | https://doaj.org/article/95606ed39deb4f43b96f7... | lanthanide|cerium|scorpionate|tris(pyrazolyl)b... | 2304-6740 | Inorganics (Basel), Vol 3, Iss 4, Pp 534-553 (... | 1 | 1 | 4 | Fabrizio Ortu | 5 | 1 | 11 | 2015 |
| 4 | Performance and Uncertainty Evaluation of Snow... | Magali Troin|Richard Arsenault|François Brissette | 10.3390/hydrology2040289 | https://doaj.org/article/18b1d70730d44573ab5c2... | snow models|hydrological models|snowmelt|uncer... | 2306-5338 | Hydrology, Vol 2, Iss 4, Pp 289-317 (2015) | 1 | 1 | 3 | Magali Troin | 4 | 1 | 11 | 2015 |
SELECT retrieves data from one (or more) tables
SELECT Title, Authors, ISSNs, Year
FROM Articles;
| Title | Authors | ISSNs | Year |
|---|---|---|---|
| The Fisher Thermodynamics of Quasi-Probabilities | Flavia Pennini|Angelo Plastino | 1099-4300 | 2015 |
| Aflatoxin Contamination of the Milk Supply: A ... | Naveed Aslam|Peter C. Wynn | 2077-0472 | 2015 |
| Metagenomic Analysis of Upwelling-Affected Bra... | Rafael R. C. Cuadrat|Juliano C. Cury|Alberto M... | 1422-0067 | 2015 |
| Synthesis and Reactivity of a Cerium(III) Scor... | Fabrizio Ortu|Hao Zhu|Marie-Emmanuelle Boulon|... | 2304-6740 | 2015 |
| Performance and Uncertainty Evaluation of Snow... | Magali Troin|Richard Arsenault|François Brissette | 2306-5338 | 2015 |
SELECT can extract particular columns (data attributes)
SELECT Title, Authors, ISSNs, Year
FROM Articles
WHERE ISSNs = '2056-9890' AND Month=2 AND YEAR=2015 ;SELECT can filter results based on boolean expressions over columns
| Crystal structure of (1S,2R,4R,9S,11S,12R)-9α-... | Ahmed Benharref|Mohamed Akssira|Lahcen El Amma... | 2056-9890 | 2015 |
| Crystal structure of β-d,l-allose | Tomohiko Ishii|Tatsuya Senoo|Taro Kozakai|Kazu... | 2056-9890 | 2015 |
| Crystal structure of 2-[4-(4-chlorophenyl)-1-(... | Ísmail Çelik|Mehmet Akkurt|Aliasghar Jarrahpou... | 2056-9890 | 2015 |
| Crystal structure of (2S/2R,3S/3R)-3-hydroxy-2... | Roumaissa Belguedj|Sofiane Bouacida|Hocine Mer... | 2056-9890 | 2015 |
| Crystal structure of 1-[(1-methyl-5-nitro-1H-i... | Roumaissa Belguedj|Abdelmalek Bouraiou|Hocine ... | 2056-9890 | 2015 |
| ... | ... | ... | ... |
| Crystal structures of 2,6-bis[(1H-1,2,4-triazo... | Marites A. Guino-o|Matthew J. Folstad|Daron E.... | 2056-9890 | 2015 |
| Crystal structure of (2-methyl-1-phenylsulfony... | M. Umadevi|V. Saravanan|R. Yamuna|A. K. Mohana... | 2056-9890 | 2015 |
| Crystal structure of bis(2,2′-bipyridine)[N′-(... | Asami Mori|Takayoshi Suzuki|Kiyohiko Nakajima | 2056-9890 | 2015 |
| Crystal structure and thermal behaviour of pyr... | Selvarasu Muthulakshmi|Doraisamyraja Kalaivani | 2056-9890 | 2015 |
| Crystal structure of 3-(2,2-dibromoacetyl)-4-h... | Ameni Brahmia|Afef Ghouili|Rached Ben Hassen | 2056-9890 | 2015 |
96 rows × 4 columns
SELECT Title, Journal_Title
FROM articles
JOIN journals
ON articles.ISSNs = journals.ISSNs;SELECT can JOIN multiple tables ON an expression
SELECT Title, Journal_Title
FROM articles
JOIN journals
ON articles.ISSNs = journals.ISSNs;Understanding the database schema is essential to write the correct query
| The Fisher Thermodynamics of Quasi-Probabilities | Entropy |
| Aflatoxin Contamination of the Milk Supply: A ... | Agriculture |
| Metagenomic Analysis of Upwelling-Affected Bra... | International Journal of Molecular Sciences |
| Synthesis and Reactivity of a Cerium(III) Scor... | Inorganics |
| Performance and Uncertainty Evaluation of Snow... | Hydrology |
| ... | ... |
| Crystal structure of [3-(1H-benzimidazol-2-yl)... | Acta Crystallographica Section E Crystallograp... |
| Crystal structure of bis(3-bromopyridine-κN)bi... | Acta Crystallographica Section E Crystallograp... |
| Crystal structure of 4,4′-(ethane-1,2-diyl)bis... | Acta Crystallographica Section E Crystallograp... |
| Crystal structure of (Z)-4-[1-(4-acetylanilino... | Acta Crystallographica Section E Crystallograp... |
| Metagenomic Analysis of Upwelling-Affected Bra... | International Journal of Molecular Sciences |
1001 rows × 2 columns
SELECT Title, Journal_Title
FROM articles
JOIN journals;Careless use of database joins may lead to row explosion or inefficient database load
| The Fisher Thermodynamics of Quasi-Probabilities | Acta Crystallographica Section E Crystallograp... |
| The Fisher Thermodynamics of Quasi-Probabilities | Agriculture |
| The Fisher Thermodynamics of Quasi-Probabilities | Agronomy |
| The Fisher Thermodynamics of Quasi-Probabilities | Animals |
| The Fisher Thermodynamics of Quasi-Probabilities | Applied Sciences |
| ... | ... |
| Metagenomic Analysis of Upwelling-Affected Bra... | Toxics |
| Metagenomic Analysis of Upwelling-Affected Bra... | Toxins |
| Metagenomic Analysis of Upwelling-Affected Bra... | Vaccines |
| Metagenomic Analysis of Upwelling-Affected Bra... | Viruses |
| Metagenomic Analysis of Upwelling-Affected Bra... | Water |
51051 rows × 2 columns
SQL query
issued
Parsing and conversion to bytecode
Query planning and optimization
Query execution
Results returned
Adapted from Figure 8.2 in Fundamentals of Data engineering
🧑🏾💻
📃
🧐
👷🏽♀️
🧑🏿🍳
SQL instructions says what you want, not how
Relational databases are able to optimize the query
e.g. which order to join tables or which rows to examine first
Queries are (largely) independent of implementation details (e.g. file vs memory vs distributed)
Restructuring the relational table design can help/complicate query writing and/or query execution
SELECT *EXPLAIN to reveal the query plan
EXPLAIN SELECT Title, Journal_Title
FROM articles
JOIN journals
ON (articles.ISSNs = journals.ISSNs)
ORDER BY Journal_Title;| 0 | Init | 0 | 32 | 0 | None | 0 | None |
| 1 | SorterOpen | 2 | 4 | 0 | k(1,B) | 0 | None |
| 2 | OpenRead | 0 | 5 | 0 | 7 | 0 | None |
| 3 | OpenRead | 1 | 3 | 0 | 5 | 0 | None |
| 4 | Rewind | 0 | 24 | 0 | None | 0 | None |
| 5 | Once | 0 | 14 | 0 | None | 0 | None |
| 6 | OpenAutoindex | 3 | 3 | 0 | k(3,B,,) | 0 | None |
| 7 | Rewind | 1 | 14 | 0 | None | 0 | None |
| 8 | Column | 1 | 2 | 2 | None | 0 | None |
| 9 | Column | 1 | 4 | 3 | None | 0 | None |
| 10 | Rowid | 1 | 4 | 0 | None | 0 | None |
| 11 | MakeRecord | 2 | 3 | 1 | None | 0 | None |
| 12 | IdxInsert | 3 | 1 | 0 | None | 16 | None |
| 13 | Next | 1 | 8 | 0 | None | 3 | None |
| 14 | Column | 0 | 6 | 5 | None | 0 | None |
| 15 | IsNull | 5 | 23 | 0 | None | 0 | None |
| 16 | SeekGE | 3 | 23 | 5 | 1 | 0 | None |
| 17 | IdxGT | 3 | 23 | 5 | 1 | 0 | None |
| 18 | Column | 0 | 1 | 7 | None | 0 | None |
| 19 | Column | 3 | 1 | 6 | None | 0 | None |
| 20 | MakeRecord | 6 | 2 | 9 | None | 0 | None |
| 21 | SorterInsert | 2 | 9 | 6 | 2 | 0 | None |
| 22 | Next | 3 | 17 | 0 | None | 0 | None |
| 23 | Next | 0 | 5 | 0 | None | 1 | None |
| 24 | OpenPseudo | 4 | 10 | 4 | None | 0 | None |
| 25 | SorterSort | 2 | 31 | 0 | None | 0 | None |
| 26 | SorterData | 2 | 10 | 4 | None | 0 | None |
| 27 | Column | 4 | 0 | 8 | None | 0 | None |
| 28 | Column | 4 | 1 | 7 | None | 0 | None |
| 29 | ResultRow | 7 | 2 | 0 | None | 0 | None |
| 30 | SorterNext | 2 | 26 | 0 | None | 0 | None |
| 31 | Halt | 0 | 0 | 0 | None | 0 | None |
| 32 | Transaction | 0 | 0 | 38 | 0 | 1 | None |
| 33 | Goto | 0 | 1 | 0 | None | 0 | None |
| 0 | Init | 0 | 23 | 0 | None | 0 | None |
| 1 | OpenRead | 0 | 5 | 0 | 7 | 0 | None |
| 2 | OpenRead | 1 | 3 | 0 | 5 | 0 | None |
| 3 | Rewind | 0 | 22 | 0 | None | 0 | None |
| 4 | Once | 0 | 13 | 0 | None | 0 | None |
| 5 | OpenAutoindex | 2 | 3 | 0 | k(3,B,,) | 0 | None |
| 6 | Rewind | 1 | 13 | 0 | None | 0 | None |
| 7 | Column | 1 | 2 | 2 | None | 0 | None |
| 8 | Column | 1 | 4 | 3 | None | 0 | None |
| 9 | Rowid | 1 | 4 | 0 | None | 0 | None |
| 10 | MakeRecord | 2 | 3 | 1 | None | 0 | None |
| 11 | IdxInsert | 2 | 1 | 0 | None | 16 | None |
| 12 | Next | 1 | 7 | 0 | None | 3 | None |
| 13 | Column | 0 | 6 | 5 | None | 0 | None |
| 14 | IsNull | 5 | 21 | 0 | None | 0 | None |
| 15 | SeekGE | 2 | 21 | 5 | 1 | 0 | None |
| 16 | IdxGT | 2 | 21 | 5 | 1 | 0 | None |
| 17 | Column | 0 | 1 | 6 | None | 0 | None |
| 18 | Column | 2 | 1 | 7 | None | 0 | None |
| 19 | ResultRow | 6 | 2 | 0 | None | 0 | None |
| 20 | Next | 2 | 16 | 0 | None | 0 | None |
| 21 | Next | 0 | 4 | 0 | None | 1 | None |
| 22 | Halt | 0 | 0 | 0 | None | 0 | None |
| 23 | Transaction | 0 | 0 | 38 | 0 | 1 | None |
| 24 | Goto | 0 | 1 | 0 | None | 0 | None |
EXPLAIN SELECT Title, Journal_Title
FROM articles
JOIN journals
ON (articles.ISSNs = journals.ISSNs);Using EXPLAIN to reveal the query plan and compare equivalent queries
SELECT Title, Authors, ISSNs, Year
FROM Articles
WHERE ISSNs = '2056-9890' AND Month=2 AND YEAR=2015 ;pd.read_sql_query("""
SELECT Title, Authors, ISSNs, Year
FROM Articles
WHERE ISSNs=? AND Month=? AND YEAR=? ;
""", db,
params=['2056-9890', 2, 2015]).head()Figure 3.4, 3.5 from Fundamentals of Data Engineering
Stream query
Batch query
Named after λ calculus, a functional mathematical system
Calculate cost for invoicing by 30 minute readings
Predict which periods are busiest
Latency: 60 minutes (or month!)
Accuracy: Gaps aligned by next readings
Immediate feedback, what is using power?
Power generators can react quickly to sudden spikes (e.g. electric car charging)
Latency: 1 minute
Accuracy: Gaps ignored e.g. WiFi is down
Real-time data processed directly, transforming on the fly
Data streams treated like immutable logs
New and historical data handled uniformly (replay)
Simplifies processing, avoiding duality of batch/stream (ensure business logic is consistent)
Low latency, but full reprocessing can be expensive
Examples: Kafka, Flink, Storm
Greek letter κ supersedes λ
Challenge: backfilling of streaming workloads using a unified codebase.
Goal: Replay Uber rider sessions for accurate monthly business analysis
Example: Rider waits to rate a driver until their next Uber app session
Solution: Use Kappa architecture and process in Windows to allow late-arriving and out-of-order events
Benefit: Switching between streaming and batch job is just a switch of data source (live vs replay)
Text
CREATE EXTERNAL TABLE raw_stream_events (
event_id STRING,
user_id STRING,
event_type STRING,
event_timestamp TIMESTAMP
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION '/hdfs/streaming/events';Ingest streaming data from Kafka into Hive
CREATE TABLE processed_events AS
SELECT
e.event_id,
e.user_id,
e.event_type,
e.event_timestamp,
u.region
FROM raw_stream_events e
JOIN user_dim u ON e.user_id = u.user_id
WHERE e.event_timestamp IS NOT NULL;Process streaming events (~ materialised view)
| event_id | user_id | event_type | event_timestamp | region |
|---|---|---|---|---|
| 3133 | alice81 | purchase | 17:52 | Europe |
| … | not here yet! | |||
{ "event_id": 3133, "user_id": "alice81", "event_type": "purchase", … }| event_id | user_id | event_data |
|---|---|---|
| 3133 | alice81 | { … } |
| … | not here yet! | |
spark.read.json("s3n://...")
.registerTempTable("json")
results = spark.sql(
"""SELECT *
FROM people
JOIN json ...""")results = spark.sql(
"SELECT * FROM people")
names = results.map(lambda p: p.name)df.select("device").where("signal > 10")
Any Spark data source can be queried
SQL can be combined with data frame operations
DataFrames have n SQL-like methods