COMP63301 Data Engineering Concepts
Stian Soiland-Reyes
This work is licensed under a
Creative Commons Attribution 4.0 International License.
Data generation
🛒📲
Ingestion
🧺
👩🏻💻🛠️
Transformation
Serving
🍲🥘🍛
Analytics
👨🏻💻📈📊
Traditional data
processing use batch computing and relational databases
Database
⛃
Database
Database
⛃
Database
⛃
"Stakeholders"
👩🏻🦰👱🏻♀️👨🏻💼
👧🏽👦🏻👧🏾
Consumers and users
🧑🏾💼💰🤵🏼♂️
Data generation
🛒📲
Ingestion
🧺
👩🏻💻🛠️
Transformation
Serving
🍲🥘🍛
Analytics
👨🏻💻📈📊
Traditional data
processing use batch computing and relational databases
Database
⛃
Database
Database
⛃
Database
⛃
"Stakeholders"
👩🏻🦰👱🏻♀️👨🏻💼
👧🏽👦🏻👧🏾
Consumers and users
🧑🏾💼💰🤵🏼♂️
Data generation
🛒📲
Ingestion
🧺
Serving
🍲🥘🍛
Analytics
👨🏻💻📈📊
“Embarrassingly
parallel”
Database
⛃
Database
Database
⛃
👩🏻🦰👱🏻♀️👨🏻💼
👧🏽👦🏻👧🏾
🧑🏾💼💰🤵🏼♂️
👩🏻💻🛠️
Transformation
Serving
🍲🥘🍛
More production lines
Can we scale up by processing each "file" separately and in parallel?
Database
⛃
Database
⛃
Database
⛃
Database
⛃
How did we ever manage before cloud? Case: Scaling Web servers in 2002
/home/a
/web
www.ntnu.no
webmail.ntnu.no
SAN
Storage Area Network
/home/b
vm5241
16 GB
vm5719
16 GB
vm6532
32 GB
host8121
machine images
machine images
machine images
www.example.com
Public IP
15.197.159.59
IaAS APIs
dataset
results
dataset
dataset
dataset
dataset
dataset
Horizontal scaling
Splitting tasks for cloud computing, 2002 style!
ingest1.example.com
ingest2
transform1
transform2
transform3
transform4
aggregate1
aggregate2
aggregate3
serve1
db.example.com
Below is a clear, structured overview of the most typical cloud services across AWS, Azure, Google Cloud (GCP), and other major cloud providers. They are grouped into common cloud service categories, with the closest equivalents placed side-by-side for easy comparison.
| Virtual Machines (IaaS) | EC2 | Virtual Machines | Compute Engine | Oracle OCI Compute, IBM Cloud Virtual Servers |
| Autoscaling | Auto Scaling | VM Scale Sets | Instance Groups | All major providers have autoscaling variants |
| Serverless Functions | Lambda | Azure Functions | Cloud Functions | Cloudflare Workers, IBM Cloud Functions |
| Container Orchestration (Managed Kubernetes) | EKS | AKS | GKE | IBM Cloud Kubernetes Service, Oracle OKE |
| Container Execution / Serverless Containers | Fargate | Container Apps | Cloud Run | Cloudflare Workers, DigitalOcean App Platform |
| Bare Metal | EC2 Bare Metal | Azure BareMetal | Bare Metal Solution | IBM Bare Metal Servers, OCI Bare Metal |
| Object Storage | S3 | Blob Storage | Cloud Storage | DigitalOcean Spaces, Backblaze B2 |
| Block Storage | EBS | Managed Disks | Persistent Disks | OCI Block Storage |
| File Storage (NFS/SMB) | EFS, FSx | Azure Files | Filestore | IBM File Storage |
| Archival Storage | Glacier | Archive Storage | Coldline/Archive | Backblaze B2, Wasabi |
| Managed Relational DBs | RDS (MySQL, PostgreSQL, etc.) | Azure SQL, PostgreSQL/MySQL Flexible Server | Cloud SQL | Oracle Autonomous DB, IBM Db2 |
| Cloud-Native Distributed SQL | Aurora | Azure Cosmos DB (multi-model) | Spanner | CockroachCloud |
| NoSQL Key-Value / Document | DynamoDB | Cosmos DB | Firestore, Cloud Bigtable | MongoDB Atlas |
| Data Warehousing | Redshift | Azure Synapse Analytics | BigQuery | Snowflake (cloud-agnostic) |
| In-Memory Caches | ElastiCache | Azure Cache for Redis | Memorystore | Redis Enterprise Cloud |
| Virtual Networks | VPC | Virtual Network | VPC | OCI Virtual Cloud Network |
| Load Balancers | ALB, NLB, CLB | Azure Load Balancer, App Gateway | Cloud Load Balancing | F5 Cloud, HAProxy Cloud |
| CDN | CloudFront | Azure CDN | Cloud CDN | Cloudflare, Akamai |
| DNS | Route 53 | Azure DNS | Cloud DNS | Cloudflare DNS |
| VPN / Direct Connect | Site-to-Site VPN, Direct Connect | VPN Gateway, ExpressRoute | Cloud VPN, Interconnect | Oracle FastConnect |
| ML Platforms | SageMaker | Azure Machine Learning | Vertex AI | IBM Watson Studio |
| Generative AI APIs | Bedrock | Azure OpenAI | Gemini | Anthropic, OpenAI APIs |
| Vision / Speech APIs | Rekognition, Polly | Cognitive Services | Cloud Vision, Speech-to-Text | Clarifai |
| Conversational AI | Lex | Bot Service | Dialogflow | Rasa (cloud-hosted) |
| CI/CD | CodePipeline, CodeBuild | Azure DevOps, GitHub Actions (Microsoft) | Cloud Build | GitLab CI, CircleCI |
| API Gateways | API Gateway | API Management | API Gateway | Kong Cloud, Apigee |
| Event Bus / Messaging | EventBridge, SNS, SQS | Event Grid, Service Bus | Pub/Sub | RabbitMQ Cloud, Kafka Confluent Cloud |
| Identity & Access | IAM | Azure AD / Entra ID | Cloud IAM | Okta |
| Secrets Management | Secrets Manager | Key Vault | Secret Manager | HashiCorp Vault |
| Key Management | KMS | Key Vault | Cloud KMS | Thales CipherTrust |
| Metrics & Monitoring | CloudWatch | Azure Monitor | Cloud Monitoring | Datadog, New Relic |
| Logging | CloudWatch Logs | Azure Log Analytics | Cloud Logging | Splunk, ELK Stack |
| Trace & APM | X-Ray | Application Insights | Cloud Trace | Datadog APM |
| Infrastructure as Code | CloudFormation | ARM/Bicep | Deployment Manager | Terraform (all clouds) |
| Config Management | Systems Manager | Automation | Config Connector | Puppet, Ansible, Chef |
| Container Registry | ECR | ACR | Artifact Registry | Docker Hub, GitHub Container Registry |
| Directory Services | Directory Service | Active Directory Domain Services | Managed AD | IBM Cloud Directory |
| ERP / SaaS Platforms | — | Dynamics 365 | — | Salesforce, SAP Cloud |
Can you list and categorise the most typical cloud services used from AWS, Azure, GCP, and other cloud computing providers?
dataset
results
dataset
dataset
dataset
dataset
dataset
scalability
Handle parallelization:
automation
vm5241
16 GB
vm6532
32 GB
Queued jobs
Dependent jobs
Container images
https://www.ibm.com/history/time-sharing
Scheduling batch jobs to virtual machines via mainframe APIs
Adopted from Figure 7.3, Fundamentals of Data Engineering
Data sources are mostly unbounded, new items arrive semi-continuously to be ingested into a computer system
What kind of data items?
Sale transaction, temperature measurement, login attempt, ad view, parcel tracking, ...
Irregularity: Data will arrive depending on external factors, e.g. when customers chose to visit the store
Adopted from Figure 7.3, Fundamentals of Data Engineering
Forming Bounded data allows processing in manageable chunks (batch processing)
by frequency: Weekly, daily, hourly, every 5 minutes
by volume: 100, 1000, 10k, 100k items
by size: kilobytes, megabytes, gigabytes
Streaming data reduces latency
Processing in micro-batches (e.g. 10 items) or real-time (e.g. 1 item) means data is integrated immediately for impacting business action
Cloud computing is needed to scale for varying data volumes (e.g. more processing nodes during "rush hour")
Adopted from Figure 5.9, Fundamentals of Data Engineering
Asynchronous sending of messages
Publish–Subscribe model
Messages are buffered in queue, until processed by subscriber
Queue system can be distributed to have handle many concurrent producers and subscribers
Adopted from Figure 5.10, Fundamentals of Data Engineering
Topic-based queues with multiple subscribers
Can dispatch to many sides e.g. analytics, operations
{
"Type": "Web Order",
"Key": "Order #12345",
"Value": "SKU 123, purchase price of $100",
"Timestamp": "2023-01-02 06:01:00"
}Adopted from Figure 5.11, Fundamentals of Data Engineering
Streams can be partitioned by the partition key of each event
e.g. using hashing or modulo: 11 mod 3 = 2
Simple way to divide the work across compute nodes
Many data items (e.g. streaming)
Splitting into partitions
λ(
)=
λ(
)=
λ(
)=
λ(
)=
def λ(order):
return (order.amount *
order.item.price)Mapping
Shuffling
Reducing
Σ(
)=
Σ(
)=
def Σ(a,b):
return a+bΣ(
)=
Σ(
)=
Result
The map-reduce computational model splits work across distributed (cloud) nodes, each running the same algorithm for mapping across the data items, shuffled to reducing nodes that then combine (e.g. sum) to the final result.
Scheduled batches may be delayed/missed by concurrent system load or connectivity issues (Don't run everything at 00:00!)
A batch may take too long to process, and block/conflict other batch jobs.
Errors may cause the rest of the batch to fail, but use of transactions can keep the databases in clean state
Batches are clearly defined and logged, and can be rerun if anything fails.
Streaming can dynamically scale (e.g. add more nodes) for increased compute loads.
Events are processed in near real-time, but may arrive out of order.
Queues are ephemeral, events may be lost if errors occur (e.g. cloud node unavailable)
Errors isolated to individual events or microbatches, but not as easily rolled back.
Harder to rerun (but Kappa architecture use a longer retention period to allow replay)
Figure 3.4, 3.5 from Fundamentals of Data Engineering
Batch processing can stay close to relational database thinking
→ Online analytical processing (OLAP)
Stream processing may at first seem to complicate serving for end users
→Need to write Spark code rather than SQL queries
But.. streams can be queried with SQL. Streams can populate OLAPs. Streams can co-exist with batch.