Introduction to Data Engineering

About me

  • ML Engineering @ NCU
  • Data Engineering @ Bridgewell

Outline

  • Difference of data scientist and data engineer

  • What a Data Engineer does?

  • Skills
  • Data Management roles
  • Use case
  • Workflow management system
  • Data Engineer interview

Difference of

data scientist and data engineer

In Taiwan regular company

Data Scientist

+

Data Analyst

+

Data Engineer

=

Data Scientist

JD from Taiwan's company

What a Data Engineer does?

Data pipeline

Series of data processes that extract, process and load data between different systems.

 

  • Batch-driven: Process data by scheduled system.
    • Such as Airflow, Oozie, Jenkins or Cron.
  • Real-time: Process new data as soon as its available.

How to distinguish data pipeline and data ETL ?

ETL is part of data pipeline, means Extract, Transform and Load.

  • Extract
    • Collect data from other upstream API service.
    • Consume data from Kafka.
    • Get HTTP request from client web pages.
  • Transform
    • Clean data, transform format etc.
  • Load
    • Store data to data warehouse or just DB.

Data warehouse

A data warehouse is a type of data management system that is designed to enable and support business intelligence (BI) activities, especially analytics. Data warehouses are solely intended to perform queries and analysis and often contain large amounts of historical data.

 

  • Such as Apache Hive, BigQuery (GCP) and RedShift (AWS)

Database vs Data warehouse

Data Report

Google BigQuery + Google Data Studio

Data infra

Data infrastructure of data platform:

 

The distributed systems that everything runs on top of.

Data Application

Building internal data tools and APIs.

Skills

  • General programming concepts
    • OOP, data structure and algorithms.
  • Databases
    • Relational DB
    • Key-value stores like Redis or Cassandra
    • Document stores like MongoDB or Elasticsearch
    • Graph databases like Neo4j
  • Distributed systems and cloud engineering
    • Hadoop
    • Kafka
    • AWS, GCP, Azure

Skills

  • Monitoring
    • Monitoring tools like Graphana/DataDog
    • TSDB like influxDB/Prometheus
  • Infrastructure
    • Ansible
    • Docker
    • K8s

Data Management Roles

Use case: Batch

Use case: Streaming

Workflow Management tool

Apache Airflow is a platform to programmatically author, schedule, and monitor workflows.

When workflows are defined as code, they become maintainable, versionable, testable and collaborative.

from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta

dag = DAG('tutorial', default_args=default_args, schedule_interval=timedelta(days=1))

# t1, t2 and t3 are examples of tasks created by instantiating operators
t1 = BashOperator(
    task_id='print_date',
    bash_command='date',
    dag=dag)

t2 = BashOperator(
    task_id='sleep',
    bash_command='sleep 5',
    retries=3,
    dag=dag)

templated_command = """
    {% for i in range(5) %}
        echo "{{ ds }}"
        echo "{{ macros.ds_add(ds, 7)}}"
        echo "{{ params.my_param }}"
    {% endfor %}
"""

t3 = BashOperator(
    task_id='templated',
    bash_command=templated_command,
    params={'my_param': 'Parameter I passed in'},
    dag=dag)

t2.set_upstream(t1)
t3.set_upstream(t1)

Airflow with AWS tools

More complex workflow

Data Engineer Interview

Facebook data engineer

  • Phone interview with recruiter
  • Screening tech interview
    • SQL questions
    • OA like Leetcode
  • Team interview
    • Problem solving questions
    • SQL programming
    • Database design and system design
    • Behavior question

Walmart data engineer

  • Phone interview with recruiter
  • Team interview
    • Coding
    • Big data system questions 
    • Data modeling, database design
    • Problem solving questions and case study
    • ETL and data pipeline
    • Math and analytics

Overview

Made with Slides.com