Data Engineering

Basics and getting started

Written by: Igor Korotach

What is Data Engineering? How is it different?

Designing and building systems that organize data for future analysis.

Distinctions

Before Big Data

  • Life was simple-ish
  • Data amount was small (mostly limited to a single organization)
  • There wasn't a clear distinction between Data Scientist and Data Analyst (both did Excel spreadsheets)
  • Data was mostly processed with Data Marts and OLAP cubes
  • A single architect could be responsible for:
    • Data Schema
    • Star schema
    • Data Security
    • Data Management
  • SQL was the King of the Land

After Big Data

  • Life got more interesting :)
  • Now we have clear pipelines between speicalists. Data Engineering -> Data Science -> Data Analytics
  • Data sources couldn't store the volume
  • Data analytics tools didn't have enought speed 
  • Data Analysts no longer have Excel and SQL :(    (This is due to NoSQL and MapReduce patterns)
  • Now we have Data Engineering Architect, Data Security Architect, Data Science Architect, Data ....... Architect

ETL (Extract, Transform, Load)

Data classification

  • Raw Data
    • Unprocessed data in arbitrary form (e.g. JSON, CSV)
  • Processed data 
    • Raw data with schema applied
    • Stored in pipelines
  • Cooked data
    • Processed data that has to be summarized and decided upon

Data Properties

  • Volume

    • How much data you have
  • Velocity

    • How fast is data getting to you
  • Variety

    • How different is your data
  • Veracity

    • How reliable and clean is your data

Data processing methods

Batch processing

Under the batch processing model, a set of data is collected over time, then fed into an analytics system. In other words, you collect a batch of information, then send it in for processing.

Stream processing

Under the streaming model, data is fed into analytics tools piece-by-piece. The processing is usually done in real time.

Processing Tools

Data storages

Thanks for your attention. You've been awesome!

Questions?

Presentation link: https://slides.com/emulebest/data-engineering-introduction/

Mail: igorkorotach@gmail.com

Telegram: @emulebest

Data Engineering Introduction

By Igor Korotach

Data Engineering Introduction

  • 264