What is Data Engineering

And how you can get started?

Written by: Igor Korotach

Who am I?

  • Programmer since Middle School
  • Student of NURE
  • Team/Tech Lead at Quantum
  • Speaker in multiple (mostly Python) conferences
  • Lector in NTU "KPI"

What is Data Engineering? How is it special?

Designing and building systems that organize data for future analysis.

What is the aim of Data Engineering?

Distinctions

Before Big Data

  • Life was simple-ish
  • Data amount was small (mostly limited to a single organization)
  • There wasn't a clear distinction between Data Scientist and Data Analyst (both did Excel spreadsheets)
  • Data was mostly processed with Data Marts and OLAP cubes
  • A single architect could be responsible for:
    • Data Schema
    • Star schema
    • Data Security
    • Data Management
  • SQL was the King of the Land

After Big Data

  • Life got more interesting :)
  • Now we have clear pipelines between speicalists. Data Engineering -> Data Science -> Data Analytics
  • Data sources couldn't store the volume
  • Data analytics tools didn't have enought speed 
  • Data Analysts no longer have Excel and SQL :(    (This is due to NoSQL and MapReduce patterns)
  • Now we have Data Engineering Architect, Data Security Architect, Data Science Architect, Data ....... Architect

What is the main process of Data Engineering?

ETL (Extract, Transform, Load)

Data classification

  • Raw Data
    • Unprocessed data in arbitrary form (e.g. JSON, CSV)
  • Processed data 
    • Raw data with schema applied
    • Stored in pipelines
  • Cooked data
    • Processed data that has to be summarized and decided upon

Data Properties

  • Volume

    • How much data you have
  • Velocity

    • How fast is data getting to you
  • Variety

    • How different is your data
  • Veracity

    • How reliable and clean is your data

Data distinctions

Data processing methods

Batch processing

Under the batch processing model, a set of data is collected over time, then fed into an analytics system. In other words, you collect a batch of information, then send it in for processing.

Stream processing

Under the streaming model, data is fed into analytics tools piece-by-piece. The processing is usually done in real time.

Data processing methods

Processing Tools

MapReduce pattern

Data storages

NoSQL vs SQL

Am I a Data Engineer?

If you like...

  • Optimizing for speed and efficiency
  • Working with a lot of complex tools
  • Analyzing the value the data can provide
  • Clean and pragmatic design
  • Python/Java Ecosystem

Congrats, you are a Data Engineer in the making!

Is this tough?

Well... yes

Take an example...

Will you manage it?

Yes!!!

(And I am there to help)

Thanks for your attention. You've been awesome!

Questions?

Presentation link: https://slides.com/emulebest/data-engineering-introduction/

Mail: igorkorotach@gmail.com

Telegram: @emulebest

What is Data Engineering?

By Igor Korotach

What is Data Engineering?

  • 234