Data engineering basis

Background

What's your experience with big data?
Explain what's big data for you ??
Which role?
Which technologies ?
Do you have questions about the decks before We start?
Experience with hadoop cluster?
Experience with map reduce , spark, pig , HDFS?

Structure vs Unstructured

Do we more data Structure or Unstructured ?
- > 80% unstructure
Defined vs Undefined Data
Qualitative vs Quantitative Data
Where is usually store ?
- Storage in Data Houses vs Data Lakes
- Any experience with products?
Ease of Analysis
Predefined Format vs Variety of Formats

Conclusion: Structure is better for data analysis but 80% is unstructure so our business will go blind if we leave it out

Data Format

Why choosing a data format is one of the key decisions in a big data project ?
- Analysis is faster
- Size of the storage
- Specialized for write or read
- Compression can go faster or slower
- Safe money
- Security

	Json	CSV	Avro	Protocol Buffers	Parquet	ORC
Row or Column			Row		Column	Column
Schema			Best		Good	Better
Read or Write			Write		Read	Read
Compression			Good		Better	Best
Splitability			Good		Good	Best
Human Readable	Good	Best	meh	meh	meh	meh

Types	CSV	JSON	XML	AVRO	Protocol Buffers	Parquet	ORC
text versus binary	text	text	text	metadata in JSON, data in binary	text	binary	binary
Data type	no	yes	no	yes	yes	yes	yes
Schema enforcement	no (minimal with header)	external for validation	external for validation	yes	yes	yes	yes
Schema evolution	non	yes	yes	yes	non	yes	non
Storage type	row	row	row	row	row	column	column
OLAP/OLTP	OLTP	OLTP	OLTP	OLTP	OLTP	OLAP	OLAP
Splittable	yes in its simpliest form	yes with JSON lines	non	yes	non	yes	yes
Compression	yes	yes	yes	yes	yes	yes	yes
Batch	yes	yes	yes	yes	yes	yes	yes
Stream	yes	yes	non	yes	yes	non	non
Typed data	non	non	non	non	yes	non	non
Ecosystems	popular everywhere for its simplicity	API and web	enterprise	Big Data and Streaming	RPC and Kubernetes	Big Data and BI	Big Data and BI

Row or Column

Row

Column

Serialization and Deserialization

Compression

Coding the data into more compact forms

Let’s first examine an encoding technique that can be used by row or column oriented databases. The example of one of the columns being for states of the United States will show dictionary and bitmap encodings.

There are 50 so we could encode the whole database with 6 bits since this would provide us 64 unique patterns.
To store the actual abbreviations would require 16 bits since this would provide us with 256 unique patterns for each of the two ASCII characters.
Worst of all if we stored the full name the lengths would be variable and the amount of bits needed would be a lot more.

Now let’s take a look at Run-length encoding. This allows you to replace any sequence of the same value with a count and value indicator. For instance we can replace aaaab with 4a1b. This becomes even more powerful when you create projections with columns that are sorted since all values that are the same are next to each other.