Conclusion: Structure is better for data analysis but 80% is unstructure so our business will go blind if we leave it out
| Json | CSV | Avro | Protocol Buffers | Parquet | ORC | |
|---|---|---|---|---|---|---|
| Row or Column | Row | Column | Column | |||
| Schema | Best | Good | Better | |||
| Read or Write | Write | Read | Read | |||
| Compression | Good | Better | Best | |||
| Splitability | Good | Good | Best | |||
| Human Readable | Good | Best | meh | meh | meh | meh |
| Types | CSV | JSON | XML | AVRO | Protocol Buffers | Parquet | ORC |
|---|---|---|---|---|---|---|---|
| text versus binary | text | text | text | metadata in JSON, data in binary | text | binary | binary |
| Data type | no | yes | no | yes | yes | yes | yes |
| Schema enforcement | no (minimal with header) | external for validation | external for validation | yes | yes | yes | yes |
| Schema evolution | non | yes | yes | yes | non | yes | non |
| Storage type | row | row | row | row | row | column | column |
| OLAP/OLTP | OLTP | OLTP | OLTP | OLTP | OLTP | OLAP | OLAP |
| Splittable | yes in its simpliest form | yes with JSON lines | non | yes | non | yes | yes |
| Compression | yes | yes | yes | yes | yes | yes | yes |
| Batch | yes | yes | yes | yes | yes | yes | yes |
| Stream | yes | yes | non | yes | yes | non | non |
| Typed data | non | non | non | non | yes | non | non |
| Ecosystems | popular everywhere for its simplicity | API and web | enterprise | Big Data and Streaming | RPC and Kubernetes | Big Data and BI | Big Data and BI |
Row
Column
Coding the data into more compact forms
Let’s first examine an encoding technique that can be used by row or column oriented databases. The example of one of the columns being for states of the United States will show dictionary and bitmap encodings.
Now let’s take a look at Run-length encoding. This allows you to replace any sequence of the same value with a count and value indicator. For instance we can replace aaaab with 4a1b. This becomes even more powerful when you create projections with columns that are sorted since all values that are the same are next to each other.
Who knows the difference?
What's the most common common ?
What's the ideal world?
Common Structure of ETL or Streaming?
Why are we migrating from hadoop ?
Tools?
Things to think About?
Stateless streaming vs stateful ? stateful scenario : I'm receiving the sales data of my car company , I want near realtime to know aggregated data.
Different workers require to store the state
Imagine a streaming process on a huge cluster
Checkpoints ?
Where is the dataset being processed? Be careful with therminology
Nifi, Airflow,Informatic
Spark,Flink, Cloudflow
Suggestion check Kappa Architecture vs Lambda
Data-bricks