Where do I get and store my data
Fanilo ANDRIANASOLO
@andfanilo
Big Data engineer @Worldline
1880 United States Census
Enumeration sheets
Hard to access on machines
50 million people
8 years to analyze
Obsolete results
1890 United States Census
Punched cards
Readable by tabulating system
63 million people
6 years instead of estimated 11
What we learned
Small time frame analysis
Actionable results
Tabulated data
Easier to process
Good data storage
Good data analysis
This is data
274041730000999
Data represents facts about entities in our world
This is information
Information is data in context
Metadata describes the structure and context of data
This is an observation
Observation is the active acquisition of information from a primary source
Within a specific time frame, all observations have the same context
When storing observations, we should make sure of :
Metadata for data definition
Data content quality
Data accessibility
A file is a durable container for data
The filename extension generally indicates how to interpret the contained data into information
Structured file formats
CSV - Comma Separated Value
Year,Make,Model 1997,Ford,E350 2000,Mercury,Cougar
Easy to read
Popular exchange format
Very compact
Not "standardized"
No hierarchical data
Not versatile if schema change
Structured file formats
XML - eXtensible Markup Language
<person position=1> <name> Eric </name> <age> 26 </age> </person>
<person> <name> Clara </name> </person>
Complex, hierarchical data
Standardized API for parsing
Schema & validation features
3x larger than CSV
Tree-like model may be hard to manipulate
Structured file formats
JSON - Javascript Object Notation
{ person: { name: Eric, age: 26, city: France family: {...} }
Supports lots of data objects
Standardized parsing
Only 2x larger than CSV
A bit less support than XML
Not as robust as XML for now
CSV, XML, JSON are popular data formats
But they are not efficient for large data transfers
This is even more important in a distributed data system for storing web-scale data
Deserialization
Serialization
0100101010...
Protocol buffers, Thrift, Avro
message Person {
required string name = 1;
optional string email = 2;
}
Person john = Person.newBuilder() .setName("John Doe") .setEmail("jdoe@example.com") .build(); output = new FileOutputStream(args[0]); john.writeTo(output);
Person john;
fstream input(argv[1],
ios::in | ios::binary);
john.ParseFromIstream(&input);
name = john.name();
email = john.email();
Interface definition
Code generation for serialization/deserialization
A feature is an attribute used to represent an observation
Year,Make,Model 1997,Ford,E350 2000,Mercury,Cougar
Set of fields = basis of features
Hard to reason using internal structure
Feature extraction
Feature engineering
Apache Hadoop ecosystem
=
HDFS
Distributed data storage
YARN
Distributed data processing
Batch
MapReduce
Script
Pig
SQL
Hive
NoSQL
HBase
In memory
Spark
and more...
"Moving computation is cheaper than moving data"
Bar(bar:string, addr: string)
Sells(bar:string, beer: string, price: real)
bar | addr |
---|---|
Joe's | 1 rue Gambetta |
Sue's | 2 rue Arnold |
bar | beer | price |
---|---|---|
Joe's | Bud's | 2.50 |
Joe's | Miller | 2.80 |
Sue's | Bud's | 2.50 |
Sue's | Miller | 3.20 |
SELECT Bar.addr, Sells.beer
FROM Bar JOIN Sells
ON Bar.bar = Sells.bar
WHERE Sells.price < 2.8
AND Sells.beer = Miller
Where can I find Miller beers under 2.8 € ?
MySQL, Oracle, PostgreSQL...
Data definition, manipulation and control language
SparkSQL, Apache Drill, "R/Python Dataframes", Presto, Apache Hive...
Databases that don't rely on a tabulated model
Key value store
Examples : Redis, Riak
Client
Client
key1: value1
key2: [value2, value3]
counter: 200
set key1 value1
get key1
Client
counter increment
Column oriented
Examples : Cassandra, HBase
RowKey | Date | Brand | Product | Paid |
---|---|---|---|---|
fanilo1 | 26/03/2012 | Pear | Phone | 10 |
fanilo2 | 27/03/2012 | Pear | 12 | |
fanilo3 | 12/04/2013 | PH | PC | |
kiwi1 | Fruit | 12 |
Client
put(fanilo4,...)
Client
Scan(fanilo1,fanilo3, filter(Brand=Pear))
Document oriented
Examples : MongoDB, Couchbase
Client
Client
{
surname: Sue,
age: 34
}
{
name: Fanilo,
houses: [{...}, {...}]
}
insert {name: Sue,age: 34}
query({name: Fanilo})
Graph based store
Examples : Neo4j
MATCH (sally:Person { name: 'Sally' })
MATCH (john:Person { name: 'John' })
MATCH (sally)-[r:FRIEND_OF]-(john)
RETURN r.since as friends_since
Some datasets
REST APIs
Client
Client
Client
Web component
GET /data/{id}
POST /user/{id}
Some REST APIs
...so many libraries to manipulate data
...and storage/processing platforms
Data Science
=
Data structures
+
Algorithms