Data stores

 

Where do I get and store my data

Fanilo ANDRIANASOLO

@andfanilo

Big Data engineer @Worldline

Data analysis in 1880

1880 United States Census

Enumeration sheets

 

Hard to access on machines

 

50 million people

 

8 years to analyze

 

Obsolete results      

1890 United States Census

Punched cards

 

Readable by tabulating system

 

63 million people

 

6 years instead of estimated 11

What we learned

Small time frame analysis 

Actionable results

Tabulated data

Easier to process

Good data storage

Good data analysis

Data analysis is only as good as the data provided

Agenda

  • Some terminology
  • Files formats & serialization
  • Distributed filesystems
  • Databases
  • Data sources & APIs

Data

This is data

 

274041730000999

 

Data represents facts about entities in our world

This is information

 

 

 

 

 

 

 

Information is data in context

 

Metadata describes the structure and context of data

This is an observation

 

 

 

 

 

Observation is the active acquisition of information from a primary source

 

Within a specific time frame, all observations have the same context

When storing observations, we should make sure of :

Metadata for data definition

 

Data content quality

 

Data accessibility

File formats & serialization

A file is a durable container for data

The filename extension generally indicates how to interpret the contained data into information

Structured file formats

CSV - Comma Separated Value

Year,Make,Model
1997,Ford,E350
2000,Mercury,Cougar

Easy to read

Popular exchange format

Very compact

Not "standardized"

No hierarchical data

Not versatile if schema change

Structured file formats

XML - eXtensible Markup Language

<person position=1> 
      <name> Eric </name> 
      <age> 26 </age> 
</person>
<person> 
      <name> Clara </name>
</person>

Complex, hierarchical data

Standardized API for parsing

Schema & validation features

3x larger than CSV

Tree-like model may be hard to manipulate

Structured file formats

JSON - Javascript Object Notation

{
  person: {
    name: Eric,
    age: 26,
    city: France
    family: {...}
}

Supports lots of data objects

Standardized parsing

Only 2x larger than CSV

A bit less support than XML

Not as robust as XML for now

CSV, XML, JSON are popular data formats

 

But they are not efficient for large data transfers

 

This is even more important in a distributed data system for storing web-scale data

Deserialization

Serialization

0100101010...

  • Description, schema of serialized data
  • Standard interface for serializing/deserializing in multiple languages

Protocol buffers, Thrift, Avro

message Person {
  required string name = 1;
  optional string email = 2;
}
Person john = Person.newBuilder()
    .setName("John Doe")
    .setEmail("jdoe@example.com")
    .build();
output = new FileOutputStream(args[0]);
john.writeTo(output);
Person john;
fstream input(argv[1],
    ios::in | ios::binary);
john.ParseFromIstream(&input);
name = john.name();
email = john.email();

Interface definition

Code generation for serialization/deserialization

On features and structure

A feature is an attribute used to represent an observation

Year,Make,Model
1997,Ford,E350
2000,Mercury,Cougar

Set of fields = basis of features

Hard to reason using internal structure

Feature extraction

Feature engineering

Distributed filesystems

Apache Hadoop ecosystem

=

HDFS

Distributed data storage

YARN

Distributed data processing

Batch

 

MapReduce

Script

 

Pig

SQL

 

Hive

NoSQL

 

HBase

In memory

 

Spark

and more...

"Moving computation is cheaper than moving data"

Databases

An organized collection of data, schemas and queries

RDBMS - SQL

Bar(bar:string, addr: string)

Sells(bar:string, beer: string, price: real)

bar addr
Joe's 1 rue Gambetta
Sue's 2 rue Arnold
bar beer price
Joe's Bud's 2.50
Joe's Miller 2.80
Sue's Bud's 2.50
Sue's Miller 3.20
SELECT Bar.addr, Sells.beer
FROM Bar JOIN Sells 
ON Bar.bar = Sells.bar
WHERE Sells.price < 2.8
AND Sells.beer =  Miller

Where can I find Miller beers under 2.8 € ?

MySQL, Oracle, PostgreSQL...

Know your SQL

Data definition, manipulation and control language

SparkSQL, Apache Drill, "R/Python Dataframes", Presto, Apache Hive...

NoSQL

Databases that don't rely on a tabulated model

Key value store

Examples : Redis, Riak

Client

Client

key1:   value1

key2: [value2, value3]

counter: 200

set key1 value1

get key1

Client

counter increment

Column oriented

Examples : Cassandra, HBase

RowKey Date Brand Product Paid
fanilo1 26/03/2012 Pear Phone 10
fanilo2 27/03/2012 Pear 12
fanilo3 12/04/2013 PH PC
kiwi1 Fruit 12

Client

put(fanilo4,...)

Client

Scan(fanilo1,fanilo3, filter(Brand=Pear))

Document oriented

Examples : MongoDB, Couchbase

Client

Client

{

  surname: Sue,

  age: 34

}

{

  name: Fanilo,

  houses: [{...}, {...}]

}

insert {name: Sue,age: 34}

query({name: Fanilo})

Graph based store

Examples : Neo4j

MATCH (sally:Person { name: 'Sally' })

MATCH (john:Person { name: 'John' })

MATCH (sally)-[r:FRIEND_OF]-(john)

RETURN r.since as friends_since

Data sources & APIs

Some datasets

REST APIs

Client

Client

Client

Web component

GET /data/{id}

POST /user/{id}

Some REST APIs

  • Twitter API
  • Meetup API
  • Facebook Graph API
  • Grand Lyon API
  • Wikidata
  • ...

...so many libraries to manipulate data

  • Python Pandas/ R Dataframe
  • Talend
  • Weka
  • Apache Spark
  • Open Refine
  • ...

...and storage/processing platforms

  • Amazon Web Services (S3, EC2...)
  • Google Cloud Data platform
  • IBM Bluemix
  • Microsoft Azure
  • Heroku
  • ...

Conclusion

Data Science

=

Data structures

+

Algorithms

Made with Slides.com