Delivering A
'Big Data ReadY'
MVP
Gregory Chomatas
Dublin Google Developers Group - 2013 July 30th
http://linkedin.com/in/gchomatas
CHOMATAS
GregorY
t: @gchomatas
http://www.astroboa.org
Entrepreneur
SW Engineer
Betaconcept / Astroboa: Founder
Aquinetix: Co-founder / CTO
7 yearS ago I realized...
LOTS of Object-relational Mismatch
DB is not the center of MY application
Domain Driven Design / Behaviour Driven Design
vs
Database Driven Design
At That time not Many alternatives existed
so we decided to roll our own data store solution...
Astroboa to the rescue
Hybrid Document-Graph Store focused on data semantics
Similar to Google Datastore & OrientDB
- External 'app independent' Semantic Data Model
- Model as you go
- Security per Entity instance / property
- Versioned Entities
- Automated REST APIs encapsulating the data layer
- Hyperlinked Resources
- Polyglot Persistence (Experimental) *
*Not available in the public version
The "Bigness' in big data
Two main paths to the realization of 'BIGNESS'
Luckily both paths converge to common principles & tools that can manage BIG Complexity & BIG Volume
Big 'Data Problems' (complexity)
-
single point of failure / resilience
-
cross data center
-
human fault tolerance
-
store / search unstructured or semi-structured data
-
flexible data modeling (e.g. traverse relationships)
- data versioning
- polyglot programming
-
multitenancy
-
share / data as a service
-
semantic web / multiple formats - endpoints
Flexible Options / Ease of operations
'BIG DATA' Problems (VOLUME)
- high volume
- high velocity
- real-time APIs / act in real time
- data as others service / dirty data from open sources
- log collection / aggregation
linear / Horizontal scaling
I am not a Big data Start-up!
- Start-up = Growth (5% - 10%) / week
- 1000 writes per aquaculture farm per day
- 120 farms on public beta = 120000 writes / day
- 1st month: 176 farms = 176000 writes / day
- 6th month: 1181 farms = 1.2M writes / day
- 1st year: 17045 farms = 17M writes / day (200/sec)
- 2nd year: 2421143 = 2.4B writes / day (27777 / sec)
ARE YOU SURE ?
a sucessful SaaS is a big data service
it's JUST an MVP - we will ADD all these big data stuff later
- A Big Data architecture can be simpler than a traditional one
- The right data store can increase productivity
- Keep it simple but not compromise the architectural concepts
- Balance between technical debt & technical equity
- An enterprise business system will usually win on underlying technological innovation, robustness and enterprise readiness
- "In business there is nothing more valuable than a technical advantage your competitors don't understand" - Paul Graham
Key BIG DATA Architecture features
Distributed Storage
- APPLICATION database vs INTEGRATION database
- Mix several data models / polyglot persistence
- External Data Schema / Common Data Structures
- Data Store encapsulated by an API (Data Services)
- Append only / save changes vs state (event sourcing)
KEY BIG DATA ARCHITECTURE FEATURES
Distributed Computing
- Asynchronous processing
- Real Time Event Processing / Streaming
- Simple decoupled services exposed through REST or RPC APIs (business services)
- Thick web clients / mob. apps using the REST or Streaming APIs
- Client-level multivariate data analysis & complex visualization
The Lambda architecture
by Nathan Marz and James Warren
store raw, immutable, perpetual data
query = function(all data)
combine batch & real time stream processing to compute arbitrary functions on arbitrary data
Ultimate Design Rule
KEEP it SIMPLE
The conventional ARCHITECTURE
new data store criteria
Distributed
Easy to change schema & queries
Simple to install, configure, operate
one component
auto-shard
peer-to-peer
Minimize impedance mismatch
Boost productivity
Directly store my aggregates
{
"date": "2013-02-28",
"allocated_worker": "swp4jhi4Tm6VxY1nueX2yw",
"cage": "1GuuHWTaQc-kpPcRV5uBGA",
"feed": "7IWmy2FATcS9Vh0RB1onXQ",
"quantity_approved": 12.5,
"farm": "__uBZUr3RWOqOSkszfbRLw",
"species": "KDU-2LCjRRynby9HLifc3g",
"batch": "i6MgxixnSCGwGWb0037wlQ",
"execution": {
"feeder": "swp4jhi4Tm6VxY1nueX2yw",
"quantity_fed": 12.5,
"species_position_start": "top",
"species_position_end": "middle",
"start": "2013-02-28T07:59:57.668Z",
"end": "2013-02-28T08:00:03.216Z",
"feeder_position_end": {
"lat_lon": {"lat": 37.7066959, "lon": 23.16831896},
"altitude": 40,
"accuracy": 12
}
}
}
The candidates
Key-Value |
Document |
Column |
Graph |
Riak
|
MongoDB |
Cassandra
|
Neo4J |
Redis |
CouchBase |
HBase |
Infinite Graph |
Pr. Voldemort
|
OrientDB
|
Hypertable |
OrientDB |
MemcacheDB |
ElasticSearch
|
Accumulo
|
Titan
|
DynamoDB
|
Google Datastore |
SimpleDB
|
Virtuoso |
My COOL data store tip
elasticsearch document store
No other NoSQL store comes close to the out of the box utility and usability of Elastic Search
schema less, multi tenant, replicating & sharding document store that implements extensible & advanced search features (geo spatial, faceting, filtering, etc.)
REST API to CREATE / UPDATE (partially) / DELETE / READ aggregates / entities
REST Search API with full text search out of the box
MULTI-TENANT friendly with REST API for creating / updating DBs & entity types
Dynamic / Semi-Dynamic / Fixed schema
Elastic search power
index over 95GB/h/node
8-node cluster: sub-200ms response for complex searches on 10B+ records
(oracle OR mysql) AND replication
apple AND ip*d
john AND city:Dublin
species:"Sea Bream" AND execution.date:[20130701 TO 20130730]
taxicub AND ("Dublin"^2 OR "Cork")
"facets" : {
"locations" : { "terms" : {"field" : "city"} }
}
"terms" : [ {
"term" : "Dublin",
"count" : 130
}, {
"term" : "Cork",
"count" : 20
}, {
"term" : "Galway",
"count" : 1
} ]
Histograms / GEO Distance
"facets" : {
"Feed_Histogram" : {
"date_histogram" : {
"key_field" : "date",
"value_field" : "execution.quantity_fed",
"interval" : "month"
}
}
}
"filter" : {
"geo_distance_range" : {
"from" : "200km",
"to" : "400km"
"pin.location" : {
"lat" : 40,
"lon" : -70
}
}
}
"filter" : {
"geo_polygon" : {
"person.location" : {
"points" : [
{"lat" : 40, "lon" : -70},
{"lat" : 30, "lon" : -80},
{"lat" : 20, "lon" : -90}
]
}
}
}
"filter" : {
"geo_distance" : {
"distance" : "200km",
"pin.location" : {
"lat" : 40,
"lon" : -70
}
}
}
RDBMS out - Document store in
The TITAN GRAPH DB
- Distributed
- Pluggable storage (Cassandra, HBase, Berkeley DB)
- Indexing with Elastic Search & Lucene
- Blueprints Interface
- Gremlin Query Language
- Rexter Server adds JSON-based REST interface
Easy Graph Traversal with gremlin
// calculate basic collaborative filtering for user 'Gregory'
m = [:]
g.v('name','Gregory').out('likes').in('likes').out('likes').groupCount(m)
m.sort{-it.value}
Start on a single machine
Data store selection tips (1)
-
Use polyglot persistence with multiple data models
-
Start with a Document Store as your system of record
-
Mix it with a key-value Store for keeping sessions, shopping cart, user prefs, counters, caching
- Mix it with a Graph store to keep and traverse entity relationships
- Use a Column Store as your system of record if you need performance rather than flexibility and you know well your data model & queries
- Keep a relational db for queries on transient data (reporting on inter-aggregate relationships)
DATA STORE SELECTION TIPS (2)
- Prefer one-component stores rather than many moving parts
-
Choose a store that makes it easy to experiment with schema and query changes & supports easy data migrations
- Prefer stores that can work with both dynamic & fixed schemas (there is always an implicit schema)
-
In early prototypes avoid Column stores as they have a high cost on schema and query changes
Data Store Selection Tips (3)
- Choose stores that support auto-sharding
-
Prefer peer-to-peer replication rather than master-slave
-
Replication factor N = 3 is a good standard choice
-
Consistency Adjustment Quorum: W > N/2 , W+R > N
All that Said...
APP CONTEXT is always the determining factor for selecting your store
as well as...
Safety / Stability
Productivity
Community
Performance
Tooling / Operation easeness
Data Modeling tips
- Remember that you fit your model to the data store and not Vice Versa (APPLICATION vs INTEGRATION DB)
- Use a Schema
- Build your aggregates or column families according to your use cases, i.e. DENORMALIZE per your query requirements
- Aggregates form the boundaries for ACID operations (transactions)
-
Pre-compute Question Focused Datasets (materialized views) to provide data organized differently from their primary aggregates
Are We finished YET?
NOT QUITE!
Do something with our monolithic app
Split THE Monolithic Application
-
Wrap data stores into DATA SERVICES
-
Create BUSINESS SERVICES on top of Data Services
-
Prefer RESTful APIs for services (ROA)
- Use a Binary Serialization Framework to create RPC APIs if performance is a concern (ROA / SOA)
- Move MVC* to fat mobile / web client apps that consume the APIs
JavaScript in the browser is one of the world's most widely distributed execution environments & Deployment is trivial !
decoupled serviceS
Fat Client
Single Page App
Api framework / DSL
class API < Grape::API
version 'v1', :using => :header, :vendor => 'aquinetix.com'
default_format :json
content_type :json, "application/json"
content_type :tsv, "text/tab-separated-values"
formatter :tsv, Aquinetix::TsvFormatter
content_type :kml, "text/xml"
formatter :kml, Aquinetix::KmlFormatter
mount CageAPI
mount CageEventsAPI
mount DeviceAPI
mount FeedAPI
mount FeedingAPI
mount LossCountEventAPI
mount OxygenSamplingEventAPI
mount SigninAPI
mount TemperatureSamplingEventAPI
mount UserAPI
add_swagger_documentation markdown: true, base_path: "http://..."
end
API FRAMEWORK / DSL
class FeedingAPI < Grape::API
resource :feedings do
desc 'Create a new feeding'
post do
execute_farm_obj_create_request 'Feeding'
end
desc 'Perform a FULL or PARTIAL update of an existing feeding'
params do
requires :id, type: String, desc: "The id (UUID) of ..."
optional :fields, type: String, desc: "Which fields ..."
end
put '/:id' do
execute_farm_obj_update_request 'Feeding'
end
desc 'Get a feeding by its id (UUID)'
params do
requires :id, :type => String, :desc => "Feeding id."
end
get '/:id' do
execute_farm_obj_instance_get_request 'Feeding'
end
end
end
MVC*at the client
-
Mobile app with backbone.js & phonegap
-
Management / BI Console with AngularJS
-
Visualization with D3.js
-
Multivariate Dataset Analysis at the browser with crossfilter.js
-
App workflow & build with yeoman, grunt, bower
* MVP, MVVM, MVC, MVW
Asynchronous / Real Time processing & Streaming API
RabbitMQ + RabbitMQ Web-Stomp Plugin at the server
SockJS, Stomp js libs at the client
Real-time event stream processing with ESPER
Alternative message brokers:
node.js + zeromq
kestrel
pusher
kafka (> 100k msg/sec)
Alternative Real-time stream processing: Storm
Use cases
count ratings, votes, click-throughs
block abusive crawlers
rate-limit apis
detect spamming attempts
track performance and trigger alerts
batch process logs
subscribe to Stomp topics from js
ws = new SockJS('http://node1.aquinetix.com:15674/stomp')
@client = Stomp.over(ws)
@client.connect('aquinetix', 'password', (x) =>
@on_connect(x)
@on_error, "/")
on_connect: (x) ->
console.log "Connected to message broker"
@feeding_subscr_id = @client.subscribe '/topic/feeding', (message) =>
feeding = JSON.parse(message.body)
Aq_Manager.events.trigger 'feeding_execution:arrived', feeding
@position_subscr_id = @client.subscribe '/topic/position', (message) =>
position = JSON.parse(message.body);
Aq_Manager.events.trigger 'worker_position:arrived', position
@client.send('/topic/feeding', {}, JSON.stringify(feeding_obj))
REAL time event processing with esper
select count(*) as tps, max(retweetCount) as maxRetweets from TwitterEvent.win:time_batch(1 sec)
select fraud.accountNumber as accntNum, fraud.warning as warn, withdraw.amount as amount,
MAX(fraud.timestamp, withdraw.timestamp) as timestamp, 'withdrawlFraud' as desc
from FraudWarningEvent.win:time(30 min) as fraud,
WithdrawalEvent.win:time(30 sec) as withdraw
where fraud.accountNumber = withdraw.accountNumber
LOG activity and operational data
Today a critical part of the production features of websites
Logstash + ElasticSearch + Kibana 3
WRAPUP
Should availability, robustness & scalability be added to your hypotheses & value proposition ?
if YES then :
Adopt an architecture with decoupled and distributed components at early stages. Build your team around it & balance technical debt / equity to get:
Increased team productivity, Increased readiness and agility, Sustainability
Build your data models around your use cases rather than around your database and experiment with a polyglot persistence strategy
Start with the most easy to install, configure & operate technologies.
Keep it SIMPLE & SUSTAINABLE
LINKS / REFERENCES
- https://github.com/sockjs/sockjs-client
-
https://github.com/robey/kestrel
- https://github.com/JustinTulloss/zeromq.node
- http://kafka.apache.org/index.html
- https://github.com/nathanmarz/storm
- https://developers.helloreverb.com/swagger/
- https://github.com/wordnik/swagger-ui