Let's Talk Big-Data

First of all...

This presentation is

a general overview

It's not a technical drill-down

If something is not clear - ask

BUT - technical questions - later

What is "Big-Data"

Questions...

When is data considered "BIG"

Answer:

Answer:

"Big Data" is a term for data sets or flows that are so large or complex that traditional "handling" are inadequate

"handling" ?! 

Data

Capture

Curate

Analysis

Search

Transfer

Querying

and so much more

Data is "Big Data" if you need to hire a team of smart engineers to just to handle it (distribute...)

Technologies

And these are just I know of...

And platforms

How do you choose ?!

Intuition ?

Premenition ?

Experience ?

We have non of those

How do you choose ?!

Sit with experienced people, developers, architects

Listen

Do homework

Research

Understand your expectations & limitations

How do you choose ?!

And eventually ?

Listen to Amir

How do you choose ?!

Seriously ?

NO!!!

Understand your "Expectations"

Understand your "Limitations"

What do you need to support for the next year (or 2)
Capture "rate"
SLA
SLA
SLA
SLA
Curation "period"
Processing time

OR

Do we have the required knowledge ?
Dev Support
DevOps support
Ops support

+

Troubleshooting
SECURITY
SECURITY

How do you choose ?!

Understand principals in Scalability & Big-Data

There are allot of good options

Choices we make now might (and should) be invalidated in the future.

Why ?

  • Product ?
  • Pricing ?
  • Better options ?

Let's Think where our "BIG" is

"Event Handling"

"Processing"

"Persistancy"

What did we choose ?

=
==

Also...

Over

"Event Handling"

"Processing"

"Persistancy"

+

Why did we choose ?

Is the devil we know ;)

S3 became the standard "de facto" for scaling data curation, it is cheap, high availability, easy to use, and has extension in many processing Frameworks

Spark Over EMR is currently one of the best contenders as a "Big Data Processing FW" - it continues to remain so due to a large community of users and feature developers - relentlesly making it better

High security requirements - in all aspects.

AWS maintain security standards and has a built-in encryption and key management solution we're currently researching into.

Why did we choose ?

Perhaps it's biggest advantage over other tools is

Reduce the requirement of devop as "scaling" is handled internally

Architecture @ High-Level

Server

DB

Utils

Panaya Server

Panaya DB

Architecture @ High-Level

Server

DB

Lambda

Kinesis

Firehose

S3

encrypted

" Raw Event Handling"

Other Events

RAW

Supervised by

IAM + KMS

API Gateway

Architecture @ High-Level

Lambda

Kinesis

Firehose

S3

encrypted

API Gateway

"front door" for applications to access data, BL, functionality in the BackEnd
event-driven function, code run in response to events from API Gateway
auto-magically buffers, than "dump" to S3 (every MB / seconds)
It's not a file-storage, It's a Key-Value storage
This is a requirement - "Key Per Customer"

IAM + KMS

Identity / Auth Management including Encrypt/Decrypt Key Management

Again - why use these ?

Lambda

Kinesis

Firehose

S3

encrypted

API Gateway

"Single Point Of entrance" - will allow us not to bind code of "monitor" to AWS (by SDK). Good practice to control traffic and "Versioning"
Handling of incoming data for uses cases such as "License Validation" && / || "BlackList", as well as JSON validity and more.
As S3 is a by Key-Value storage (and not an FS) - there's no support for ops like "Append", so to generate a large file, a buffer is required
It's a Key-Value storage - sensitive data should be encrypted
This is a requirement - "Key Per Customer"

IAM + KMS

Identity / Auth Management including Encrypt/Decrypt Key Management

Elephant in the room...

Server-Less Arch

Plus/Minus :)

Architecture @ High-Level

S3

encrypted

"Processing"

Over

"Timed Batch"

We have will have 2 types

"On Demand Batch"

Prepare data for "On Demand" Batch-Processing
This is the thing we were waiting got - Run the Main Algorithm
Schedualer

Architecture @ High-Level

S3

encrypted

"Rollout to Panaya"

?

Consider

Scenarios still contain "sensitive" data

"pain" to think about

How do we develop "to cloud" ?

How do we "deploy" in this Arch ?

Connection to Panaya ?

How do we QA ? Test ?

Do we need on premise ?

Troubleshooting...

THANK U

Architecture @ High-Level

Server

API Gateway

DB

Lambda

Kinesis

Firehose

S3

encrypted

" Raw Event Handling"

Other Events

RAW

Supervised by

IAM + KMS

Quesitons ?

S3

encrypted

"Rollout to Panaya"

Over

"Timed Batch"

We have will have 2 types

"On Demand Batch"

Prepare data for "On Demand" Batch-Processing
This is the thing we were waiting got - Run the Main Algorithm

Let's talk Big-Data

By Amir Gal-Or

Let's talk Big-Data

Scaling "Up/Down/Left/Right" presentation for Panaya

  • 682