Let's Talk Big-Data
First of all...
This presentation is
a general overview
It's not a technical drill-down
If something is not clear - ask
BUT - technical questions - later
What is "Big-Data"
Questions...
When is data considered "BIG"
Answer:
Answer:
"Big Data" is a term for data sets or flows that are so large or complex that traditional "handling" are inadequate
"handling" ?!
Data
Capture
Curate
Analysis
Search
Transfer
Querying
and so much more
Data is "Big Data" if you need to hire a team of smart engineers to just to handle it (distribute...)
Technologies
And these are just I know of...
And platforms
How do you choose ?!
Intuition ?
Premenition ?
Experience ?
We have non of those
How do you choose ?!
Sit with experienced people, developers, architects
Listen
Do homework
Research
Understand your expectations & limitations
How do you choose ?!
And eventually ?
Listen to Amir
How do you choose ?!
Seriously ?
NO!!!
Understand your "Expectations"
Understand your "Limitations"
What do you need to support for the next year (or 2)
Capture "rate"
SLA
SLA
SLA
SLA
Curation "period"
Processing time
OR
Do we have the required knowledge ?
Dev Support
DevOps support
Ops support
+
Troubleshooting
SECURITY
SECURITY
How do you choose ?!
Understand principals in Scalability & Big-Data
There are allot of good options
Choices we make now might (and should) be invalidated in the future.
Why ?
- Product ?
- Pricing ?
- Better options ?
Let's Think where our "BIG" is
"Event Handling"
"Processing"
"Persistancy"
What did we choose ?
Also...
Over
"Event Handling"
"Processing"
"Persistancy"
+
Why did we choose ?
Is the devil we know ;)
S3 became the standard "de facto" for scaling data curation, it is cheap, high availability, easy to use, and has extension in many processing Frameworks
Spark Over EMR is currently one of the best contenders as a "Big Data Processing FW" - it continues to remain so due to a large community of users and feature developers - relentlesly making it better
High security requirements - in all aspects.
AWS maintain security standards and has a built-in encryption and key management solution we're currently researching into.
Why did we choose ?
Perhaps it's biggest advantage over other tools is
Reduce the requirement of devop as "scaling" is handled internally
Architecture @ High-Level
Server
DB
Utils
Panaya Server
Panaya DB
Architecture @ High-Level
Server
DB
Lambda
Kinesis
Firehose
S3
encrypted
" Raw Event Handling"
Other Events
RAW
Supervised by
IAM + KMS
API Gateway
Architecture @ High-Level
Lambda
Kinesis
Firehose
S3
encrypted
API Gateway
"front door" for applications to access data, BL, functionality in the BackEnd
event-driven function, code run in response to events from API Gateway
auto-magically buffers, than "dump" to S3 (every MB / seconds)
It's not a file-storage, It's a Key-Value storage
This is a requirement - "Key Per Customer"
IAM + KMS
Identity / Auth Management including Encrypt/Decrypt Key Management
Again - why use these ?
Lambda
Kinesis
Firehose
S3
encrypted
API Gateway
"Single Point Of entrance" - will allow us not to bind code of "monitor" to AWS (by SDK). Good practice to control traffic and "Versioning"
Handling of incoming data for uses cases such as "License Validation" && / || "BlackList", as well as JSON validity and more.
As S3 is a by Key-Value storage (and not an FS) - there's no support for ops like "Append", so to generate a large file, a buffer is required
It's a Key-Value storage - sensitive data should be encrypted
This is a requirement - "Key Per Customer"
IAM + KMS
Identity / Auth Management including Encrypt/Decrypt Key Management
Elephant in the room...
Server-Less Arch
Plus/Minus :)
Architecture @ High-Level
S3
encrypted
"Processing"
Over
"Timed Batch"
We have will have 2 types
"On Demand Batch"
Prepare data for "On Demand" Batch-Processing
This is the thing we were waiting got - Run the Main Algorithm
Schedualer
Architecture @ High-Level
S3
encrypted
"Rollout to Panaya"
?
Consider
Scenarios still contain "sensitive" data
"pain" to think about
How do we develop "to cloud" ?
How do we "deploy" in this Arch ?
Connection to Panaya ?
How do we QA ? Test ?
Do we need on premise ?
Troubleshooting...
THANK U
Architecture @ High-Level
Server
API Gateway
DB
Lambda
Kinesis
Firehose
S3
encrypted
" Raw Event Handling"
Other Events
RAW
Supervised by
IAM + KMS
Quesitons ?
S3
encrypted
"Rollout to Panaya"
Over
"Timed Batch"
We have will have 2 types
"On Demand Batch"
Prepare data for "On Demand" Batch-Processing
This is the thing we were waiting got - Run the Main Algorithm
Let's talk Big-Data
By Amir Gal-Or
Let's talk Big-Data
Scaling "Up/Down/Left/Right" presentation for Panaya
- 682