Title Text

BIG DATA
on AWS
Title Text

Contact
Javi Moreno
@ciberado
javi.moreno@capside.com

Title Text

Agenda: first part
Quick and fast intro
Datalakes with S3
Format transformation and ETL with Glue
Ingestion with Kinesis Streams y Firehose
Preparing raw data with Lambda
Realtime processing with Kinesis Analytics
DynamoDB for processed dataon
Title Text

Querying the datalake with Athena
Transforming massive amounts of data with EMR/Hive
Interactive queries with Redshift
Spectrum for integrating Redshift and S3
Data visualization with QuickSight
Realtime dashboards with Kinesis Client Library
Cleanup and conclusion
Agenda: second part
Title Text

Quick and fast intro

Title Text


Title Text

Datalakes with S3
Unlimited and affordable object-based storage
99.999999999% data durability
Easy world-wide replication
Data encryption at rest and on the fly
Versioning and life cycle management

Title Text

ETL with Glue
Extract, transform and load managed service
Data catalog for partition management
Optional schema discovery with Crawlers
Advanced job schedule
Code generation

Title Text

Ingestion with Kinesis
Managed near real-time queue
Capacity measured in Shards (1000 rps or 1MBs)
Unlimited ingestion scalability
Up to seven days retention period
Easy processing
Long-term storage integration with Firehose
Title Text

Preparing with Lambda
Serverless star product
Integrated with most services
$0.20 per 1.000.000 invocations
Nodejs, Python, Java, C# and Java
C9 editor integrated with the console
Seamless massive autoscaling
Title Text

Kinesis analytics
Codeless near real-time processing
Powered by a SQL-like language
Input transformation available
Results can be processed with Lambda
Text
Text

Title Text

DynamoDB
Semi-structured database
Very fast and predictable performance
Integrated autoscaling
Global tables
Easy backup procedure
SDK based support for JSON
Title Text

Querying with Athena
Query S3 without ETL
Serverless Presto-based infrastructure
Standard SQL variant supported
Pay per scanned amount of data

Title Text

Elastic Map Reduce
Managed infrastructure for Yarn
Supports Hadoop MapReduce, Presto, Spark...
Integrated with S3
Master node, core nodes and task nodes
Scale up without problems
Cost effective with spot instances support

Title Text

Redshift datawarehouse
Columnar based storage
Interactive query over 1.5PB of data
Postgresql SQL dialect and connectivity
Included snapshot for backup
Change cluster size at any time
Extreme performance for aggregation

Title Text

Spectrum tables
Massive managed cluster
Integrated with Redshift nodes
Join with tables stored directly on S3
Pay for amount of scanned data
SQL dialect compatible with Redshift

Title Text

Quicksight visualizations
Software as a Service data explorer
Powered by S.P.I.C.E on-memory engine
Flexible visualization library
Easy-to-use and attractive user interface
Integrated with most data sources
Title Text

Dashboards with KCL
Powerful java wrapper over Kinesis API
Automatic recovery from failure based on DynamoDB
Supports batch processing
Makes easy to create scalable consumers
Designed to be run on EC2 instances
Title Text

Cleanup
S3 buckets
EC2 instances ¡
Glue databases
DynamoDB tables
Athena catalogs
Redshift clusters
Quicksight datasets

Title Text

@ciberado

Title Text

Bigdata on AWS
By Javier Moreno
Bigdata on AWS
- 699