Fundamentals
Learning Objectives
Big Data
Definition of Big Data
Estimated 1.5 terabytes of compressed data is produced daily.
Twitter messages are 140 bytes each generating 8TB data per day.
A Boeing 737 will generate 240 terabytes of flight data during a single flight across the US
Facebook has around 50 PB warehouse and it’s constantly growing.
Why invest time and money into Big Data?
IDC is predicting the big data market is expected to grow about 32% a year to $23.8 billion in 2016
The market for analytics software is predicted to reach $51 billion by 2016
Mind Commerce estimates global spending on big data will grow 48% between 2014 and 2019
Big data revenue will reach $135 billion by the end of 2019
Most Common Types of Data
Big Data Technologies
Big Data Short Video
Top ten Big Data Insights at Rackspace
10 Minute Video
What is Big Data to Rackspace?
7 Minutes
Hadoop
What is Hadoop?
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.
It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.
Evolution of Hadoop
Traditional Systems vs. Hadoop
Facts about Hadoop
Apache Hadoop Framework
Hadoop Configuration
Name Nodes - is the centerpiece of a HDFS file system. It keeps a directory tree of all files in the file system, and tracks data location within the cluster. It does not store the data, it only keeps track of where it lives.
Gateway Node - Gateway Nodes are the interface between the cluster and outside network. They are also used to run client applications, and cluster administration services, as well as serving a staging area for data ingestion.
Data Nodes - These are the nodes where the data lives, and where the processing occurs. They are also referred to as “slave nodes”.
Hadoop Configuration
Hadoop Cluster
Hadoop Cluster
Master Servers manager the infrastructure
Hadoop Cluster
Hadoop at Rackspace
Cloud Big Data Architecture
Managed/Dedicated Big Data Architecture
Test your Knowledge
Is Hadoop a data store?
No. While you can technically store data in a Hadoop Cluster, it is really more of a data processing system, than a true data store.
Then what is a data store?
A data store is a data repository including a set of integrated objects. These objects are modeled using classes defined in database schemas. Examples of typical data stores, MySQL, PostgreSQL, Redis, MongoDB
Think of Data Processing as the collection and manipulation of items of data to produce meaningful information. In the case of Hadoop, this means taking unstructured, unusable data, and processing it in a way that makes it useful or meaningful
Hadoop Ecosystem
V1
Hadoop Ecosystem
V2
Data Architecture
Data Architecture
Data Architecture
Data Architecture
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop Ecosystem
YARN Architecture
YARN Use Case
Source: Hortonworks
HDFS Architecture
MapReduce Architecture
Alternative to the traditional batch map/reduce model that can be used for real-time stream data processing and fast interactive queries that finish within seconds.
MapReduce writes results to disk.
Spark holds intermediate results in memory.
MapReduce supports map and reduce functions.
Spark supports more than just map and reduce functions.
Allows you to write complex MapReduce transformations using a simple scripting language called "Pig Latin". It's made of two components:
Pig
Hive Architecture
Hive
Spark SQL
Spark SQL Use Cases
TEZ
Using Tez to create Hadoop applications that integrate with YARN
TEZ
By allowing projects like Apache Hive and Apache Pig to run a complex DAG of tasks, Tez can be used to process data, that earlier took multiple MR jobs, now in a single Tez job as shown above.
Column oriented database that allows reading and writing of data to HDFS on a real-time basis.
When to use HBase?
HBase 101 Architecture
Storm Characteristics
Storm
Enterprises use Storm to prevent certain outcomes or to optimize their objectives. Here are some “prevent” and “optimize” use cases.
Storm
Spark Characteristics
Spark
Spark
Command-line interface application used to transfer data to and from Hadoop to any RDMS.
Sqoop: Importing to Hive
Application that collects, aggregates, and moves large amounts of streaming data into the Hadoop Distributed File System (HDFS).
Flume
Flume
Agents consist of three pluggable components: sources, sinks, and channels. An agent must have at least one of each in order to run.
Operational framework for provisioning, managing and monitoring Apache Hadoop clusters.
Ambari Dashboard
Provides configuration service, synchronization service, and naming registry for software in the Hadoop ecosystem.
Zookeeper Benefits
Solr Features
An open-source Web interface that supports Apache Hadoop and its ecosystem.
Hue
Oozie Architecture
Framework that simplifies data management by allowing users to easily configure and manage data migration, disaster recovery and data retention workflows.
Falcon Architecture
Knox Advantages
Knox Architecture
Let's take another look at the ecosystem
Find a use case for your group's assigned component
Hadoop Ecosystem
Hadoop Use Cases
Sentiment Data
Data collected from social media platforms ranging from Twitter, to Facebook, to blogs, to a never-ending array websites with comment sections and other ways to socially interact on the Internet.
Sentiment Data
Sensor/Machine Data
Data collected from sensors that monitor and track very specific things, such as temperature, speed, location
Sensor Data
HVAC
Geographic Data
Geolocation data gives organizations the ability to track every moving aspect of their business, be they objects or individuals.
Geographic Data
Clickstream Data
Stream of clicks that a user takes as they path through a website.
Clickstream Data
Server Log Data
Server log data is for security analysis on a network.
Server Log Data
Unstructured Video
Data from non-text-based things; that includes both images and also video.
Unstructured Video
Unstructured Text
Data collected from free-flowing text sources such as documents, emails, and tweets.
Unstructured Text
Rackspace Use Case
Analytics Compute Grid
Popular Companies Leveraging Hadoop
Data Refinement...Distilling large qualities of structured and unstructured data for use in a traditional DW, e.g.: "sessionizatiuon" of weblogs
Data Exploration across unstructured content on millions of customer satisfaction surveys to determine sentiment.
Application Enrichment provides recommendations and personalized experience to website for each unique visitor.
Use Case by Industry
Financial Services
Telecom
Retail
Manufacturing
Healthcare
Oil and Gas
Pharmaceutical
Rackspace Big Data Platform
Managed and Cloud Big Data
Big Data Platform
Managed Big Data Platform
Cloud Big Data Platform
Cloud Big Data
Supporting Hosting Environments
Hybrid Cloud
Hortonworks Data Platform
Rackspace Cloud Control