Hadoop Fundamentals V2

Framework for data processing.
Allows multiple data processing engines to use Hadoop as the common standard for batch, interactive and real-time engines that can simultaneously access the same data set.

There are a lot of frameworks in the ecosystem, but Yarn is one of the ones you need to know most about. The goal of an operating system is to facilitate applications to achieve 100% utilization of all resources on the physical system while letting every application execute at its maximum potential. This is what YARN achieves in a Hadoop cluster. The compute resources managed by YARN in a Hadoop cluster are memory and cpu. A YARN application can request these resources and YARN will make them available according to its scheduler policy. What distinguishes YARN from other distributed compute frameworks is that the applications that can run on YARN can be rapidly developed. Many standalone applications have already been adapted and they range in types from batch applications such as MapReduce to realtime always-on database applications such as HOYA (HBase on YARN).

YARN Architecture

YARN Use Case

At one point Yahoo! had 40k+ nodes spanning multiple datacenters – 365PB+ of data
YARN provided a compute framework that allowed higher utilization of nodes
100% utilization (while being efficient) is always a good thing
Yahoo! was able to bring down an entire datacenter of about 10k Nodes

Source: Hortonworks

HDFS

Java-based distributed file system.
Stores large volumes of unstructured data and spans across large clusters of commodity servers.
Works closely with MapReduce.

HDFS Architecture

MapReduce

Framework for writing applications that process large amounts of structured and unstructured data.
Designed to run batch jobs that address every file in the system.
Splits a large data set into independent chunks and organizes them into key, value pairs for parallel processing.

MapReduce Architecture

Spark Data Processing

Alternative to the traditional batch map/reduce model that can be used for real-time stream data processing and fast interactive queries that finish within seconds.

MapReduce writes results to disk.
Spark holds intermediate results in memory.

MapReduce supports map and reduce functions.
Spark supports more than just map and reduce functions.

Pig

Allows you to write complex MapReduce transformations using a simple scripting language called "Pig Latin". It's made of two components:

Pig Latin (the language) defines a set of transformations on a data set such as aggregate, join and sort.
Compiler to translate Pig Latin to MapReduce so it can be executed within Hadoop.

Pig

Hive

Used to explore, structure and analyze large datasets stored in Hadoop's HDFS.
Provides an SQL-like language called HiveQL with schema on read and converts queries to map/reduce jobs.

Hive Architecture

This diagram shows the major components of Hive and its interactions with Hadoop. As shown in this diagram, the main components of Hive are: 1. UI - The user interface for users to submit queries and other operations to the system. Currently the system has a command line interface and a web based GUI is being developed. 2. Driver - The component which receives the queries. This component implements the notion of session handles and provides execute and fetch APIs modeled on JDBC/ODBC interfaces. 3. Compiler - The component that parses the query, does semantic analysis on the different qurey blocks and query expressions and eventually generates an execution plan with the help of the table and partition metadata looked up from the metastore. 4. Metastore - The component that stores all the structure information of the various table and partitions in the warehouse including column and column type information, the serializers and deserializers necessary to read and write data and the corresponding hdfs files where the data is stored. 5. Execution Engine - The component which executes the execution plan created by the compiler. The plan is a DAG of stages. The execution engine manages the dependencies between these different stages of the plan and executes these stages on the appropriate system components.

Hive

Spark SQL

Part of the Spark computing framework.
Allows relational queries expressed in SQL, HiveQL, or Scala to be executed using Spark.
Used for real-time, in-memory, parallelized processing of Hadoop data.

Spark SQL

Spark SQL Use Cases

Stand Alone: Standalone deployment: With the standalone deployment one can statically allocate resources on all or a subset of machines in a Hadoop cluster and run Spark side by side with Hadoop MR. The user can then run arbitrary Spark jobs on her HDFS data. Its simplicity makes this the deployment of choice for many Hadoop 1.x users. Spark Over Yarn: Hadoop Yarn deployment: Hadoop users who have already deployed or are planning to deploy Hadoop Yarn can simply run Spark on YARN without any pre-installation or administrative access required. This allows users to easily integrate Spark in their Hadoop stack and take advantage of the full power of Spark, as well as of other components running on top of Spark. Spark In MapReduce (SIMR): For the Hadoop users that are not running YARN yet, another option, in addition to the standalone deployment, is to use SIMR to launch Spark jobs inside MapReduce. With SIMR, users can start experimenting with Spark and use its shell within a couple of minutes after downloading it! This tremendously lowers the barrier of deployment, and lets virtually everyone play with Spark.

TEZ

Extensible framework for building YARN based, high performance batch and interactive data processing applications.
It allows applications to span the scalability dimension from GB’s to PB’s of data and 10’s to 1000’s of nodes.

TEZ

Using Tez to create Hadoop applications that integrate with YARN

TEZ

By allowing projects like Apache Hive and Apache Pig to run a complex DAG of tasks, Tez can be used to process data, that earlier took multiple MR jobs, now in a single Tez job as shown above.

HBase

Column oriented database that allows reading and writing of data to HDFS on a real-time basis.

Designed to run on a cluster of dozens to possibly thousands or more servers.
Modeled after Google's Bigtable
EBay and Facebook use HBase heavily.

When to use HBase?

When you need random, realtime read/write access to your Big Data.
When hosting very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware.

HBase 101 Architecture

Storm

Distributed real-time computation system for processing fast, large streams of data.
With Storm and MapReduce running together in Hadoop on YARN, a Hadoop cluster can efficiently process a full range of workloads from real-time to interactive to batch.

Storm Characteristics

Storm

Enterprises use Storm to prevent certain outcomes or to optimize their objectives. Here are some “prevent” and “optimize” use cases.

Storm

Spark Streaming

Part of the Spark computing framework.
Easy and fast real-time stream processing.
Provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data.

Spark Characteristics

Spark

Sqoop

Command-line interface application used to transfer data to and from Hadoop to any RDMS.

Sqoop: Importing to Hive

Flume

Application that collects, aggregates, and moves large amounts of streaming data into the Hadoop Distributed File System (HDFS).

Flume

Flume

Agents consist of three pluggable components: sources, sinks, and channels. An agent must have at least one of each in order to run.

Sources Flume sources listen for and consume events. Events can range from newline-terminated strings in stdout to HTTP POSTs and RPC calls — it all depends on what sources the agent is configured to use. Flume agents may have more than one source, but must have at least one. Sources require a name and a type; the type then dictates additional configuration parameters.On consuming an event, Flume sources write the event to a channel. Importantly, sources write to their channels as transactions. By dealing in events and transactions, Flume agents maintain end-to-end flow reliability. Events are not dropped inside a Flume agent unless the channel is explicitly allowed to discard them due to a full queue. Channels Channels are the mechanism by which Flume agents transfer events from their sources to their sinks. Events written to the channel by a source are not removed from the channel until a sink removes that event in a transaction. This allows Flume sinks to retry writes in the event of a failure in the external repository (such as HDFS or an outgoing network connection). For example, if the network between a Flume agent and a Hadoop cluster goes down, the channel will keep all events queued until the sink can correctly write to the cluster and close its transactions with the channel. Channels are typically of two types: in-memory queues and durable disk-backed queues. In-memory channels provide high throughput but no recovery if an agent fails. File or database-backed channels, on the other hand, are durable. They support full recovery and event replay in the case of agent failure. Sinks Sinks provide Flume agents pluggable output capability — if you need to write to a new type storage, just write a Java class that implements the necessary classes. Like sources, sinks correspond to a type of output: writes to HDFS or HBase, remote procedure calls to other agents, or any number of other external repositories. Sinks remove events from the channel in transactions and write them to output. Transactions close when the event is successfully written, ensuring that all events are committed to their final destination.

Ambari

Operational framework for provisioning, managing and monitoring Apache Hadoop clusters.

Ambari Dashboard

Zookeeper

Provides configuration service, synchronization service, and naming registry for software in the Hadoop ecosystem.

Zookeeper Benefits

Zookeeper

Solr

Searches data stored in HDFS in Hadoop.
Powers the search and navigation features of many of the world’s largest Internet sites, enabling powerful full-text search and near real-time indexing.

Solr Features

Near real-time indexing
Advanced full-text search
Comprehensive HTML administration interfaces
Flexible and adaptable, with XML configuration
Server statistics exposed over JMX for monitoring
Standards-based open interfaces like XML, JSON and HTTP
Linearly scalable, auto index replication, auto failover and recovery

Solr

Hue

An open-source Web interface that supports Apache Hadoop and its ecosystem.

DEMO

Hue

Oozie

Workflow scheduler system to manage Apache Hadoop jobs.
There are two basic types of Oozie jobs:

Oozie Architecture

Falcon

Framework that simplifies data management by allowing users to easily configure and manage data migration, disaster recovery and data retention workflows.

Falcon Architecture

KNOX

Provides a single access point for Hadoop services in a cluster.
Knox runs as a server (or cluster of servers) that serve one or more Hadoop clusters.

Knox Advantages

Knox Architecture

Let's take another look at the ecosystem

Find a use case for your group's assigned component

Hadoop Ecosystem

Introductions

Rackspace Analytics Compute Grid

Big Data: After Re-Architecting for Private Cloud

Hadoop Fundamentals V2

Hadoop Fundamentals V2

Rackspace University

Hadoop Fundamentals V2

More from Rackspace University