Operational






 

whoami

 

Assess your Knowledge

Learning Objectives

  • Install a Hadoop Cluster with Apache Ambari
  • Describe the various services that comprise HDFS
  • Run Hadoop's resource manager - YARN
  • Perform basic JVM administration
  • Interact with Hive
  • Use Swift as an additional data storage for Hadoop


 

Data Architecture


Data Architecture

"

Data Architecture

Data Architecture


Hadoop


Hadoop


Hadoop


Hadoop


Hadoop

 

Hadoop

Hadoop


Hadoop Ecosystem

         

                  
 

Getting started 

w/ Cloud Big Data Platform

 

Hadoop Architecture Overview

 

What is the Cloud Big Data Platform

 

Cluster Types

 

Lab: Spin up a CBD Cluster

  1. Create a cluster in the Reach control panel
  2. SSH into the gateway node
  3. Load the status pages via an SSH SOCKS proxy

Step 1: Create a cluster in the Reach control panel

Log in to https://mycloud.rackspace.com


 

Step 2: Create a cluster in the Reach control panel

 Use the command below to SSH into the gateway node using its public IP and the username/password you created 

ssh -D 12345 username@gatewayip


 

Step 3: Load the status pages via an SSH SOCKS proxy

 

Distributed Filesystem - HDFS

 

NameNode and DataNodes


 

File System

     

 

Configuration


 

 

Data ingest and replication

How data is saved to HDFS


 

Lab: HDFS ingest and replication

  1. Copy a file into HDFS with 1 MB (1048576 b) blocksize
  2. Copy a file into HDFS with a 5x replication factor
  3. Look up all the information that’s available on the webpages for the services

NOTE: Complete this work on the Gateway Node

 

Lab Solution


hadoop fs -D dfs.blocksize=30 -put somefile somelocation
hdfs fsck filelocation -files -blocks -locations
hdfs dfs -mkdir test
hdfs dfs -ls 
    
 

Namenode functionality

  • Keeps track of the HDFS namespace metadata
  • Controls opening, closing, and renaming files/directories by clients
 

Datanode functionality

  • Handles read/write requests for blocks
  • Reports blocks back to the Namenode
  • Replicates blocks to other datanodes
 

File Permission

Supports POSIX-style permissions

  • user:group ownership
  • rwx permissions on user, group, and other

$ hdfs dfs -ls myfile
-rw-r--r--   1 swanftw swanftw          1 2014-07-24 20:14 myfile
$ hdfs dfs -chmod 755 myfile
$ hdfs dfs -ls myfile
-rwxr-xr-x   1 swanftw swanftw      	1 2014-07-24 20:14 myfile
 
 

File System Shell

 


 

File system shell (HDFS client)

 
 

File system shell (HDFS client) cont...

 


 

File system shell (HDFS admin)

  


 

YARN

 
 

YARN

The JobTracker and TaskTracker have been replaced with the YARN ResourceManager and NodeManager.

Application Master negotiates resources with the Resource Manager and works with the NodeManagers to start the containers

 

YARN


 

Node Manager

  • Typically runs on each datanode
  • Responsible for monitoring and allocating resources to containers
  • Reports resources back to Resource Manager
 

Node Manager Components


 

Resource Manager

  • Typically runs alongside the Namenode service
  • Coordinates all resources among the given nodes
  • Works together with the Node Managers and Application Masters
 

Resource Manager Components


 

MapReduce

  • Splits a large data set into independent chunks and organizes them into key, value pairs for parallel processing
  • Uses parallel processing to improve the speed and reliability of the cluster
  • YARN based job that takes advantage of the ability to parallel process
  • Three step process of Map, Shuffle/Sort and then Reduce
 

Map

The ETL or Projection piece of MapReduce

  • Extract > Transform > Load
  • Map phase

Divides the input into ranges and creates a map task for each range in the input.       Tasks are distributed to the worker nodes. The output of each map task is partitioned into a group of key-value pairs for each reduce.

 

Shuffle/Sort

This refers to the action of shuffling & snorting data across the network nodes

 

Reduce

Reduce Phase

  • Collects the various results and combines them to answer the larger problem the master node was trying to solve. Each reduce pulls the relevant partition from the machines where the maps executed, then writes its output back into HDFS.
 

MapReduce


 

YARN Applications


 

YARN Commands

 


 

rmadmin Commands


 

rmadmin Commands


 

Lab: MapReduce

  • Submit a MapReduce job
  • Inspect the output and observe the status pages while your job is running and after it completes
  • Copy a book in txt format into HDFS and use it as input to the wordcount function in hadoop-mapreduce-examples.jar
 

Lab Solution


$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar wordcount /user//in /user//out
    


 

DEMO: Configuration

 

Lab: Commands

  • Submit a MapReduce job of your choice
  • Use the commands learned in this section to view the status of the job
 

Basic JVM Administration

 

HEAP

Memory Allocated to a JVM and is where referenced objects are stored.


 

HEAP

During Garbage Collection unreferenced objects are cleared out of memory and survivors of the process are moved to the first survivor space.

As an object survives longer it is promoted into older generation space.

 

Java Commands

 


 

JMX

Hadoop by default extends information about the running processes for monitoring via nagios or other custom monitoring.

 

Slide Title

fields in jstat -gcutil


 

Lab: Java Monitoring

  1. From the Primary Namenode, run “ sudo jps -l “ and identify the namenode service /font>
  2. Using jstat iterate over the statistics until you observe the garbage collection process move objects
  3. What statistics changed?
 

Lab Solution


jps -l 

25760 org.apache.zookeeper.server.quorum.QuorumPeerMain
1615 org.apache.hadoop.hdfs.server.namenode.NameNode
2279 org.apache.hadoop.yarn.server.nodemanager.NodeManager


jstat -gcutil 1615 2000   ←- 1615=PID 2000=Times to report

-observe changes in allocations
    
sudo yum -y install java-1.7.0-openjdk-devel.x86_64

 

Hive and Pig

 

What is Pig?

Apache Pig is a high level language for expressing data analysis programs built on top of the Hadoop platform for analyzing large data sets.

 

Why Pig?

  • Ease of programming
  • Optimization opportunities
  • Extensibility
 

Using Pig

  • Grunt Shell - For interactive pig commands
  • Script file
  • Modes: Local Mode and Hadoop Mode
 

Sample Pig Script


A = load ‘passwd’ using PigStorage(‘:’);
B = foreach A generate $0 as id;
store B into ‘id.out’;
    
 

Hive

  • Opensourced by Facebook
  • Data warehouse for querying large sets of data on HDFS
  • Supports table structuring of data in HDFS
  • SQL like query language - HiveQL
 

Hive continued...

  • Complies HiveQL queries into MapReduce jobs/font>
  • Data stored in HDFS
  • Doesn’t support transactions
  • Supports UDF’s (User Defined Functions)
 

Hive Components

  • Shell - Interactive queries
  • Driver - session handle, fetch and execute
  • Compiler - Parse, plan and optimize
  • Execution Engine - default TEZ | DAG of stages (Converted to MapReduce)
  • Metastore - Schema, data location, SerDe
 

Hive Metastore

  • Database - Namespace containing a set of tables
  • Table - Contains list of columns, their types and SerDe info
  • Partition - Mapping to HDFS directories
  • Statistics - Info about the database
  • Supports MySQL, Derby and others
 

Query Processing

 

Hive/MR vs Hive/Tez


 

Interacting with Hive

  • Hive cli to connect to hiveserver
  • Beeline cli to connect to hiveserver2

$hive
hive> create table test(id string);
hive> describe test; 
hive> drop table test;
    
 

Hive Tables

 

HCatalog

  • Metadata and table management for Hadoop
  • Shared schema for Hive, Pig and MapReduce
 

HCatalog

 

Apache Tez

  • Execution framework to improve performance of Hive
  • In the future Pig will also be able to use the Tez engine as its back end
 

Pig Walkthrough


Log onto the gateway node

wget https://s3.amazonaws.com/hw-sandbox/tutorial1/infochimps_dataset_4778_download_16677-csv.zip 

unzip

cd infochimps_dataset_4778_download_16677/

hdfs dfs -put NYSE

pig

STOCK_A = LOAD 'NYSE/NYSE_daily_prices_A.csv' using PigStorage(',') AS (exchange:chararray, symbol:chararray, 
date:chararray, open:float, high:float, low:float, close:float, volume:int, adj_close:float); DESCRIBE STOCK_A 

B = LIMIT STOCK_A 100;

DESCRIBE B;

C = FOREACH B GENERATE symbol, date, close;

DESCRIBE C;

STORE C INTO 'output/C';

quit
hdfs dfs -cat output/C/part-r-00000

 

Lab: Working with Hive

  1. Download data from http://seanlahman.com/files/database/lahman591-csv.zip
  2. This lab will use Master.csv and Batting.csv from unzipped files
  3. Load those files into HDFS, load it thru hive, cleanse and run queries
  4. Get the max runs per year
  5. Next get the player ID who has the highest score
 

Lab: Working with Hive

  1. Now load Master.csv similar to how we loaded Batting.csv. The column names should be available in the readme.txt file from the unzipped folder.
  2. Combine data from both fields to list the player name instead of the player id.
 

Hive Walkthrough


SSH into the Gateway node

wget http://seanlahman.com/files/database/lahman591-csv.zip

sudo yum install unzip

unzip lahman591-csv.zip

hdfs dfs -mkdir /user//baseball

hdfs dfs -copyFromLocal Master.csv Batting.csv /user//baseball

hive (This gets you into the hive shell)

hive> CREATE TABLE batting(player_id STRING, year INT, stint INT, team STRING, league STRING, games INT, games_batter 

INT, at_bats INT, runs INT, hits INT, doubles INT, triples INT, homeruns INT, runs_batted_in INT, stolen_bases INT, 

caught_stealing INT, base_on_balls INT, strikeouts INT, walks INT, hit_by_pitch INT, sh INT, sf INT, gidp INT, g_old INT) ROW 

FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;

hive> LOAD DATA INPATH '/user/baseball/Batting.csv' OVERWRITE INTO TABLE batting;

hive> SELECT year, max(runs) FROM batting GROUP BY year;

hive> SELECT b.year, b.player_id, b.runs from batting b JOIN (SELECT year, max(runs) runs FROM batting GROUP BY year) y ON (b.year = y.year and b.runs = y.runs);




 

Ingest Data via Swifts

SwiftFS for Hadoop


  • New Hadoop filesystem URL, swift://
  • Read from and write to Swift object stores
  • Supports local and remote stores
  • Included as part of Hadoop
  • Can use anywhere you can use hdfs:// URLs
  • swift://[container].rack-[region]/[path]
 

SwiftFS for Hadoop cont...

 

SwiftFS for Hadoop (cont.)


 

Advantages

  • Store data 24/7 without a running cluster
  • “Unlimited” storage capacity
  • Easily share data between clusters
  • Cross-datacenter storage
  • Compatible with Hive, Pig, and other tools
  • Other considerations | ~30% performance hit
 

Disadvantages

  • Content will be retrieved/sent to Swift for storage, as operations will still occur within HDFS
  • Network throughput for this work can be a bottleneck
  • Slower read/write, but still fucking cool
 

Wiki Resource

http://www.rackspace.com/knowledge_center/article/swift-filesystem-for-hadoop

 

Lab: Working with SwiftFS | Part 1

  1. Working on your gateway node, unzip files from http://seanlahman.com/files/database/lahman591-csv.zip
  2. Using the same DC as your cluster, create a container in Cloud Files called baseball
  3. Load Master.csv and Batting.csv into the container
  4. Load data in the .csv file in Cloud files into a hive table similar to the previous chapter but using swiftFS URI instead
 

Lab: Working with SwiftFS | Part 2

  1. Run the same queries
  2. Create an external table in CloudFiles using Hive and load some results there.
 

Assess your Knowledge

 

A massive thanks to the content dev team...

Casey Gotcher

Chris Old

David Grier

Joe Silva

Mark Lessel

Nirmal Ranganathan

Sean Anderson

Chris Caillouet


 

http://hortonworks.com






Hadoop Operational

By Rackspace University

Hadoop Operational

  • 2,276