Operational
whoami
Assess your Knowledge
Learning Objectives
Data Architecture
Data Architecture
"Data Architecture
Data Architecture
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop Ecosystem
Getting started
w/ Cloud Big Data Platform
Hadoop Architecture Overview
What is the Cloud Big Data Platform
Cluster Types
Lab: Spin up a CBD Cluster
Step 1: Create a cluster in the Reach control panel
Log in to https://mycloud.rackspace.com
Step 2: Create a cluster in the Reach control panel
Use the command below to SSH into the gateway node using its public IP and the username/password you created
ssh -D 12345 username@gatewayip
Step 3: Load the status pages via an SSH SOCKS proxy
Distributed Filesystem - HDFS
NameNode and DataNodes
File System
Configuration
Data ingest and replication
How data is saved to HDFS
Lab: HDFS ingest and replication
NOTE: Complete this work on the Gateway Node
Lab Solution
hadoop fs -D dfs.blocksize=30 -put somefile somelocation
hdfs fsck filelocation -files -blocks -locations
hdfs dfs -mkdir test
hdfs dfs -ls
Namenode functionality
Datanode functionality
File Permission
Supports POSIX-style permissions
$ hdfs dfs -ls myfile
-rw-r--r-- 1 swanftw swanftw 1 2014-07-24 20:14 myfile
$ hdfs dfs -chmod 755 myfile
$ hdfs dfs -ls myfile
-rwxr-xr-x 1 swanftw swanftw 1 2014-07-24 20:14 myfile
File System Shell
File system shell (HDFS client)
File system shell (HDFS client) cont...
File system shell (HDFS admin)
YARN
YARN
The JobTracker and TaskTracker have been replaced with the YARN ResourceManager and NodeManager.
Application Master negotiates resources with the Resource Manager and works with the NodeManagers to start the containers
YARN
Node Manager
Resource Manager
Map
The ETL or Projection piece of MapReduce
Divides the input into ranges and creates a map task for each range in the input. Tasks are distributed to the worker nodes. The output of each map task is partitioned into a group of key-value pairs for each reduce.
Shuffle/Sort
This refers to the action of shuffling & snorting data across the network nodes
Reduce
Reduce Phase
MapReduce
YARN Applications
rmadmin Commands
Lab: MapReduce
Lab Solution
$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar wordcount /user//in /user//out
Lab: Commands
Basic JVM Administration
HEAP
Memory Allocated to a JVM and is where referenced objects are stored.
HEAP
During Garbage Collection unreferenced objects are cleared out of memory and survivors of the process are moved to the first survivor space.
As an object survives longer it is promoted into older generation space.
Java Commands
JMX
Hadoop by default extends information about the running processes for monitoring via nagios or other custom monitoring.
Slide Title
fields in jstat -gcutil
Lab: Java Monitoring
Lab Solution
jps -l
25760 org.apache.zookeeper.server.quorum.QuorumPeerMain
1615 org.apache.hadoop.hdfs.server.namenode.NameNode
2279 org.apache.hadoop.yarn.server.nodemanager.NodeManager
jstat -gcutil 1615 2000 ←- 1615=PID 2000=Times to report
-observe changes in allocations
sudo yum -y install java-1.7.0-openjdk-devel.x86_64
Hive and Pig
What is Pig?
Apache Pig is a high level language for expressing data analysis programs built on top of the Hadoop platform for analyzing large data sets.
Why Pig?
Using Pig
Sample Pig Script
A = load ‘passwd’ using PigStorage(‘:’);
B = foreach A generate $0 as id;
store B into ‘id.out’;
Hive
Hive continued...
Hive Components
Hive Metastore
Query Processing
Hive/MR vs Hive/Tez
Interacting with Hive
$hive
hive> create table test(id string);
hive> describe test;
hive> drop table test;
Hive Tables
HCatalog
HCatalog
Apache Tez
Pig Walkthrough
Log onto the gateway node
wget https://s3.amazonaws.com/hw-sandbox/tutorial1/infochimps_dataset_4778_download_16677-csv.zip
unzip
cd infochimps_dataset_4778_download_16677/
hdfs dfs -put NYSE
pig
STOCK_A = LOAD 'NYSE/NYSE_daily_prices_A.csv' using PigStorage(',') AS (exchange:chararray, symbol:chararray,
date:chararray, open:float, high:float, low:float, close:float, volume:int, adj_close:float); DESCRIBE STOCK_A
B = LIMIT STOCK_A 100;
DESCRIBE B;
C = FOREACH B GENERATE symbol, date, close;
DESCRIBE C;
STORE C INTO 'output/C';
quit
hdfs dfs -cat output/C/part-r-00000
Lab: Working with Hive
Lab: Working with Hive
Hive Walkthrough
SSH into the Gateway node
wget http://seanlahman.com/files/database/lahman591-csv.zip
sudo yum install unzip
unzip lahman591-csv.zip
hdfs dfs -mkdir /user//baseball
hdfs dfs -copyFromLocal Master.csv Batting.csv /user//baseball
hive (This gets you into the hive shell)
hive> CREATE TABLE batting(player_id STRING, year INT, stint INT, team STRING, league STRING, games INT, games_batter
INT, at_bats INT, runs INT, hits INT, doubles INT, triples INT, homeruns INT, runs_batted_in INT, stolen_bases INT,
caught_stealing INT, base_on_balls INT, strikeouts INT, walks INT, hit_by_pitch INT, sh INT, sf INT, gidp INT, g_old INT) ROW
FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
hive> LOAD DATA INPATH '/user/baseball/Batting.csv' OVERWRITE INTO TABLE batting;
hive> SELECT year, max(runs) FROM batting GROUP BY year;
hive> SELECT b.year, b.player_id, b.runs from batting b JOIN (SELECT year, max(runs) runs FROM batting GROUP BY year) y ON (b.year = y.year and b.runs = y.runs);
Ingest Data via Swifts
SwiftFS for Hadoop
SwiftFS for Hadoop cont...
SwiftFS for Hadoop (cont.)
Advantages
Disadvantages
Wiki Resource
http://www.rackspace.com/knowledge_center/article/swift-filesystem-for-hadoop
Lab: Working with SwiftFS | Part 1
Lab: Working with SwiftFS | Part 2
Assess your Knowledge
A massive thanks to the content dev team...
Casey Gotcher
Chris Old
David Grier
Joe Silva
Mark Lessel
Nirmal Ranganathan
Sean Anderson
Chris Caillouet