Hadoop Hands-on
Installing Hadoop
can be done in 3 modes:
- Standalone mode
- all daemons run under a single JVM, singe node - Pesudo Distributed mode (Single-Node Cluster)
- all daemons run under a single node but separate JVMs - Fully Distributed mode (Multi-Node Cluster)
- all daemons run under seprate nodes & separate JVMs
- minimum nodes required (for testing) - 2
- recommended minimum nodes (for testing) - 3
Installing hadoop in Pesudo Distributed mode
(Single-Node Cluster)
You will need:
GNU/Linux (actual install or on a VM), Internet, storage space (min. 1GB) and ram space (min. 1GB)
Installing hadoop in Fully Distributed mode
(Multi-Node Cluster)
You will need:
GNU/Linux (3 actual installs or on a VM), Internet, storage space (min. 1GB) and ram space (min. 2GB) on each node
see:
Test Your single/Multi Node Install
(on small data first)
$ #get some data
$ wget "http://www.gutenberg.org/cache/epub/2600/pg2600.txt"
$ #start hadoop daemons
$ start-dfs.sh;start-yarn.sh;jps
$ #make in input directory on HDFS to put you downloaded data file
$ hadoop fs -mkdir /inputsmall
$ #put your data on the HDFS
$ hadoop fs -put pg2600.txt /inputsmall
$ #make sure that the file got loaded
$ hadoop fs -ls /inputsmall
$ #run the example wordcount application
$ hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop*examples*.jar wordcount /inputsmall /outputsmall
Test Your single/Multi Node Install
(on relatively big data)
$ #populate a file with lot of words
$ for i in {1..76};do cat pg2600.txt >> foo.txt; done
$ #make in input directory on HDFS to put you downloaded data file
$ hadoop fs -mkdir /inputbig
$ #put your data on the HDFS
$ hadoop fs -put foo.txt /inputbig
$ #make sure that the file got loaded
$ hadoop fs -ls /inputbig
$ #run the example wordcount application
$ hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop*examples*.jar wordcount /inputbig /outputbig
For comparison we will use a HL language (python)
(on small and relatively big data)
$ #The mapper file
$ cat << 'EOT' >> mapper.py
#!/usr/bin/env python
import sys
# input comes from STDIN (standard input)
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# split the line into words
words = line.split()
# increase counters
for word in words:
# write the results to STDOUT (standard output);
# what we output here will be the input for the
# Reduce step, i.e. the input for reducer.py
#
# tab-delimited; the trivial word count is 1
print '%s\t%s' % (word, 1)
EOT
$ #The reducer file
$ cat <<'EOT'>>reducer.py
#!/usr/bin/env python
from operator import itemgetter
import sys
current_word = None
current_count = 0
word = None
# input comes from STDIN
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# parse the input we got from mapper.py
word, count = line.split('\t', 1)
# convert count (currently a string) to int
try:
count = int(count)
except ValueError:
# count was not a number, so silently
# ignore/discard this line
continue
# this IF-switch only works because Hadoop sorts map output
# by key (here: word) before it is passed to the reducer
if current_word == word:
current_count += count
else:
if current_word:
# write result to STDOUT
print '%s\t%s' % (current_word, current_count)
current_count = count
current_word = word
# do not forget to output the last word if needed!
if current_word == word:
print '%s\t%s' % (current_word, current_count)
EOT
$ #Give execute permission
$ chmod 777 *.py
$ #execute on small data
$ cat pg2600.txt | python ../mapper.py | sort -k1,1 | python ../reducer.py
$ #execute on *relatively* big data
$ cat foo.txt | python ../mapper.py | sort -k1,1 | python ../reducer.py
Hadoop Hands-on
By Ashesh Kumar
Hadoop Hands-on
- 1,145