Hadoop Hands-on

https://user501254.github.io/BD_STTP_2016/

Installing Hadoop

can be done in 3 modes:

Standalone mode
- all daemons run under a single JVM, singe node
Pesudo Distributed mode (Single-Node Cluster)
- all daemons run under a single node but separate JVMs
Fully Distributed mode (Multi-Node Cluster)
- all daemons run under seprate nodes & separate JVMs
- minimum nodes required (for testing) - 2
- recommended minimum nodes (for testing) - 3

Installing hadoop in Pesudo Distributed mode
(Single-Node Cluster)

You will need:

GNU/Linux (actual install or on a VM), Internet, storage space (min. 1GB) and ram space (min. 1GB)

see: https://youtu.be/gWkbPVNER5k

Installing hadoop in Fully Distributed mode
(Multi-Node Cluster)

You will need:

GNU/Linux (3 actual installs or on a VM), Internet, storage space (min. 1GB) and ram space (min. 2GB) on each node

see:

Test Your single/Multi Node Install

(on small data first)

$ #get some data
$ wget "http://www.gutenberg.org/cache/epub/2600/pg2600.txt"

$ #start hadoop daemons
$ start-dfs.sh;start-yarn.sh;jps

$ #make in input directory on HDFS to put you downloaded data file
$ hadoop fs -mkdir /inputsmall
$ #put your data on the HDFS
$ hadoop fs -put pg2600.txt /inputsmall
$ #make sure that the file got loaded
$ hadoop fs -ls /inputsmall

$ #run the example wordcount application
$ hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop*examples*.jar wordcount /inputsmall /outputsmall

Test Your single/Multi Node Install

(on relatively big data)

$ #populate a file with lot of words
$ for i in {1..76};do cat pg2600.txt >> foo.txt; done

$ #make in input directory on HDFS to put you downloaded data file
$ hadoop fs -mkdir /inputbig
$ #put your data on the HDFS
$ hadoop fs -put foo.txt /inputbig
$ #make sure that the file got loaded
$ hadoop fs -ls /inputbig

$ #run the example wordcount application
$ hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop*examples*.jar wordcount /inputbig /outputbig

For comparison we will use a HL language (python)

(on small and relatively big data)

$ #The mapper file
$ cat << 'EOT' >> mapper.py
#!/usr/bin/env python

import sys

# input comes from STDIN (standard input)
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()
    # split the line into words
    words = line.split()
    # increase counters
    for word in words:
        # write the results to STDOUT (standard output);
        # what we output here will be the input for the
        # Reduce step, i.e. the input for reducer.py
        #
        # tab-delimited; the trivial word count is 1
        print '%s\t%s' % (word, 1)
EOT


$ #The reducer file
$ cat <<'EOT'>>reducer.py
#!/usr/bin/env python

from operator import itemgetter
import sys

current_word = None
current_count = 0
word = None

# input comes from STDIN
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()

    # parse the input we got from mapper.py
    word, count = line.split('\t', 1)

    # convert count (currently a string) to int
    try:
        count = int(count)
    except ValueError:
        # count was not a number, so silently
        # ignore/discard this line
        continue

    # this IF-switch only works because Hadoop sorts map output
    # by key (here: word) before it is passed to the reducer
    if current_word == word:
        current_count += count
    else:
        if current_word:
            # write result to STDOUT
            print '%s\t%s' % (current_word, current_count)
        current_count = count
        current_word = word

# do not forget to output the last word if needed!
if current_word == word:
    print '%s\t%s' % (current_word, current_count)
EOT

$ #Give execute permission
$ chmod 777 *.py

$ #execute on small data
$ cat pg2600.txt | python ../mapper.py | sort -k1,1 | python ../reducer.py

$ #execute on *relatively* big data
$ cat foo.txt | python ../mapper.py | sort -k1,1 | python ../reducer.py