Pivotal Hadoop & Python Map-Reduce Tutorial

First of all, thanks to Michael G. Noll for his blog which can be found here: http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python

The following is a slightly modified version of this blog to allow it to run with the Pivotal Hadoop

Step 1.

Download the Pivotal Hadoop VM from here:

http://www.gopivotal.com/big-data/pivotal-hd

Find the downloads button. Download the Pivotal HD Single Node VM. At this point it may ask you to create an account with Pivotal.

The file & version I downloaded was: PIVHDSNE_VMWARE_VM-2.0.0-52.7z

Step 2.

Unzip the file. I used this: http://www.7-zip.org

Step 3.

You should now have a folder matching the original filename. In the directory launch the VM (Workstation/Fusion should kick this off).

Step 4.

Start up the VM. Run the start_all.sh script on the desktop to start hadoop services.

In firefox go to the Pivotal Command Center page (there should be a bookmark also):

https://pivhdsne:5443/login

User is gpadmin and password is Gpadmin1 (note they appear to change this password between versions, so view the README file on the desktop if you have an issue).

Click on the Pivotal HD instance.

You now have the dashboard. If the services are still down, they should come up soon. No need to refresh the page – it refreshes automatically.

Step 5. – Set up mapper.py

Save the following code in a file called /home/gpadmin/mapper.py (I used vi – you can just use a text editor, but vi is more hard core)

#!/usr/bin/env python
import sys
# input comes from STDIN (standard input)
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()
    # split the line into words
    words = line.split()
    # increase counters
    for word in words:
        # write the results to STDOUT (standard output);
        # what we output here will be the input for the
        # Reduce step, i.e. the input for reducer.py
        #
        # tab-delimited; the trivial word count is 1
        print '%s\t%s' % (word, 1)

Save the file (esc, wq!).

Step 6.

We need to be able to execute the file, so add permissions by running this:

chmod +x /home/gpadmin/mapper.py

Step 7. Set up reducer.py

Save the following code in a file called /home/gpadmin/reducer.py

#!/usr/bin/env python
from operator import itemgetter
import sys
current_word = None
current_count = 0
word = None
# input comes from STDIN
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()
    # parse the input we got from mapper.py
    word, count = line.split('\t', 1)
    # convert count (currently a string) to int
    try:
        count = int(count)
    except ValueError:
        # count was not a number, so silently
        # ignore/discard this line
        continue
    # this IF-switch only works because Hadoop sorts map output
    # by key (here: word) before it is passed to the reducer
    if current_word == word:
        current_count += count
    else:
        if current_word:
            # write result to STDOUT
            print '%s\t%s' % (current_word, current_count)
        current_count = count
        current_word = word
# do not forget to output the last word if needed!
if current_word == word:
    print '%s\t%s' % (current_word, current_count)

Step 8.

Chmod again:

chmod +x /home/gpadmin/reducer.py

Step 9 – Do some basic tests.

Run this on the command line:

echo foo foo quux labs foo bar quux | /home/gpadmin/mapper.py

You should see:

foo     1

foo     1

quux   1

labs   1

foo     1

bar     1

quux   1

Next test, run:

echo foo foo quux labs foo bar quux | /home/gpadmin/mapper.py | sort -k1,1 | /home/gpadmin/reducer.py

It should produce:

bar     1

foo     3

labs   1

quux   2

Step 10 – let’s get some books to crunch code against!

Here are the 3 books that the original blog uses. Download these and put them in /tmp/gutenberg. You will need to create this directory first. Fine the plain text UTF-8 versions.

http://www.gutenberg.org/etext/20417

http://www.gutenberg.org/etext/5000

http://www.gutenberg.org/etext/4300

Step 11 – do another test

Run the following command:

cat /tmp/gutenberg/pg20417.txt | /home/gpadmin/mapper.py

The result should show a count for each occurrence of a word in the book.

Step 12 – Upload the books to hadoop

Run this command to copy the books.

hdfs dfs -copyFromLocal /tmp/gutenberg /user/gpadmin/gutenberg

If at any point you receive an error when running hdfs commands that says “connection refused” then you need to start Hadoop and the relevant services. Run the start_all.sh script on the desktop. You may need to stop services or even reboot if the pivotal command centre doesn’t look happy.

Check your 3 files were written to Hadoop by running:

hdfs dfs -ls /user/gpadmin/gutenberg

You should see:

Found 3 items

-rw-r–r–   1 gpadmin gpadmin     674570 2014-04-09 17:27 /user/gpadmin/gutenberg/pg20417.txt

-rw-r–r–   1 gpadmin gpadmin   1573150 2014-04-09 17:27 /user/gpadmin/gutenberg/pg4300.txt

-rw-r–r–   1 gpadmin gpadmin   1423803 2014-04-09 17:27 /user/gpadmin/gutenberg/pg5000.txt

Step 13 – Run the map-reduce job!

Use the following command:

hadoop jar /usr/lib/gphd/hadoop-mapreduce/hadoop-streaming.jar -file /home/gpadmin/mapper.py -mapper /home/gpadmin/mapper.py -file /home/gpadmin/reducer.py -reducer /home/gpadmin/reducer.py -input /user/gpadmin/gutenberg/* -output /user/gpadmin/gutenberg-output

You will see a bunch of warnings, but if all is good you will see it progress to 100% for both the map and reduce functions. Scroll up the page and you should see “job completed successfully”.

If you see errors, check that every directory and file from the command above was typed correctly

Step 14 – Check out your output

You should see 2 files if you run:

hdfs dfs -ls /user/gpadmin/gutenberg-output

Then run this to see the contents of the output:

hdfs dfs -cat /user/gpadmin/gutenberg-output/part-00000

You should see a list of how often every word appears in the books. You are done!

Clean up.

If you need to delete data or directories, these commands are useful

hdfs dfs -rm /user/gpadmin/gutenberg-output/*

hdfs dfs -rmdir /user/gpadmin/gutenberg-output

Next Steps

Now you have seen the basics of Hadoop, move on to the tutorial supplied with the Pivotal HD VM. The link to the tutorial is here:

http://pivotalhd.cfapps.io/tutorial/getting-started/dataset.html

 

Further info

Again thanks to the original author for the tutorial that has been modified to produce the above. For further info and some more code: http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/

One thought on “Pivotal Hadoop & Python Map-Reduce Tutorial”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s