First of all, thanks to Michael G. Noll for his blog which can be found here: http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python
The following is a slightly modified version of this blog to allow it to run with the Pivotal Hadoop
Step 1.
Download the Pivotal Hadoop VM from here:
http://www.gopivotal.com/big-data/pivotal-hd
Find the downloads button. Download the Pivotal HD Single Node VM. At this point it may ask you to create an account with Pivotal.
The file & version I downloaded was: PIVHDSNE_VMWARE_VM-2.0.0-52.7z
Step 2.
Unzip the file. I used this: http://www.7-zip.org
Step 3.
You should now have a folder matching the original filename. In the directory launch the VM (Workstation/Fusion should kick this off).
Step 4.
Start up the VM. Run the start_all.sh script on the desktop to start hadoop services.
In firefox go to the Pivotal Command Center page (there should be a bookmark also):
User is gpadmin and password is Gpadmin1 (note they appear to change this password between versions, so view the README file on the desktop if you have an issue).
Click on the Pivotal HD instance.
You now have the dashboard. If the services are still down, they should come up soon. No need to refresh the page – it refreshes automatically.
Step 5. – Set up mapper.py
Save the following code in a file called /home/gpadmin/mapper.py (I used vi – you can just use a text editor, but vi is more hard core)
#!/usr/bin/env python
import sys
# input comes from STDIN (standard input)
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# split the line into words
words = line.split()
# increase counters
for word in words:
# write the results to STDOUT (standard output);
# what we output here will be the input for the
# Reduce step, i.e. the input for reducer.py
#
# tab-delimited; the trivial word count is 1
print '%s\t%s' % (word, 1)
Save the file (esc, wq!).
Step 6.
We need to be able to execute the file, so add permissions by running this:
chmod +x /home/gpadmin/mapper.py
Step 7. Set up reducer.py
Save the following code in a file called /home/gpadmin/reducer.py
#!/usr/bin/env python
from operator import itemgetter
import sys
current_word = None
current_count = 0
word = None
# input comes from STDIN
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# parse the input we got from mapper.py
word, count = line.split('\t', 1)
# convert count (currently a string) to int
try:
count = int(count)
except ValueError:
# count was not a number, so silently
# ignore/discard this line
continue
# this IF-switch only works because Hadoop sorts map output
# by key (here: word) before it is passed to the reducer
if current_word == word:
current_count += count
else:
if current_word:
# write result to STDOUT
print '%s\t%s' % (current_word, current_count)
current_count = count
current_word = word
# do not forget to output the last word if needed!
if current_word == word:
print '%s\t%s' % (current_word, current_count)
Step 8.
Chmod again:
chmod +x /home/gpadmin/reducer.py
Step 9 – Do some basic tests.
Run this on the command line:
echo foo foo quux labs foo bar quux | /home/gpadmin/mapper.py
You should see:
foo 1
foo 1
quux 1
labs 1
foo 1
bar 1
quux 1
Next test, run:
echo foo foo quux labs foo bar quux | /home/gpadmin/mapper.py | sort -k1,1 | /home/gpadmin/reducer.py
It should produce:
bar 1
foo 3
labs 1
quux 2
Step 10 – let’s get some books to crunch code against!
Here are the 3 books that the original blog uses. Download these and put them in /tmp/gutenberg. You will need to create this directory first. Fine the plain text UTF-8 versions.
http://www.gutenberg.org/etext/20417
http://www.gutenberg.org/etext/5000
http://www.gutenberg.org/etext/4300
Step 11 – do another test
Run the following command:
cat /tmp/gutenberg/pg20417.txt | /home/gpadmin/mapper.py
The result should show a count for each occurrence of a word in the book.
Step 12 – Upload the books to hadoop
Run this command to copy the books.
hdfs dfs -copyFromLocal /tmp/gutenberg /user/gpadmin/gutenberg
If at any point you receive an error when running hdfs commands that says “connection refused” then you need to start Hadoop and the relevant services. Run the start_all.sh script on the desktop. You may need to stop services or even reboot if the pivotal command centre doesn’t look happy.
Check your 3 files were written to Hadoop by running:
hdfs dfs -ls /user/gpadmin/gutenberg
You should see:
Found 3 items
-rw-r–r– 1 gpadmin gpadmin 674570 2014-04-09 17:27 /user/gpadmin/gutenberg/pg20417.txt
-rw-r–r– 1 gpadmin gpadmin 1573150 2014-04-09 17:27 /user/gpadmin/gutenberg/pg4300.txt
-rw-r–r– 1 gpadmin gpadmin 1423803 2014-04-09 17:27 /user/gpadmin/gutenberg/pg5000.txt
Step 13 – Run the map-reduce job!
Use the following command:
hadoop jar /usr/lib/gphd/hadoop-mapreduce/hadoop-streaming.jar -file /home/gpadmin/mapper.py -mapper /home/gpadmin/mapper.py -file /home/gpadmin/reducer.py -reducer /home/gpadmin/reducer.py -input /user/gpadmin/gutenberg/* -output /user/gpadmin/gutenberg-output
You will see a bunch of warnings, but if all is good you will see it progress to 100% for both the map and reduce functions. Scroll up the page and you should see “job completed successfully”.
If you see errors, check that every directory and file from the command above was typed correctly
Step 14 – Check out your output
You should see 2 files if you run:
hdfs dfs -ls /user/gpadmin/gutenberg-output
Then run this to see the contents of the output:
hdfs dfs -cat /user/gpadmin/gutenberg-output/part-00000
You should see a list of how often every word appears in the books. You are done!
Clean up.
If you need to delete data or directories, these commands are useful
hdfs dfs -rm /user/gpadmin/gutenberg-output/*
hdfs dfs -rmdir /user/gpadmin/gutenberg-output
Next Steps
Now you have seen the basics of Hadoop, move on to the tutorial supplied with the Pivotal HD VM. The link to the tutorial is here:
http://pivotalhd.cfapps.io/tutorial/getting-started/dataset.html
Further info
Again thanks to the original author for the tutorial that has been modified to produce the above. For further info and some more code: http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/