James Murphy

Hadoop Part I: Configuring Hadoop (for Complete Beginners)

Hadoop Part I: Configuring Hadoop (for Complete Beginners)

When starting out with new technology, I often find that one of the most challenging bits is getting things into a place where I can start to write code.  I usually find this process quite frustrating and it isn’t a part that I particularly enjoy.  Now, this may say something about my relative sysadmin/coding ability, but this stuff is often difficult, at least for me (I’m looking at you, Python libraries on Windows – you know who you are).  This guide is as much a set of notes for me so that I can repeat this as a real introduction.  I wouldn’t claim to actually know the right way to do this; this is rather just a commentary of what I have done.

Installing Hadoop

I was originally planning on installing the virtual machine supplied with the Yahoo! Developer Network tutorial on Hadoop, but this is for a very old version (0.18.0) which I think dates from around 2008.  Still, the idea of an out-of-the-box sandbox sounded nice, so I decided to look around for a more up-to-date alternative.  I settled on Cloudera’s Quickstart VM (using VMWare Player to run it on), but I imagine that the HortonWorks virtual machine provides very similar facilities.  A nice feature of the Cloudera VM image is that it provides a desktop when you first start it.  Not essential, but a nice bonus.

This installs a single-node version of Hadoop, which should be a good environment for getting to grips with the basics of Hadoop MapReduce programming, without too much start-up time cost. The next step in this direction is to move to a multi-node cluster, and multi-node clusters are (obviously) where the power of Hadoop actually lies, but that is a somewhat more involved setup project that I will tackle in the future.

Once the virtual machine is running with the appropriate Cloudera Quickstart VM image, you can SSH into it using e.g. putty from Windows.  Files can be transferred from your host machine via SCP, e.g. with WinSCP).

Putting Files into HDFS

A lot has been written elsewhere about the Hadoop Distributed File System (HDFS), and it is a very interesting beast designed for use with ‘big’ datasets across multiple physical machines.  For my purpose, all I want to do is put some (smallish) files into it, so that I can process the data in them with Hadoop.  To do this, transfer the files from your PC to your Hadoop VM using SCP, so that they appear on the VM’s filesystem.  Then, they can be put into HDFS using the following command:

where {source} and {destination} have the obvious meaning, e.g. ./test.txt for both cases.  You can see if this works by listing the HDFS directory with

where {dir} is the (HDFS) directory of interest, e.g. ./data.  In order to make a directory, use

where {dirname} is the name and path of the directory you want to make.  A couple of other useful commands are one to delete things called {name} (recursively using the -r switch, so that you can empty directories easily)

and to get files back from the HDFS to the computer’s own filesystem

which does what you would expect.

There are, of course, lots of other Hadoop file system command line commands that let you do all the usual stuff. And there are also visual tools for managing the HDFS.  But my needs are simple at the moment, so that will do for now.

Configuring a Coding Environment

I’m doing my Hadoop development in Java.  Other languages are available (via the streaming interface, which I haven’t yet explored at all), but Java is the one that is most ‘natively’ supported, so that seems like a good idea.  You need the Java JDK installed on your host PC. I’m going to use Eclipse as my IDE for Java development.

To compile stuff that will work with Hadoop, it is necessary to get the Hadoop libraries (which can be added as dependencies to projects).  You can do this by downloading the Hadoop binaries from here (I used the 2.7.1 binary), and unzipping them to somewhere on your PC, which I’ll call the [hadoop] directory.

Now, create a new Eclipse Java project called WordCount. Then go and find the ubiquitous WordCount example code (e.g. here) and paste it into your WordCount.java file.  If your setup is like mine, everything will immediately go red underlined and show up errors everywhere.  This is simply because the Hadoop libraries have not been imported into the project.  Go to Project -> Properties -> Java Build Path, then click on the Libraries tab and then the Add External JARs… button.  Now, point it to the the file [hadoop]/share/hadoop/common/hadoop-common-[version#].jar from your Hadoop binaries and click open.  Then do the same for the file [hadoop]/share/hadoop/mapreduce/hadoop-mapreduce-client-core-[version#].jar.  When you click OK to close the properties dialogue box, everything should not be red anymore.

Compiling and Running

In order to run this first MapReduce program we need to compile it to a .jar file and put it on the VM.  In Eclipse you can create a .jar file from a project via File->Export, then select Java->JAR file and choose the location where you want to save it.  You can then move this .jar file to the VM via your favourite SCP program e.g. WinSCP.

Then, you can apply the WordCount program (compiled to wordcount.jar) to the contents of a folder (./data) in the HDFS as follows:

assuming that ./output is a folder that does not exist in the HDFS.  If this runs, the output of the processing will be placed into the ./output folder.  You can then get this from the HDFS to the local filesystem if you want to inspect it using the get HDFS command (see above).  The part-r-00000 file will then contain the word counts for all the words in the files in the ./data directory.

And that’s it, I’ve managed to run a Hadoop program!

There is a small gotcha here (that got me, at least).  For the version of Hadoop that I had on the Cloudera VM, the .jar file must be compiled for Java 1.7.  The Eclipse setup that I had installed automatically compiled for Java 1.8, so when I first tried to run the .jar file, I got an exception of the form java.lang.UnsupportedClassVersionError: WordCount: Unsupported major.minor version 52.0.  I therefore had to change the Eclipse project to target Java 1.7 via Project->Properties, Java Compiler and then selecting 1.7 as the Compiler Compliance Level (after unticking “Use compliance from execution environment”).  Now, depending on the versions you have of everything, all these versions might be different for you.

An Alternative Coding Environment

This seems a bit clumsy and tiresome, to say the least, to do every time you want to run a program.  So, you might think that there must be a better way to do all of this.  And indeed there is.  Or, at least a different way.  (And, no doubt, there are also a million better ways to get this configured so that everything integrates beautifully, deploys remotely, sends back debugging information, and does everything you could ever want.  But I’m no configuration genius, so I’m just going to have to settle for something that will run some code, which is hard enough!).

This alternative version uses the version of Eclipse that handily comes configured in the ClouderaVM.  This allows you to run jobs in local mode (LocalJobRunner mode), in which everything runs in a single JVM, and in which the local file system can be used in place of HDFS.  Since everything is running on the same (virtual) machine, and that the local file system can be read directly, both the remote deployment and HDFS bits of the above procedure can be avoided, which should make for a much speedier development cycle.  More importantly, it is possible to add breakpoints and inspect things whilst the code is running, which is going to be enormously helpful in developing anything serious.  So this is probably going to be a much better approach for prototyping and development.  There is a great screencast video about how to get these projects up and running here, so I won’t repeat what is in that, but I will mention some of the difficulties that I have had in setting up this up ready to write and run code.

I tried to set this up for my own new WordCount project, which I created from scratch in Eclipse.  I pasted the WordCount code into a single source file, WordCount.java.  As above, everything lit up with red underlining, so I knew I had to add some libraries in Project -> Properties -> Java Build Path.  But which ones?  I added everything in usr/lib/hadoop/client and that allowed the WordCount program to run, but gave some warnings about logging not being properly configured.  No output from the Hadoop console showed up in Eclipse in this configuration.

Alternatively, I tried adding everything in usr/lib/hadoop/client-0.20 plus the commons-http-client-3.1.jar file from usr/lib/hadoop/lib (as instructed in the video linked above; I would never have guessed this on my own), and that also allowed the program to run.  This time I didn’t get any warnings about logging, but I did get one saying that it was unable to find the native-hadoop library for my platform and was using builtin-java classes instead.  Is this bad?  I don’t know, to be honest.  This time, however, I got the console output from Hadoop showing up in Eclipse.

Which of these is best?  I can’t say.  You pays your money and you takes your choice.

Conclusion

This is where I am with the Hadoop development platform.  The alternative approach above is clearly the easiest to get started developing with, even if the configuration has the little wrinkle mentioned.  So, I’m going to push on using this latter, but it’s nice to know that I can ‘deploy’ what I make to the single-node platform and make use of HDFS in the first way if I so wanted.

Setting up to a point where I can start to write Hadoop code has not been super-easy (even though it sounds it when I read this back), but it isn’t impossibly difficult either.  Now for the fun bit – writing some code.  Hopefully that will make it all seem worthwhile.