Thursday, August 8, 2013

Starting Small: Set Up a Hadoop Compute Cluster Using Raspberry Pis

What is Hadoop?


Hadoop is a big data computing framework that generally refers to the main components: the core, HDFS, and MapReduce. There are several other projects under the umbrella as well. For more information, see this interview with Cloudera CSO Mike Olson.

What is a Raspberry Pi?


The Pi is a small, inexpensive ($39) ARM based computer. It is meant primarily as an educational tool.

Is Hadoop on the Pi practical?


Nope! Compute performance is horrendous.

Then why?


It's a great learning opportunity to work with Hadoop and multiple nodes. It's also cool to be able to put your "compute cluster" in a lunch box.
In reality though, this article is much more about setting up your first Hadoop compute cluster than it is about the Pi.

Could I use this guide to setup a non-Raspberry Pi based (Real) Hadoop Cluster?


Absolutely, please do. I'll make notes to that effect throughout.

I've been wanting to do this anyhow, let's get started!


Yeah, that's what I said.


Hardware Used


  • 3x Raspberry Pis
  • 1x 110v Power Splitter
  • 1x 4 port 10/100 Switch
  • 1x Powered 7 port USB hub
  • 3x 2' Cat 5 cables
  • 3x 2' USB "power" cables
  • 3x Class 10 16GB SDHC Cards

Total cost: about $170 bucks. I picked it all up at my local Microcenter, including the Pis!

Pi Hadoop Cluster; caution: may be slow.

If you would like a a high performance Hadoop cluster just pick one of these up on the way home:

This will cost at least $59.95. Maybe more. Probably more.


Raspberry Pi Preparation


We'll do the master node by itself first so that this guide can be used for a single or multi-node setup. Note: On a non-Pi installation skip to Networking Preparation (though you may want to update your OS manually).

Initial Config

  1. Download the "Raspbian wheezy" image from here and follow these directions to set up your first SD card. Soft-float isn't necessary.
  2. Hook up a monitor and keyboard and start up the device by connecting up the USB power.
  3. When the Raspberry Pi config tool launches, change the keyboard layout first. If you set passwords, etc. with the wrong KB layout you may have a hard time with that later.
  4. Change the timezone, then the locale. (Language, etc.)
  5. Change the default user password
  6. Configure the Pi to disable the X environment on boot. (boot_behavior->straight to desktop->no)
  7. Enable the SSH Daemon (Advanced-> A4 SSH) then exit the raspi-config tool
  8. Reboot the Pi. (sudo reboot)

Update The OS

Note: It is assumed from henceforth you are connecting to the Pi or your OS via SSH. You can continue on the direct terminal if you like until we get to multiple nodes.
  1. Log on to your Pi using the Raspberry user with the password you set earlier.
  2. To pull the newest sources, execute sudo apt-get update
  3. To pull upgrades for the OS and packages, execute sudo apt-get upgrade
  4. Reboot the pi. (sudo reboot)

Split the Memory/Overclock

  1. After logging back into the Pi, execute sudo raspi-config to bring up the Rasperry Pi config tool.
  2. Select Option 8, "Advanced Options"
  3. Select Option A3, "Memory Split"
  4. When prompted for how much memory the GPU should have, enter the minimum, "16" and hit enter.
  5. (If you would like to overclock) Select Option 7, "Overclock", hit "OK" to the warning, and select what speed you would like.
  6. On the main menu, select "finish" and hit "Yes" to reboot.

Networking Preparation

Each Hadoop node must have a unique name and static IP.  I'll be following the Debian instructions since the Rasbian build is based on Debian. If you're running another distro follow the instructions for that. (Redhat, CentOS, Ubuntu)

First we need to change the machine hostname by editing /etc/hostname
sudo nano /etc/hostname
Note: I'll be using nano throughout the article; I'm sure some of you will be substituting vi instead. :)
Change the hostname to what you would like and save the file; I used "Node 1" throughout this document. If you're setting up the second node make sure to use a different hostname.

Now to change to a static IP by editing /etc/network/interfaces
sudo nano /etc/network/interfaces
Change the iface eth0 to a static IP by replacing "auto eth0" and/or "iface eth0 inet dhcp" with:
iface eth0 inet static
        address 192.168.1.40
        netmask 255.255.255.0
        gateway 192.168.1.1
Note: Substitute the correct address (IP), netmask, and gateway for your environment.

Make sure you can resolve DNS queries correctly by editing your /etc/resolv.conf:
sudo nano /etc/resolv.conf
Change the content to match below, substituting the correct information for your environment:
domain company.com
search company.com
nameserver 192.168.1.20
nameserver 192.168.1.30
domain=domain suffix for this machine
search=appends when FQDN not specified
nameserver=list DNS servers in order of precedence

We could restart networking etc, but for the sake of simplicity we'll bounce the box:
sudo reboot

Install Java

If you're going for performance you can install the SunJDK, but getting that to work requires a bit of extra effort. Since we're on the Pi and performance isn't our goal, I'll be using OpenJDK which installs easily. If you're installing on a "real" machine/VM, you may want to diverge a bit here and go for the real deal (Probably Java ver 6).
After connecting to the new IP via SSH again, execute:
sudo apt-get install openjdk-7-jdk
Ensure that 7 is the version you want. If not substitute the package accordingly.




Create and Config Hadoop User and Groups

Create the hadoop group:
sudo addgroup hadoop
Create hadoop user "hduser" and place it in the hadoop group; make sure you remember the password!
sudo adduser --ingroup hadoop hduser
Give hduser the ability to sudo:
sudo adduser hduser sudo

Now logout:
logout
Log back in as the newly created hduser. From hence forth everything will be executed as that user.

Setup SSH Certs

SSH is used between Hadoop nodes and services (even on the same node) to coordinate work. These SSH sessions will be run as the Hadoop user, "hduser" that we created earlier. We need to generate a keypair to use for EACH node. Let's do this one first:

Create the certs for use with ssh:
ssh-keygen -t rsa -P ""
When prompted, save the key to the default directory (will be /home/USERNAME/.ssh/id_rsa). If done correctly you will be shown the fingerprint and randomart image.


Copy the public key into the user's authorized keys store (~ represents the home directory of the current user, >> appends to that file to preserve any keys already there) 
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
SSH to localhost and add to the list of known hosts.
ssh localhost
When prompted, type "yes" to add to the list of known hosts.

Download/Install Hadoop!

Note: As of this writing, the newest version of the 1.x branch (which we're covering) is 1.2. (Ahem, 1.2.1, they're quick) You should check to see what the newest stable release is and change the download link below accordingly. Versions and mirrors can be found here.

Download to home dir:
cd ~
wget http://mirror.catn.com/pub/apache/hadoop/core/hadoop-1.2.0/hadoop-1.2.0.tar.gz
Unzip to the /usr/local dir:
sudo tar vxzf hadoop-1.2.0.tar.gz -C /usr/local
Change the dir name of the unzipped hadoop dir. Note that if your version # is different you'll need to adjust the command
cd /usr/local
sudo mv hadoop-1.2.0/ hadoop
Set the hadoop dir owner to be the hadoop user (hduser):
sudo chown -R hduser:hadoop hadoop

Configure the User Environment

Now we'll config the environment for the user (hduser) to run Hadoop. This assumes you're logged on as that user.
Change to the home dir and open the .bashrc file for editing:
cd ~
nano .bashrc
Add the following to the .bashrc file. It doesn't matter where: (note if you're using Sun Java on a real box you need to put in the right dir for JAVA_HOME)
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-armhf
export HADOOP_INSTALL=/usr/local/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
Bounce the box (you could actually just log off/on, but what the hell eh?):
sudo reboot

Configure Hadoop!

(Finally!) It's time to setup Hadoop on the machine. With these steps we'll set up a single node and we'll discuss multiple node configuration afterward.
First let's ensure your Hadoop install is in place correctly. After logging in as hduser, execute:
hadoop version
you should see something like: (differs depending on version)
Hadoop 1.2.0
Subversion https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.2 -r 1479473
Compiled by hortonfo on Mon May  6 06:59:37 UTC 2013
From source with checksum 2e0dac51ede113c1f2ca8e7d82fb3405
This command was run using /usr/local/hadoop/hadoop-core-1.2.0.jar
Good. Now let's configure. First up we'll set up the environment file that defines some overall runtime parameters:
nano /usr/local/hadoop/conf/hadoop-env.sh
Add the following lines to set the Java Home parameter (specifies where Java is) and how much memory the Hadoop processes are allowed to use. There should be sample lines in your existing hadoop-env.sh that you can just un-comment and modify.
Note: If you're using a "real" machine rather than a Pi, change the heapsize to a number appropriate to your machine (physical mem - (OS + caching overhead)) and ensure you have the right directory for JAVA_HOME because it will likely be different. If you have a Raspberry Pi version "A" rather than "B", you'll heapsize will need to be much lower since the RAM is halved.
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-armhf
export HADOOP_HEAPSIZE=272
Now let's configure core-site.xml. This file defines critical operational parameters for a Hadoop site.
Note: You'll notice we're using "localhost" in the configuration. That will change when we go to multi-node, but we'll leave it as localhost for a demonstration of an exportable local only configuration.
nano /usr/local/hadoop/conf/core-site.xml
Add the following lines. If there is already information in the file make sure you respect XML format rules. The "description" field is not required, but I've included it to help this tutorial make sense.
<configuration>
  <property>
    <name>hadoop.tmp.dir</name>
    <value>/fs/hadoop/tmp</value>    <description>Sets the operating directory for Hadoop data.
    </description>
  </property>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://localhost:54310</value>    <description>The name of the default file system.  A URI whose
    scheme and authority determine the FileSystem implementation.
    The URI's scheme determines the config property (fs.SCHEME.impl) naming
    the FileSystem implementation class.  The URI's authority is used to
    determine the host, port, etc. for a filesystem.  
    </description>
  </property>
</configuration>
It's time to configure the mapred-site.xml. This file determines where the mapreduce job tracker(s) run. Again, we're using localhost for now and we'll change it when we go to multi-node in a bit.
nano /usr/local/hadoop/conf/mapred-site.xml
Add the following lines:
<configuration>
  <property>
    <name>mapred.job.tracker</name>
    <value>localhost:54311</value>
    <description>The host and port that the MapReduce job tracker runs
    at.  If "local", then jobs are run in-process as a single map
    and reduce task.
    </description>
  </property>
</configuration>
Now on to hdfs-site.xml. This file configures the HDFS parameters.
nano /usr/local/hadoop/conf/hdfs-site.xml
Add the following lines:
Note: The dfs.replication value sets how many copies of any given file should exist in the configured HDFS site. We'll do 1 for now since there is only one node, but again when we configure multiple nodes we'll change this.
<configuration>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
    <description>Default block replication.
    The actual number of replications can be specified when the file is created.
    The default is used if replication is not specified in create time.
    </description>
  </property>
</configuration>
Create the working directory and set permissions
sudo mkdir -p /fs/hadoop/tmp
sudo chown hduser:hadoop /fs/hadoop/tmp
sudo chmod 750 /fs/hadoop/tmp/
Format the working directory
/usr/local/hadoop/bin/hadoop namenode -format

Start Hadoop and Run Your First Job

Let's roll, this will be fun.
Start it up:
cd /usr/local/hadoop
bin/start-all.sh
Now let's make sure it's running successfully. First we'll use the jps command (Java PS) to determine what Java processes are running:
jps
You should see the following processes: (ignore the number in front, that's a process ID)
4863 Jps
4003 SecondaryNameNode
4192 TaskTracker
3893 DataNode
3787 NameNode
4079 JobTracker
If there are any missing processes, you'll need to review logs. By default, logs are located in:
/usr/local/hadoop/logs
There are two types of log files: .out and .log. .out files detail process information while .log files are the logging output from the process. Generally .log files are used for troubleshooting. The logfile name standard is:
hadoop-(username)-(processtype)-(machinename).log
i.e.
hadoop-hduser-jobtracker-node1.log
In this, like many other articles, we'll use the included wordcount example. For the wordcount example you will need plain text large enough to be interesting. For this purpose, you can download plain text books from Project Gutenberg.

After downloading 1 or more books (make sure you selected "Plain Text UTF-8" format) copy them to the local filesystem on your Hadoop node using SSH (via SCP or similar). In my example I've copied the books to /tmp/books.

Now let's copy the books onto the HDFS filesystem by using the Hadoop dfs command.
cd /usr/local/hadoop
bin/hadoop dfs -copyFromLocal /tmp/books /fs/hduser/books
After copying in the book(s), execute the wordcount example:
bin/hadoop jar hadoop*examples*.jar wordcount /fs/hduser/books /fs/hduser/books-output
You'll see the job run; it may take awhile... remember these Pis don't perform all that well. Upon completion you should see something like this:
13/06/17 15:46:54 INFO mapred.JobClient: Job complete: job_201306170244_0001
13/06/17 15:46:55 INFO mapred.JobClient: Counters: 29
13/06/17 15:46:55 INFO mapred.JobClient:   Job Counters
13/06/17 15:46:55 INFO mapred.JobClient:     Launched reduce tasks=1
13/06/17 15:46:55 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=566320
13/06/17 15:46:55 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
13/06/17 15:46:55 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
13/06/17 15:46:55 INFO mapred.JobClient:     Launched map tasks=1
13/06/17 15:46:55 INFO mapred.JobClient:     Data-local map tasks=1
13/06/17 15:46:55 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=112626
13/06/17 15:46:55 INFO mapred.JobClient:   File Output Format Counters
13/06/17 15:46:55 INFO mapred.JobClient:     Bytes Written=229278
...
The results from the run, if you're curious, should be in the last part of the command you executed earlier. (/fs/hduser/books-output in our case) Congrats! You just ran your first Hadoop job, welcome to the world of (tiny)big data!

Multiple Nodes


Hadoop has been successfully run on thousands of nodes; this is where we derive the power of the platform. The first step to getting to many nodes is getting to 2, so let's do that.

Set up the Secondary Pi

I'll cover cloning a node for the third Pi, so for this one we'll assume you've set up another Pi the same as the first using the instructions above. After that, we just need make appropriate changes to the configuration files. Important Note: I'll be referring to the nodes as Node 1 and Node 2 and making the assumption that you have named them that. Feel free to use different names, just make sure you substitute them in below. Name resolution will be an issue; we'll touch on that below.

Node 1 will run:
  • NameNode
  • Secondary NameNode (In a large production cluster this would be somewhere else)
  • DataNode
  • JobTracker (In a large production cluster this would be somewhere else)
  • TaskTracker
Node 2 will run:
  • DataNode
  • TaskTracker
Normally this is where I planned on writing up an overview of the different pieces of Hadoop, but upon researching I found an article that exceeded what I planned to write it by such a magnitude I figured why bother, I'll just link it here. Excellent post by Brad Hedlund, be sure to check it out.
On Node 1, edit masters
nano /usr/local/hadoop/conf/masters
remove "localhost" and add the first node FQDN and save the file:
Node1.domain.ext
Note that conf/masters does NOT determine which node holds "master" roles, i.e. NameNode, JobTracker, etc. It only defines which node will attempt to contact the nodes in slaves (below) to initiate start-up.
On Node 1, edit slaves
nano /usr/local/hadoop/conf/slaves
remove "localhost" add the internal FQDN of all nodes in the cluster and save the file:
Node1.domain.ext
Node2.domain.ext
On ALL (both) cluster members edit core-site.xml
nano /usr/local/hadoop/conf/core-site.xml
and change the fs.default.name value to reflect the location of the NameNode:
<property>
 <name>fs.default.name</name>
 <value>hdfs://node1:54310</value>
</property>
On ALL (both) cluster members edit mapred-site.xml
nano /usr/local/hadoop/conf/mapred-site.xml
and change the mapred.job.tracker value to reflect the location of the JobTracker:
<property>
 <name>mapred.job.tracker</name>
 <value>node1:54311</value>
</property>
On ALL (both) cluster members edit hdfs-site.xml
nano /usr/local/hadoop/conf/hdfs-site.xml
and change the dfs.replication value to reflect the location of the JobTracker:
<property>
 <name>dfs.replication</name>
 <value>2</value>
</property>
The dfs.replication value determines how many copies of any given piece of data will exist in the HDFS site. By default this is set to 3 which is optimal for many small->medium clusters. In this case we set it to the number of nodes we'll have right now, 2.

Now we need to ensure the master node can talk to the slave node(s) by adding the master key to the authorized keys file on each node (just 1 for now)
On Node 1, cat your public key and copy it to the clipboard or some other medium for transfer to the slave node
cat ~/.ssh/id_rsa.pub
On Node 2
nano ~/.ssh/authorized_keys
and paste in the key from the master and save the file to allow it to take commands.

Name Lookup

The Hadoop nodes should be able to resolve the names of the nodes in the cluster. To accomplish this we have two options: the right way and the quick way.

The right way: DNS

The best way to enable name lookup works correctly is by adding the nodes and their IPs to your internal DNS. Not only does this facilitate standard Hadoop operation, but it makes it possible to browse around the Hadoop status web pages (more on that below) from a non-node member, which I suspect you'll want to do. Those sites are automatically created by Hadoop all the hyperlinks use the node DNS names, so you need to be able to resolve those names from a "client" machine. This can scale too; with the right tool set it's very easy to automate DNS updates when spinning up a node.

To do this option, add your Hadoop hostnames to the DNS zone they reside in.

The quick way: HOSTS

If you don't have access to update your internal DNS zone, you'll need to use the hosts files on the nodes. This will allow the nodes to talk to each other via name. There are scenarios where you may want to automate the updating of node based HOSTS files for performance or fault tolerance reasons as well, but I won't go into that here.

To use this option do the following:

On ALL (both) cluster members edit /etc/hosts
nano /etc/hosts
and add each of your nodes with the appropriate IP addresses (mine are only examples)
192.168.1.40    node1
192.168.1.41    node2
Now that the configuration is done we need to format the HDFS filesystem. Note: THIS IS DESTRUCTIVE. Anything currently on your single node Hadoop system will be deleted.
On Node 1
bin/hadoop namenode -format
When that is done, let's start the cluster:
cd /usr/local/hadoop
bin/start-all.sh
Just as before, use the jps command to check processes, but this time on both nodes.
On Node 1
jps
You should see the following processes: (ignore the number in front, that's a process ID)
4863 Jps
4003 SecondaryNameNode
4192 TaskTracker
3893 DataNode
3787 NameNode
4079 JobTracker
On Node 2
jps
You should see the following processes: (ignore the number in front, that's a process ID)
6365 TaskTracker
7248 Jps
6279 DataNode
If there are problems review the logs. (for detailed location see reference above) Note that by default each node contains its own logs.

Now let's run the wordcount example in a cluster. To facilitate this you'll need enough data to split, so when downloading UTF-8 books from Project Gutenberg get at least 6 very long books (that should do with default settings).

Copy them to the local filesystem on your master Hadoop node using SSH (via SCP or similar). In my example I've copied the books to /tmp/books.

Now let's copy the books onto the HDFS filesystem by using the Hadoop dfs command. This will automatically replicate data to all nodes.
cd /usr/local/hadoop
bin/hadoop dfs -copyFromLocal /tmp/books /fs/hduser/books
After copying in the book(s) wait a couple minutes for all blocks to replicate (more on monitoring below) then execute the wordcount example:
bin/hadoop jar hadoop*examples*.jar wordcount /fs/hduser/books /fs/hduser/books-output

Monitoring HDFS/Jobs


Hadoop includes a great set of management web sites that will allow you to monitor jobs, check logfiles, browse the filesystem, etc. Let's take a moment to examine the three sites.

JobTracker
By default, the job tracker site can be found at http://(jobtrackernodename.domain.ext):50030/jobtracker.jsp ; i.e. http://node1.company.com:50030/jobtracker.jsp



On the Job Tracker you can see information regarding the cluster, running map and reduce tasks, node status, running and completed jobs, and you can also drill into specific node and task status.

DFSHealth
By default, the DFSHealth site can be found at http://(NameNodename.domain.ext):50070/dfshealth.jsp ; i.e. http://node1.company.com:50070/dfshealth.jsp


The DFSHealth site allows you to browse the HDFS system, view nodes, look at space consumption, and check NameNode logs.

TaskTracker
By default, the task tracker site can be found at http://(nodename.domain.ext):50060/tasktracker.jsp ; i.e. http://node2.company.com:50060/tasktracker.jsp



The Task Tracker site runs on each node that can run a task to provide more detailed information about what task is running on that node at the time.


Cloning a Node and Adding it to the Cluster


Now we'll explore how to get our third node into the cluster the quickest way possible: cloning the drive. This will cover, at a high level, the SD card cloning process for the Pi. If you're using a real machine replace the drive cloning steps with your favorite (*cough*Clonzilla*cough*) cloning software, or copy the virtual drive if you are using VMs.

You should always clone a slave node unless you intend on making a new cluster. To this end, do the following:

  1. Shut down Node 2 (sudo init 0) and remove the SD card.
  2. Insert the SD card into a PC to so we can clone it; I'm using a Windows 7 machine so if you're using Linux we'll diverge here. (I'll meet you at the pass!)
  3. Capture an image to your hdd using your favorite SDcard imaging software. I use either Roadkill's Disk Image or dd for windows
  4. Remove the Node 2 SD card and place it back into Node 2. Do not power up yet.
  5. Insert a new SD card of the same size to your imaging PC.
  6. Delete any partitions on the card to ensure error-free imaging. I use Diskpart , Select Disk x , Clean , exit where "x" is the disk number. Use list disk to find it... don't screw up or you'll wipe out your HDD!
  7. Image the new SD card with the image from Node 2 and remove it from the PC when complete.


  8. Insert the newly cloned SD card to Node 3. Power up that node and leave Node 2 off for now or the IPs will conflict.

Log into the new node (same IP as the cloned node) via ssh with the hduser (Hadoop user) account and perform the following tasks:

Change the hostname:
sudo nano /etc/hostname
For this example change "Node2" to "Node3" and save the file.

Change the IP address:
sudo nano /etc/network/interfaces
For this example change "192.168.1.41" to "192.168.1.42" and save the file. Switch to your own IP address range if necessary.

If you're using HOSTS for name resolution, add Node 3 with the proper IP address to the hosts file
sudo nano /etc/hosts
Repeat this on the master node and for the sake of completeness the other node(s) should probably be updated as well. In a large scale deployment this would all be automated.
If you are using DNS, make sure to add the Node3 entry to your zone.

Reboot Node 3 to enact the IP and hostname changes
sudo reboot
Power up Node 2 again at this time if you had it powered down. Log back into Node 3 (now with the new IP!) with the hduser account via SSH.

Generate a new key:
ssh-keygen -t rsa -P ""

Copy the key to its own trusted store:
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

If you aren't already, log into Node1 (Master) as hduser via SSH and do the following:
Add Node 3 to the slaves list
nano /usr/local/hadoop/conf/slaves

SSH to Node 3 to add it to known hosts
ssh rasdoop3
When prompted re: continue connecting type yes

An ALL nodes edit hdfs-site.xml
nano /usr/local/hadoop/conf/hdfs-site.xml
Change it to replicate to 3 nodes. Note if you add more nodes to the cluster after this you would most likely NOT increase this above 3. It's not 1:1 that we're going for, it's the number of replicas we want in the entire cluster. Once you're running a cluster with > 3 nodes you'll generally know if you want dfs.replication set to more than 3.
<configuration>
  <property>
    <name>dfs.replication</name>
    <value>3</value>
    <description>Default block replication.
    The actual number of replications can be specified when the file is created.
    The default is used if replication is not specified in create time.
    </description>
  </property>
</configuration>

Now we need to format the filesystem. It is possible, but slightly more complicated, to clone/add a node without re-formatting the whole HDFS cluster, but that's a topic for another blog post. Since we have virtually no data here, we'll address all concerns by wiping clean:
On Node 1:
cd /usr/local
bin/hadoop namenode -format

Now we should be able to start up all three nodes. Let's give it a shot!
On Node 1:
bin/start-all.sh

Note: If you have issues starting up the Datanodes you may need to delete the sub-directories under the directory listed in core-site.xml as "hadoop.tmp.dir" and then re-format again (see above). 

As for startup troubleshooting and running the wordcount example, copy out the instructions above after the initial cluster setup. There should be nothing different from running with two nodes save the fact that you may need more books to get it working on all 3 nodes.


Postmortem/Additional Reading

You did it! You've now got the experience of setting up a Hadoop cluster under your belt. There's a ton of directions to go from here; this relatively new tech is changing the way a lot of companies view data and business intelligence.

There is so much great community content out there; here's just a small list of the references I used and some additional reading.

References:

Raspberry Pi forums: Hadoop + HDFS + MR on Pi cluster - works!
Michael G. Noll: Running Hadoop on Ubuntu Linux Single-Node Cluster
Michael G. Noll: Running Hadoop on Ubuntu Linux Multi-Note Cluster
Hadoop Wiki: Getting Started with Hadoop
Hadoop Wiki: Wordcount Example
University of Glasgow's Raspberry Pi Hadoop Project
Hadoop Wiki: Cluster Setup
Hadoop Wiki: Java Versions
All Things Hadoop: Tips, Tricks and Pointers When Setting Up..
Hadoop Wiki: FAQ
Code/Google Hadoop Toolkit: Hadoop Performance Monitoring
Jeremy Morgan: How to Overclock the Raspberry Pi

62 comments:

Jagan said...

Very good information which you have shared,it is very helpful for Hadoop online training

Toby Meyer said...

Thanks Jagan, that was the idea exactly. I'm glad you enjoyed it!

Chris said...

Hey Toby,

thank you for that great how to - I passed it to my old Dual PII 350 running debian since 2003; working fine as name & data node.

By the way, I had to add my user called hadoop to /etc/sudoers.

Keep such stuff up :-)
Chris

simon said...

Nice work Toby - I now have my 3 node Raspberry Pi Hadoop cluster up and running - slow but fun! Cheers Simon

simon said...

Nice work Toby - I now have my 3 node Raspberry Pi Hadoop cluster up and running - slow but fun! Cheers Simon

Toby Meyer said...

Hey Chris & Simon, thanks for the feedback, I'm glad you found it useful!

Simon said...

Toby,
I'm thinking of setting up my own Raspberry Pi hadoop cluster. What is the 110v power splitter for?
Thanks for the awesome article!

Toby Meyer said...

Hey Simon! The 110v splitter is just so that I can plug in all the components to one outlet. Have fun & good luck!

Simon said...

Gotcha gotcha. Thanks!

Pk's Blog said...

Can you tell me what is the part number or name of Raspberry PI model that you bought from Microcenter. There are quite a few options. I would like to go with what you had used.

Thank you,

Toby Meyer said...

Hey PK, sure thing! You're looking for a standard model B; here is a link to the exact model. Thanks for reading and good luck!

Ken Fox said...

Toby: I'm running into a problem where the job tracker and task tracker start and then abort with the following error:
FATAL org.apache.hadoop.mapred.JobTracker: java.lang.IllegalArgumentException: Does not contain a valid host:port authority: local
at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:164)
at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:130)
at org.apache.hadoop.mapred.JobTracker.getAddress(JobTracker.java:2131)
at org.apache.hadoop.mapred.JobTracker.(JobTracker.java:1869)
at org.apache.hadoop.mapred.JobTracker.(JobTracker.java:1689)
at org.apache.hadoop.mapred.JobTracker.(JobTracker.java:1683)
at org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:320)
at org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:311)
at org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:306)
at org.apache.hadoop.mapred.JobTracker.main(JobTracker.java:4710)

Task tracker is similar.

I've searched several places but haven't found anything that resolves the issue. I'm still at the single node phase of the basic setup.

Have you seen this before and do you have any suggestions?

Toby Meyer said...

Hi Ken!

I suspect you have an issue with your /usr/local/hadoop/conf/mapred-site.xml file; you may want to take another look at it. Feel free to post it here if you like and we'll have a look!

J_L said...

Hey Toby,

thank you very much for sharing this stuff.
We encounter an error message when starting all .

server vm is only supported on armv7+

This is because of the java we use?

Cheers

/Jan

Toby Meyer said...

Hi @Jan!

Yes, that is because of the version of Java you are using. I'm guessing this applies to you. Here is a huge discussion on the topic... there are a few options out there Java wise. For a proof of concept, you could try switching over to OpenJDK. Good luck!

Ken Fox said...

Toby - Thanks -- I fixed it, but I'm not sure how.
Best guess is that it was a dos2unix type issue since I had cut and pasted the config file information directly from this site, and the functional version looks identical to what's posted here.
I have two nodes to work with at this time. I had both run the wordcount example successfully, and now I am working on getting the master-slave version working.

J_L said...

Good day Toby,

thanx 4 your advice. Hadoop startet using open jdk.

Cheers Jan

Toby Meyer said...

@Ken, @ Jan

Excellent! I'm glad you both got it working. I hadn't thought of that issue Ken, that makes sense. I suppose here is the potential to introduce an text encoding issue with copy/paste depending on what SSH client you use. Good job getting around that.. I'm sure that information will be useful to other folks setting it up as well.

Thanks for reading!

lucj said...

Hi Toby,
Thanks for this great tutorial !

I have an error message, though, when trying to ssh on localhost:

hduser@node1 ~ $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
hduser@node1 ~ $ ssh localhost
socket: Address family not supported by protocol
ssh: connect to host localhost port 22: Address family not supported by protocol
hduser@node1 ~ $

Any idea ?
Thanks a lot,
Luc

Toby Meyer said...

Hey @Luc!

It's possible that you've got ipv6 enabled and SSH isn't setup for it. What happens if you telnet 127.0.0.1? If that works, you could try to disable IPv6 to get the lab up and running.

Here is additional information. I hope that works for you!

LFC said...

I just happen to have 3 raspberry Pi around, and haven't found anything interesting enough to do with them. I also happen to have just signed up for udacity's hadoop class. This is what I am going to do with my 3 Pis. Thanks!

Toby Meyer said...

Exellent @LFC, good luck & have fun!

Bala Thandapani said...

How to run the jsp program in hadoop, provide me the list of steps

Toby Meyer said...

Hi @Bala,

Assuming you want to use Hadoop to serve up a JSP I'd direct you to this answer at Stack Overflow. Unless you really need to I would advise against using the built in Jetty to serve up non-Hadoop standard content as it goes against the way Hadoop attempts to load balance across nodes. That said, it could be done. If you do want to go down that path you'll most likely need to do it in code and this would get you started.

mark amos said...

Thanks Toby for the useful tutorial.

I just ran the word count example on an Hadoop 2.2.0 cluster that I built using your notes and Rasesh Mori's tutorial.

It consists of two mini-ITX Atom PC's running Centos and 3 Raspberry Pi's running Debian. I like how you can mix-and-match different devices/OSs for a working cluster.

We just built a big a couple of big clusters at work (a 5 node VM cluster and a 9 node hardware cluster) using Ambari and that was pretty quick.

I wanted to see if I could do it at home using "raw" hadoop.

Thanks again - fun stuff!

Toby Meyer said...

Hi @ Mark!

I'm glad I could help in any way. That sounds like a blast. Ambari is a really cool project too; I'd love to write something up on that... perhaps I'll add it to the list. :)

Thanks for sharing your success!

Brendon Unland said...

Hi,

I really wanted this tutorial to work. I've gone through it 3 times and continue to get this error after the command to copy the books onto the HDFS filesystem by using the Hadoop dfs command.

13/12/10 12:45:21 WARN hdfs.DFSClient: DataStreamer Exception: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /fs/hduser/books/kjv.txt could only be replicated to 0 nodes, instead of 1
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1920)
at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:783)
at ... stack trace is much bigger but comments on blog is limited.

Any ideas?

Thanks,
Brendon

Toby Meyer said...

Hi Brendon!

A couple things to try;

- If you have a firewall enabled, disable it on each node. This can be done temporarily with the command
service iptables stop
. (Debian, Rasbian if applicable) If that works, then you'll need to add firewall exceptions if you wish to leave it enabled.
- Make sure the fs.default.name entry in core-site.xml is correct on your slave nodes; they must be pointing at the (HDFS)master.
- I'm sure you've done this, but make sure you format HDFS after all nodes are added. Try /usr/local/hadoop/bin/hadoop namenode -format
- Try starting all nodes separately and watching for errors. See this Stack post for more info.

Good luck; let us know if you're still having issues!

Brendon Unland said...

Toby,

I put this on the shelf for a while to come back to it fresh. I re-installed the os on the SD card to start from scratch. There is no firewall. I've run into the exact same issue every time. I'm using Hadoop 1.2.1.

I feel like everything is going great until I execute this command:

/usr/local/hadoop/bin/hadoop namenode -format

...
14/01/23 20:58:04 INFO common.Storage: Storage directory /fs/hadoop/tmp/dfs/name has been successfully formatted.
14/01/23 20:58:04 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at java.net.UnknownHostException: node1: node1

Then when I go to copy the books...

I get a long error message that starts with...

14/01/23 21:09:06 WARN hdfs.DFSClient: DataStreamer Exception: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /fs/hduser/books/kjv.txt could only be replicated to 0 nodes, instead of 1
...

Any ideas would be appreciate.

Toby Meyer said...

Hey Brendon! Good question; that may be due to a DNS lookup error for node1. Try adding the node1 record to your preferred dns servers (listed in /etc/resolv.conf) or put the correct IP address into /etc/hosts on each node. Make sure you add entries for all nodes in the cluster.

The second error is most likely related to the first so hopefully that takes care of both.

Perhaps that will get it going, let us know if not; thanks!

J_L said...

Hi,

maybe we are just too st...

We've installed & configured hadoop, we've formated the namenode and started hadoop.

Then we try to copy a text-file to hadoop using hdfs -copyFromlocal, but it keeps complaining that it cannot find the provided folder (the target folder, in your example )fs/hduser/books). How would you checkt if the folder exists ( it should be hdfs i think). We could find it, but maybe our search was incorrect?

How do i traverse the hdfs? How can i make sure that the folder is created?

best regards

Jan

Brendon Unland said...

Toby,

Yes, I added correct IP address into /etc/hosts and process my first job!

Thanks so much for the reply, I am really enjoying this project.
Brendon

Toby Meyer said...

@Jan

Stupendous? I don't know if you can be *too* stupendous. :) Keep in mind this stuff can be difficult, but eventually we'll get it.

To traverse/manipulate the filesystem, you need the "hadoop dfs"(or just fs) command.(assuming at least the data node(s) are running) The best syntax guide I've found is here. So if for example you wanted to check the /fs/hduser path for the books directory, you would type:
hadoop dfs -ls /fs/hduser Then if you need to create it the other commands are generally just like the *nix equivalents. Note, however, that hadoop dfs on the Raspberry Pis can seem very slow sometimes.

@Brandon:
Excellent! Glad to hear it; I must confess this project has been one of my favorites.

J_L said...

Dear Toby,

thank you very much for your answer. We've been trying this several times, bt we always get: 'No such file or directory'.
Even if i try to make a new directory:
e.g.,
hadoop dfs -mkdir /user/hadoop/dir1
it keeps reporting the no-such-file error :/

best regards
jan

Toby Meyer said...

@Jan

Hmm, that's tough. I can't reproduce it... can you paste in the output from the command:

hadoop fsck /

?

J_L said...
This comment has been removed by the author.
Anto Z. said...

Thanks Toby for taking the time to put that great tutorial together. Unfortunately it doesn't work for me.

I followed all the steps and try to reinstall the whole things but I can't get it to work.

All start as it should at the 5 processes are shown by jps command

hduser@myraspberrypi /usr/local/hadoop/logs $ jps
2778 DataNode
3041 TaskTracker
2876 SecondaryNameNode
3465 Jps
2687 NameNode
2941 JobTracker

However when I look at the logs, I have 1 error in the namenode.log

2014-02-09 20:18:03,518 ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:hduser cause:java.io.IOException: File /fs/hadoop/tmp/mapred/system/jobtracker.info could only be replicated to 0 nodes, instead of 1

And another one in the jobtracker.log

2014-02-09 20:17:26,950 ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:hduser cause:java.net.ConnectException: Call to localhost/127.0.0.1:54310 failed on connection exception: java.net.ConnectException: Connection refused

which seems to be linked.

I am not able to connect to http://localhost:54310.

I checked all the permissions and it looks fine. My research haven't been successful.
If you can advise a direction to investigate, that would be much appreciated!

Toby Meyer said...

Hi @Anto,

Here are a couple things to try if you haven't already:

- Ensure your local firewall is either disabled or configured correctly. On Debian/Raspbian you can temporarily disable it with service iptables stop
- Shut everything down and startup one at a time, checking for errors after each one. The order should be:

/usr/local/hadoop/bin/hadoop namenode
check /usr/local/hadoop/logs/hadoop-username-namenode-nodename.log
/usr/local/hadoop/bin/hadoop datanode
check /usr/local/hadoop/logs/hadoop-username-datanode-nodename.log

After that you can start the trackers with /usr/local/hadoop/bin/start-mapred.sh, but I would bet you'll find the error prior to that.

(If you're running multiple nodes, which it looks like you aren't but just in case)

- Try replacing "localhost" in most cases with the actual node name. For example, fs.default.name in /usr/local/hadoop/conf/core-site.xml should point to the hostname of the name node.
- Check your DNS/name lookup configuration; if using the /etc/hosts file each node must have all nodes in the cluster (including itself) in the file.

After you name/data node(s) are coming up right you will have to make sure you format HDFS with: /usr/local/hadoop/bin/hadoop namenode -format

Good luck!

@Jan: I hope you got your problem fixed!

J_L said...

Hi,

this is the ouptut:

hadoop fsck/

...

Connecting to namenode via http://localhost:50070
FSCK started by hduser (auth:SIMPLE) from /127.0.0.1 for path / at Sat Jun 22 03:...
Status: HEALTHY
Total size: 0 B
Total dirs: 1
Total files: 0
Total symlinlinks: 0
Total blocks (validated): 0
Minimally replicated blocks: 0
Over-replicated blocks: 0
...
Default replication factor: 1
Avergae block replication: 0.0
Corrupt blocks: 0
Missing replicas: 0
Number of data-nodes: 0
Number of racks: 0
FSCK ended ... in 479 milliseconds

The filesystem under path '/' is HEALTHY

So the data node ain't running. Is that it?

I've checked the datanode log and i've noticed that it can't connect to hdfs://localhost:54310, so i changed it to localhost:54310.

What i did now is stop it all and start it again. But also with localhost:54310 i receive the error.

i've then changed it back again, but i still do not get the datanode running.

my best regards
/jan

Anto Z. said...

Hi Toby,

Thx for your help, but it still doesn't work.

1/ I disabled iptables (uninstall it really).

2/ successfully launched namenode using "/usr/local/hadoop/bin/hadoop namenode". I can to telnet 54310 ok.

Namenode logs end with
14/02/16 00:08:54 INFO ipc.Server: IPC Server handler 8 on 54310: starting

3/ datanode starts ok using "/usr/local/hadoop/bin/hadoop datanode"

Datanode logs end with
14/02/16 00:14:19 INFO datanode.DataNode: Starting Periodic block scanner
14/02/16 00:14:19 INFO datanode.DataNode: Generated rough (lockless) block report in 0 ms

Datanode correctly registered to namenode. In namenode logs:

14/02/16 00:14:19 INFO hdfs.StateChange: BLOCK* registerDatanode: node registration from 127.0.0.1:50010 storage DS-1253013518-127.0.1.1-50010-1391965156080
14/02/16 00:14:19 INFO net.NetworkTopology: Adding a new node: /default-rack/127.0.0.1:50010
14/02/16 00:14:19 INFO hdfs.StateChange: *BLOCK* NameNode.blocksBeingWrittenReport: from 127.0.0.1:50010 0 blocks
14/02/16 00:14:19 INFO hdfs.StateChange: *BLOCK* processReport: from 127.0.0.1:50010, blocks: 0, processing time: 0 msecs


4/ I reformat the hdfs partition using "/usr/local/hadoop/bin/hadoop namenode -format" - no issue

5/ I launched "/usr/local/hadoop/bin/start-mapred.sh". IT FAILS

In the "hadoop-hduser-jobtracker-el-server.log" I got:

2014-02-16 00:17:08,123 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /app/hadoop/tmp/mapred/system/jobtracker.info could only be replicated to 0 nodes, instead of 1
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1920)

2014-02-16 00:17:08,123 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for null bad datanode[0] nodes == null
2014-02-16 00:17:08,123 WARN org.apache.hadoop.hdfs.DFSClient: Could not get block locations. Source file "/app/hadoop/tmp/mapred/system/jobtracker.info" - Aborting...
2014-02-16 00:17:08,123 WARN org.apache.hadoop.mapred.JobTracker: Writing to file hdfs://vmware:54310/app/hadoop/tmp/mapred/system/jobtracker.info failed!
2014-02-16 00:17:08,123 WARN org.apache.hadoop.mapred.JobTracker: FileSystem is not ready yet!
2014-02-16 00:17:08,126 WARN org.apache.hadoop.mapred.JobTracker: Failed to initialize recovery manager.
org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /app/hadoop/tmp/mapred/system/jobtracker.info could only be replicated to 0 nodes, instead of 1


in the namenode logs, I enter an infinite loops that displays

14/02/16 00:17:08 INFO namenode.FSEditLog: Number of transactions: 1 Total time for transactions(ms): 1 Number of transactions batched in Syncs: 0 Number of syncs: 0 SyncTimes(ms): 0
14/02/16 00:17:08 WARN namenode.FSNamesystem: Not able to place enough replicas, still in need of 1 to reach 1
Not able to place enough replicas
14/02/16 00:17:08 ERROR security.UserGroupInformation: PriviledgedActionException as:hduser cause:java.io.IOException: File /app/hadoop/tmp/mapred/system/jobtracker.info could only be replicated to 0 nodes, instead of 1
14/02/16 00:17:08 INFO ipc.Server: IPC Server handler 5 on 54310, call addBlock(/app/hadoop/tmp/mapred/system/jobtracker.info, DFSClient_NONMAPREDUCE_-502449137_1, null) from 127.0.0.1:36898: error: java.io.IOException: File /app/hadoop/tmp/mapred/system/jobtracker.info could only be replicated to 0 nodes, instead of 1
java.io.IOException: File /app/hadoop/tmp/mapred/system/jobtracker.info could only be replicated to 0 nodes, instead of 1


That datanode and the tasktracker don't show any error.

-> it seems my hdfs filesystem is not ready. I don't understand why. There is space on my hd. hdfs repository (/app/hadoop/tmp) shows a tree structure (2 folders dfs & mapred. Permissioning is 750 hduser:hadoop.

Any help would be greatly welcome!

Toby Meyer said...

@Jan,

A couple pointers...
-hdfs://localhost:54310 is correct; the core-site configuration file needs a protocol definition preceding the host/port.
-the hadoop fsck / didn't look too bad! If your datanode wasn't running that command would return an error rather than any results, so that's a good sign.
- This error message: INFO org.apache.hadoop.ipc.RPC: Server at rasdoop1/192.168.1.40:54310 not available yet, Zzzzz... is "normal", especially in the Raspberry Pi setup since the machines are so slow. As long as you see something like: INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Registered FSDatasetStatusMBean
afterward it's no problem.
- to confirm it's running, what do you see when you execute jps? You should see:

4863 Jps
4003 SecondaryNameNode
4192 TaskTracker
3893 DataNode
3787 NameNode
4079 JobTracker

(with different process IDs in front of each running process)

- if you re-format the data node, do you have any more luck? Based on what I'm seeing it almost looks like it is running fine but doesn't have a FS structure. What happens if you do the following:

Ensure your core-site.xml contains at least:
Temp dir pointed to:/fs/hadoop/tmp
Default name set to:hdfs://localhost:54310

Then:


sudo mkdir -p /fs/hadoop/tmp
sudo chown hduser:hadoop /fs/hadoop/tmp
sudo chmod 750 /fs/hadoop/tmp/
/usr/local/hadoop/bin/hadoop namenode -format

and then try to create your directories? Do any of those result in an error message?

Toby Meyer said...

Hey There @Anto,

Looks like you made some progress; the job tracker is connecting now. I'm a little curious about the hdfs://vmware reference; could you post the contents of your conf/core-site.xml? Is the only item in your hdfs-site.xml the dfs.replication value or are there additional settings? Also, what does your /etc/hosts look like? There have been issues where the localhost entry isn't simply:

127.0.0.1 localhost

Oh, just so I'm certain too.. you are running just one node right now correct?

Anto Z. said...

Hi Toby,

Thanks for your quick answer.

conf/core-site.xml








hadoop.tmp.dir
/app/hadoop/tmp
A base for other temporary directories.



fs.default.name
hdfs://vmware:54310
The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class.The url's authority is used to determine the host, port, etc. for a filesystem.





the dfs.replication is the only info in the hdfs-site.xml and is set to 1.

hdfs-site.xml








dfs.replication
1
Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.





my /etc/hosts file looks like

127.0.0.1 localhost
192.168.11.119 vmware


I also make sure that my hostname is correctly setup to 'vmware' in /etc/hostname

And I do confirm I am running just one node (with all the 'vmware' machine hosting all the processes for the time being.

Thanks,
Anto

J_L said...

Hi Toby,

actually in the process list, there is no data node running, only namenode, resource manager, secondary namenode and node manager.
No Task Tracker, no job tracker.

I've formatted several times. but in the datanode log, i always find the connection error. i tried pingin hdfs://localhost..., but that won't work either.

I never mentioned since i read someone was using hadoop 2.2 and it was running fine - we do use hadoop 2.2, too.

my highest regards

Jan

Toby Meyer said...

@Anto: If you haven't already, try changing the

fs.default.name
hdfs://vmware:54310

to

fs.default.name
hdfs://localhost:54310

and see if that does the trick.

@Jan

2.2 will require some additional and different settings. I'm hoping to create a 2.2 version of this article soon, but for the time being this article is wonderful. Most of it (the Hadoop specific portions) still applies to the Pi. There are additional needed environment variables (.bashrc) and some of the default ports have changed, so pay special attention to those parts. If you want some info with a little more "meat", give hortonworks a try, theirs is always great.

If you need to check which ports are open to be certain, you can always use netstat -np.

Anto Z. said...

Hello Tobby,

Thanks for your answer. I tried that earlier and unfortunately it showed up the same error...

I am kind of stuck...

Sorry for my previous message I didn't realised the tags didn't shown up.

Thanks,
Anto

Data SEO said...
This comment has been removed by the author.
Paul said...

Toby,

Many thanks for the great tutorial. I have just been working through it and hit a snag.

I've installed Hadoop 2.3.0 and am at the point of configuring core-site.xml, hdf-site.xml and there doesn't appear to be a /conf directory under /usr/local/hadoop.

Just wondering what you think I may have done wrong.

Kind Regards

Paul

Paul said...

Toby,

Scratch my last question just found them all under /usr/local/hadoop/etc/hadoop/

Kind Regards

Paul

Toby Meyer said...

Hi @Paul,

Glad you found it & I appreciate you following up... I'm sure others running 2.3 will appreciate the info.

Thanks for reading!

Rob Snow said...

Hi Toby.
What a great post, far more clear and concise than some I have been trying to follow!
I have everything place with 2 PIs, jobs start on both, and when i load in books, all seems good.
However when I run the word count example, the HealthCheck page only ever shows 1 node, and the time it takes to run increases with more books...so i am assuming node2 never kicks in. I am loading 20 Books (26.6MB) ,should that be enough to kick the 2nd node into action?
all config files seem ok, and both nodes can talk to each other.
Any advice would be greatly appreciated.
Cheers
Rob

Paul said...

Toby,

I have hit a snag at the last hurdle and was wondering if you had any thoughts on how to solve the problem. I have installed and configured everything and all seems ok until I go to run the Map Reduce example. I get the following :

Number of Maps = 2
Samples per Map = 5
14/03/28 06:41:51 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Wrote input for Map #0
Wrote input for Map #1
Starting Job
14/03/28 06:43:03 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
14/03/28 06:43:37 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 0 time(s); maxRetries=45
14/03/28 06:43:58 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 1 time(s); maxRetries=45
14/03/28 06:44:18 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 2 time(s); maxRetries=45
14/03/28 06:44:38 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 3 time(s); maxRetries=45
...
14/03/28 06:59:00 WARN security.UserGroupInformation: PriviledgedActionException as:hduser (auth:SIMPLE) cause:org.apache.hadoop.net.ConnectTimeoutException: Call From Node1/199.101.28.130 to 0.0.0.0:8032 failed on socket timeout exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=0.0.0.0/0.0.0.0:8032]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout
org.apache.hadoop.net.ConnectTimeoutException: Call From Node1/199.101.28.130 to 0.0.0.0:8032 failed on socket timeout exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=0.0.0.0/0.0.0.0:8032]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)

Toby Meyer said...

Hi @Rob Snow!

Thanks for the kind words! Both nodes won't kick in unless the data set is large enough that it can be split between the two. To get all nodes working in my case I had to start jobs using several of the largest books I can find, so load up some "War and Peace" (and a few others) and give it another shot!

P.S. Winter is coming. :)

@Paul

Looks like there may be a couple problems, but first and foremost name resolution (0.0.0.0). I believe the resource manger settings are covered under /conf/yarn-site.xml in 2.3; can you make sure all the ".address" values are correct in that file? If using hostnames, make sure you can resolve the listed names correctly. If not using DNS you may need to add them to your /etc/hosts file on all nodes. Good luck!

Paul said...

Toby,

Many thanks. Got way laid but am back on it now.

I was just wondering about resolv.conf.

How do I find the Domain suffix for the machine and is the Search parameter required?

I'm confused about what nameserver ip's to use. Can I simply do as you have done?

domain company.com
search company.com
nameserver 192.168.1.20
nameserver 192.168.1.30

Paul said...

Toby,

Please ignore last comment. Not sure what I was thinking.

Kind Regards

Paul

Toby Meyer said...

Hey @Paul! I'm glad you got it taken care of. I have received that question from at least another person as well, so if anyone else has a similar question I'd like to address it here.

If you're in an environment where you don't control those settings, you can find them by using the following commands on a different machine that is already set up:

-Windows:
--ipconfig /all (connection-specific DNS Suffix for search and DNS servers for nameserver)
-*nix:
--cat /etc/resolv.conf (essentially copy values)

Hopefully if anyone else has the same concerns your question will help. Thanks!

Nicholas Herriot said...

Hi Toby,

what a fantastic blog! I had already purchased the "Raspberry Pi Super Cluster" - but your blog has far better descriptions for the Hadoop section. :-)

I followed your blog and have a question. Everything 'seems' to work for me in that when i run the startup script I can see the correct jps processes working on each of my nodes. I have one master and 2 workers. So on the workers I have:
DataNode; and
TaskTracker;

On my master I have:
JobTracker;
NameNode;
TaskTracker;
DataNode; and
SecondaryNameNode.

However on multiple books being processed I had an error with java not having enough 'Heap' - I have mine set as 272 as your tutorial suggests.

When I do a wordcount on just a small book it takes maybe 10 minutes to complete! Do you get this?
What Raspberry Pi's did you use? Revision A or B? I'm using 'A'.

I also noticed on my setup on the NameNode JSP page it only every shows 1 live node! And the %cpu for my java processes on the slave machine never go above a 5% - do you get this?

Final question! Honest! :-), I get warnings about a 'snappy' java library not loaded, so the system uses a built in library, which i guess is down to me using the OpenJDK. Does this make much difference?

Kind regards, Nicholas.

Nicholas Herriot said...

Hi Toby,

what a fantastic blog! I had already purchased the "Raspberry Pi Super Cluster" - but your blog has far better descriptions for the Hadoop section. :-)

I followed your blog and have a question. Everything 'seems' to work for me in that when i run the startup script I can see the correct jps processes working on each of my nodes. I have one master and 2 workers. So on the workers I have:
DataNode; and
TaskTracker;

On my master I have:
JobTracker;
NameNode;
TaskTracker;
DataNode; and
SecondaryNameNode.

However on multiple books being processed I had an error with java not having enough 'Heap' - I have mine set as 272 as your tutorial suggests.

When I do a wordcount on just a small book it takes maybe 10 minutes to complete! Do you get this?
What Raspberry Pi's did you use? Revision A or B? I'm using 'A'.

I also noticed on my setup on the NameNode JSP page it only every shows 1 live node! And the %cpu for my java processes on the slave machine never go above a 5% - do you get this?

Final question! Honest! :-), I get warnings about a 'snappy' java library not loaded, so the system uses a built in library, which i guess is down to me using the OpenJDK. Does this make much difference?

Kind regards, Nicholas.

Toby Meyer said...

Hi @Nicholas,

Thank you very much, I appreciate the compliment. First issue is the heapsize with the A model. I am using B, and a heapsize of 272 will be too much for an A. I would bring it all the way down to 142 or so in your situation, and I've updated the article to clarify (Thanks, I didn't consider that!). On my Pis I it takes a few minutes to complete; I'm guessing with less RAM it will be even slower. Also, I think you may have a cluster issue that we'll discuss below. That said, this definitely is not for performance but rather learning Hadoop easily and cheaply. Now that you've done it here you can carry it forward into a production environment with 16-way servers with gigs and gigs of ram and SSDs as far as the eye can see. :)

As for the NameNode JSP page(assuming you're talking 50070:dfshealth.jsp), it does seem like something is wrong there. Double check your config files under the "Multiple Nodes" section; something may be wrong there. They should be reporting and working harder than that. Additionally, check the logs on your worker nodes after performing a start-all.sh. Feel free to post any errors here so we can get to the bottom of it.

The snappy warning is not a big deal; there are faster libs than OpenJDK but as stated earlier these aren't going to be fast either way.

donald fossouo said...

Thks fort this article, and sorry for my english

I would like to perform a testing architecture for Hadoop

i only have 3 VMs with CentOS each could you help me to design it?


I was thinkg about :

1 VMs with the namenode & Jobtracker
1 VMs as a client
1 VMs as the datanode

But i am new to this concept and don't know exactly where to start

Can you share the link to buy the necessary materials too?

Toby Meyer said...

Hey Donald! You're in luck, I wrote this article with that in mind. :) There is no reason you can't follow the article top to bottom with VMs as well with a couple minor differences:

- Some of the OS/tool syntax will differ depending on the distro. In your case, for example, APT will be replaced with YUM since you're running CentOS.
- The JVM you will probably end up using should be Oracle. (see this)
- Obviously the hardware/firmware guidance will differ, such as export HADOOP_HEAPSIZE=272 should be changed to more closely reflect how much memory your machine has. For example, if your VM has in around 1GB of RAM allocated you could set the heapsize to 800 or so.

Give it a go, good luck!

lukas said...

Hi Toby,

First of all, thank you for this great blog!
I have a question. I installed Hadoop 2.2.0 on my Raspberry Pi model B. I also completed the configuation including the yarn-site.xml.
When I finally run the command start-dfs.sh the following line is shown:
"localhost: Server VM is only supported on ARMv7+ VFP".

Toby, do you know what next step I can take to fix this?

Thanks,
Lukas