Saturday, February 21, 2009

Mini Local HBase Cluster

I am trying to get a local setup going where I have everything I need on my PC - heck even on my MacBookPro. I use Eclipse and Java to develop, so that is easy. I also use Memcached and there is a nice MacPorts version for it available. But what I also need is a working Hadoop/HBase cluster!

At work we have a few of these, large and small, but they are either in production or simply to complex to use them for day to day testing. Especially when you try to debug a MapReduce job or code talking directly to HBase. I found that the excellent HBase team had already a class in place that is used to set up the JUnit tests they run. And the same goes for Hadoop. So I set out to extract the bare essentials if you will to create a tiny HBase cluster running on a tiny Hadoop distributed filesystem.

After a couple of issues that had to be resolved the below class is my "culmination" of sweet cluster goodness ;)

/* File: MiniLocalHBase.java
* Created: Feb 21, 2009
* Author: Lars George
*
* Copyright (c) 2009 larsgeorge.com
*/

package com.larsgeorge.hadoop.hbase;

import java.io.IOException;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.HConstants;
import org.apache.hadoop.hbase.MiniHBaseCluster;
import org.apache.hadoop.hbase.util.FSUtils;
import org.apache.hadoop.hdfs.MiniDFSCluster;

/**
* Starts a small local DFS and HBase cluster.
*
* @author Lars George
*/
public class MiniLocalHBase {

static HBaseConfiguration conf = null;
static MiniDFSCluster dfs = null;
static MiniHBaseCluster hbase = null;

/**
* Main entry point to this class.
*
* @param args The command line arguments.
*/
public static void main(String[] args) {
try {
int n = args.length > 0 && args[0] != null ?
Integer.parseInt(args[0]) : 4;
conf = new HBaseConfiguration();
dfs = new MiniDFSCluster(conf, 2, true, (String[]) null);
// set file system to the mini dfs just started up
FileSystem fs = dfs.getFileSystem();
conf.set("fs.default.name", fs.getUri().toString());
Path parentdir = fs.getHomeDirectory();
conf.set(HConstants.HBASE_DIR, parentdir.toString());
fs.mkdirs(parentdir);
FSUtils.setVersion(fs, parentdir);
conf.set(HConstants.REGIONSERVER_ADDRESS, HConstants.DEFAULT_HOST + ":0");
// disable UI or it clashes for more than one RegionServer
conf.set("hbase.regionserver.info.port", "-1");
hbase = new MiniHBaseCluster(conf, n);
// add close hook
Runtime.getRuntime().addShutdownHook(new Thread() {
public void run() {
hbase.shutdown();
if (dfs != null) {
try {
FileSystem fs = dfs.getFileSystem();
if (fs != null) fs.close();
} catch (IOException e) {
System.err.println("error closing file system: " + e);
}
try {
dfs.shutdown();
} catch (Exception e) { /*ignore*/ }
}
}
} );
} catch (Exception e) {
e.printStackTrace();
}
} // main

} // MiniLocalHBase

The critical part for me was that if you wanted to be able to start more than one region server you have to disable the UI of each of these region servers or they will fail trying to bind the same info port, usually 60030.

I also added a small shutdown hook so that when you quit the process it will shut down nicely and keep the data in such a condition that you can restart the local again later on for further testing. Otherwise you may end up having to redo the file system - no biggie I guess, but hey why not? You can specify the number of RegionServer's being started on the command line. It defaults to 4 in my sample code above. Also, you do not need any hbase-site.xml or hadoop-site.xml to set anything else. All required settings are hardcoded to start the different servers in separate threads. You can of course add one and tweak further settings - just keep in mind that the ones hardcoded in the code cannot be reassigned by the external XML settings files. You would have to move those directly into the code.

To start this mini cluster you can either run this from within Eclipse for example, which makes it really easy since all the required libraries are in place, or you start it from the command line. This could work like so:

hadoop$ java -Xms512m -Xmx512m -cp bin:lib/hadoop-0.19.0-core.jar:lib/hadoop-0.19.0-test.jar:lib/hbase-0.19.0.jar:lib/hbase-0.19.0-test.jar:lib/commons-logging-1.0.4.jar:lib/jetty-5.1.4.jar:lib/servlet-api.jar:lib/jetty-ext/jasper-runtime.jar:lib/jetty-ext/jsp-api.jar:lib/jetty-ext/jasper-compiler.jar:lib/jetty-ext/commons-el.jar com.larsgeorge.hadoop.hbase.MiniLocalHBase

What I did is create a small project, have the class compile into the "bin" directory and threw all Hadoop and HBase libraries into the "lib" directory. This was only for the sake of keeping the command line short. I suggest you have the classpath set already or have it point to the original locations where you have untar'ed the respective packages.

Running it from within Eclipse let's you of course use the integrated debugging tools at hand. The next step is to follow through with what the test classes already have implemented and be able to start Map/Reduce jobs with the debugging enabled. Mind you though as the local cluster is not very powerful - even if you give it more memory than I did above. But fill it with a few hundred rows and use it to debug your code and once it runs fine, run it happily ever after on your production site.

All the credit goes to the Hadoop and HBase teams of course, I simply gathered their code from various places.

Thursday, February 5, 2009

Apache fails on Semaphores

In the last few years I had twice an issue with our Apache web servers where all of a sudden they would crash and not start again. While there are obvious reasons in case the configuration is screwed up there are also cases where you simply do not know why it would not restart. There is enough drive space, RAM, no other processes running locking the port (even checked with lsof).

All you get is an error message in the log saying:

[Fri May 21 15:34:22 2008] [crit] (28)No space left on device: mod_rewrite: could not create rewrite_log_lock
Configuration Failed


After some digging the issue was that all semaphores were used up and had to be deleted first. Here is a script I use to do that:
echo "Semaphores found: "
ipcs -s | awk '{ print $2 }' | wc -l
ipcs -s | awk '{ print $2 }' | xargs -n 1 ipcrm sem
echo "Semaphores found after removal: "
ipcs -s | awk '{ print $2 }' | wc -l

Sometimes you really wonder what else could go wrong.

Wednesday, February 4, 2009

String starts with a number in XSL

I needed a way to test if a particular value in an XML file started with a letter based prefix. If not, then the value would start with a number and needed to be prefixed first before being output. While I found a great post how to remove leading zeros I could not find how to check if the first letter is of a particular type, for example a letter or a number. In Java you can do that easily like so (using BeanShell here):

bsh % String s1 = "1234";
bsh % String s2 = "A1234";
bsh % print(Character.isLetter(s1.charAt(0)));
false
bsh % print(Character.isLetter(s2.charAt(0)));
true

This is of course Unicode safe. With XSL though I could not find a similar feature but for my purposes it was sufficient to reverse the check and see if I had a Latin number first. Here is how:
<xsl:template match="/person/personnumber">
<reference>
<xsl:variable name="num">
<xsl:value-of select="."/>
</xsl:variable>
<xsl:choose>
<xsl:when test="contains('0123456789', substring($num, 1, 1))">
<xsl:variable name="snum">
<xsl:call-template name="removeLeadingZeros">
<xsl:with-param name="originalString" select="$num"/>
</xsl:call-template>
</xsl:variable>
<xsl:value-of select="concat('PE', $snum)"/>
</xsl:when>
<xsl:otherwise>
<xsl:value-of select="."/>
</xsl:otherwise>
</xsl:choose>
</reference>
</xsl:template>

Monday, February 2, 2009

Hadoop Scripts

If you work with HBase and Hadoop in particular, you start off doing most things on the command line. After a while this is getting tedious and - in the end - becomes a nuisance. And error prone! While I wish there would be a an existing and established solution out there that helps managing a Hadoop cluster I find that there are few that you can use right now. Of the few that come to mind is the "Hadoop on Demand" (HOD) package residing in the contribution folder of the Hadoop releases. The other is ZooKeeper.

Interesting things are in the pipeline though, for example Simon from Yahoo!.

What I often need are small helpers that allow me to clean up behind me or which helps me deploy new servers. There are different solutions that usually involve some sort of combination of SSH and rsync. Tools I found and some of them even tried are SmartFrog, Puppet, and Tentakel.

Especially in the beginning you often find that you do not know these tools well enough or they do one thing - but not another. Of course, you can combine them and make that work somehow. I usually resort to a set of well known and proven scripts that I created over time to simplify working with a particular system. With Hadoop most of these scripts are run on the master and since it already is set up to use SSH to talk to all slaves it makes it easy to use the same mechanism.

The first one is to show all Java processes across the machines to see that they are up - or all shut down before attempting a new start:
#!/bin/bash
# $Revision: 1.0 $
#
# Shows all Java processes on the Hadoop cluster.
#
# Created 2008/01/07 by Lars George
#

servers="$(cat /usr/local/hadoop/conf/masters /usr/local/hadoop/conf/slaves)"

for srv in $servers; do
echo "Sending command to $srv...";
ssh $srv "ps aux | grep -v grep | grep java"
done

echo "done."

The next one is a poor-man's deployment thingamajig. It helps copying a new release across the machines and setting up the symbolic link I use for the current version in production. Of course this all varies with your setup.
#!/bin/bash
# $Revision: 1.0 $
#
# Rsync's Hadoop files across all slaves. Must run on namenode.
#
# Created 2008/01/03 by Lars George
#

if [ "$#" != "2" ]; then
echo "usage: $(basename $0) <dir-name> <ln-name>"
echo " example: $(basename $0) hbase-0.1 hbase"
exit 1
fi

for srv in $(cat /usr/local/hadoop/conf/slaves); do
echo "Sending command to $srv...";
rsync -vaz --exclude='logs/*' /usr/local/$1 $srv:/usr/local/
ssh $srv "rm -fR /usr/local/$2 ; ln -s /usr/local/$1 /usr/local/$2"
done

echo "done."

I basically download a new version on the master (or build one) and issue a

$ rsyncnewhadoop /usr/local/hadoop-0.19.0 hadoop

It copies the directory across and changes the "/usr/local/hadoop" symbolic link to point to the this new release.

Another helper I use quite often is to diff an existing and a new version before I actually copy them across the cluster. It can be used like so:

$ diffnewversion /usr/local/hbase-0.19.0 /usr/local/hadoop-0.19.0

Again I assume that the current version is symlinked as explained above. Otherwise you would have to make adjustements obviously.
#!/bin/bash
#
# Diff's the configuration files between the current symlinked versions and the given one.
#
# Created 2009/01/23 by Lars George
#

if [[ $# == 0 ]]; then
echo "usage: $(basename $0) <new_dir> [<new_dir>]"
exit 1;
fi

DIRS="conf bin"

for path in $*; do
if [[ "$1" == *hadoop* ]]; then
kind="hadoop"
else
kind="hbase"
fi
for dir in $DIRS; do
echo
echo "Comparing $kind $dir directory..."
echo
for f in /usr/local/$kind/$dir/*; do
echo
echo
echo "Checking $(basename $f)"
diff -w $f $1/$dir/$(basename $f)
if [[ $? == 0 ]]; then
echo "Files are the same..."
fi
echo
echo "================================================================"
done
done
shift 1
done

echo "done."

The last one I am posting here helps removing the Distributed File System (DFS) after for example a complete corruption (I didn't say they happen) or when you want to have a clean start.

Note: It assumes that the data is stored under "/data1/hadoop" and "/data2/hadoop" - that is where I have my data. If yours is different then adjust the path or - if you like - grep/awk the hadoop-site.xml and parse the paths out of the "dfs.name.dir" and "dfs.data.dir" respectively.
#!/bin/bash
# $Revision: 1.0 $
#
# Deletes all files and directories pertaining to the Hadoop DFS.
#
# Created 2008/12/12 by Lars George
#

servers="$(cat /usr/local/hadoop/conf/masters /usr/local/hadoop/conf/slaves)"
# optionally allow single server use
if [[ $# > 0 ]]; then
servers="$*"
fi
first="$(echo $servers | head -n 1 | awk -F. '{ print $1 }')"
dirs="/tmp/hbase* /tmp/hsperfdata* /tmp/task* /tmp/Jetty* /data1/hadoop/* /data2/hadoop/*"

echo "IMPORTANT: Are you sure you want to delete the DFS starting with $first?"
echo "Type \"yes\" to continue:"
read yes
if [ "$yes" == "yes" ]; then
for srv in $servers; do
echo "Sending command to $srv...";
for dir in $dirs; do
pa=$(dirname $dir)
fn=$(basename $dir)
echo "removing $pa/$fn...";
ssh $srv "find $pa -name \"$fn\" -type f -delete ; rm -fR $pa/$fn"
done
done
else
echo "aborted."
fi

echo "done."

I have a few others that for example let me kill runaway Java processes, sync only config changes across the cluster machines, starts and stops safely, and so on. I won't post them here as they are pretty trivial like the ones above or do not differ much. Let me know if you have similar scripts or better ones!