Monday, February 2, 2009

Hadoop Scripts

If you work with HBase and Hadoop in particular, you start off doing most things on the command line. After a while this is getting tedious and - in the end - becomes a nuisance. And error prone! While I wish there would be a an existing and established solution out there that helps managing a Hadoop cluster I find that there are few that you can use right now. Of the few that come to mind is the "Hadoop on Demand" (HOD) package residing in the contribution folder of the Hadoop releases. The other is ZooKeeper.

Interesting things are in the pipeline though, for example Simon from Yahoo!.

What I often need are small helpers that allow me to clean up behind me or which helps me deploy new servers. There are different solutions that usually involve some sort of combination of SSH and rsync. Tools I found and some of them even tried are SmartFrog, Puppet, and Tentakel.

Especially in the beginning you often find that you do not know these tools well enough or they do one thing - but not another. Of course, you can combine them and make that work somehow. I usually resort to a set of well known and proven scripts that I created over time to simplify working with a particular system. With Hadoop most of these scripts are run on the master and since it already is set up to use SSH to talk to all slaves it makes it easy to use the same mechanism.

The first one is to show all Java processes across the machines to see that they are up - or all shut down before attempting a new start:
#!/bin/bash
# $Revision: 1.0 $
#
# Shows all Java processes on the Hadoop cluster.
#
# Created 2008/01/07 by Lars George
#

servers="$(cat /usr/local/hadoop/conf/masters /usr/local/hadoop/conf/slaves)"

for srv in $servers; do
echo "Sending command to $srv...";
ssh $srv "ps aux | grep -v grep | grep java"
done

echo "done."

The next one is a poor-man's deployment thingamajig. It helps copying a new release across the machines and setting up the symbolic link I use for the current version in production. Of course this all varies with your setup.
#!/bin/bash
# $Revision: 1.0 $
#
# Rsync's Hadoop files across all slaves. Must run on namenode.
#
# Created 2008/01/03 by Lars George
#

if [ "$#" != "2" ]; then
echo "usage: $(basename $0) <dir-name> <ln-name>"
echo " example: $(basename $0) hbase-0.1 hbase"
exit 1
fi

for srv in $(cat /usr/local/hadoop/conf/slaves); do
echo "Sending command to $srv...";
rsync -vaz --exclude='logs/*' /usr/local/$1 $srv:/usr/local/
ssh $srv "rm -fR /usr/local/$2 ; ln -s /usr/local/$1 /usr/local/$2"
done

echo "done."

I basically download a new version on the master (or build one) and issue a

$ rsyncnewhadoop /usr/local/hadoop-0.19.0 hadoop

It copies the directory across and changes the "/usr/local/hadoop" symbolic link to point to the this new release.

Another helper I use quite often is to diff an existing and a new version before I actually copy them across the cluster. It can be used like so:

$ diffnewversion /usr/local/hbase-0.19.0 /usr/local/hadoop-0.19.0

Again I assume that the current version is symlinked as explained above. Otherwise you would have to make adjustements obviously.
#!/bin/bash
#
# Diff's the configuration files between the current symlinked versions and the given one.
#
# Created 2009/01/23 by Lars George
#

if [[ $# == 0 ]]; then
echo "usage: $(basename $0) <new_dir> [<new_dir>]"
exit 1;
fi

DIRS="conf bin"

for path in $*; do
if [[ "$1" == *hadoop* ]]; then
kind="hadoop"
else
kind="hbase"
fi
for dir in $DIRS; do
echo
echo "Comparing $kind $dir directory..."
echo
for f in /usr/local/$kind/$dir/*; do
echo
echo
echo "Checking $(basename $f)"
diff -w $f $1/$dir/$(basename $f)
if [[ $? == 0 ]]; then
echo "Files are the same..."
fi
echo
echo "================================================================"
done
done
shift 1
done

echo "done."

The last one I am posting here helps removing the Distributed File System (DFS) after for example a complete corruption (I didn't say they happen) or when you want to have a clean start.

Note: It assumes that the data is stored under "/data1/hadoop" and "/data2/hadoop" - that is where I have my data. If yours is different then adjust the path or - if you like - grep/awk the hadoop-site.xml and parse the paths out of the "dfs.name.dir" and "dfs.data.dir" respectively.
#!/bin/bash
# $Revision: 1.0 $
#
# Deletes all files and directories pertaining to the Hadoop DFS.
#
# Created 2008/12/12 by Lars George
#

servers="$(cat /usr/local/hadoop/conf/masters /usr/local/hadoop/conf/slaves)"
# optionally allow single server use
if [[ $# > 0 ]]; then
servers="$*"
fi
first="$(echo $servers | head -n 1 | awk -F. '{ print $1 }')"
dirs="/tmp/hbase* /tmp/hsperfdata* /tmp/task* /tmp/Jetty* /data1/hadoop/* /data2/hadoop/*"

echo "IMPORTANT: Are you sure you want to delete the DFS starting with $first?"
echo "Type \"yes\" to continue:"
read yes
if [ "$yes" == "yes" ]; then
for srv in $servers; do
echo "Sending command to $srv...";
for dir in $dirs; do
pa=$(dirname $dir)
fn=$(basename $dir)
echo "removing $pa/$fn...";
ssh $srv "find $pa -name \"$fn\" -type f -delete ; rm -fR $pa/$fn"
done
done
else
echo "aborted."
fi

echo "done."

I have a few others that for example let me kill runaway Java processes, sync only config changes across the cluster machines, starts and stops safely, and so on. I won't post them here as they are pretty trivial like the ones above or do not differ much. Let me know if you have similar scripts or better ones!

3 comments:

  1. Simon is monitoring; handy, but not enough. What I hope to see from it is better scalability; Ganglia gets tricky to navigate with big/multiple clusters.


    Y! use bcfg2 as their configuration management tool with LDAP, somehow.

    The way we use SmartFrog for our hadoop clusters is build/push out RPMs containing the binaries -any machine can be any type of node; they run SmartFrog as a daemon.

    then we tell them what to be using SmartFrog, they stay that way until rebooted; we can get a view of a whole cluster. This is good for tweaking with configuration options; fun for testing. But as I'm fiddling with the hadoop internals as well as SmartFrog, I have to push the RPMs out regularly, which means scp/ssh work. Killing all java processes on a machine is always handy.

    Have you thought of contributing your scripts back to apache?

    ReplyDelete
  2. Not sure if they are fitting anywhere. Usually the stuff that is committed back are somewhat more complex. But I am happy to contribute - I just do not know where :(

    Thank you for your great insight though, this stuff is pure learning by doing. No matter how good the docs are. Appreciated!

    ReplyDelete
  3. hey can provide a single shell script for entire provisioning of hadoop in ubuntu.
    I followed this tutorial http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/ and i have installed hadoop. I just need a shell script which does the entire job.

    ReplyDelete