Tuesday, October 20, 2009

HBase on Cloudera Training Virtual Machine (0.3.1)

You might want to run HBase on Cloudera's Virtual Machine to get a quick start to a prototyping setup. In theory you download the VM, start it and you are ready to go. There are a few issues though, the worst being that the current Hadoop Training VM does not include HBase at all. Also, Cloudera is using a specific version of Hadoop that it deems stable and maintains it own release cycle. So Cloudera's version of Hadoop is 0.18.3. HBase though needs Hadoop 0.20 - but we are in luck as Andrew Purtell of TrendMicro maintains a special branch of HBase 0.20 that works with Cloudera's release.

Here are the steps to get HBase running on Cloudera's VM:
  1. Download VM

    Get it from Cloudera's website.
  2. Start VM

    As the above page states: "To launch the VMWare image, you will either need VMware Player for windows and linux, or VMware Fusion for Mac."

    Note: I have Parallels for Mac and wanted to use that. I used Parallels Transporter to convert the "cloudera-training-0.3.1.vmx" to a new "cloudera-training-0.2-cl3-000002.hdd", create a new VM in Parallels selecting Ubuntu Linux as the OS and the newly created .hdd as the disk image. Boot up the VM and you are up and running. I gave it a bit more memory for the graphics to be able to switch the VM to 1440x900 which is native to my MacBook Pro I am using.

    Finally follow the steps explained on the page above, i.e. open a Terminal and issue:
    $ cd ~/git
    $ ./update-exercises --workspace
  3. Pull HBase branch

    Open a new Terminal (or issue a $ cd .. in the open one), then:
    $ sudo -u hadoop git clone http://git.apache.org/hbase.git /home/hadoop/hbase
    $ sudo -u hadoop sh -c "cd /home/hadoop/hbase ; git checkout origin/0.20_on_hadoop-0.18.3"
    HEAD is now at c050f68... pull up to release

    First we clone the repository, then switch to the actual branch. You will notice that I am using sudo -u hadoop because Hadoop itself is started under that account and so I wanted it to match. Also, the default "training" account does not have SSH set up as explained in Hadoop's quick-start guide. When sudo is asking for a password use the default set to "training".
  4. Build Branch

    Continue in Terminal:
    $ sudo -u hadoop sh -c "cd /home/hadoop/hbase/ ; export PATH=$PATH:/usr/share/apache-ant-1.7.1/bin ; ant package"
  5. Configure HBase

    There are a few edits to be made to get HBase running.
    $ sudo -u hadoop vim /home/hadoop/hbase/build/conf/hbase-site.xml
    $ sudo -u hadoop vim /home/hadoop/hbase/build/conf/hbase-env.sh 
    # The java implementation to use.  Java 1.6 required.
    # export JAVA_HOME=/usr/java/jdk1.6.0/
    export JAVA_HOME=/usr/lib/jvm/java-6-sun

    Note: There is a small glitch in the revision 826669 of that Cloudera specific HBase branch. The master UI (on port 60010 on localhost) will not start because a path is different and Jetty packages are missing because of it. You can fix it by editing the start up script and changing the path scanned:
    $ sudo -u hadoop vim /home/hadoop/hbase/build/bin/hbase

    for f in $HBASE_HOME/lib/jsp-2.1/*.jar; do
    for f in $HBASE_HOME/lib/jetty-ext/*.jar; do

    This is only until the developers have fixed this in the branch (compare the revision I used r813052 with what you get). Or if you do not want the UI you can ignore this and the error in the logs too. HBase will still run, just not its web based interface.
  6. Rev up the Engine!

    The final thing is to start HBase:
    $ sudo -u hadoop /home/hadoop/hbase/build/bin/start-hbase.sh
    $ sudo -u hadoop /home/hadoop/hbase/build/bin/hbase shell
    HBase Shell; enter 'help<RETURN>' for list of supported commands.
    Version: 0.20.0-0.18.3, r813052, Mon Oct 19 06:51:57 PDT 2009
    hbase(main):001:0> list
    0 row(s) in 0.2320 seconds


This sums it up. I hope you give HBase on the Cloudera Training VM a whirl as it also has Eclipse installed and therefore provides a quick start into Hadoop and HBase.

Just keep in mind that this is for prototyping only! With such a setup you will only be able to insert a handful of rows. If you overdo it you will bring it to its knees very quickly. But you can safely use it to play around with the shell to create tables or use the API to get used to it and test changes in your code etc.

Update: Updated title to include version number, fixed XML


  1. Great post Lars. Hope to see more HBase post as well. :)

    One thing though:
    training@training-vm:~$ hadoop version
    Hadoop 0.20.1+133
    Subversion -r cf888d18fcd414b839d23b7e61208e3f0118f15b
    Compiled by root on Sun Sep 27 00:27:10 UTC 2009

    Does this mean we don't need the special HBase branch anymore?

  2. Hi sf, it seems as you say that cloudera-training-0.3.2 is updated to the latest release. Very nice. I checked and it does not have HBase installed on it by the looks. I will try it out and post my findings here. Thanks!

  3. Hbase is the Hadoop database. Think of it as a distributed, scalable, big data store.