Tuesday, November 24, 2009

HBase vs. BigTable Comparison

HBase is an open-source implementation of the Google BigTable architecture. That part is fairly easy to understand and grasp. What I personally feel is a bit more difficult is to understand how much HBase covers and where there are differences (still) compared to the BigTable specification. This post is an attempt to compare the two systems.

Before we embark onto the dark technology side of things I would like to point out one thing upfront: HBase is very close to what the BigTable paper describes. Putting aside minor differences, as of HBase 0.20, which is using ZooKeeper as its lock distributed coordination service, it has all the means to be nearly an exact implementation of BigTable's functionality. What I will be looking into below are mainly subtle variations or differences. Where possible I will try to point out how the HBase team is working on improving the situation given there is a need to do so.


The comparison in this post is based on the OSDI'06 paper that describes the system Google implemented in about seven person-years and which is in operation since 2005. The paper was published 2006 while the HBase sub-project of Hadoop was established only around the end of that same year to early 2007. Back then the current version of Hadoop was 0.15.0. Given we are now about 2 years in, with Hadoop 0.20.1 and HBase 0.20.2 available, you can hopefully understand that indeed much has happened since. Please also note that I am comparing a 14 page high level technical paper with an open-source project that can be examined freely from top to bottom. It usually means that there is more to tell about how HBase does things because the information is available.

Towards the end I will also address a few newer features that BigTable has nowadays and how HBase is comparing to those. We start though with naming conventions.


There are a few different terms used in either system describing the same thing. The most prominent being what HBase calls "regions" while Google refers to it as "tablet". These are the partitions of subsequent rows spread across many "region servers" - or "tablet server" respectively. Apart from that most differences are minor or caused by usage of related technologies since Google's code is obviously closed-source and therefore only mirrored by open-source projects. The open-source projects are free to use other terms and most importantly names for the projects themselves.


The following table lists various "features" of BigTable and compares them with what HBase has to offer. Some are actual implementation details, some are configurable option and so on. This may be confusing but it would be difficult to sort them into categories and not ending up with one entry only in each of them.

 Google BigTable 
 Apache HBase 
Atomic Read/Write/Modify
Yes, per row
Yes, per row
Since BigTable does not strive to be a relational database it does not have transactions. The closest to such a mechanism is the atomic access to each row in the table. HBase also implements a row lock API which allows the user to lock more than one row at a time.
Lexicographic Row Order
All rows are sorted lexicographically in one order and that one order only. Again, this is no SQL database where you can have different sorting orders.
Block Support
Within each storage file data is written as smaller blocks of data. This enables faster loading of data from large storage files. The size is configurable in either system. The typical size is 64K.
Block Compression
Yes, per column family
Yes, per column family
Google uses BMDiff and Zippy in a two step process. BMDiff works really well because neighboring key-value pairs in the store files are often very similar. This can be achieved by using versioning so that all modifications to a value are stored next to each other but still have a lot in common. Or by designing the row keys in such a way that for example web pages from the same site are all bundled. Zippy then is a modified LZW algorithm. HBase on the other hand uses the standard Java supplied GZip or with a little effort the GPL licensed LZO format. There are indications though that Hadoop also may want to have BMDiff (HADOOP-5793) and possibly Zippy as well.
Number of Column Families
Hundreds at Most
Less than 100
While the number of rows and columns is theoretically unbound the number of column families is not. This is a design trade-off but does not impose too much restrictions if the tables and key are designed accordingly.
Column Family Name Format
The main reason for HBase here is that column family names are used as directories in the file system.
Qualifier Format
Any arbitrary byte[] array can be used.
Key/Value Format
Like above, any arbitrary byte[] array can be used.
Access Control
BigTable enforces access control on a column family level. HBase does not have yet have that feature (see HBASE-1697).
Cell Versions
Versioning is done using timestamps. See next feature below too. The number of versions that should be kept are freely configurable on a column family level.
Custom Timestamps
Yes (micro)
Yes (milli)
With both systems you can either set the timestamp of a value that is stored yourself or leave the default "now". There are "known" restrictions in HBase that the outcome is indeterminate when adding older timestamps after already having stored newer ones beforehand.
Data Time-To-Live
Besides having versions of data cells the user can also set a time-to-live on the stored data that allows to discard data after a specific amount of time.
Batch Writes
Both systems allow to batch table operations.
Value based Counters
BigTable and HBase can use a specific column as atomic counters. HBase does this by acquiring a row lock before the value is incremented.
Row Filters
Again both system allow to apply filters when scanning rows.
Client Script Execution
BigTable uses Sawzall to enable users to process the stored data.
MapReduce Support
Both systems have convenience classes that allow scanning a table in MapReduce jobs.
Storage Systems
While BigTable works on Google's GFS, HBase has the option to use any file system as long as there is a proxy or driver class for it.
File Format
Block Index
At end of file
At end of file
Both storage file formats have a similar block oriented structure with the block index stored at the end of the file.
Memory Mapping
BigTable can memory map storage files directly into memory.
Lock Service
There is a difference in where ZooKeeper is used to coordinate tasks in HBase as opposed to provide locking services. Overall though ZooKeeper does for HBase pretty much what Chubby does for BigTable with slightly different semantics.
Single Master
HBase recently added support for multiple masters. These are on "hot" standby and monitor the master's ZooKeeper node.
Tablet/Region Count
Both systems recommend about the same amount of regions per region server. Of course this depends on many things but given a similar setup as far as "commodity" machines are concerned it seems to result in the same amount of load on each server.
Tablet/Region Size
The maximum region size can be configured for HBase and BigTable. HBase used 256MB as the default value.
Root Location
1st META / Chubby
-ROOT- / ZooKeeper
HBase handles the Root table slightly different from BigTable, where it is the first region in the Meta table. HBase uses its own table with a single region to store the Root table. Once either system starts the address of the server hosting the Root region is stored in ZooKeeper or Chubby so that the clients can resolve its location without hitting the master.
Client Region Cache
The clients in either system caches the location of regions and has appropriate mechanisms to detect stale information and update the local cache respectively
Meta Prefetch
No (?)
A design feature of BigTable is to fetch more than one Meta region information. This proactively fills the client cache for future lookups.
The history of region related events (such as splits, assignment, reassignment) is recorded in the Meta table.
Locality Groups
It is not entirely clear but it seems everything in BigTable is defined by Locality Groups. The group multiple column families into one so that they get stored together and also share the same configuration parameters. A single column family is probably a Locality Group with one member. HBase does not have this option and handles each column family separately.
In-Memory Column Families
These are for relatively small tables that need very fast access times.
KeyValue (Cell) Cache
This is a cache that servers hot cells.
Block Cache
Blocks read from the storage files are cached internally in configurable caches.
Bloom Filters
These filters allow - at a cost of using memory on the region server - to quickly check if a specific cell exists or maybe not.
Write-Ahead Log (WAL)
Each region server in either system stores one modification log for all regions it hosts.
Secondary Log
In addition to the Write-Ahead log mentioned above BigTable has a second log that it can use when the first is going slow. This is a performance optimization.
Skip Write-Ahead Log
For bulk imports the client in HBase can opt to skip writing into the WAL.
Fast Table/Region Split
Splitting a region or tablet is fast as the daughter regions first read the original storage file until a compaction finally rewrites the data into the region's local store.

New Features

As mentioned above, a few years have passed since the original OSDI'06 BigTable paper. Jeff Dean - a fellow at Google - has mentioned a few new BigTable features during speeches and presentations he gave recently. We will have a look at some of them here.

 Google BigTable 
 Apache HBase 
Client Isolation
BigTable is internally used to server many separate clients and can therefore keep the data between isolated.
BigTable can host code that resides with the regions and splits with them as well. See HBASE-2000 for progress on this feature within HBase.
Corruption Safety
This is an interesting topic. BigTable uses CRC checksums to verify if data has been written safely. While HBase does not have this, the question is if that is build into Hadoop's HDFS?
HBase is working on the same topic in HBASE-1295

Note: the color codes indicate what features have a direct match or where it is missing (yet). Weaker features are colored yellow, as I am not sure if they are immediately necessary or even applicable given HBase's implementation.

Variations and Differences

Some of the above features need a bit more looking into as they are difficult to be narrowed down to simple "Yay or Nay" questions. I am addressing them below separately.

Lock Service

This is from the BigTable paper:
Bigtable uses Chubby for a variety of tasks: to ensure that there is at most one active master at any time; to store the bootstrap location of Bigtable data (see Section 5.1); to discover tablet servers and finalize tablet server deaths (see Section 5.2); to store Bigtable schema information (the column family information for each table); and to store access control lists. If Chubby becomes unavailable for an extended period of time, Bigtable becomes unavailable.

There is a lot of overlap compared to how HBase does use ZooKeeper. What is different though is that schema information is not stored in ZooKeeper (yet, see http://wiki.apache.org/hadoop/ZooKeeper/HBaseUseCases) for details. What is important here though is the same reliance on the lock service being available. From my own experience and reading the threads on the HBase mailing list it is often underestimated what can happen when ZooKeeper does not get the resources it needs to react timely. It is better to have a small ZooKeeper cluster on older machines not doing anything else as opposed to having ZooKeeper nodes running next to the already heavy Hadoop or HBase processes. Once you starve ZooKeeper you will see a domino effect of HBase nodes going down with it - including the master(s).

Update: After talking to a few guys of the ZooKeeper team I would like to point out that this is indeed not a ZooKeeper issue. It has to do with the fact that if you have an already heavily loaded node trying to also respond in time to ZooKeeper resources then you may face a timeout situation where the HBase RegionServers and even the Master may think that their coordination service is gone and shut themselves down. Patrick Hunt has responded to this by mail and by post. Please read both to see that ZooKeeper is able to handle the load. I personally recommend to set up ZooKeeper in combination with HBase on a separate cluster, maybe a set of spare machines you have from a recent update to the cluster and which are slightly outdated (no 2xquad core CPU with 16GB of memory) but are otherwise perfectly fine. This also allows you to monitor the machines separately and not having to see a combined CPU load of 100% on the servers and not really knowing where it comes from and what effect it may have.

Another important difference is that ZooKeeper is no lock service like Chubby - and I do think it does not have to be as far as HBase is concerned. ZooKeeper is a distributed coordination service enabling HBase to do Master node elections etc. It also allows using semaphores to indicate state or actions required. So where Chubby creates a lock file to indicate a tablet server is up and running HBase in turn uses ephemeral nodes that exist as long as the session between the RegionServer which creates that node and ZooKeeper is active. This also causes the differences in semantics where in BigTable can delete a tablet servers lock file to indicate that it has lost its lease on tablets. In HBase this has to be handled differently because of the slightly less restrictive architecture of ZooKeeper. These are only semantics as mentioned and do not mean one is better than the other - just different.

The first level is a file stored in Chubby that contains the location of the root tablet. The root tablet contains the location of all tablets in a special METADATA table. Each METADATA tablet contains the location of a set of user tablets. The root tablet is just the first tablet in the METADATA table, but is treated specially - it is never split - to ensure that the tablet location hierarchy has no more than three levels.

As mentioned above in HBase the root region is its own table with a single region. If that makes a difference to having it as the first (non-splittable) region of the meta table I doubt strongly. It is just the same feature but implemented differently.

The METADATA table stores the location of a tablet under a row key that is an encoding of the tablet's table identifier and its end row.

HBase does have a different layout here. It stores the start and end row with each region where the end row is exclusive and denotes the first (or start) row of the next region. Again, these are minor differences and I am not sure if there is a better or worse solution. It is just done differently.

Master Operation

To detect when a tablet server is no longer serving its tablets, the master periodically asks each tablet server for the status of its lock. If a tablet server reports that it has lost its lock, or if the master was unable to reach a server during its last several attempts, the master attempts to acquire an exclusive lock on the server's file. If the master is able to acquire the lock, then Chubby is live and the tablet server is either dead or having trouble reaching Chubby, so the master ensures that the tablet server can never serve again by deleting its server file. Once a server's file has been deleted, the master can move all the tablets that were previously assigned to that server into the set of unassigned tablets. To ensure that a Bigtable cluster is not vulnerable to networking issues between the master and Chubby, the master kills itself if its Chubby session expires.

This is quite different even up to the current HBase 0.20.2. Here the master uses a heartbeat protocol that is used by the region servers to report for duty and that they are still alive subsequently. I am not sure if this topic is covered by the master rewrite umbrella issue HBASE-1816 - and if it needs to be addressed at all. It could well be that what we have in HBase now is sufficient and does its job just fine. It was created when there was no lock service yet and therefore could be considered legacy code too.

Master Startup

The master executes the following steps at startup. (1) The master grabs a unique master lock in Chubby, which prevents concurrent master instantiations. (2) The master scans the servers directory in Chubby to find the live servers. (3) The master communicates with every live tablet server to discover what tablets are already assigned to each server. (4) The master scans the METADATA table to learn the set of tablets. Whenever this scan encounters a tablet that is not already assigned, the master adds the tablet to the set of unassigned tablets, which makes the tablet eligible for tablet assignment.

Along what I mentioned above, this part of the code was created before ZooKeeper was available. So HBase actually waits for the region servers to report for duty. It also scans the .META. table to learn what is there and which server is assigned to it. ZooKeeper is (yet) only used to publish the server hosting the -ROOT- region.

Tablet/Region Splits

In case the split notification is lost (either because the tablet server or the master died), the master detects the new tablet when it asks a tablet server to load the tablet that has now split. The tablet server will notify the master of the split, because the tablet entry it finds in the METADATA table will specify only a portion of the tablet that the master asked it to load.

The master node in HBase uses the .META. solely to detect when a region was split but the message was lost. For that reason it scans the .META. on a regular basis to see when a region appears that is not yet assigned. It will then assign that region as per its default strategy.


The following are more terminology differences than anything else.

As write operations execute, the size of the memtable increases. When the memtable size reaches a threshold, the memtable is frozen, a new memtable is created, and the frozen memtable is converted to an SSTable and written to GFS. This minor compaction process has two goals: it shrinks the memory usage of the tablet server, and it reduces the amount of data that has to be read from the commit log during recovery if this server dies. Incoming read and write operations can continue while compactions occur.

HBase has a similar operation but it is referred to as a "flush". Opposed to that "minor compactions" in HBase rewrite the last N used store files, i.e. those with the most recent mutations as they are probably much smaller than previously created files that have more data in them.

... we bound the number of such files by periodically executing a merging compaction in the background. A merging compaction reads the contents of a few SSTables and the memtable, and writes out a new SSTable. The input SSTables and memtable can be discarded as soon as the compaction has finished.

This again refers to what is called "minor compaction" in HBase.

A merging compaction that rewrites all SSTables into exactly one SSTable is called a major compaction.

Here we have an exact match though, a "major compaction" in HBase also rewrites all files into one.

Immutable Files

Knowing that files are fixed once written BigTable makes the following assumption:

The only mutable data structure that is accessed by both reads and writes is the memtable. To reduce contention during reads of the memtable, we make each memtable row copy-on-write and allow reads and writes to proceed in parallel.

I do believe this is done similar in HBase but am not sure. It certainly has the same architecture as HDFS files for example are also immutable once written.

I can only recommend that you read the BigTable too and make up your own mind. This post was inspired by the idea to learn what BigTable really has to offer and how much HBase has already covered. The difficult part is of course that there is not too much information available on BigTable. But the numbers even the 2006 paper lists are more than impressive. If HBase as on open-source project with just a handful of committers of whom most have a full-time day jobs can achieve something even remotely comparable I think this is a huge success. And looking at the 0.21 and 0.22 road map, the already small gap is going to shrink even further!

Friday, November 20, 2009

HBase on Cloudera Training Virtual Machine (0.3.2)

Note: This is a follow up to my earlier post. Since then Cloudera released a new VM that includes the current 0.20 branch of Hadoop. Below I have the same post adjusted to work with that new release. Please note that there are subtle changes, for example the NameNode port has changed. So if you in any way still have the older post please make sure you forget about it and follow this one here instead for the new VM version.

You might want to run HBase on Cloudera's Virtual Machine to get a quick start to a prototyping setup. In theory you download the VM, start it and you are ready to go. The main issue though is that the current Hadoop Training VM does not include HBase at all (yet?). Apart from that the install of a local HBase instance is a straight forward process.

Here are the steps to get HBase running on Cloudera's VM:
  1. Download VM

    Get it from Cloudera's website.

  2. Start VM

    As the above page states: "To launch the VMWare image, you will either need VMware Player for windows and linux, or VMware Fusion for Mac."

    Note: I have Parallels for Mac and wanted to use that. I used Parallels Transporter to convert the "cloudera-training-0.3.2.vmx" to a new "cloudera-training-0.2-cl4-000001.hdd", create a new VM in Parallels selecting Ubuntu Linux as the OS and the newly created .hdd as the disk image. Boot up the VM and you are up and running. I gave it a bit more memory for the graphics to be able to switch the VM to 1440x900 which is the native screen resolution on my MacBook Pro I am using.

    Finally follow the steps explained on the page above, i.e. open a Terminal and issue:
    $ cd ~/git
    $ ./update-exercises --workspace

  3. Pull HBase branch

    We are using the brand new HBase 0.20.2 release. Open a new Terminal (or issue a $ cd .. in the open one), then:
    $ sudo -u hadoop git clone http://git.apache.org/hbase.git /home/hadoop/hbase
    $ sudo -u hadoop sh -c "cd /home/hadoop/hbase ; git checkout origin/tags/0.20.2"
    Note: moving to "origin/tags/0.20.2" which isn't a local branch
    If you want to create a new branch from this checkout, you may do so
    (now or later) by using -b with the checkout command again. Example:
      git checkout -b <new_branch_name>
    HEAD is now at 777fb63... HBase release 0.20.2

    First we clone the repository, then switch to the actual branch. You will notice that I am using sudo -u hadoop because Hadoop itself is started under that account and so I wanted it to match. Also, the default "training" account does not have SSH set up as explained in Hadoop's quick-start guide. When sudo is asking for a password use the default, which is set to "training".

    You can ignore the messages git prints out while performing the checkout.

  4. Build Branch

    Continue in Terminal:
    $ sudo -u hadoop sh -c "cd /home/hadoop/hbase/ ; export PATH=$PATH:/usr/share/apache-ant-1.7.1/bin ; ant package"

  5. Configure HBase

    There are a few edits to be made to get HBase running.
    $ sudo -u hadoop vim /home/hadoop/hbase/build/conf/hbase-site.xml
    $ sudo -u hadoop vim /home/hadoop/hbase/build/conf/hbase-env.sh 
    # The java implementation to use.  Java 1.6 required.
    # export JAVA_HOME=/usr/java/jdk1.6.0/
    export JAVA_HOME=/usr/lib/jvm/java-6-sun

  6. Rev up the Engine!

    The final thing is to start HBase:
    $ sudo -u hadoop /home/hadoop/hbase/build/bin/start-hbase.sh
    $ sudo -u hadoop /home/hadoop/hbase/build/bin/hbase shell
    HBase Shell; enter 'help<RETURN>' for list of supported commands.
    Version: 0.20.2, r777fb63ff0c73369abc4d799388a45b8bda9e5fd, Thu Nov 19 15:32:17 PST 2009


    Let's create a table and check if it was created OK.
    hbase(main):001:0> list
    0 row(s) in 0.0910 seconds
    hbase(main):002:0> create 't1', 'f1', 'f2', 'f3'
    0 row(s) in 6.1260 seconds
    hbase(main):003:0> list                         
    1 row(s) in 0.0470 seconds
    hbase(main):004:0> describe 't1'                
    DESCRIPTION                                                             ENABLED                               
     {NAME => 't1', FAMILIES => [{NAME => 'f1', COMPRESSION => 'NONE', VERS true                                  
     IONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => '                                       
     false', BLOCKCACHE => 'true'}, {NAME => 'f2', COMPRESSION => 'NONE', V                                       
     ERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY =                                       
     > 'false', BLOCKCACHE => 'true'}, {NAME => 'f3', COMPRESSION => 'NONE'                                       
     , VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMOR                                       
     Y => 'false', BLOCKCACHE => 'true'}]}                                                                        
    1 row(s) in 0.0750 seconds
This sums it up. I hope you give HBase on the Cloudera Training VM a whirl as it also has Eclipse installed and therefore provides a quick start into Hadoop and HBase.

Just keep in mind that this is for prototyping only! With such a setup you will only be able to insert a handful of rows. If you overdo it you will bring it to its knees very quickly. But you can safely use it to play around with the shell to create tables or use the API to get used to it and test changes in your code etc.

Finally a screenshot of the running HBase UI: