Tuesday, November 24, 2009

HBase vs. BigTable Comparison

HBase is an open-source implementation of the Google BigTable architecture. That part is fairly easy to understand and grasp. What I personally feel is a bit more difficult is to understand how much HBase covers and where there are differences (still) compared to the BigTable specification. This post is an attempt to compare the two systems.

Before we embark onto the dark technology side of things I would like to point out one thing upfront: HBase is very close to what the BigTable paper describes. Putting aside minor differences, as of HBase 0.20, which is using ZooKeeper as its lock distributed coordination service, it has all the means to be nearly an exact implementation of BigTable's functionality. What I will be looking into below are mainly subtle variations or differences. Where possible I will try to point out how the HBase team is working on improving the situation given there is a need to do so.

Scope

The comparison in this post is based on the OSDI'06 paper that describes the system Google implemented in about seven person-years and which is in operation since 2005. The paper was published 2006 while the HBase sub-project of Hadoop was established only around the end of that same year to early 2007. Back then the current version of Hadoop was 0.15.0. Given we are now about 2 years in, with Hadoop 0.20.1 and HBase 0.20.2 available, you can hopefully understand that indeed much has happened since. Please also note that I am comparing a 14 page high level technical paper with an open-source project that can be examined freely from top to bottom. It usually means that there is more to tell about how HBase does things because the information is available.

Towards the end I will also address a few newer features that BigTable has nowadays and how HBase is comparing to those. We start though with naming conventions.

Terminology

There are a few different terms used in either system describing the same thing. The most prominent being what HBase calls "regions" while Google refers to it as "tablet". These are the partitions of subsequent rows spread across many "region servers" - or "tablet server" respectively. Apart from that most differences are minor or caused by usage of related technologies since Google's code is obviously closed-source and therefore only mirrored by open-source projects. The open-source projects are free to use other terms and most importantly names for the projects themselves.

Features

The following table lists various "features" of BigTable and compares them with what HBase has to offer. Some are actual implementation details, some are configurable option and so on. This may be confusing but it would be difficult to sort them into categories and not ending up with one entry only in each of them.

Feature
 Google BigTable 
 Apache HBase 
Notes
Atomic Read/Write/Modify
Yes, per row
Yes, per row
Since BigTable does not strive to be a relational database it does not have transactions. The closest to such a mechanism is the atomic access to each row in the table. HBase also implements a row lock API which allows the user to lock more than one row at a time.
Lexicographic Row Order
Yes
Yes
All rows are sorted lexicographically in one order and that one order only. Again, this is no SQL database where you can have different sorting orders.
Block Support
Yes
Yes
Within each storage file data is written as smaller blocks of data. This enables faster loading of data from large storage files. The size is configurable in either system. The typical size is 64K.
Block Compression
Yes, per column family
Yes, per column family
Google uses BMDiff and Zippy in a two step process. BMDiff works really well because neighboring key-value pairs in the store files are often very similar. This can be achieved by using versioning so that all modifications to a value are stored next to each other but still have a lot in common. Or by designing the row keys in such a way that for example web pages from the same site are all bundled. Zippy then is a modified LZW algorithm. HBase on the other hand uses the standard Java supplied GZip or with a little effort the GPL licensed LZO format. There are indications though that Hadoop also may want to have BMDiff (HADOOP-5793) and possibly Zippy as well.
Number of Column Families
Hundreds at Most
Less than 100
While the number of rows and columns is theoretically unbound the number of column families is not. This is a design trade-off but does not impose too much restrictions if the tables and key are designed accordingly.
Column Family Name Format
Printable
Printable
The main reason for HBase here is that column family names are used as directories in the file system.
Qualifier Format
Arbitrary
Arbitrary
Any arbitrary byte[] array can be used.
Key/Value Format
Arbitrary
Arbitrary
Like above, any arbitrary byte[] array can be used.
Access Control
Yes
No
BigTable enforces access control on a column family level. HBase does not have yet have that feature (see HBASE-1697).
Cell Versions
Yes
Yes
Versioning is done using timestamps. See next feature below too. The number of versions that should be kept are freely configurable on a column family level.
Custom Timestamps
Yes (micro)
Yes (milli)
With both systems you can either set the timestamp of a value that is stored yourself or leave the default "now". There are "known" restrictions in HBase that the outcome is indeterminate when adding older timestamps after already having stored newer ones beforehand.
Data Time-To-Live
Yes
Yes
Besides having versions of data cells the user can also set a time-to-live on the stored data that allows to discard data after a specific amount of time.
Batch Writes
Yes
Yes
Both systems allow to batch table operations.
Value based Counters
Yes
Yes
BigTable and HBase can use a specific column as atomic counters. HBase does this by acquiring a row lock before the value is incremented.
Row Filters
Yes
Yes
Again both system allow to apply filters when scanning rows.
Client Script Execution
Yes
No
BigTable uses Sawzall to enable users to process the stored data.
MapReduce Support
Yes
Yes
Both systems have convenience classes that allow scanning a table in MapReduce jobs.
Storage Systems
GFS
HDFS, S3, S3N, EBS
While BigTable works on Google's GFS, HBase has the option to use any file system as long as there is a proxy or driver class for it.
File Format
SSTable
HFile
 
Block Index
At end of file
At end of file
Both storage file formats have a similar block oriented structure with the block index stored at the end of the file.
Memory Mapping
Yes
No
BigTable can memory map storage files directly into memory.
Lock Service
Chubby
ZooKeeper
There is a difference in where ZooKeeper is used to coordinate tasks in HBase as opposed to provide locking services. Overall though ZooKeeper does for HBase pretty much what Chubby does for BigTable with slightly different semantics.
Single Master
Yes
No
HBase recently added support for multiple masters. These are on "hot" standby and monitor the master's ZooKeeper node.
Tablet/Region Count
10-1000
10-1000
Both systems recommend about the same amount of regions per region server. Of course this depends on many things but given a similar setup as far as "commodity" machines are concerned it seems to result in the same amount of load on each server.
Tablet/Region Size
100-200MB
256MB
The maximum region size can be configured for HBase and BigTable. HBase used 256MB as the default value.
Root Location
1st META / Chubby
-ROOT- / ZooKeeper
HBase handles the Root table slightly different from BigTable, where it is the first region in the Meta table. HBase uses its own table with a single region to store the Root table. Once either system starts the address of the server hosting the Root region is stored in ZooKeeper or Chubby so that the clients can resolve its location without hitting the master.
Client Region Cache
Yes
Yes
The clients in either system caches the location of regions and has appropriate mechanisms to detect stale information and update the local cache respectively
Meta Prefetch
Yes
No (?)
A design feature of BigTable is to fetch more than one Meta region information. This proactively fills the client cache for future lookups.
Historian
Yes
Yes
The history of region related events (such as splits, assignment, reassignment) is recorded in the Meta table.
Locality Groups
Yes
No
It is not entirely clear but it seems everything in BigTable is defined by Locality Groups. The group multiple column families into one so that they get stored together and also share the same configuration parameters. A single column family is probably a Locality Group with one member. HBase does not have this option and handles each column family separately.
In-Memory Column Families
Yes
Yes
These are for relatively small tables that need very fast access times.
KeyValue (Cell) Cache
Yes
No
This is a cache that servers hot cells.
Block Cache
Yes
Yes
Blocks read from the storage files are cached internally in configurable caches.
Bloom Filters
Yes
Yes
These filters allow - at a cost of using memory on the region server - to quickly check if a specific cell exists or maybe not.
Write-Ahead Log (WAL)
Yes
Yes
Each region server in either system stores one modification log for all regions it hosts.
Secondary Log
Yes
No
In addition to the Write-Ahead log mentioned above BigTable has a second log that it can use when the first is going slow. This is a performance optimization.
Skip Write-Ahead Log
?
Yes
For bulk imports the client in HBase can opt to skip writing into the WAL.
Fast Table/Region Split
Yes
Yes
Splitting a region or tablet is fast as the daughter regions first read the original storage file until a compaction finally rewrites the data into the region's local store.

New Features

As mentioned above, a few years have passed since the original OSDI'06 BigTable paper. Jeff Dean - a fellow at Google - has mentioned a few new BigTable features during speeches and presentations he gave recently. We will have a look at some of them here.

Feature
 Google BigTable 
 Apache HBase 
Notes
Client Isolation
Yes
No
BigTable is internally used to server many separate clients and can therefore keep the data between isolated.
Coprocessors
Yes
No
BigTable can host code that resides with the regions and splits with them as well. See HBASE-2000 for progress on this feature within HBase.
Corruption Safety
Yes
No
This is an interesting topic. BigTable uses CRC checksums to verify if data has been written safely. While HBase does not have this, the question is if that is build into Hadoop's HDFS?
Replication
Yes
No
HBase is working on the same topic in HBASE-1295

Note: the color codes indicate what features have a direct match or where it is missing (yet). Weaker features are colored yellow, as I am not sure if they are immediately necessary or even applicable given HBase's implementation.

Variations and Differences

Some of the above features need a bit more looking into as they are difficult to be narrowed down to simple "Yay or Nay" questions. I am addressing them below separately.

Lock Service

This is from the BigTable paper:
Bigtable uses Chubby for a variety of tasks: to ensure that there is at most one active master at any time; to store the bootstrap location of Bigtable data (see Section 5.1); to discover tablet servers and finalize tablet server deaths (see Section 5.2); to store Bigtable schema information (the column family information for each table); and to store access control lists. If Chubby becomes unavailable for an extended period of time, Bigtable becomes unavailable.

There is a lot of overlap compared to how HBase does use ZooKeeper. What is different though is that schema information is not stored in ZooKeeper (yet, see http://wiki.apache.org/hadoop/ZooKeeper/HBaseUseCases) for details. What is important here though is the same reliance on the lock service being available. From my own experience and reading the threads on the HBase mailing list it is often underestimated what can happen when ZooKeeper does not get the resources it needs to react timely. It is better to have a small ZooKeeper cluster on older machines not doing anything else as opposed to having ZooKeeper nodes running next to the already heavy Hadoop or HBase processes. Once you starve ZooKeeper you will see a domino effect of HBase nodes going down with it - including the master(s).

Update: After talking to a few guys of the ZooKeeper team I would like to point out that this is indeed not a ZooKeeper issue. It has to do with the fact that if you have an already heavily loaded node trying to also respond in time to ZooKeeper resources then you may face a timeout situation where the HBase RegionServers and even the Master may think that their coordination service is gone and shut themselves down. Patrick Hunt has responded to this by mail and by post. Please read both to see that ZooKeeper is able to handle the load. I personally recommend to set up ZooKeeper in combination with HBase on a separate cluster, maybe a set of spare machines you have from a recent update to the cluster and which are slightly outdated (no 2xquad core CPU with 16GB of memory) but are otherwise perfectly fine. This also allows you to monitor the machines separately and not having to see a combined CPU load of 100% on the servers and not really knowing where it comes from and what effect it may have.

Another important difference is that ZooKeeper is no lock service like Chubby - and I do think it does not have to be as far as HBase is concerned. ZooKeeper is a distributed coordination service enabling HBase to do Master node elections etc. It also allows using semaphores to indicate state or actions required. So where Chubby creates a lock file to indicate a tablet server is up and running HBase in turn uses ephemeral nodes that exist as long as the session between the RegionServer which creates that node and ZooKeeper is active. This also causes the differences in semantics where in BigTable can delete a tablet servers lock file to indicate that it has lost its lease on tablets. In HBase this has to be handled differently because of the slightly less restrictive architecture of ZooKeeper. These are only semantics as mentioned and do not mean one is better than the other - just different.

The first level is a file stored in Chubby that contains the location of the root tablet. The root tablet contains the location of all tablets in a special METADATA table. Each METADATA tablet contains the location of a set of user tablets. The root tablet is just the first tablet in the METADATA table, but is treated specially - it is never split - to ensure that the tablet location hierarchy has no more than three levels.

As mentioned above in HBase the root region is its own table with a single region. If that makes a difference to having it as the first (non-splittable) region of the meta table I doubt strongly. It is just the same feature but implemented differently.

The METADATA table stores the location of a tablet under a row key that is an encoding of the tablet's table identifier and its end row.

HBase does have a different layout here. It stores the start and end row with each region where the end row is exclusive and denotes the first (or start) row of the next region. Again, these are minor differences and I am not sure if there is a better or worse solution. It is just done differently.

Master Operation

To detect when a tablet server is no longer serving its tablets, the master periodically asks each tablet server for the status of its lock. If a tablet server reports that it has lost its lock, or if the master was unable to reach a server during its last several attempts, the master attempts to acquire an exclusive lock on the server's file. If the master is able to acquire the lock, then Chubby is live and the tablet server is either dead or having trouble reaching Chubby, so the master ensures that the tablet server can never serve again by deleting its server file. Once a server's file has been deleted, the master can move all the tablets that were previously assigned to that server into the set of unassigned tablets. To ensure that a Bigtable cluster is not vulnerable to networking issues between the master and Chubby, the master kills itself if its Chubby session expires.

This is quite different even up to the current HBase 0.20.2. Here the master uses a heartbeat protocol that is used by the region servers to report for duty and that they are still alive subsequently. I am not sure if this topic is covered by the master rewrite umbrella issue HBASE-1816 - and if it needs to be addressed at all. It could well be that what we have in HBase now is sufficient and does its job just fine. It was created when there was no lock service yet and therefore could be considered legacy code too.

Master Startup

The master executes the following steps at startup. (1) The master grabs a unique master lock in Chubby, which prevents concurrent master instantiations. (2) The master scans the servers directory in Chubby to find the live servers. (3) The master communicates with every live tablet server to discover what tablets are already assigned to each server. (4) The master scans the METADATA table to learn the set of tablets. Whenever this scan encounters a tablet that is not already assigned, the master adds the tablet to the set of unassigned tablets, which makes the tablet eligible for tablet assignment.

Along what I mentioned above, this part of the code was created before ZooKeeper was available. So HBase actually waits for the region servers to report for duty. It also scans the .META. table to learn what is there and which server is assigned to it. ZooKeeper is (yet) only used to publish the server hosting the -ROOT- region.

Tablet/Region Splits

In case the split notification is lost (either because the tablet server or the master died), the master detects the new tablet when it asks a tablet server to load the tablet that has now split. The tablet server will notify the master of the split, because the tablet entry it finds in the METADATA table will specify only a portion of the tablet that the master asked it to load.

The master node in HBase uses the .META. solely to detect when a region was split but the message was lost. For that reason it scans the .META. on a regular basis to see when a region appears that is not yet assigned. It will then assign that region as per its default strategy.

Compactions

The following are more terminology differences than anything else.

As write operations execute, the size of the memtable increases. When the memtable size reaches a threshold, the memtable is frozen, a new memtable is created, and the frozen memtable is converted to an SSTable and written to GFS. This minor compaction process has two goals: it shrinks the memory usage of the tablet server, and it reduces the amount of data that has to be read from the commit log during recovery if this server dies. Incoming read and write operations can continue while compactions occur.

HBase has a similar operation but it is referred to as a "flush". Opposed to that "minor compactions" in HBase rewrite the last N used store files, i.e. those with the most recent mutations as they are probably much smaller than previously created files that have more data in them.

... we bound the number of such files by periodically executing a merging compaction in the background. A merging compaction reads the contents of a few SSTables and the memtable, and writes out a new SSTable. The input SSTables and memtable can be discarded as soon as the compaction has finished.

This again refers to what is called "minor compaction" in HBase.

A merging compaction that rewrites all SSTables into exactly one SSTable is called a major compaction.

Here we have an exact match though, a "major compaction" in HBase also rewrites all files into one.

Immutable Files

Knowing that files are fixed once written BigTable makes the following assumption:

The only mutable data structure that is accessed by both reads and writes is the memtable. To reduce contention during reads of the memtable, we make each memtable row copy-on-write and allow reads and writes to proceed in parallel.

I do believe this is done similar in HBase but am not sure. It certainly has the same architecture as HDFS files for example are also immutable once written.

I can only recommend that you read the BigTable too and make up your own mind. This post was inspired by the idea to learn what BigTable really has to offer and how much HBase has already covered. The difficult part is of course that there is not too much information available on BigTable. But the numbers even the 2006 paper lists are more than impressive. If HBase as on open-source project with just a handful of committers of whom most have a full-time day jobs can achieve something even remotely comparable I think this is a huge success. And looking at the 0.21 and 0.22 road map, the already small gap is going to shrink even further!

11 comments:

  1. Another great post Lars! My only complaint would be that you don't post daily :) Keep up the good work my friend.

    By the way, perhaps the Single Master entry for Bigtable should be yellow since I came across this piece (http://queue.acm.org/detail.cfm?id=1594206):

    "For these and other reasons, engineers at Google have been working for much of the past two years on a new distributed master system designed to take full advantage of BigTable to attack some of those problems that have proved particularly difficult for GFS."

    ReplyDelete
  2. > BigTable uses CRC checksums to verify if data
    > has been written safely. While HBase does not
    > have this, the question is if that is build
    > into Hadoop's HDFS?
    Yes, HDFS transparently checksums all data written to it and by default verifies checksums
    when reading data. A separate checksum is created for every io.bytes.per.checksum
    bytes of data. The default is 512 bytes, and since a CRC-32 checksum is 4 bytes long,
    the storage overhead is less than 1%.

    ReplyDelete
  3. @fuentesjr Thanks Sal, I am trying :) You are right, I read the note too that they are redesigning the single master architecture. That post is mainly GFS though, which is Hadoop in our case. Reading it it does not seem to indicate what BigTable does nowadays. Right?

    @Igor Thanks for clarifying this. What was not really clear to me is how Jeff Dean speaks about corruption issues and what they mean for the Hadoop stack. I am aware of what can go wrong and that given a large enough cluster you have always something fail. But the added effort they put it to guard against this, is it because of using low-level programming languages such as C or C++ and having to deal with issues directly on that level? Thinking about memory failures, disk corruptions etc. is what Hadoop has enough to cover all of the "possible" types of failures? Or should there be more effort spent on finding out if there is more work to be done? Given the large Hadoop clusters out there and the lack of this discussion I am personally assuming this is already taken care of.

    ReplyDelete
  4. You should also consider covering Hypertable here as well, since it is also a Bigtable implementation (however, in C++ for significant performance improvements)

    ReplyDelete
  5. Lars this is an awesome post, keep up the good work! I also appreciate you posting the update section clarifying some issues wrt ZooKeeper integration and the work we (ZK team) have been doing with the HBase team.

    http://wiki.apache.org/hadoop/ZooKeeper/HBaseAndZooKeeper

    Regards!

    ReplyDelete
  6. Hi Lars, Grate Post very informative. But I created HBase table more than 1200 column families. But in your comparison , you said max allowed Column families are less than 100. is there any reason for that?

    ReplyDelete
  7. great comparison, thanks a lot. This benchmark is also very helpful: http://www.slideshare.net/schubertzhang/hbase-0200-performance-evaluation

    ReplyDelete
  8. Great comparison... Really helpful to consider various parameters.
    I am still not clear about the parent child relationship between tables that Bigtable claims to support. Is HBase planning to support that too?

    ReplyDelete
  9. Great post Lars! Your blog is the most informative place where I can learn hbase except hbase official site.

    ReplyDelete