Friday, February 12, 2010

FOSDEM 2010 NoSQL Talk

Let me take a minute to wrap up my FOSDEM 2010 experience. I was part of the NoSQL DevRoom organized by @stevenn from Outerthought, who I had the pleasure to visit before a few months back as an HBase Ambassador.

First things first, the NoSQL DevRoom was just an absolute success and I had a blast attending it. I also made sure to not walk around and see other talks outside the NoSQL track while there were many and plenty good ones. I did so deliberately to see the other projects and what they have to offer. I thought it was great, a good vibe was felt throughout the whole day as the audience got a whirlwind tour through the NoSQL landscape. The room was full to the brim for most presentations and some folks had to miss out as we could not have had more enter. This did prove the great interest in this fairly new kind of technology. Exciting!

The focus of my talk was about the history I have with HBase starting with it in late 2007. At this point I would like to take to the opportunity to thank Michael Stack, the lead of HBase, as he has helped me many times back then to sort out the problems I ran into. I would also like to say that if you start with HBase today you will not have these problems as HBase has matured tremendously since then. It is an overall stable product and can solve scalability issues you may face with regular RDBMS's today - and that with great ease.

So the talk I gave did not really sell all the features nor did it explain everything fully. I felt this could be left to the reader to look up on the project's website (or here on my blog) and hence I focused on my use case only. First up, here are the slides.


My Life with HBase - FOSDEM 2010 NoSQL -

After my talk and throughout the rest of the day I also had great conversations with the attendees who had many and great questions.

Having listened to the other talks though I felt I probably could have done a better job selling HBase to the audience. I could have reported about use-cases in well known companies, gave better performance numbers and so on. I have learned a lesson and am making sure I will be doing a better job next time around. I guess this is also another facet of what this is about, i.e. learning to achieve a higher level of professionalism.

But as I said above, my intend was to report about my life with HBase. I am grateful though that it was accepted as that and please let me cite Todd Hoff (see Hot Scalability Links for February 12, 2010) who put it in such nice words:
"The hardscabble tale of HBase's growth from infancy to maturity. A very good introduction and overview of HBase."
Thank you!

Finally here is the video of the talk:



I am looking forward to more NoSQL events in Europe in the near future and will attempt to represent HBase once more (including those adjustments I mentioned above). My hope is that we as Europeans are able to adopt these new technologies and stay abreast with the rest of the world. We sure have smart people to do so.

Friday, February 5, 2010

IvyDE and HBase 0.21 (trunk)

If you are staying on top of HBase development and frequently update to the HBase trunk (0.21.0-dev at the time of this post) you may have noticed that we now have support for Apache Ivy (see HBASE-1433). This is good because it allows to better control dependencies of the required jar files. It does have a few drawbacks though. One issue that you must be online to get your initial set of jars. You can also set up a local mirror or reuse the one you need for Hadoop anyways to share some of them.

Another issue is that it pulls in many more libs as part of the dependency resolving process. This reminds me bit of aptitude and when you try to install Java, for example on Debian. It often wants to pull in a litany of "required" packages but upon closer look many are only recommended and need not to be installed.

Finally you need to get the jar files somehow into your development environment. I am using Eclipse 3.5 on my Windows 7 PC as well as on my MacOS machines. If you have not run ant from the command line yet you have no jars downloaded and opening the project in Eclipse yields an endless amount of errors. You have two choices, you can run ant and get all jars and then add them to the project in Eclipse. But that is rather static and does not work well with future changes. It also is not the "ivy" way to resolve the libraries automatically.

The other option you have is adding a plugin to Eclipse that can handle Ivy for you, right within the IDE. Luckily for Eclipse there is IvyDE. You install it according to its documentation and then add a "Classpath Container" as described here. That part works quite well and after a restart IvyDE is ready to go.

A few more steps have to be done to get HBase working now - as in compiling without errors. The crucial one is editing the Ivy library and setting the HBase specific Ivy files. In particular the "Ivy settings path" and the properties file. The latter especially is specifying all the various version numbers that the ivy.xml is using. Without it the Ivy resolve process will fail with many errors all over the place. Please note that in the screen shot I added you see how it looks like on my Windows PC. The paths will be slightly different for your setup and probably even using another format if you are on a Mac or Linux machine. As long as you specify both you should be fine.

The other important issue is that you have to repeat that same step adding the Classpath Container two more times: each of the two larger contrib packages "contrib/stargate" and "contrib/transactional" have their own ivy.xml! For both you have to go into the respective directory and right click on the ivy.xml and follow the steps described in the Ivy documentation. Enter the same information as mentioned above to make the resolve work, leave everything else the way it is. You may notice that the contrib packages have a few more targets unticked. That is OK and can be used as-is.

As a temporary step you have to add two more static libraries that are in the $HBASE_HOME/lib directory: libthrift-0.2.0.jar and zookeeper-3.2.2.jar. Those will eventually be published on the Ivy repositories and then this step is obsolete (see INFRA-2461).

Eventually you end up with three containers as shown in the second and third screen shot. The Eclipse toolbar now also has an Ivy "Resolve All Dependencies" button which you can use to trigger the download process. Personally I had to do this a few times as the mirrors with the jars seem to be flaky at times. I ended up with for example "hadoop-mapred.jar" missing. Another resolve run fixed the problem.

The last screen shot shows the three Ivy related containers once more in the tree view of the Package Explorer in the Java perspective. What you also see is the Ivy console, which also is installed with the plugin. You have to open it as usual using the "Window - Show View - Console" menu (if you do not have the Console View open already) and then use the drop down menu next to the "Open Console" button in that view to open the Ivy console. It gives you access to all the details when resolving the dependencies and can hint when you have done something wrong. Please note though that it also lists a lot of connection errors, one for every mirror or repository that does not respond or yet has the required package available. One of them should respond though or as mentioned above you will have to try later again.

Eclipse automatically compiles the project and if everything worked out it does so now without a hitch. Good luck!

Update: Added info about the yet still static thrift and zookeeper jars. See Kay Kay's comment below.

Saturday, January 30, 2010

HBase Architecture 101 - Write-ahead-Log

What is the Write-ahead-Log you ask? In my previous post we had a look at the general storage architecture of HBase. One thing that was mentioned is the Write-ahead-Log, or WAL. This post explains how the log works in detail, but bear in mind that it describes the current version, which is 0.20.3. I will address the various plans to improve the log for 0.21 at the end of this article. For the term itself please read here.

Big Picture
The WAL is the lifeline that is needed when disaster strikes. Similar to a BIN log in MySQL it records all changes to the data. This is important in case something happens to the primary storage. So if the server crashes it can effectively replay that log to get everything up to where the server should have been just before the crash. It also means that if writing the record to the WAL fails the whole operation must be considered a failure.

Let"s look at the high level view of how this is done in HBase. First the client initiates an action that modifies data. This is currently a call to put(Put), delete(Delete) and incrementColumnValue() (abbreviated as "incr" here at times). Each of these modifications is wrapped into a KeyValue object instance and sent over the wire using RPC calls. The calls are (ideally batched) to the HRegionServer that serves the affected regions. Once it arrives the payload, the said KeyValue, is routed to the HRegion that is responsible for the affected row. The data is written to the WAL and then put into the MemStore of the actual Store that holds the record. And that also pretty much describes the write-path of HBase.

Eventually when the MemStore gets to a certain size or after a specific time the data is asynchronously persisted to the file system. In between that timeframe data is stored volatile in memory. And if the HRegionServer  hosting that memory crashes the data is lost... but for the existence of what is the topic of this post, the WAL!

We have a look now at the various classes or "wheels" working the magic of the WAL. First up is one of the main classes of this contraption.

HLog

The class which implements the WAL is called HLog. What you may have read in my previous post and is also illustrated above is that there is only one instance of the HLog class, which is one per HRegionServer. When a HRegion is instantiated the single HLog is passed on as a parameter to the constructor of HRegion.

Central part to HLog's functionality is the append() method, which internally eventually calls doWrite(). It is what is called when the above mentioned modification methods are invoked... or is it? One thing to note here is that for performance reasons there is an option for put(), delete(), and incrementColumnValue() to be called with an extra parameter set: setWriteToWAL(boolean). If you invoke this method while setting up for example a Put() instance then the writing to WAL is forfeited! That is also why the downward arrow in the big picture above is done with a dashed line to indicate the optional step. By default you certainly want the WAL, no doubt about that. But say you run a large bulk import MapReduce job that you can rerun at any time. You gain extra performance but need to take extra care that no data was lost during the import. The choice is yours.

Another important feature of the HLog is keeping track of the changes. This is done by using a "sequence number". It uses an AtomicLong internally to be thread-safe and is either starting out at zero - or at that last known number persisted to the file system. So as the region is opening its storage file, it reads the highest sequence number which is stored as a meta field in each HFile and sets the HLog sequence number to that value if it is higher than what has been recorded before. So at the end of opening all storage files the HLog is initialized to reflect where persisting has ended and where to continue. You will see in a minute where this is used.

The image to the right shows three different regions. Each of them covering a different row key range. As mentioned above each of these regions shares the the same single instance of HLog. What that means in this context is that the data as it arrives at each region it is written to the WAL in an unpredictable order. We will address this further below.

Finally the HLog has the facilities to recover and split a log left by a crashed HRegionServer. These are invoked by the HMaster before regions are deployed again.

HLogKey

Currently the WAL is using a Hadoop SequenceFile, which stores record as sets of key/values. For the WAL the value is simply the KeyValue sent from the client. The key is represented by an HLogKey instance. If you may recall from my first post in this series the KeyValue does only represent the row, column family, qualifier, timestamp, and value as well as the "Key Type". Last time I did not address that field since there was no context. Now we have one because the Key Type is what identifies what the KeyValue represents, a "put" or a "delete" (where there are a few more variations of the latter to express what is to be deleted, value, column family or a specific column).

What we are missing though is where the KeyValue belongs to, i.e. the region and the table name. That is stored in the HLogKey. What is also stored is the above sequence number. With each record that number is incremented to be able to keep a sequential order of edits. Finally it records the "Write Time", a time stamp to record when the edit was written to the log.

LogFlusher

As mentioned above as data arrives at a HRegionServer in form of KeyValue instances it is written (optionally) to the WAL. And as mentioned as well it is then written to a SequenceFile. While this seems trivial, it is not. One of the base classes in Java IO is the Stream. Especially streams writing to a file system are often buffered to improve performance as the OS is much faster writing data in batches, or blocks. If you write records separately IO throughput would be really bad. But in the context of the WAL this is causing a gap where data is supposedly written to disk but in reality it is in limbo. To mitigate the issue the underlaying stream needs to be flushed on a regular basis. This functionality is provided by the LogFlusher class and thread. It simply calls HLog.optionalSync(), which checks if the  hbase.regionserver.optionallogflushinterval, set to 10 seconds by default, has been exceeded and if that is the case invokes HLog.sync(). The other place invoking the sync method is HLog.doWrite(). Once it has written the current edit to the stream it checks if the hbase.regionserver.flushlogentries parameter, set to 100 by default, has been exceeded and calls sync as well.

Sync itself invokes HLog.Writer.sync() and is implemented in SequenceFileLogWriter. For now we assume it flushes the stream to disk and all is well. That in reality this is all a bit more complicated is discussed below.

LogRoller

Obviously it makes sense to have some size restrictions related to the logs written. Also we want to make sure a log is persisted on a regular basis. This is done by the LogRoller class and thread. It is controlled by the hbase.regionserver.logroll.period parameter in the $HBASE_HOME/conf/hbase-site.xml file. By default this is set to 1 hour. So every 60 minutes the log is closed and a new one started. Over time we are gathering that way a bunch of log files that need to be maintained as well. The HLog.rollWriter() method, which is called by the LogRoller to do the above rolling of the current log file, is taking care of that as well by calling HLog.cleanOldLogs() subsequently. It checks what the highest sequence number written to a storage file is, because up to that number all edits are persisted. It then checks if there is a log left that has edits all less than that number. If that is the case it deletes said logs and leaves just those that are still needed.

This is a good place to talk about the following obscure message you may see in your logs:

2009-12-15 01:45:48,427 INFO org.apache.hadoop.hbase.regionserver.HLog: Too
many hlogs: logs=130, maxlogs=96; forcing flush of region with oldest edits:
foobar,1b2dc5f3b5d4,1260083783909


It is printed because the configured maximum number of log files to keep exceeds the number of log files that are required to be kept because they still contain outstanding edits that have not yet been persisted. The main reason I saw this being the case is when you stress out the file system so much that it cannot keep up persisting the data at the rate new data is added. Otherwise log flushes should take care of this. Note though that when this message is printed the server goes into a special mode trying to force flushing out edits to reduce the number of logs required to be kept.

The other parameters controlling the log rolling are hbase.regionserver.hlog.blocksize and hbase.regionserver.logroll.multiplier, which are set by default to rotate logs when they are at 95% of the blocksize of the SequenceFile, typically 64M. So either the logs are considered full or when a certain amount of time has passed causes the logs to be switched out, whatever comes first.

Replay

Once a HRegionServer starts and is opening the regions it hosts it checks if there are some left over log files and applies those all the way down in Store.doReconstructionLog(). Replaying a log is simply done by reading the log and adding the contained edits to the current MemStore. At the end an explicit flush of the MemStore (note, this is not the flush of the log!) helps writing those changes out to disk.

The old logs usually come from a previous region server crash. When the HMaster is started or detects that region server has crashed it splits the log files belonging to that server into separate files and stores those in the region directories on the file system they belong to. After that the above mechanism takes care of replaying the logs. One thing to note is that regions from a crashed server can only be redeployed if the logs have been split and copied. Splitting itself is done in HLog.splitLog(). The old log is read into memory in the main thread (means single threaded) and then using a pool of threads written to all region directories, one thread for each region.

Issues

As mentioned above all edits are written to one HLog per HRegionServer. You would ask why that is the case? Why not write all edits for a specific region into its own log file? Let's quote the BigTable paper once more:

If we kept the commit log for each tablet in a separate log file, a very large number of files would be written concurrently in GFS. Depending on the underlying file system implementation on each GFS server, these writes could cause a large number of disk seeks to write to the different physical log files.

HBase followed that principle for pretty much the same reasons. As explained above you end up with many files since logs are rolled and kept until they are safe to be deleted. If you do this for every region separately this would not scale well - or at least be an itch that sooner or later is causing pain.

So far that seems to be no issue. But again, it causes problems when things go wrong. As long as you have applied all edits in time and persisted the data safely, all is well. But if you have to split the log because of a server crash then you need to divide into suitable pieces, as described above in the "replay" paragraph. But as you have seen above as well all edits are intermingled in the log and there is no index of what is stored at all. For that reason the HMaster cannot redeploy any region from a crashed server until it has split the logs for that very server. And that can be quite a number if the server was behind applying the edits.

Another problem is data safety. You want to be able to rely on the system to save all your data, no matter what newfangled algorithms are employed behind the scenes. As far as HBase and the log is concerned you can turn down the log flush times to as low as you want - you are still dependent on the underlaying file system as mentioned above; the stream used to store the data is flushed but is it written to disk yet? We are talking about fsync style issues. Now for HBase we are most likely talking Hadoop's HDFS as being the file system that is persisted to.

Up to this point it should be abundantly clear that the log is what keeps data safe. For that reason a log could be kept open for up to an hour (or more if configured so). As data arrives a new key/value pair is written to the SequenceFile and occasionally flushed to disk. But that is not how Hadoop was set out to work. It was meant to provide an API that allows to open a file, write data into it (preferably a lot) and closed right away, leaving an immutable file for everyone else to read many times. Only after a file is closed it is visible and readable to others. If a process dies while writing the data the file is pretty much considered lost. What is required is a feature that allows to read the log up to the point where the crashed server has written it (or as close as possible).

Interlude: HDFS append, hflush, hsync, sync... wth?

It all started with HADOOP-1700 reported by HBase lead Michael Stack. It was committed in Hadoop 0.19.0 and meant to solve the problem. But that was not the case. So the issue was tackled again in HADOOP-4379 aka HDFS-200 and implemented syncFs() that was meant to help syncing changes to a file to be more reliable. For a while we had custom code (see HBASE-1470) that detected a patched Hadoop that exposed that API. But again this did not solve the issue entirely.

Then came HDFS-265, which revisits the append idea in general. It also introduces a Syncable interface that exposes hsync() and hflush().

Lastly SequenceFile.Writer.sync() is not the same as the above, it simply writes a synchronization marker into the file that helps reading it later - or recover data if broken.

While append for HDFS in general is useful it is not used in HBase, but the hflush() is. What it does is writing out everything to disk as the log is written. In case of a server crash we can safely read that "dirty" file up to the last edits. The append in Hadoop 0.19.0 was so badly suited that a hadoop fsck / would report the DFS being corrupt because of the open log files HBase kept.

Bottom line is, without Hadoop 0.21.0 you can very well face data loss. With Hadoop 0.21.0 you have a state-of-the-art system.

Planned Improvements

For HBase 0.21.0 there are quite a few things lined up that affect the WAL architecture. Here are some of the noteworthy ones.

SequenceFile Replacement

One of the central building blocks around the WAL is the actual storage file format. The used SequenceFile has quite a few shortcomings that need to be addressed. One for example is the suboptimal performance as all writing in SequenceFile is synchronized, as documented in HBASE-2105.

As with HFile replacing MapFile in HBase 0.20.0 it makes sense to think about a complete replacement. A first step was done to make the HBase classes independent of the underlaying file format. HBASE-2059 made the class implementing the log configurable.

Another idea is to change to a different serialization altogether. HBASE-2055 proposes such a format using Hadoop's Avro as the low level system. Avro is also slated to be the new RPC format for Hadoop, which does help as more people are familiar with it.

Append/Sync

Even with hflush() we have a problem that calling it too often may cause the system to slow down. Previous tests using the older syncFs() call did show that calling it for every record slows down the system considerably. One step to help is to implement a "Group Commit", done in HBASE-1939. It flushes out records in batches. In addition HBASE-1944 adds the notion of a "deferred log flush" as a parameter of a Column Family. If set to true it leaves the syncing of changes to the log to the newly added LogSyncer class and thread. Finally HBASE-2041 sets the flushlogentries to 1 and optionallogflushinterval to 1000 msecs. The .META. is always synced for every change, user tables can be configured as needed.

Distributed Log Splitting

As remarked splitting the log is an issue when regions need to be redeployed. One idea is to keep a list of regions with edits in Zookeeper. That way at least all "clean" regions can be deployed instantly. Only those with edits need to wait then until the logs are split.

What is left is to improve how the logs are split to make the process faster. Here is how is the BigTable addresses the issue:
One approach would be for each new tablet server to read this full commit log file and apply just the entries needed for the tablets it needs to recover. However, under such a scheme, if 100 machines were each assigned a single tablet from a failed tablet server, then the log file would be read 100 times (once by each server).
and further
We avoid duplicating log reads by first sorting the commit log entries in order of the keys (table, row name, log sequence number). In the sorted output, all mutations for a particular tablet are contiguous and can therefore be read efficiently with one disk seek followed by a sequential read. To parallelize the sorting, we partition the log file into 64 MB segments, and sort each segment in parallel on different tablet servers. This sorting process is coordinated by the master and is initiated when a tablet server indicates that it needs to recover mutations from some commit log file.
This is where its at. As part of the HMaster rewrite (see HBASE-1816) the log splitting will be addressed as well. HBASE-1364 wraps the splitting of logs into one issue. But I am sure that will evolve in more sub tasks as the details get discussed.

Thursday, January 28, 2010

2nd Munich OpenHUG Meeting

At the end of last year we had a first meeting in Munich to allow everyone interested in all things Hadoop and related technologies to gather, get to know each other and exchange their knowledge, experiences and information. We would like to invite for the second Munich OpenHUG Meeting and hope to see you all again and meet even more new enthusiasts there. We would also be thrilled if those attending would be willing to share their story so that we can learn about other projects and how people are using the exciting NoSQL related technologies. No pressure though, come along and simply listen if you prefer, we welcome anyone and everybody!

When: Thursday February 25, 2010 at 5:30pm open end
Where: eCircle AG, Nymphenburger Straße 86, 80636 München ["Bruckmann" Building, "U1 Mailinger Str", map (in German) and look for the signs]

Thanks again to Bob Schulze from eCircle to provide the location and projector. So far we have a talk scheduled by Christoph Rupp about HamsterDB. We are still looking for volunteers who would like to present on any related topic (please contact me)! Otherwise we will have an open discussion about whatever is brought up by the attendees.

Last but not least there will be something to drink and we will get pizzas in. Since we do not know how many of you will come we simply stay at the events location and continue our chats over food.

Looking forward to seeing you there!

Please RSVP at Upcoming or Xing.

Sunday, January 10, 2010

First Munich OpenHUG Meeting - Summary

On December 17th we had the first Munich OpenHUG meeting. The location was kindly provided by eCircle's Bob Schulze. This was the first in a series of meetings in Munich based on everything Hadoop and related technologies. Let's use the term NoSQL with care as this is not about blame or finger pointing (I feel that it was used like that in the past and therefore making this express point). We are trying to get together the brightest local and remote talent to report and present on new age and evolved existing technologies.

The first talk of the night was given by Bob himself presenting his findings of evaluating HBase and Hadoop for their internal use. He went into detail explaining how HBase is structuring its data and how it can be used for their needs. One thing that I noted in particular was his HBase Explorer, which he subsequently published on SourceForge as an Open Source project. The talk was concluded by an open discussion about HBase.

The second part of the meeting was my own presentation about how we at WorldLingo use HBase and Hadoop (as well as Lucene etc.)

We continued our discussion on HBase with the developers of eCircle present. This was very interesting and fruitful and we had the chance to exchange experiences made along our similar paths.

I would have wished for the overall attendance to be a little higher, but it was a great start. Talking to other hosts of similar events it seems that this is normal and therefore my hopes are up for the next meetings throughout this year. We have planned the next meeting for February 25th, 2010 at the same location. If you have interest in presenting a talk on any related topic, please contact me!

I am looking forward to meeting you all there!

Lars

Monday, December 14, 2009

First Munich OpenHUG Meeting

First Munich OpenHUG Meeting

We are trying to gauge the interest in a south Germany Hadoop User Group Meeting. After seeing quite a big interest in the Berlin meetings a few of us got together and decided to test the waters for another meeting at the other end of the country. We are therefore happy to announce the first Munich OpenHUG Meeting.

When: Thursday December 17, 2009 at 5:30pm open end
Where: eCircle AG, Nymphenburger Straße 86, 80636 München ("Bruckmann" Building, "U1 Mailinger Str", map in German and look for the signs)

Thanks to Bob Schulze from eCircle to provide the location, projector and also giving a first presentation on how eCircle is planning to use the Hadoop stack.

We also have Dave Butlerdi giving an overview of his usage of Hadoop.
Finally I will give a state of affairs of the HBase project. What is it, what does it do and how am I using it (since early 2008).

We are also open for everyone who wants to talk about anything related to these new technologies often combined under the rather new term "NoSQL". Take the opportunity to talk about what you are working on and find like minded people to bounce ideas off. This is also why we chose the title OpenHUG for the meeting. While we mostly work with Hadoop and its subprojects we also like to learn about related projects and technologies.

Last but not least there will be something to drink and we will get pizzas in. Since we do not know how many of you will come on such short notice we simply stay at Bob's place and continue or chats over food.

As this is a first meeting in Munich on this topic we called it in a day after the Berlin meeting. Given there is interest we will in the future settle on dates that fit nicely between the Berlin dates so that we have no overlap and you can attend both meetings.

Please RSVP at Yahoo's Upcoming or Xing.

Tuesday, November 24, 2009

HBase vs. BigTable Comparison

HBase is an open-source implementation of the Google BigTable architecture. That part is fairly easy to understand and grasp. What I personally feel is a bit more difficult is to understand how much HBase covers and where there are differences (still) compared to the BigTable specification. This post is an attempt to compare the two systems.

Before we embark onto the dark technology side of things I would like to point out one thing upfront: HBase is very close to what the BigTable paper describes. Putting aside minor differences, as of HBase 0.20, which is using ZooKeeper as its lock distributed coordination service, it has all the means to be nearly an exact implementation of BigTable's functionality. What I will be looking into below are mainly subtle variations or differences. Where possible I will try to point out how the HBase team is working on improving the situation given there is a need to do so.

Scope

The comparison in this post is based on the OSDI'06 paper that describes the system Google implemented in about seven person-years and which is in operation since 2005. The paper was published 2006 while the HBase sub-project of Hadoop was established only around the end of that same year to early 2007. Back then the current version of Hadoop was 0.15.0. Given we are now about 2 years in, with Hadoop 0.20.1 and HBase 0.20.2 available, you can hopefully understand that indeed much has happened since. Please also note that I am comparing a 14 page high level technical paper with an open-source project that can be examined freely from top to bottom. It usually means that there is more to tell about how HBase does things because the information is available.

Towards the end I will also address a few newer features that BigTable has nowadays and how HBase is comparing to those. We start though with naming conventions.

Terminology

There are a few different terms used in either system describing the same thing. The most prominent being what HBase calls "regions" while Google refers to it as "tablet". These are the partitions of subsequent rows spread across many "region servers" - or "tablet server" respectively. Apart from that most differences are minor or caused by usage of related technologies since Google's code is obviously closed-source and therefore only mirrored by open-source projects. The open-source projects are free to use other terms and most importantly names for the projects themselves.

Features

The following table lists various "features" of BigTable and compares them with what HBase has to offer. Some are actual implementation details, some are configurable option and so on. This may be confusing but it would be difficult to sort them into categories and not ending up with one entry only in each of them.

Feature
 Google BigTable 
 Apache HBase 
Notes
Atomic Read/Write/Modify
Yes, per row
Yes, per row
Since BigTable does not strive to be a relational database it does not have transactions. The closest to such a mechanism is the atomic access to each row in the table. HBase also implements a row lock API which allows the user to lock more than one row at a time.
Lexicographic Row Order
Yes
Yes
All rows are sorted lexicographically in one order and that one order only. Again, this is no SQL database where you can have different sorting orders.
Block Support
Yes
Yes
Within each storage file data is written as smaller blocks of data. This enables faster loading of data from large storage files. The size is configurable in either system. The typical size is 64K.
Block Compression
Yes, per column family
Yes, per column family
Google uses BMDiff and Zippy in a two step process. BMDiff works really well because neighboring key-value pairs in the store files are often very similar. This can be achieved by using versioning so that all modifications to a value are stored next to each other but still have a lot in common. Or by designing the row keys in such a way that for example web pages from the same site are all bundled. Zippy then is a modified LZW algorithm. HBase on the other hand uses the standard Java supplied GZip or with a little effort the GPL licensed LZO format. There are indications though that Hadoop also may want to have BMDiff (HADOOP-5793) and possibly Zippy as well.
Number of Column Families
Hundreds at Most
Less than 100
While the number of rows and columns is theoretically unbound the number of column families is not. This is a design trade-off but does not impose too much restrictions if the tables and key are designed accordingly.
Column Family Name Format
Printable
Printable
The main reason for HBase here is that column family names are used as directories in the file system.
Qualifier Format
Arbitrary
Arbitrary
Any arbitrary byte[] array can be used.
Key/Value Format
Arbitrary
Arbitrary
Like above, any arbitrary byte[] array can be used.
Access Control
Yes
No
BigTable enforces access control on a column family level. HBase does not have yet have that feature (see HBASE-1697).
Cell Versions
Yes
Yes
Versioning is done using timestamps. See next feature below too. The number of versions that should be kept are freely configurable on a column family level.
Custom Timestamps
Yes (micro)
Yes (milli)
With both systems you can either set the timestamp of a value that is stored yourself or leave the default "now". There are "known" restrictions in HBase that the outcome is indeterminate when adding older timestamps after already having stored newer ones beforehand.
Data Time-To-Live
Yes
Yes
Besides having versions of data cells the user can also set a time-to-live on the stored data that allows to discard data after a specific amount of time.
Batch Writes
Yes
Yes
Both systems allow to batch table operations.
Value based Counters
Yes
Yes
BigTable and HBase can use a specific column as atomic counters. HBase does this by acquiring a row lock before the value is incremented.
Row Filters
Yes
Yes
Again both system allow to apply filters when scanning rows.
Client Script Execution
Yes
No
BigTable uses Sawzall to enable users to process the stored data.
MapReduce Support
Yes
Yes
Both systems have convenience classes that allow scanning a table in MapReduce jobs.
Storage Systems
GFS
HDFS, S3, S3N, EBS
While BigTable works on Google's GFS, HBase has the option to use any file system as long as there is a proxy or driver class for it.
File Format
SSTable
HFile
 
Block Index
At end of file
At end of file
Both storage file formats have a similar block oriented structure with the block index stored at the end of the file.
Memory Mapping
Yes
No
BigTable can memory map storage files directly into memory.
Lock Service
Chubby
ZooKeeper
There is a difference in where ZooKeeper is used to coordinate tasks in HBase as opposed to provide locking services. Overall though ZooKeeper does for HBase pretty much what Chubby does for BigTable with slightly different semantics.
Single Master
Yes
No
HBase recently added support for multiple masters. These are on "hot" standby and monitor the master's ZooKeeper node.
Tablet/Region Count
10-1000
10-1000
Both systems recommend about the same amount of regions per region server. Of course this depends on many things but given a similar setup as far as "commodity" machines are concerned it seems to result in the same amount of load on each server.
Tablet/Region Size
100-200MB
256MB
The maximum region size can be configured for HBase and BigTable. HBase used 256MB as the default value.
Root Location
1st META / Chubby
-ROOT- / ZooKeeper
HBase handles the Root table slightly different from BigTable, where it is the first region in the Meta table. HBase uses its own table with a single region to store the Root table. Once either system starts the address of the server hosting the Root region is stored in ZooKeeper or Chubby so that the clients can resolve its location without hitting the master.
Client Region Cache
Yes
Yes
The clients in either system caches the location of regions and has appropriate mechanisms to detect stale information and update the local cache respectively
Meta Prefetch
Yes
No (?)
A design feature of BigTable is to fetch more than one Meta region information. This proactively fills the client cache for future lookups.
Historian
Yes
Yes
The history of region related events (such as splits, assignment, reassignment) is recorded in the Meta table.
Locality Groups
Yes
No
It is not entirely clear but it seems everything in BigTable is defined by Locality Groups. The group multiple column families into one so that they get stored together and also share the same configuration parameters. A single column family is probably a Locality Group with one member. HBase does not have this option and handles each column family separately.
In-Memory Column Families
Yes
Yes
These are for relatively small tables that need very fast access times.
KeyValue (Cell) Cache
Yes
No
This is a cache that servers hot cells.
Block Cache
Yes
Yes
Blocks read from the storage files are cached internally in configurable caches.
Bloom Filters
Yes
Yes
These filters allow - at a cost of using memory on the region server - to quickly check if a specific cell exists or maybe not.
Write-Ahead Log (WAL)
Yes
Yes
Each region server in either system stores one modification log for all regions it hosts.
Secondary Log
Yes
No
In addition to the Write-Ahead log mentioned above BigTable has a second log that it can use when the first is going slow. This is a performance optimization.
Skip Write-Ahead Log
?
Yes
For bulk imports the client in HBase can opt to skip writing into the WAL.
Fast Table/Region Split
Yes
Yes
Splitting a region or tablet is fast as the daughter regions first read the original storage file until a compaction finally rewrites the data into the region's local store.

New Features

As mentioned above, a few years have passed since the original OSDI'06 BigTable paper. Jeff Dean - a fellow at Google - has mentioned a few new BigTable features during speeches and presentations he gave recently. We will have a look at some of them here.

Feature
 Google BigTable 
 Apache HBase 
Notes
Client Isolation
Yes
No
BigTable is internally used to server many separate clients and can therefore keep the data between isolated.
Coprocessors
Yes
No
BigTable can host code that resides with the regions and splits with them as well. See HBASE-2000 for progress on this feature within HBase.
Corruption Safety
Yes
No
This is an interesting topic. BigTable uses CRC checksums to verify if data has been written safely. While HBase does not have this, the question is if that is build into Hadoop's HDFS?
Replication
Yes
No
HBase is working on the same topic in HBASE-1295

Note: the color codes indicate what features have a direct match or where it is missing (yet). Weaker features are colored yellow, as I am not sure if they are immediately necessary or even applicable given HBase's implementation.

Variations and Differences

Some of the above features need a bit more looking into as they are difficult to be narrowed down to simple "Yay or Nay" questions. I am addressing them below separately.

Lock Service

This is from the BigTable paper:
Bigtable uses Chubby for a variety of tasks: to ensure that there is at most one active master at any time; to store the bootstrap location of Bigtable data (see Section 5.1); to discover tablet servers and finalize tablet server deaths (see Section 5.2); to store Bigtable schema information (the column family information for each table); and to store access control lists. If Chubby becomes unavailable for an extended period of time, Bigtable becomes unavailable.

There is a lot of overlap compared to how HBase does use ZooKeeper. What is different though is that schema information is not stored in ZooKeeper (yet, see http://wiki.apache.org/hadoop/ZooKeeper/HBaseUseCases) for details. What is important here though is the same reliance on the lock service being available. From my own experience and reading the threads on the HBase mailing list it is often underestimated what can happen when ZooKeeper does not get the resources it needs to react timely. It is better to have a small ZooKeeper cluster on older machines not doing anything else as opposed to having ZooKeeper nodes running next to the already heavy Hadoop or HBase processes. Once you starve ZooKeeper you will see a domino effect of HBase nodes going down with it - including the master(s).

Update: After talking to a few guys of the ZooKeeper team I would like to point out that this is indeed not a ZooKeeper issue. It has to do with the fact that if you have an already heavily loaded node trying to also respond in time to ZooKeeper resources then you may face a timeout situation where the HBase RegionServers and even the Master may think that their coordination service is gone and shut themselves down. Patrick Hunt has responded to this by mail and by post. Please read both to see that ZooKeeper is able to handle the load. I personally recommend to set up ZooKeeper in combination with HBase on a separate cluster, maybe a set of spare machines you have from a recent update to the cluster and which are slightly outdated (no 2xquad core CPU with 16GB of memory) but are otherwise perfectly fine. This also allows you to monitor the machines separately and not having to see a combined CPU load of 100% on the servers and not really knowing where it comes from and what effect it may have.

Another important difference is that ZooKeeper is no lock service like Chubby - and I do think it does not have to be as far as HBase is concerned. ZooKeeper is a distributed coordination service enabling HBase to do Master node elections etc. It also allows using semaphores to indicate state or actions required. So where Chubby creates a lock file to indicate a tablet server is up and running HBase in turn uses ephemeral nodes that exist as long as the session between the RegionServer which creates that node and ZooKeeper is active. This also causes the differences in semantics where in BigTable can delete a tablet servers lock file to indicate that it has lost its lease on tablets. In HBase this has to be handled differently because of the slightly less restrictive architecture of ZooKeeper. These are only semantics as mentioned and do not mean one is better than the other - just different.

The first level is a file stored in Chubby that contains the location of the root tablet. The root tablet contains the location of all tablets in a special METADATA table. Each METADATA tablet contains the location of a set of user tablets. The root tablet is just the first tablet in the METADATA table, but is treated specially - it is never split - to ensure that the tablet location hierarchy has no more than three levels.

As mentioned above in HBase the root region is its own table with a single region. If that makes a difference to having it as the first (non-splittable) region of the meta table I doubt strongly. It is just the same feature but implemented differently.

The METADATA table stores the location of a tablet under a row key that is an encoding of the tablet's table identifier and its end row.

HBase does have a different layout here. It stores the start and end row with each region where the end row is exclusive and denotes the first (or start) row of the next region. Again, these are minor differences and I am not sure if there is a better or worse solution. It is just done differently.

Master Operation

To detect when a tablet server is no longer serving its tablets, the master periodically asks each tablet server for the status of its lock. If a tablet server reports that it has lost its lock, or if the master was unable to reach a server during its last several attempts, the master attempts to acquire an exclusive lock on the server's file. If the master is able to acquire the lock, then Chubby is live and the tablet server is either dead or having trouble reaching Chubby, so the master ensures that the tablet server can never serve again by deleting its server file. Once a server's file has been deleted, the master can move all the tablets that were previously assigned to that server into the set of unassigned tablets. To ensure that a Bigtable cluster is not vulnerable to networking issues between the master and Chubby, the master kills itself if its Chubby session expires.

This is quite different even up to the current HBase 0.20.2. Here the master uses a heartbeat protocol that is used by the region servers to report for duty and that they are still alive subsequently. I am not sure if this topic is covered by the master rewrite umbrella issue HBASE-1816 - and if it needs to be addressed at all. It could well be that what we have in HBase now is sufficient and does its job just fine. It was created when there was no lock service yet and therefore could be considered legacy code too.

Master Startup

The master executes the following steps at startup. (1) The master grabs a unique master lock in Chubby, which prevents concurrent master instantiations. (2) The master scans the servers directory in Chubby to find the live servers. (3) The master communicates with every live tablet server to discover what tablets are already assigned to each server. (4) The master scans the METADATA table to learn the set of tablets. Whenever this scan encounters a tablet that is not already assigned, the master adds the tablet to the set of unassigned tablets, which makes the tablet eligible for tablet assignment.

Along what I mentioned above, this part of the code was created before ZooKeeper was available. So HBase actually waits for the region servers to report for duty. It also scans the .META. table to learn what is there and which server is assigned to it. ZooKeeper is (yet) only used to publish the server hosting the -ROOT- region.

Tablet/Region Splits

In case the split notification is lost (either because the tablet server or the master died), the master detects the new tablet when it asks a tablet server to load the tablet that has now split. The tablet server will notify the master of the split, because the tablet entry it finds in the METADATA table will specify only a portion of the tablet that the master asked it to load.

The master node in HBase uses the .META. solely to detect when a region was split but the message was lost. For that reason it scans the .META. on a regular basis to see when a region appears that is not yet assigned. It will then assign that region as per its default strategy.

Compactions

The following are more terminology differences than anything else.

As write operations execute, the size of the memtable increases. When the memtable size reaches a threshold, the memtable is frozen, a new memtable is created, and the frozen memtable is converted to an SSTable and written to GFS. This minor compaction process has two goals: it shrinks the memory usage of the tablet server, and it reduces the amount of data that has to be read from the commit log during recovery if this server dies. Incoming read and write operations can continue while compactions occur.

HBase has a similar operation but it is referred to as a "flush". Opposed to that "minor compactions" in HBase rewrite the last N used store files, i.e. those with the most recent mutations as they are probably much smaller than previously created files that have more data in them.

... we bound the number of such files by periodically executing a merging compaction in the background. A merging compaction reads the contents of a few SSTables and the memtable, and writes out a new SSTable. The input SSTables and memtable can be discarded as soon as the compaction has finished.

This again refers to what is called "minor compaction" in HBase.

A merging compaction that rewrites all SSTables into exactly one SSTable is called a major compaction.

Here we have an exact match though, a "major compaction" in HBase also rewrites all files into one.

Immutable Files

Knowing that files are fixed once written BigTable makes the following assumption:

The only mutable data structure that is accessed by both reads and writes is the memtable. To reduce contention during reads of the memtable, we make each memtable row copy-on-write and allow reads and writes to proceed in parallel.

I do believe this is done similar in HBase but am not sure. It certainly has the same architecture as HDFS files for example are also immutable once written.

I can only recommend that you read the BigTable too and make up your own mind. This post was inspired by the idea to learn what BigTable really has to offer and how much HBase has already covered. The difficult part is of course that there is not too much information available on BigTable. But the numbers even the 2006 paper lists are more than impressive. If HBase as on open-source project with just a handful of committers of whom most have a full-time day jobs can achieve something even remotely comparable I think this is a huge success. And looking at the 0.21 and 0.22 road map, the already small gap is going to shrink even further!