HBase @ WorldLingo - Get more Information Technology
The second talk given was by Jan Lehnardt, a CouchDB team member. I am looking into Erlang for the last few months to see how we could use it for our own efforts. CouchDB is one of the projects you come across when reading articles about Erlang. So it was really great to have Jan present too.
At the end of both our talks it was great to see how the questions from the audience at times tried to compare the two. So is HBase better or worse than CouchDB. Of course, you cannot compare them directly. While they share common features (well, they store data, right?) they are made to solve different problems. CouchDB is offering a schema free storage with build in replication, which can even be used to create offline clients that sync their changes with another site when they have connectivity again. One of the features puzzling me most is the ability to use it to serve your own applications to the world. You create the pages and scripts you need and push it into the database using CouchApp. Since the database already has a built-in web server it can handle your applications requirements implicitly. Nice!
I asked Jan if he had concerns about scaling this, or if it wouldn't be better to use Apache or Nginx to serve the static content. His argument was that Erlang can handle many many more concurrent request than Apache can for example. I read up on Yaws and saw his point. So I guess it is a question then of memory and CPU requirements. The former is apparently another strength of CouchDB, which has proven to serve thousands of concurrent requests only needed about 10MB of RAM - how awesome is that?!?! I am not sure about CPU then - but take a gander that it is equally sane.
Another Erlang project I am interested in is Dynomite, a Erlang based Amazon Dynamo "clone" (or rather implementation). Talking to Cliff it seems it is as awesome leveraging the Erlang OTP abilities to create something that a normal Java developer and their JRE is just not used to.
And that brings me to HBase. I told the crowd in Berlin that as of version 0.18.0 HBase is ready for anyone to get started with - given they read the Wiki to set the file handles right and a few other bits in pieces.
Note: I was actually thinking about suggesting an improvement to the HBase team to have a command line check that can be invoked separately or is called when "start-hbase.sh" is called that checks a few of these common parameters and prints out warnings to the user. I know that the file handle count is printed out in the log files, but for a newbie this is a bit too deep down. What could be checked? First of the file handles being say 32K. The next thing is newer resource limits that were introduced with Hadoop for example that now need tweaking. An example is the "xciever" (sic) value. This again is documented in the Wiki, but who reads it, right? Another common issue is RAM. If the master knows the number of regions (or while it is scanning the META to determine it) it could warn if the JRE is not given enough memory. Sure, there are no hard boundaries, but better to see a Warning: Found x regions. Your configured memory for the JRE seems too low for the system to run stable.
Back to HBase. I also told the audience that as of HBase 0.19.0 the scanning was much improved speed wise and that I am happy where we are nowadays in terms of stability and speed. Sure, it could be faster for random reads so I may be able to drop my MemCached layer. And the team is working on that. So, here's hoping that we will see the best HBase ever in the upcoming version. I for myself am 100% sure that the HBase guys can deliver - they have done so in the past and will now as well. All I can say - give it a shot!
So, CouchDB is lean and mean while HBase is a resource hog from my experience. But it is also built to scale to Petabyte size data. With CouchDB, you would have to add sharding on top of it including all the issues that come with it, for example rebalancing, fail-over, recovery, adding more servers and so on. For me HBase is the system of choice - for this particular problem. That does not mean I could use CouchDB, or even Erlang for that matter, in a separate area. Until then I will keep my eyes very close in this exciting (though in case of Erlang not new!) technology. May the open-source projects are rule and live long and prosper!


5 comments: