Tuesday, May 26, 2009

HBase Schema Manager

As already mentioned in one of my previous posts, HBase at times makes it difficult to maintain or even create a new table structure. Imagine you have a running cluster and quite an elaborate table setup. Now you want to to create a backup cluster for load balancing and general tasks like reporting etc. How do you get all the values from one system into the other?

While you can use various examples that help you backing up the data and eventually restore it, how do you "clone" the table schemas?

Or imagine you have an existing system like the one we talked about above and you simply want to change a few things around. With an RDBMS you can save the required steps in a DDL statement and execute it on the server - or the backup server etc. But with HBase there is now DDL or even the possibility of executing pre-built scripts against a running cluster.

What I described in my previous post was a why to store the table schemas into an XML configuration file and run that against a cluster. The code handles adding new tables and more importantly the addition, removal and modification of column families for any named table.

I have put this all into a separate Java application that may be useful to you. You can get it from my GitHub repository. It is really simple to use, you create an XML based configuration file, for example:
<?xml version="1.0" encoding="UTF-8"?>
<configurations>
<configuration>
<name>foo</name>
<description>Configuration for the FooBar HBase cluster.</description>
<hbase_master>foo.bar.com:60000</hbase_master>
<schema>
<table>
<name>test</name>
<description>Test table.</description>
<column_family>
<name>sample</name>
<description>Sample column.</description>
<!-- Default: 3 -->
<max_versions>1</max_versions>
<!-- Default: DEFAULT_COMPRESSION_TYPE -->
<compression_type/>
<!-- Default: false -->
<in_memory/>
<!-- Default: false -->
<block_cache_enabled/>
<!-- Default: -1 (forever) -->
<time_to_live/>
<!-- Default: 2147483647 -->
<max_value_length/>
<!-- Default: DEFAULT_BLOOM_FILTER_DESCRIPTOR -->
<bloom_filter/>
</column_family>
</table>
</schema>
</configuration>
</configurations>

Then all you have to do is run the application like so:

java -jar hbase-manager-1.0.jar schema.xml

The "schema.xml" is the above XML configuration saved on your local machine. The output shows the steps performed:
$ java -jar hbase-manager-1.0.jar schema.xml
creating table test...
table created
done.

You can also specify more options on the command line:
usage: HbaseManager [<options>] <schema-xml-filename> [<config-name>]
-l,--list lists all tables but performs no further action.
-n,--nocreate do not create non-existent tables.
-v,--verbose print verbose output.

If you use the "verbose" option you get more details:
$ java -jar hbase-manager-1.0.jar -v schema.xml
schema filename: schema.xml
configuration used: null
using config number: default
table schemas read from config:
[name -> test
description -> Test table.
columns -> {sample=name -> sample
description -> Sample column.
maxVersions -> 1
compressionType -> NONE
inMemory -> false
blockCacheEnabled -> false
maxValueLength -> 2147483647
timeToLive -> -1
bloomFilter -> false}]
hbase.master -> foo.bar.com:60000
authoritative -> true
name -> test
tableExists -> true
changing table test...
no changes detected!
done.

Finally you can use the "list" option to check initial connectivity and the successful changes:
$ java -jar hbase-manager-1.0.jar -l schema.xml
tables found: 1
test
done.

A few notes: First and most importantly, if you change a large table, i.e. one with thousands of regions, this process can take quite a long time. This is caused by the enableTable() call having to scan the complete .META. table to assign the regions to their respective region servers. There is possibly room for improvement in my little application to handle this better - suggestions welcome!

Also, I do not have Bloom Filter settings implemented, as this is still changing from 0.19 to 0.20. Once it has been finalized I will add support for it.

If you do not specify a configuration name then the first one is used. Having more than one configuration allows you to have multiple clusters defined in one schema file and by specifying the name you can execute only a specific one when you need to.

No comments:

Post a Comment