Lineland

Hi Lars, Are you still doing HBase consulting? Jus...

2015-12-22T03:42:33.912-08:00

Hi Lars,
Are you still doing HBase consulting? Just curious to know - how far you were successful in your endeavor?

Great article... Thanx for sharing... Can you plea...

2015-11-26T01:32:23.635-08:00

Great article... Thanx for sharing... Can you please upload more????

Happy New Year 2016

Because of that, data is written into new files an...

2014-10-06T14:03:36.137-07:00

Because of that, data is written into new files and as their number grows HBase compacts them into another set of new, consolidated files.file sharing

Very informative. I searched the net for this leve...

2014-03-27T10:31:23.641-07:00

Very informative. I searched the net for this level of details but found sparse data about the internals. This article has helped me connect the various dots.
Thanks Lars for such an amazing article.

Yes, merge is using B+tree in its own RegionServer...

2014-01-08T15:17:39.480-08:00

Yes, merge is using B+tree
in its own RegionServer
Yes.

In HMatser From Zoopkeeper

2014-01-08T15:13:11.242-08:00

In HMatser
From Zoopkeeper

HFile. you mean memstore, yes. Hbase will split in...

2014-01-08T15:01:05.676-08:00

HFile.
you mean memstore, yes.
Hbase will split into another region in same regionserver

per RegionServer

2014-01-08T14:53:53.220-08:00

per RegionServer

Nice Comparision/discussion Regarding Metadata, H...

2013-09-21T01:06:41.981-07:00

Nice Comparision/discussion

Regarding Metadata, HCATALOG(Merged with Hive v11) can use be used, to get the metadata defined in Hive metastore , to be used in pig. It would be helpful to manage your metadata across pig, hive and mapreduce.

Very good analysis and summary. I look at Hive a...

2013-08-14T04:53:41.028-07:00

Very good analysis and summary.

I look at Hive as being an open sourced
DWS for the Hadoop Framework.

Pig / PigLatin is a data_Flow language that is
geared up for Big_Data Information-Streams.

VBR/ Wallis Dudhnath

Yes you are right.

2013-06-05T21:55:24.774-07:00

Yes you are right.

This article is talking HFile v1. In V2 this has c...

2013-06-05T21:54:21.530-07:00

This article is talking HFile v1. In V2 this has changed to multi level indexing with B+tree like data structure for efficient processing and the index is moved at block level.

Lars, Great writeup. Its quite informative. I have...

2013-04-30T21:07:51.888-07:00

Lars, Great writeup. Its quite informative. I have apriblem that is actually counter intutive. I have an Hbase/ Hadoop cluster with 8 nodes. All the tables I store fit into a single region. So any time I scan a table, it hits a single node. The cluster is absolutely idle, and the tables get no writes at all. If I run an hbase shell on one of the nodes and scan one of the tables the performane varies 6x depending on where i scan it. If I scan it on any machine in the cluster which is not the regionserver hosting the region for the table I am scanning it runs mich faster tan if I were to scan the table from the machine that actually hosts the specific region i am scanning. All other nodes are ok. And this happens regardless of which table I choose to scan. These are scan table from hbase shell, not even a mapreduce. Any ideas on why might this be? I have hbase 0.92 installed. Thanks

One question: In the first big picture, the HLog s...

2013-03-29T23:08:13.864-07:00

One question: In the first big picture, the HLog should be per RegionServer, not per Region ?

Hi Lars, Thank you for this article. I am working...

2013-03-08T07:47:25.711-08:00

Hi Lars,

Thank you for this article. I am working on building a Hadoop System and there will be HBase too. We are planning to server multiple purposes with the same data. There will be HBase, but there will be Mapreduce as well. You mention use of HFiles, does that mean I will have to plan space for both HFiles and Mapfiles for the same data (like doubling the storage volume), if I were to use Hbase and Mapreduce on the same Hadoop cluster.

Thank you,

Oner

Hbase is the Hadoop database. Think of it as a di...

2013-01-07T01:24:09.018-08:00

Hbase is the Hadoop database. Think of it as a distributed, scalable, big data store.

I am working on Hbase. I have query regarding how ...

2012-12-07T04:36:29.089-08:00

I am working on Hbase. I have query regarding how Hbase store the data in sorted order by LSM Tree.

As per my understanding, Hbase use LSM Tree for data transfer in large scale data processing. when Data comes from client, it store in-memory sequentially first and than sort and store as B-Tree as Store file. Than it is merging the Store file with Disk B-Tree(of key). is it correct ? Am I missing something ?

If Yes, than in cluster env. there are multiple RegionServers who take the client request. On that case, How all the Hlogs (of each regionServer) merge with disk B-Tree ?

Is it like Hlog only merge the data with Hfile of same regionServer ?

Excellent post with informative

2012-11-27T05:28:11.361-08:00

Excellent post with informative

Great article, Thanks! I was actually struggling t...

2012-11-12T06:43:33.450-08:00

Great article, Thanks! I was actually struggling to understand how HBase deals with data locality. This has been helpful

I find it ridiculous that we need to configure Had...

2012-09-25T19:19:32.064-07:00

I find it ridiculous that we need to configure Hadoop to allow this many threads. And I would also argue that HBase shouldn't die just because a DataNode was in a bad mood and decided to run out of "xcievers" (sic).

Hi, I have few questions. Q1: In RDBMS we have ...

2012-07-12T06:02:28.545-07:00

Hi,

I have few questions.

Q1: In RDBMS we have multiple DB schemas\oracle user instances. Similarly, can we have multiple db schemas in hbase? If yes, can we have multiple schemas one one hadoop-hbase cluster?

If multiple schemas possible, how can we define them? Using configuration or programatically?

Q2: can we have same column family name in multiple tables? if yes, does it impacts performance if we have same name column family in multiple tables?

Q3: Sequential keys improves read performance and random keys improves write performance. which way one must go?

Q4: What are best practices to improve hadoop+hbase performance?

Q5: when one program is deleting tables, another program is accessing a row of that table. what would be impact of it? can we have some sort of lock while reading or while deleting a table?

Q6: as everything in application is byte form, what would happen if hbase db and application are using different character set? can we synch both for some particular character set by configuration or programatically?

Regards,
Rashmi

Thanks a lot! Its a really great blog! Any details...

2012-06-26T00:10:03.304-07:00

Thanks a lot! Its a really great blog! Any details on the magic part of the header?

@anti neutrino the default blocksize for HFiles ar...

2012-06-14T17:54:48.699-07:00

@anti neutrino the default blocksize for HFiles are 64KB. As George mentioned, smaller blocksizes are better for random read access and larger blocksizes are better for sequential reads such as scan. I've seen some people setting it up to 1GB for their purpose but your configuration may vary

Hi, this is really insightful. I have 1 query thou...

2012-06-07T12:34:34.575-07:00

Hi,
this is really insightful.
I have 1 query though, what is the ideal or recommended size of the HDFS block size for storing HBASE files ? To my guesses, the smaller the HDFS block size better it would be for HBase performance.

If a record is like { Name : "ABC" Addre...

2012-05-18T14:19:34.596-07:00

If a record is like {
Name : "ABC"
Address : "XYZ"
Number : "123"
}

Does HBase store the name of the column in the row with every row or does it store it once for the table and reference it in the individual rows?