Discussion:
Pro and Cons of using HBase table as an external table in HIVE
Ramasubramanian Narayanan
2017-06-07 10:13:35 UTC
Permalink
Hi,

Can you please let us know Pro and Cons of using HBase table as an external
table in HIVE.

Will there be any performance degrade when using Hive over HBase instead of
using direct HIVE table.

The table that I am planning to use in HBase will be master table like
account, customer. Wanting to achieve Slowly Changing Dimension. Please
through some lights on that too if you have done any such implementations.

Thanks and Regards,
Rams
Mich Talebzadeh
2017-06-07 15:46:53 UTC
Permalink
As I know using Hive on Hbase can only be done through Hive

Example

hive> create external table MARKETDATAHBASE (key STRING, TICKER STRING,
TIMECREATED STRING, PRICE STRING)

STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'WITH
SERDEPROPERTIES ("hbase.columns.mapping" =
":key,PRICE_INFO:TICKER,PRICE_INFO:TIMECREATED,PRICE_INFO:PRICE")

TBLPROPERTIES ("hbase.table.name" = "MARKETDATAHBASE");


The problem here is that like most Hive external tables you are creating a
pointer to Hbase with Hive storage handler and there is very little
optimization that can be done.


In all probability you would be better off using Apache Phoenix on top of
Hbase with Phoenix secondary indexes. Granted the SQL capability in Phoenix
may not be that good as Hive but should do for most purposes.


In Phoenix you can do:



CREATE TABLE MARKETDATAHBASE (PK VARCHAR PRIMARY KEY, PRICE_INFO.TICKER
VARCHAR, PRICE_INFO.TIMECREATED VARCHAR, PRICE_INFO.PRICE VARCHAR);



HTH,

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 7 June 2017 at 11:13, Ramasubramanian Narayanan <
Post by Ramasubramanian Narayanan
Hi,
Can you please let us know Pro and Cons of using HBase table as an
external table in HIVE.
Will there be any performance degrade when using Hive over HBase instead
of using direct HIVE table.
The table that I am planning to use in HBase will be master table like
account, customer. Wanting to achieve Slowly Changing Dimension. Please
through some lights on that too if you have done any such implementations.
Thanks and Regards,
Rams
Uli Bethke
2017-06-07 16:40:33 UTC
Permalink
Why are you thinking of using HBase?

Just store the SCD versions in a normal Hive dimension table. In case
you are worried about updates to columns such as 'valid to' and 'latest
record indicator' you can calculate these on the fly using window
functions. No need to create and update them physically. You can read
more about it here
https://sonra.io/2017/05/15/dimensional-modeling-and-kimball-data-marts-in-the-age-of-big-data-and-hadoop/
Post by Ramasubramanian Narayanan
Hi,
Can you please let us know Pro and Cons of using HBase table as an
external table in HIVE.
Will there be any performance degrade when using Hive over HBase
instead of using direct HIVE table.
The table that I am planning to use in HBase will be master table like
account, customer. Wanting to achieve Slowly Changing Dimension.
Please through some lights on that too if you have done any such
implementations.
Thanks and Regards,
Rams
--
___________________________
Uli Bethke
CEO Sonra
p: +353 86 32 83 040
w: www.sonra.io
l: linkedin.com/in/ulibethke
t: twitter.com/ubethke
s: uli.bethke

Chair Hadoop User Group Ireland
www.hugireland.org
Associate President DAMA Ireland
Michael Segel
2017-06-09 09:38:10 UTC
Permalink
The pro’s is that you have the ability to update a table without having to worry about duplication of the row. Tez is doing some form of compaction for you that already exists in HBase.

The cons:

1) Its slower. Reads from HBase have more overhead with them than just reading a file. Read Lars George’s book on what takes place when you do a read.

2) HBase is not a relational store. (You have to think about what that implies)

3) You need to query against your row key for best performance, otherwise it will always be a complete table scan.

HBase was designed to give you fast access for direct get() and limited range scans. Otherwise you have to perform full table scans. This means that unless you’re able to do a range scan, your full table scan will be slower than if you did this on a flat file set. Again the reason why you would want to use HBase if your data set is mutable.

You also have to trigger a range scan when you write your hive query and you have make sure that you’re querying off your row key.

HBase was designed as a <key,value> store. Plain and simple. If you don’t use the key, you have to do a full table scan. So even though you are partitioning on row key, you never use your partitions. However in Hive or Spark, you can create an alternative partition pattern. (e.g your key is the transaction_id, yet you partition on month/year portion of the transaction_date)

You can speed things up a little by using an inverted table as a secondary index. However this assumes that you want to use joins. If you have a single base table with no joins then you can limit your range scans based on making sure you are querying against the row key. Note: This will mean that you have limited querying capabilities.

And yes, I’ve done this before but can’t share it with you.

HTH

P.S.
I haven’t tried Hive queries where you have what would be the equivalent of a get() .

In earlier versions of hive, the issue would be “SELECT * FROM foo where rowkey=BAR” would still do a full table scan because of the lack of predicate pushdown.
This may have been fixed in later releases of hive. That would be your test case. If there is predicate pushdown, then you will be faster, assuming that the query triggers an implied range scan.
This would be a simple thing. However keep in mind that you’re going to generate a map/reduce job (unless using a query engine like Tez) where you wouldn’t if you just wrote your code in Java.
Hi,
Can you please let us know Pro and Cons of using HBase table as an external table in HIVE.
Will there be any performance degrade when using Hive over HBase instead of using direct HIVE table.
The table that I am planning to use in HBase will be master table like account, customer. Wanting to achieve Slowly Changing Dimension. Please through some lights on that too if you have done any such implementations.
Amey Barve
2017-06-09 09:50:03 UTC
Permalink
Hi Michael,

"If there is predicate pushdown, then you will be faster, assuming that the
query triggers an implied range scan"
---> Does this bring results faster than plain hive querying over ORC /
Text file formats

In other words Is querying over plain hive (ORC or Text) *always* faster
than through HiveStorageHandler?

Regards,
Amey
The pro’s is that you have the ability to update a table without having to
worry about duplication of the row. Tez is doing some form of compaction
for you that already exists in HBase.
1) Its slower. Reads from HBase have more overhead with them than just
reading a file. Read Lars George’s book on what takes place when you do a
read.
2) HBase is not a relational store. (You have to think about what that implies)
3) You need to query against your row key for best performance, otherwise
it will always be a complete table scan.
HBase was designed to give you fast access for direct get() and limited
range scans. Otherwise you have to perform full table scans. This means
that unless you’re able to do a range scan, your full table scan will be
slower than if you did this on a flat file set. Again the reason why you
would want to use HBase if your data set is mutable.
You also have to trigger a range scan when you write your hive query and
you have make sure that you’re querying off your row key.
HBase was designed as a <key,value> store. Plain and simple. If you don’t
use the key, you have to do a full table scan. So even though you are
partitioning on row key, you never use your partitions. However in Hive or
Spark, you can create an alternative partition pattern. (e.g your key is
the transaction_id, yet you partition on month/year portion of the
transaction_date)
You can speed things up a little by using an inverted table as a secondary
index. However this assumes that you want to use joins. If you have a
single base table with no joins then you can limit your range scans based
on making sure you are querying against the row key. Note: This will mean
that you have limited querying capabilities.
And yes, I’ve done this before but can’t share it with you.
HTH
P.S.
I haven’t tried Hive queries where you have what would be the equivalent
of a get() .
In earlier versions of hive, the issue would be “SELECT * FROM foo where
rowkey=BAR” would still do a full table scan because of the lack of
predicate pushdown.
This may have been fixed in later releases of hive. That would be your
test case. If there is predicate pushdown, then you will be faster,
assuming that the query triggers an implied range scan.
This would be a simple thing. However keep in mind that you’re going to
generate a map/reduce job (unless using a query engine like Tez) where you
wouldn’t if you just wrote your code in Java.
On Jun 7, 2017, at 5:13 AM, Ramasubramanian Narayanan <
Hi,
Can you please let us know Pro and Cons of using HBase table as an
external table in HIVE.
Will there be any performance degrade when using Hive over HBase instead
of using direct HIVE table.
The table that I am planning to use in HBase will be master table like
account, customer. Wanting to achieve Slowly Changing Dimension. Please
through some lights on that too if you have done any such implementations.
Thanks and Regards,
Rams
Edward Capriolo
2017-06-09 12:14:22 UTC
Permalink
Think about it like this one system is scanning a local file ORC, using an
hbase scanner (over the network), and scanning the data in sstable format?
Post by Amey Barve
Hi Michael,
"If there is predicate pushdown, then you will be faster, assuming that
the query triggers an implied range scan"
---> Does this bring results faster than plain hive querying over ORC /
Text file formats
In other words Is querying over plain hive (ORC or Text) *always* faster
than through HiveStorageHandler?
Regards,
Amey
The pro’s is that you have the ability to update a table without having
to worry about duplication of the row. Tez is doing some form of
compaction for you that already exists in HBase.
1) Its slower. Reads from HBase have more overhead with them than just
reading a file. Read Lars George’s book on what takes place when you do a
read.
2) HBase is not a relational store. (You have to think about what that implies)
3) You need to query against your row key for best performance, otherwise
it will always be a complete table scan.
HBase was designed to give you fast access for direct get() and limited
range scans. Otherwise you have to perform full table scans. This means
that unless you’re able to do a range scan, your full table scan will be
slower than if you did this on a flat file set. Again the reason why you
would want to use HBase if your data set is mutable.
You also have to trigger a range scan when you write your hive query and
you have make sure that you’re querying off your row key.
HBase was designed as a <key,value> store. Plain and simple. If you
don’t use the key, you have to do a full table scan. So even though you are
partitioning on row key, you never use your partitions. However in Hive or
Spark, you can create an alternative partition pattern. (e.g your key is
the transaction_id, yet you partition on month/year portion of the
transaction_date)
You can speed things up a little by using an inverted table as a
secondary index. However this assumes that you want to use joins. If you
have a single base table with no joins then you can limit your range scans
based on making sure you are querying against the row key. Note: This will
mean that you have limited querying capabilities.
And yes, I’ve done this before but can’t share it with you.
HTH
P.S.
I haven’t tried Hive queries where you have what would be the equivalent
of a get() .
In earlier versions of hive, the issue would be “SELECT * FROM foo where
rowkey=BAR” would still do a full table scan because of the lack of
predicate pushdown.
This may have been fixed in later releases of hive. That would be your
test case. If there is predicate pushdown, then you will be faster,
assuming that the query triggers an implied range scan.
This would be a simple thing. However keep in mind that you’re going to
generate a map/reduce job (unless using a query engine like Tez) where you
wouldn’t if you just wrote your code in Java.
On Jun 7, 2017, at 5:13 AM, Ramasubramanian Narayanan <
Hi,
Can you please let us know Pro and Cons of using HBase table as an
external table in HIVE.
Will there be any performance degrade when using Hive over HBase
instead of using direct HIVE table.
The table that I am planning to use in HBase will be master table like
account, customer. Wanting to achieve Slowly Changing Dimension. Please
through some lights on that too if you have done any such implementations.
Thanks and Regards,
Rams
Michael Segel
2017-06-09 12:35:30 UTC
Permalink
No.

First, I apologize for my first response. I guess its never a good idea to check email at 4:00 in the morning before your first cup of coffee. ;-)
I went into a bit more detail that may have confused the issue.

To answer your question

In other words Is querying over plain hive (ORC or Text) always faster than through HiveStorageHandler?
No.

Not always. It will depend on the data, the schema and the query.

HBase is a <KEY, VALUE> store, where the KEY is the rowkey.
HBase partitions its data based on the rowkey.

The rows are stored by rowkey in lexicographical sorted order. This creates a physical index.

With respect to hive, if the query uses a filter against the rowkey, the resulting query will perform a range scan.

So

SELECT *
FROM someTable
WHERE rowkey > aValue
AND rowkey < bValue

This query will result in a range scan filter and depending on the values of aValue and bValue, you should exclude a portion of your table.

If you were to store the data in an HDFS table, the odds are your partition plan would not be on the rowkey. So your hive query against this table would not be able to exclude
data. Depending on how much data was excluded, you could be slower than a query against HBase.

There are some performance tuning tips that would help reduce the cost of the query.


Does this make sense?

Even though HBase could be slower, there are some reasons why you may want to use Hbase over ORC or Parquet.

The problem is that there isn’t a straight forward and simple answer. There are a lot of factors that go in to deciding what tools to use.
(e.g. If you’re a Cloudera Fanboi, you’d use Impala as your query engine, Hortonworks would push Tez, and MapR is a bit more agnostic.)

And then you have to decide if you want to use Hive, or a query engine or something different altogether.

————————————
====================
————————————

HAVING SAID THAT

Depending on your query, you may want to consider secondary indexing.

When you do that.. HBase could become faster. Again it depends on the query and the data.

Hbase takes care of deduping data. Hive does not. So unless your data set is immutable (new rows = new data 
 no updates 
) you will have to figure out how to dedupe your data which is outside of Hive.

Note that this falls outside the scope of this discussion. The OP’s question is about a base table and doesn’t seem to involve any joins.

HTH

-Mike




On Jun 9, 2017, at 4:50 AM, Amey Barve <***@gmail.com<mailto:***@gmail.com>> wrote:

Hi Michael,

"If there is predicate pushdown, then you will be faster, assuming that the query triggers an implied range scan"
---> Does this bring results faster than plain hive querying over ORC / Text file formats

In other words Is querying over plain hive (ORC or Text) always faster than through HiveStorageHandler?

Regards,
Amey

On 9 June 2017 at 15:08, Michael Segel <***@hotmail.com<mailto:***@hotmail.com>> wrote:
The pro’s is that you have the ability to update a table without having to worry about duplication of the row. Tez is doing some form of compaction for you that already exists in HBase.

The cons:

1) Its slower. Reads from HBase have more overhead with them than just reading a file. Read Lars George’s book on what takes place when you do a read.

2) HBase is not a relational store. (You have to think about what that implies)

3) You need to query against your row key for best performance, otherwise it will always be a complete table scan.

HBase was designed to give you fast access for direct get() and limited range scans. Otherwise you have to perform full table scans. This means that unless you’re able to do a range scan, your full table scan will be slower than if you did this on a flat file set. Again the reason why you would want to use HBase if your data set is mutable.

You also have to trigger a range scan when you write your hive query and you have make sure that you’re querying off your row key.

HBase was designed as a <key,value> store. Plain and simple. If you don’t use the key, you have to do a full table scan. So even though you are partitioning on row key, you never use your partitions. However in Hive or Spark, you can create an alternative partition pattern. (e.g your key is the transaction_id, yet you partition on month/year portion of the transaction_date)

You can speed things up a little by using an inverted table as a secondary index. However this assumes that you want to use joins. If you have a single base table with no joins then you can limit your range scans based on making sure you are querying against the row key. Note: This will mean that you have limited querying capabilities.

And yes, I’ve done this before but can’t share it with you.

HTH

P.S.
I haven’t tried Hive queries where you have what would be the equivalent of a get() .

In earlier versions of hive, the issue would be “SELECT * FROM foo where rowkey=BAR” would still do a full table scan because of the lack of predicate pushdown.
This may have been fixed in later releases of hive. That would be your test case. If there is predicate pushdown, then you will be faster, assuming that the query triggers an implied range scan.
This would be a simple thing. However keep in mind that you’re going to generate a map/reduce job (unless using a query engine like Tez) where you wouldn’t if you just wrote your code in Java.
Hi,
Can you please let us know Pro and Cons of using HBase table as an external table in HIVE.
Will there be any performance degrade when using Hive over HBase instead of using direct HIVE table.
The table that I am planning to use in HBase will be master table like account, customer. Wanting to achieve Slowly Changing Dimension. Please through some lights on that too if you have done any such implementations.
Thanks and Regards,
Rams
Michael Segel
2017-06-09 12:46:12 UTC
Permalink
Sorry. Need to send via right email address.

Begin forwarded message:

From: Michael Segel <***@segel.com<mailto:***@segel.com>>
Subject: Re: Pro and Cons of using HBase table as an external table in HIVE
Date: June 9, 2017 at 7:37:22 AM CDT
To: ***@hive.apache.org<mailto:***@hive.apache.org>

Hey Edward,

Yes, that’s the gist of it.
However
 if you can exclude data
 your query in HBase could be faster.
Having said that


I should have included hardware in to the equation
 Also data locality could come in to play
 But that really would confuse the issue and the OP even more. ;-)


-Mike

On Jun 9, 2017, at 7:14 AM, Edward Capriolo <***@gmail.com<mailto:***@gmail.com>> wrote:

Think about it like this one system is scanning a local file ORC, using an hbase scanner (over the network), and scanning the data in sstable format?

On Fri, Jun 9, 2017 at 5:50 AM, Amey Barve <***@gmail.com<mailto:***@gmail.com>> wrote:
Hi Michael,

"If there is predicate pushdown, then you will be faster, assuming that the query triggers an implied range scan"
---> Does this bring results faster than plain hive querying over ORC / Text file formats

In other words Is querying over plain hive (ORC or Text) always faster than through HiveStorageHandler?

Regards,
Amey

On 9 June 2017 at 15:08, Michael Segel <***@hotmail.com<mailto:***@hotmail.com>> wrote:
The pro’s is that you have the ability to update a table without having to worry about duplication of the row. Tez is doing some form of compaction for you that already exists in HBase.

The cons:

1) Its slower. Reads from HBase have more overhead with them than just reading a file. Read Lars George’s book on what takes place when you do a read.

2) HBase is not a relational store. (You have to think about what that implies)

3) You need to query against your row key for best performance, otherwise it will always be a complete table scan.

HBase was designed to give you fast access for direct get() and limited range scans. Otherwise you have to perform full table scans. This means that unless you’re able to do a range scan, your full table scan will be slower than if you did this on a flat file set. Again the reason why you would want to use HBase if your data set is mutable.

You also have to trigger a range scan when you write your hive query and you have make sure that you’re querying off your row key.

HBase was designed as a <key,value> store. Plain and simple. If you don’t use the key, you have to do a full table scan. So even though you are partitioning on row key, you never use your partitions. However in Hive or Spark, you can create an alternative partition pattern. (e.g your key is the transaction_id, yet you partition on month/year portion of the transaction_date)

You can speed things up a little by using an inverted table as a secondary index. However this assumes that you want to use joins. If you have a single base table with no joins then you can limit your range scans based on making sure you are querying against the row key. Note: This will mean that you have limited querying capabilities.

And yes, I’ve done this before but can’t share it with you.

HTH

P.S.
I haven’t tried Hive queries where you have what would be the equivalent of a get() .

In earlier versions of hive, the issue would be “SELECT * FROM foo where rowkey=BAR” would still do a full table scan because of the lack of predicate pushdown.
This may have been fixed in later releases of hive. That would be your test case. If there is predicate pushdown, then you will be faster, assuming that the query triggers an implied range scan.
This would be a simple thing. However keep in mind that you’re going to generate a map/reduce job (unless using a query engine like Tez) where you wouldn’t if you just wrote your code in Java.
Hi,
Can you please let us know Pro and Cons of using HBase table as an external table in HIVE.
Will there be any performance degrade when using Hive over HBase instead of using direct HIVE table.
The table that I am planning to use in HBase will be master table like account, customer. Wanting to achieve Slowly Changing Dimension. Please through some lights on that too if you have done any such implementations.
Thanks and Regards,
Rams
Loading...