Discussion:
ROW_NUMBER() equivalent in Hive
kumar mr
2013-02-21 07:33:16 UTC
Permalink
Hi,


This is Kumar, and this is my first question in this group.


I have a requirement to implement ROW_NUMBER() from Teradata in Hive where partitioning happens on multiple columns along with multiple column ordering.
It can be easily implemented in Hadoop MR, but I have to do in Hive. By doing in UDF can assign same rank to grouping key considering dataset is small, but ordering need to be done in prior step.
Can we do this in lot simpler way?


Thanks in advance.


Regards,
Kumar
Owen O'Malley
2013-02-21 16:08:01 UTC
Permalink
What are the semantics for ROW_NUMBER? Is it a global row number? Per a
partition? Per a bucket?

-- Owen
Post by kumar mr
Hi,
This is Kumar, and this is my first question in this group.
I have a requirement to implement ROW_NUMBER() from Teradata in Hive
where partitioning happens on multiple columns along with multiple column
ordering.
It can be easily implemented in Hadoop MR, but I have to do in Hive. By
doing in UDF can assign same rank to grouping key considering dataset is
small, but ordering need to be done in prior step.
Can we do this in lot simpler way?
Thanks in advance.
Regards,
Kumar
Ashutosh Chauhan
2013-02-21 16:17:35 UTC
Permalink
Kumar,

If you are willing to be on bleeding edge, this and many other partitioning
and windowing functionality some of us are developing in a branch over at:
https://svn.apache.org/repos/asf/hive/branches/ptf-windowing
Check out this branch, build hive and than you can have row_number()
functionality. Look in
ql/src/test/queries/clientpositive/ptf_general_queries.q which has about 60
or so example queries demonstrating various capabilities which we have
already working (including row_number).
We hope to have this branch merged in trunk soon.

Hope it helps,
Ashutosh
Post by kumar mr
Hi,
This is Kumar, and this is my first question in this group.
I have a requirement to implement ROW_NUMBER() from Teradata in Hive
where partitioning happens on multiple columns along with multiple column
ordering.
It can be easily implemented in Hadoop MR, but I have to do in Hive. By
doing in UDF can assign same rank to grouping key considering dataset is
small, but ordering need to be done in prior step.
Can we do this in lot simpler way?
Thanks in advance.
Regards,
Kumar
Stephen Boesch
2013-02-21 20:17:23 UTC
Permalink
Hi Ashutosh,
I am interested / reviewing your windowing feature. Can you be more
specific about which (a) tests and (b) src files constitute your additions
(there are lots of files there ;) )

thanks

stephen boesch
Post by kumar mr
Kumar,
If you are willing to be on bleeding edge, this and many other
partitioning and windowing functionality some of us are developing in a
https://svn.apache.org/repos/asf/hive/branches/ptf-windowing
Check out this branch, build hive and than you can have row_number()
functionality. Look in
ql/src/test/queries/clientpositive/ptf_general_queries.q which has about 60
or so example queries demonstrating various capabilities which we have
already working (including row_number).
We hope to have this branch merged in trunk soon.
Hope it helps,
Ashutosh
Post by kumar mr
Hi,
This is Kumar, and this is my first question in this group.
I have a requirement to implement ROW_NUMBER() from Teradata in Hive
where partitioning happens on multiple columns along with multiple column
ordering.
It can be easily implemented in Hadoop MR, but I have to do in Hive. By
doing in UDF can assign same rank to grouping key considering dataset is
small, but ordering need to be done in prior step.
Can we do this in lot simpler way?
Thanks in advance.
Regards,
Kumar
Ashutosh Chauhan
2013-02-22 01:44:10 UTC
Permalink
Hi Stephen,

As I indicated in my previous email, check out file ql/src/test/queries/
clientpositive/ptf_general_queries.q it has plenty of example queries
demonstrating the functionality which is available. If you are interested
in hive src changes which has enabled this feature.. you may want to start
by looking at a patch attached on HIVE-896 which was the starting point for
this work. That jira also has links with other jira which we did /are doing
on top of that patch.

Hope it helps,
Ashutosh
Post by Stephen Boesch
Hi Ashutosh,
I am interested / reviewing your windowing feature. Can you be more
specific about which (a) tests and (b) src files constitute your additions
(there are lots of files there ;) )
thanks
stephen boesch
Post by kumar mr
Kumar,
If you are willing to be on bleeding edge, this and many other
partitioning and windowing functionality some of us are developing in a
https://svn.apache.org/repos/asf/hive/branches/ptf-windowing
Check out this branch, build hive and than you can have row_number()
functionality. Look in
ql/src/test/queries/clientpositive/ptf_general_queries.q which has about 60
or so example queries demonstrating various capabilities which we have
already working (including row_number).
We hope to have this branch merged in trunk soon.
Hope it helps,
Ashutosh
Post by kumar mr
Hi,
This is Kumar, and this is my first question in this group.
I have a requirement to implement ROW_NUMBER() from Teradata in Hive
where partitioning happens on multiple columns along with multiple column
ordering.
It can be easily implemented in Hadoop MR, but I have to do in Hive. By
doing in UDF can assign same rank to grouping key considering dataset is
small, but ordering need to be done in prior step.
Can we do this in lot simpler way?
Thanks in advance.
Regards,
Kumar
kumar mr
2013-02-21 19:12:45 UTC
Permalink
Owen,


it's for entire table. the sample TD query looks like below,



SELECT
columnA
,columnB
, columnC
, columnD
, columnX
,ROW_NUMBER() OVER (PARTITION BY columnA, columnB, columnC ORDER BY columnX DESC, columnY DESC) AS rank
FROM table a


Regards,
Kumar





-----Original Message-----
From: Owen O'Malley <omalley-1oDqGaOF3Lkdnm+***@public.gmane.org>
To: user <user-***@public.gmane.org>
Sent: Thu, Feb 21, 2013 8:08 am
Subject: Re: ROW_NUMBER() equivalent in Hive


What are the semantics for ROW_NUMBER? Is it a global row number? Per a partition? Per a bucket?


-- Owen




On Wed, Feb 20, 2013 at 11:33 PM, kumar mr <kumar.mr-***@public.gmane.org> wrote:

Hi,


This is Kumar, and this is my first question in this group.


I have a requirement to implement ROW_NUMBER() from Teradata in Hive where partitioning happens on multiple columns along with multiple column ordering.
It can be easily implemented in Hadoop MR, but I have to do in Hive. By doing in UDF can assign same rank to grouping key considering dataset is small, but ordering need to be done in prior step.
Can we do this in lot simpler way?


Thanks in advance.


Regards,
Kumar
Loading...