Discussion:
how to determine the memory usage of select,join, in hive on spark?
诺铁
2015-01-23 07:29:51 UTC
Permalink
hi,

when I am trying to join several tables, then write result to another
table, it runs very slow. by observing worker log and spark ui, I found
many gc time.

the input tables are not very big, their size are:
84M
705M
2.7G
2.4M
573M

the resulting output is about 1.5GB.
the worker is given 70G memory(only 1 worker), and I set spark to use Kryo.
I don't understand the reason why there are so many gc, that makes job very
slow.

when using spark core api, I can call RDD.cache(), than watch how much
memory the rdd used, in hive on spark, are there anyway to profile memory
usage?
Xuefu Zhang
2015-01-24 15:03:24 UTC
Permalink
Hi,

Since you have only one worker, you should be able to use jmap to get a
dump of the worker process. In Hive, you can configure the memory usage for
join.

As to the slowness and hive GC you observed, I'm thinking this might have
to do with your query. Could you share it?

Thanks,
Xuefu
Post by 诺铁
hi,
when I am trying to join several tables, then write result to another
table, it runs very slow. by observing worker log and spark ui, I found
many gc time.
84M
705M
2.7G
2.4M
573M
the resulting output is about 1.5GB.
the worker is given 70G memory(only 1 worker), and I set spark to use Kryo.
I don't understand the reason why there are so many gc, that makes job
very slow.
when using spark core api, I can call RDD.cache(), than watch how much
memory the rdd used, in hive on spark, are there anyway to profile memory
usage?
Sanjay Subramanian
2015-01-25 16:02:17 UTC
Permalink
hey guys 

This is the Hive table definition I have created based on the JSON I am using this version of hive json serde https://github.com/rcongiu/Hive-JSON-Serde

ADD JAR /home/sanjay/mycode/jar/jsonserde/json-serde-1.3.1-SNAPSHOT-jar-with-dependencies.jar;DROP TABLE IF EXISTS  datafeed_json;CREATE EXTERNAL TABLE IF NOT EXISTS   datafeed_json (   object STRING,   entry array          <struct            <id:STRING,              time:BIGINT,              changes:array               <struct                 <field:STRING,                   value:struct                    <item:STRING,                      verb:STRING,                      parent_id:STRING,                      sender_id:BIGINT,                      created_time:BIGINT>>>>>) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' STORED AS TEXTFILE LOCATION '/data/sanjay/datafeed';

QUERY 1=======ADD JAR /home/sanjay/mycode/jar/jsonserde/json-serde-1.3.1-SNAPSHOT-jar-with-dependencies.jar; SELECT    object,    entry[0].id,    entry[0].time,    entry[0].changes[0].field,    entry[0].changes[0].value.item,    entry[0].changes[0].value.verb,    entry[0].changes[0].value.parent_id,    entry[0].changes[0].value.sender_id,    entry[0].changes[0].value.created_time  FROM    datafeed_json;
RESULT1======foo123  113621765320467 1418608223 leads song1 rock 113621765320467_1107142375968396 100004748082019 1418608223

QUERY2======ADD JAR /home/sanjay/mycode/jar/jsonserde/json-serde-1.3.1-SNAPSHOT-jar-with-dependencies.jar; SELECT    object,    entry.id,    entry.time,    ntry  FROM    datafeed_json  LATERAL VIEW EXPLODE    (datafeed_json.entry.changes) oc1 AS ntry;
RESULT2=======This gives 4 rows but I was not able to iteratively do the LATERAL VIEW EXPLODE

I tried various combinations of LATERAL VIEW , LATERAL VIEW EXPLODE, json_tuple to extract all fields in an exploded view from the JSON in tab separated format but no luck.
Any thoughts ?


Thanks
sanjay  
Edward Capriolo
2015-01-25 16:11:32 UTC
Permalink
Nested lists require nested lateral views.

On Sun, Jan 25, 2015 at 11:02 AM, Sanjay Subramanian <
Post by Sanjay Subramanian
hey guys
This is the Hive table definition I have created based on the JSON
I am using this version of hive json serde
https://github.com/rcongiu/Hive-JSON-Serde
ADD JAR
/home/sanjay/mycode/jar/jsonserde/json-serde-1.3.1-SNAPSHOT-jar-with-dependencies.jar
;
DROP TABLE IF EXISTS
datafeed_json
;
CREATE EXTERNAL TABLE IF NOT EXISTS
datafeed_json (
object STRING,
entry array
<struct
<id:STRING,
time:BIGINT,
changes:array
<struct
<field:STRING,
value:struct
<item:STRING,
verb:STRING,
parent_id:STRING,
sender_id:BIGINT,
created_time:BIGINT>>>>>
) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' STORED AS TEXTFILE
LOCATION '/data/sanjay/datafeed'
;
QUERY 1
=======
ADD JAR
/home/sanjay/mycode/jar/jsonserde/json-serde-1.3.1-SNAPSHOT-jar-with-dependencies.jar
;
SELECT
object,
entry[0].id,
entry[0].time,
entry[0].changes[0].field,
entry[0].changes[0].value.item,
entry[0].changes[0].value.verb,
entry[0].changes[0].value.parent_id,
entry[0].changes[0].value.sender_id,
entry[0].changes[0].value.created_time
FROM
datafeed_json
;
RESULT1
======
foo123 113621765320467 1418608223 leads song1 rock
113621765320467_1107142375968396 100004748082019 1418608223
QUERY2
======
ADD JAR
/home/sanjay/mycode/jar/jsonserde/json-serde-1.3.1-SNAPSHOT-jar-with-dependencies.jar
;
SELECT
object,
entry.id,
entry.time,
ntry
FROM
datafeed_json
LATERAL VIEW EXPLODE
(datafeed_json.entry.changes) oc1 AS ntry
;
RESULT2
=======
This gives 4 rows but I was not able to iteratively do the LATERAL VIEW
EXPLODE
I tried various combinations of LATERAL VIEW , LATERAL VIEW EXPLODE,
json_tuple to extract all fields in an exploded view from the JSON in tab
separated format but no luck.
Any thoughts ?
Thanks
sanjay
Sanjay Subramanian
2015-01-25 16:25:06 UTC
Permalink
Thanks Ed. Let me try a few more iterations. Somehow I am not doing this correctly :-) 
regards
sanjay From: Edward Capriolo <***@gmail.com>
To: "***@hive.apache.org" <***@hive.apache.org>; Sanjay Subramanian <***@yahoo.com>
Sent: Sunday, January 25, 2015 8:11 AM
Subject: Re: Hive JSON Serde question

Nested lists require nested lateral views.



On Sun, Jan 25, 2015 at 11:02 AM, Sanjay Subramanian <***@yahoo.com> wrote:

hey guys 

This is the Hive table definition I have created based on the JSON I am using this version of hive json serde https://github.com/rcongiu/Hive-JSON-Serde

ADD JAR /home/sanjay/mycode/jar/jsonserde/json-serde-1.3.1-SNAPSHOT-jar-with-dependencies.jar;DROP TABLE IF EXISTS  datafeed_json;CREATE EXTERNAL TABLE IF NOT EXISTS   datafeed_json (   object STRING,   entry array          <struct            <id:STRING,              time:BIGINT,              changes:array               <struct                 <field:STRING,                   value:struct                    <item:STRING,                      verb:STRING,                      parent_id:STRING,                      sender_id:BIGINT,                      created_time:BIGINT>>>>>) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' STORED AS TEXTFILE LOCATION '/data/sanjay/datafeed';

QUERY 1=======ADD JAR /home/sanjay/mycode/jar/jsonserde/json-serde-1.3.1-SNAPSHOT-jar-with-dependencies.jar; SELECT    object,    entry[0].id,    entry[0].time,    entry[0].changes[0].field,    entry[0].changes[0].value.item,    entry[0].changes[0].value.verb,    entry[0].changes[0].value.parent_id,    entry[0].changes[0].value.sender_id,    entry[0].changes[0].value.created_time  FROM    datafeed_json;
RESULT1======foo123  113621765320467 1418608223 leads song1 rock 113621765320467_1107142375968396 100004748082019 1418608223

QUERY2======ADD JAR /home/sanjay/mycode/jar/jsonserde/json-serde-1.3.1-SNAPSHOT-jar-with-dependencies.jar; SELECT    object,    entry.id,    entry.time,    ntry  FROM    datafeed_json  LATERAL VIEW EXPLODE    (datafeed_json.entry.changes) oc1 AS ntry;
RESULT2=======This gives 4 rows but I was not able to iteratively do the LATERAL VIEW EXPLODE

I tried various combinations of LATERAL VIEW , LATERAL VIEW EXPLODE, json_tuple to extract all fields in an exploded view from the JSON in tab separated format but no luck.
Any thoughts ?


Thanks
sanjay  
丁桂涛(桂花)
2015-01-26 00:45:47 UTC
Permalink
Try get_json_object UDF. No iterations need. :)

On Mon, Jan 26, 2015 at 12:25 AM, Sanjay Subramanian <
Post by Sanjay Subramanian
Thanks Ed. Let me try a few more iterations. Somehow I am not doing this
correctly :-)
regards
sanjay
------------------------------
*Sent:* Sunday, January 25, 2015 8:11 AM
*Subject:* Re: Hive JSON Serde question
Nested lists require nested lateral views.
On Sun, Jan 25, 2015 at 11:02 AM, Sanjay Subramanian <
hey guys
This is the Hive table definition I have created based on the JSON
I am using this version of hive json serde
https://github.com/rcongiu/Hive-JSON-Serde
ADD JAR
/home/sanjay/mycode/jar/jsonserde/json-serde-1.3.1-SNAPSHOT-jar-with-dependencies.jar
;
DROP TABLE IF EXISTS
datafeed_json
;
CREATE EXTERNAL TABLE IF NOT EXISTS
datafeed_json (
object STRING,
entry array
<struct
<id:STRING,
time:BIGINT,
changes:array
<struct
<field:STRING,
value:struct
<item:STRING,
verb:STRING,
parent_id:STRING,
sender_id:BIGINT,
created_time:BIGINT>>>>>
) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' STORED AS TEXTFILE
LOCATION '/data/sanjay/datafeed'
;
QUERY 1
=======
ADD JAR
/home/sanjay/mycode/jar/jsonserde/json-serde-1.3.1-SNAPSHOT-jar-with-dependencies.jar
;
SELECT
object,
entry[0].id,
entry[0].time,
entry[0].changes[0].field,
entry[0].changes[0].value.item,
entry[0].changes[0].value.verb,
entry[0].changes[0].value.parent_id,
entry[0].changes[0].value.sender_id,
entry[0].changes[0].value.created_time
FROM
datafeed_json
;
RESULT1
======
foo123 113621765320467 1418608223 leads song1 rock
113621765320467_1107142375968396 100004748082019 1418608223
QUERY2
======
ADD JAR
/home/sanjay/mycode/jar/jsonserde/json-serde-1.3.1-SNAPSHOT-jar-with-dependencies.jar
;
SELECT
object,
entry.id,
entry.time,
ntry
FROM
datafeed_json
LATERAL VIEW EXPLODE
(datafeed_json.entry.changes) oc1 AS ntry
;
RESULT2
=======
This gives 4 rows but I was not able to iteratively do the LATERAL VIEW
EXPLODE
I tried various combinations of LATERAL VIEW , LATERAL VIEW EXPLODE,
json_tuple to extract all fields in an exploded view from the JSON in tab
separated format but no luck.
Any thoughts ?
Thanks
sanjay
Sanjay Subramanian
2015-01-26 02:54:08 UTC
Permalink
sure will try get_json_objectthank uregardssanjay  
From: 䞁桂涛桂花 <***@baixing.com>
To: ***@hive.apache.org; Sanjay Subramanian <***@yahoo.com>
Sent: Sunday, January 25, 2015 4:45 PM
Subject: Re: Hive JSON Serde question

Try get_json_object UDF. No iterations need. :)


On Mon, Jan 26, 2015 at 12:25 AM, Sanjay Subramanian <***@yahoo.com> wrote:

Thanks Ed. Let me try a few more iterations. Somehow I am not doing this correctly :-) 
regards
sanjay From: Edward Capriolo <***@gmail.com>
To: "***@hive.apache.org" <***@hive.apache.org>; Sanjay Subramanian <***@yahoo.com>
Sent: Sunday, January 25, 2015 8:11 AM
Subject: Re: Hive JSON Serde question

Nested lists require nested lateral views.



On Sun, Jan 25, 2015 at 11:02 AM, Sanjay Subramanian <***@yahoo.com> wrote:

hey guys 

This is the Hive table definition I have created based on the JSON I am using this version of hive json serde https://github.com/rcongiu/Hive-JSON-Serde

ADD JAR /home/sanjay/mycode/jar/jsonserde/json-serde-1.3.1-SNAPSHOT-jar-with-dependencies.jar;DROP TABLE IF EXISTS  datafeed_json;CREATE EXTERNAL TABLE IF NOT EXISTS   datafeed_json (   object STRING,   entry array          <struct            <id:STRING,              time:BIGINT,              changes:array               <struct                 <field:STRING,                   value:struct                    <item:STRING,                      verb:STRING,                      parent_id:STRING,                      sender_id:BIGINT,                      created_time:BIGINT>>>>>) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' STORED AS TEXTFILE LOCATION '/data/sanjay/datafeed';

QUERY 1=======ADD JAR /home/sanjay/mycode/jar/jsonserde/json-serde-1.3.1-SNAPSHOT-jar-with-dependencies.jar; SELECT    object,    entry[0].id,    entry[0].time,    entry[0].changes[0].field,    entry[0].changes[0].value.item,    entry[0].changes[0].value.verb,    entry[0].changes[0].value.parent_id,    entry[0].changes[0].value.sender_id,    entry[0].changes[0].value.created_time  FROM    datafeed_json;
RESULT1======foo123  113621765320467 1418608223 leads song1 rock 113621765320467_1107142375968396 100004748082019 1418608223

QUERY2======ADD JAR /home/sanjay/mycode/jar/jsonserde/json-serde-1.3.1-SNAPSHOT-jar-with-dependencies.jar; SELECT    object,    entry.id,    entry.time,    ntry  FROM    datafeed_json  LATERAL VIEW EXPLODE    (datafeed_json.entry.changes) oc1 AS ntry;
RESULT2=======This gives 4 rows but I was not able to iteratively do the LATERAL VIEW EXPLODE

I tried various combinations of LATERAL VIEW , LATERAL VIEW EXPLODE, json_tuple to extract all fields in an exploded view from the JSON in tab separated format but no luck.
Any thoughts ?


Thanks
sanjay  
Ari Flink
2015-01-26 18:29:56 UTC
Permalink
unsubscribe

On Sun, Jan 25, 2015 at 6:54 PM, Sanjay Subramanian <
sure will try get_json_object
thank u
regards
sanjay
------------------------------
*Sent:* Sunday, January 25, 2015 4:45 PM
*Subject:* Re: Hive JSON Serde question
Try get_json_object UDF. No iterations need. :)
On Mon, Jan 26, 2015 at 12:25 AM, Sanjay Subramanian <
Thanks Ed. Let me try a few more iterations. Somehow I am not doing this
correctly :-)
regards
sanjay
------------------------------
*Sent:* Sunday, January 25, 2015 8:11 AM
*Subject:* Re: Hive JSON Serde question
Nested lists require nested lateral views.
On Sun, Jan 25, 2015 at 11:02 AM, Sanjay Subramanian <
hey guys
This is the Hive table definition I have created based on the JSON
I am using this version of hive json serde
https://github.com/rcongiu/Hive-JSON-Serde
ADD JAR
/home/sanjay/mycode/jar/jsonserde/json-serde-1.3.1-SNAPSHOT-jar-with-dependencies.jar
;
DROP TABLE IF EXISTS
datafeed_json
;
CREATE EXTERNAL TABLE IF NOT EXISTS
datafeed_json (
object STRING,
entry array
<struct
<id:STRING,
time:BIGINT,
changes:array
<struct
<field:STRING,
value:struct
<item:STRING,
verb:STRING,
parent_id:STRING,
sender_id:BIGINT,
created_time:BIGINT>>>>>
) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' STORED AS TEXTFILE
LOCATION '/data/sanjay/datafeed'
;
QUERY 1
=======
ADD JAR
/home/sanjay/mycode/jar/jsonserde/json-serde-1.3.1-SNAPSHOT-jar-with-dependencies.jar
;
SELECT
object,
entry[0].id,
entry[0].time,
entry[0].changes[0].field,
entry[0].changes[0].value.item,
entry[0].changes[0].value.verb,
entry[0].changes[0].value.parent_id,
entry[0].changes[0].value.sender_id,
entry[0].changes[0].value.created_time
FROM
datafeed_json
;
RESULT1
======
foo123 113621765320467 1418608223 leads song1 rock
113621765320467_1107142375968396 100004748082019 1418608223
QUERY2
======
ADD JAR
/home/sanjay/mycode/jar/jsonserde/json-serde-1.3.1-SNAPSHOT-jar-with-dependencies.jar
;
SELECT
object,
entry.id,
entry.time,
ntry
FROM
datafeed_json
LATERAL VIEW EXPLODE
(datafeed_json.entry.changes) oc1 AS ntry
;
RESULT2
=======
This gives 4 rows but I was not able to iteratively do the LATERAL VIEW
EXPLODE
I tried various combinations of LATERAL VIEW , LATERAL VIEW EXPLODE,
json_tuple to extract all fields in an exploded view from the JSON in tab
separated format but no luck.
Any thoughts ?
Thanks
sanjay
Martin, Nick
2015-01-26 18:31:57 UTC
Permalink
Hi Ari,

Please send an email to user-***@hive.apache.org<mailto:user-***@hive.apache.org>

Thanks!
Nick

From: Ari Flink [mailto:***@gmail.com]
Sent: Monday, January 26, 2015 1:30 PM
To: ***@hive.apache.org
Subject: Re: Hive JSON Serde question

unsubscribe

On Sun, Jan 25, 2015 at 6:54 PM, Sanjay Subramanian <***@yahoo.com<mailto:***@yahoo.com>> wrote:
sure will try get_json_object
thank u
regards
sanjay

________________________________
From: 䞁桂涛桂花 <***@baixing.com<mailto:***@baixing.com>>
To: ***@hive.apache.org<mailto:***@hive.apache.org>; Sanjay Subramanian <***@yahoo.com<mailto:***@yahoo.com>>
Sent: Sunday, January 25, 2015 4:45 PM
Subject: Re: Hive JSON Serde question

Try get_json_object UDF. No iterations need. :)


On Mon, Jan 26, 2015 at 12:25 AM, Sanjay Subramanian <***@yahoo.com<mailto:***@yahoo.com>> wrote:
Thanks Ed. Let me try a few more iterations. Somehow I am not doing this correctly :-)

regards

sanjay
________________________________
From: Edward Capriolo <***@gmail.com<mailto:***@gmail.com>>
To: "***@hive.apache.org<mailto:***@hive.apache.org>" <***@hive.apache.org<mailto:***@hive.apache.org>>; Sanjay Subramanian <***@yahoo.com<mailto:***@yahoo.com>>
Sent: Sunday, January 25, 2015 8:11 AM
Subject: Re: Hive JSON Serde question

Nested lists require nested lateral views.


On Sun, Jan 25, 2015 at 11:02 AM, Sanjay Subramanian <***@yahoo.com<mailto:***@yahoo.com>> wrote:
hey guys

This is the Hive table definition I have created based on the JSON
I am using this version of hive json serde
https://github.com/rcongiu/Hive-JSON-Serde

ADD JAR /home/sanjay/mycode/jar/jsonserde/json-serde-1.3.1-SNAPSHOT-jar-with-dependencies.jar
;
DROP TABLE IF EXISTS
datafeed_json
;
CREATE EXTERNAL TABLE IF NOT EXISTS
datafeed_json (
object STRING,
entry array
<struct
<id:STRING,
time:BIGINT,
changes:array
<struct
<field:STRING,
value:struct
<item:STRING,
verb:STRING,
parent_id:STRING,
sender_id:BIGINT,
created_time:BIGINT>>>>>
) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' STORED AS TEXTFILE LOCATION '/data/sanjay/datafeed'
;


QUERY 1
=======
ADD JAR /home/sanjay/mycode/jar/jsonserde/json-serde-1.3.1-SNAPSHOT-jar-with-dependencies.jar
;
SELECT
object,
entry[0].id,
entry[0].time,
entry[0].changes[0].field,
entry[0].changes[0].value.item,
entry[0].changes[0].value.verb,
entry[0].changes[0].value.parent_id,
entry[0].changes[0].value.sender_id,
entry[0].changes[0].value.created_time
FROM
datafeed_json
;

RESULT1
======
foo123 113621765320467 1418608223 leads song1 rock 113621765320467_1107142375968396 100004748082019 1418608223


QUERY2
======
ADD JAR /home/sanjay/mycode/jar/jsonserde/json-serde-1.3.1-SNAPSHOT-jar-with-dependencies.jar
;
SELECT
object,
entry.id<http://entry.id/>,
entry.time,
ntry
FROM
datafeed_json
LATERAL VIEW EXPLODE
(datafeed_json.entry.changes) oc1 AS ntry
;

RESULT2
=======
This gives 4 rows but I was not able to iteratively do the LATERAL VIEW EXPLODE


I tried various combinations of LATERAL VIEW , LATERAL VIEW EXPLODE, json_tuple to extract all fields in an exploded view from the JSON in tab separated format but no luck.

Any thoughts ?


Thanks

sanjay

Loading...