Issue on using hive Dynamic Partitions on larger tables

Discussion:

Bejoy Ks

2011-06-16 16:34:39 UTC

Hi Hive Experts
I'm facing an issue while using hive Dynamic Partitions on larger tables. I
tried out Dynamic partitions on smaller tables and it was working fine but
unfortunately when i tried the same on a larger table the map reduce job
terminates throwing an error as

2011-06-16 12:14:28,592 Stage-1 map = 74%, reduce = 0%
[Fatal Error] total number of created files exceeds 100000. Killing the job.
Ended Job = job_201106061630_0536 with errors
FAILED: Execution Error, return code 2 from
org.apache.hadoop.hive.ql.exec.MapRedTask

I tried setting the parameter hive.max.created.files to a larger value, still
the same error
hive>set hive.max.created.files=500000;
The same error was thrown 'total number of created files exceeds 100000' even
after I changed the value to 500000. I doubt whether the value is set for the
config parameter is not getting affected. Or am I setting the wrong parameter to
solve this issue. Please advise

The other parameters I did set on hive CLI for dynamic partitions are
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.dynamic.partition=true;
set hive.exec.max.dynamic.partitions.pernode=300;

The hive QL query I used for dynamic partition is
INSERT OVERWRITE TABLE parameter_part PARTITION(location)
SELECT p.seq_id,p.lead_id,p.arr_datetime,p.computed_value,
p.del_date,p.location FROM parameter_def p;

Please help me out in resolving the same

Thank You.

Regards
Bejoy.K.S

Steven Wong

2011-06-18 01:24:34 UTC

Permalink

The name of the parameter is actually hive.exec.max.created.files. The wiki has a typo, which I'll fix.

From: Bejoy Ks [mailto:bejoy_ks-/***@public.gmane.org]
Sent: Thursday, June 16, 2011 9:35 AM
To: hive user group
Subject: Issue on using hive Dynamic Partitions on larger tables

Hi Hive Experts
I'm facing an issue while using hive Dynamic Partitions on larger tables. I tried out Dynamic partitions on smaller tables and it was working fine but unfortunately when i tried the same on a larger table the map reduce job terminates throwing an error as

2011-06-16 12:14:28,592 Stage-1 map = 74%, reduce = 0%
[Fatal Error] total number of created files exceeds 100000. Killing the job.
Ended Job = job_201106061630_0536 with errors
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask

I tried setting the parameter hive.max.created.files to a larger value, still the same error
hive>set hive.max.created.files=500000;
The same error was thrown 'total number of created files exceeds 100000' even after I changed the value to 500000. I doubt whether the value is set for the config parameter is not getting affected. Or am I setting the wrong parameter to solve this issue. Please advise

The other parameters I did set on hive CLI for dynamic partitions are
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.dynamic.partition=true;
set hive.exec.max.dynamic.partitions.pernode=300;

The hive QL query I used for dynamic partition is
INSERT OVERWRITE TABLE parameter_part PARTITION(location)
SELECT p.seq_id,p.lead_id,p.arr_datetime,p.computed_value,
p.del_date,p.location FROM parameter_def p;

Please help me out in resolving the same

Thank You.

Regards
Bejoy.K.S

Bejoy Ks

2011-06-20 14:57:16 UTC

Permalink

Thanks Steven. Now I'm out of that bug, but another one pops when I'm trying for
Dynamic partitions with larger tables. I have implemenetd the same on smaller
tables using the same approach mentioned below, but some how it fails for larger
tables.

My Larger source Table(parameter_def) contains 5 billion rows which I have
SQOOPed into hive from a DWH and when I try implementing the dynamic partition
on the same with the Query
INSERT OVERWRITE TABLE parameter_part PARTITION(location)
SELECT p.seq_id,p.lead_id,p.arr_datetime,p.computed_value,
p.del_date,p.location FROM parameter_def p;
There are 2 map reduce jobs triggered and the first one now runs to completion
after setting

hive.exec.max.created.files=150000;
But the second job just fails as such without even running. Given below is the
error log
From putty console
2011-06-20 10:40:13,348 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201106061630_0937
Ended Job = 1659539584, job is filtered out (removed at runtime).
Launching Job 2 out of 2
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201106061630_0938, Tracking URL =
http://********.com:50030/jobdetails.jsp?jobid=job_201106061630_0938
Kill Command = /usr/lib/hadoop/bin/hadoop job
-Dmapred.job.tracker=********.com:8021 -kill job_201106061630_0938
2011-06-20 10:42:51,914 Stage-3 map = 100%, reduce = 100%
Ended Job = job_201106061630_0938 with errors
FAILED: Execution Error, return code 2 from
org.apache.hadoop.hive.ql.exec.MapRedTask

From hive log file
2011-06-20 10:41:02,293 WARN mapred.JobClient
(JobClient.java:copyAndConfigureFiles(649)) - Use GenericOptionsParser for
parsing the arguments. Applications should implement Tool for the same.
2011-06-20 10:42:51,917 ERROR exec.MapRedTask
(SessionState.java:printError(343)) - Ended Job = job_201106061630_0938 with
errors
2011-06-20 10:42:51,938 ERROR ql.Driver (SessionState.java:printError(343)) -
FAILED: Execution Error, return code 2 from
org.apache.hadoop.hive.ql.exec.MapRedTask

The hadoop and hive version I'm using are as follows
Hadoop Version - Hadoop 0.20.2-cdh3u0
Hive Version - Hive 0.7(lib/hive-hwi-0.7.0-cdh3u0.war)

Please help me out in figuring what is going wrong with my implementation.

Thank You

Regards
Bejoy.K.S

________________________________
From: Steven Wong <swong-***@public.gmane.org>
To: "user-***@public.gmane.org" <user-***@public.gmane.org>
Sent: Sat, June 18, 2011 6:54:34 AM
Subject: RE: Issue on using hive Dynamic Partitions on larger tables

The name of the parameter is actually hive.exec.max.created.files. The wiki has
a typo, which Iâll fix.

From:Bejoy Ks [mailto:***@yahoo.com]
Sent: Thursday, June 16, 2011 9:35 AM
To: hive user group
Subject: Issue on using hive Dynamic Partitions on larger tables

Hi Hive Experts
I'm facing an issue while using hive Dynamic Partitions on larger tables. I
tried out Dynamic partitions on smaller tables and it was working fine but
unfortunately when i tried the same on a larger table the map reduce job
terminates throwing an error as

2011-06-16 12:14:28,592 Stage-1 map = 74%, reduce = 0%
[Fatal Error] total number of created files exceeds 100000. Killing the job.
Ended Job = job_201106061630_0536 with errors
FAILED: Execution Error, return code 2 from
org.apache.hadoop.hive.ql.exec.MapRedTask

I tried setting the parameter hive.max.created.files to a larger value, still
the same error
hive>set hive.max.created.files=500000;
The same error was thrown 'total number of created files exceeds 100000' even
after I changed the value to 500000. I doubt whether the value is set for the
config parameter is not getting affected. Or am I setting the wrong parameter to
solve this issue. Please advise

The other parameters I did set on hive CLI for dynamic partitions are
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.dynamic.partition=true;
set hive.exec.max.dynamic.partitions.pernode=300;

The hive QL query I used for dynamic partition is
INSERT OVERWRITE TABLE parameter_part PARTITION(location)
SELECT p.seq_id,p.lead_id,p.arr_datetime,p.computed_value,
p.del_date,p.location FROM parameter_def p;

Please help me out in resolving the same

Thank You.

Regards
Bejoy.K.S

Bejoy Ks

2011-06-21 12:26:55 UTC

Permalink

Hey Guys
I was able to resolve the same by groping and distributing records to
reducers using DISTRIBUTE BY. My modified query would be as folows

FROM parameter_def p
INSERT OVERWRITE TABLE parameter_part PARTITION(location)
SELECT p.seq_id,p.lead_id,p.arr_datetime,p.computed_value,p.del_date,p.location
DISTRIBUTE BY location;

With this query the entire job worked like a charm. If there could be any better
implementations on similar scenarios please do share.

Thank You

Regards
Bejoy.KS

From: Bejoy Ks <bejoy_ks-/***@public.gmane.org>
To: user-***@public.gmane.orgche.org
Sent: Mon, June 20, 2011 8:27:16 PM
Subject: Re: Issue on using hive Dynamic Partitions on larger tables

Thanks Steven. Now I'm out of that bug, but another one pops when I'm trying for
Dynamic partitions with larger tables. I have implemenetd the same on smaller
tables using the same approach mentioned below, but some how it fails for larger
tables.

My Larger source Table(parameter_def) contains 5 billion rows which I have
SQOOPed into hive from a DWH and when I try implementing the dynamic partition
on the same with the Query
INSERT OVERWRITE TABLE parameter_part PARTITION(location)
SELECT p.seq_id,p.lead_id,p.arr_datetime,p.computed_value,
p.del_date,p.location FROM parameter_def p;
There are 2 map reduce jobs triggered and the first one now runs to completion
after setting

hive.exec.max.created.files=150000;
But the second job just fails as such without even running. Given below is the
error log
From putty console
2011-06-20 10:40:13,348 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201106061630_0937
Ended Job = 1659539584, job is filtered out (removed at runtime).
Launching Job 2 out of 2
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201106061630_0938, Tracking URL =
http://********.com:50030/jobdetails.jsp?jobid=job_201106061630_0938
Kill Command = /usr/lib/hadoop/bin/hadoop job
-Dmapred.job.tracker=********.com:8021 -kill job_201106061630_0938
2011-06-20 10:42:51,914 Stage-3 map = 100%, reduce = 100%
Ended Job = job_201106061630_0938 with errors
FAILED: Execution Error, return code 2 from
org.apache.hadoop.hive.ql.exec.MapRedTask

From hive log file
2011-06-20 10:41:02,293 WARN mapred.JobClient
(JobClient.java:copyAndConfigureFiles(649)) - Use GenericOptionsParser for
parsing the arguments. Applications should implement Tool for the same.
2011-06-20 10:42:51,917 ERROR exec.MapRedTask
(SessionState.java:printError(343)) - Ended Job = job_201106061630_0938 with
errors
2011-06-20 10:42:51,938 ERROR ql.Driver (SessionState.java:printError(343)) -
FAILED: Execution Error, return code 2 from
org.apache.hadoop.hive.ql.exec.MapRedTask

The hadoop and hive version I'm using are as follows
Hadoop Version - Hadoop 0.20.2-cdh3u0
Hive Version - Hive 0.7(lib/hive-hwi-0.7.0-cdh3u0.war)

Please help me out in figuring what is going wrong with my implementation.

Thank You

Regards
Bejoy.K.S

________________________________
From: Steven Wong <swong-***@public.gmane.org>
To: "user-***@public.gmane.org" <user-***@public.gmane.org>
Sent: Sat, June 18, 2011 6:54:34 AM
Subject: RE: Issue on using hive Dynamic Partitions on larger tables

The name of the parameter is actually hive.exec.max.created.files. The wiki has
a typo, which Iâll fix.

From:Bejoy Ks [mailto:bejoy_ks-/***@public.gmane.org]
Sent: Thursday, June 16, 2011 9:35 AM
To: hive user group
Subject: Issue on using hive Dynamic Partitions on larger tables

Hi Hive Experts
I'm facing an issue while using hive Dynamic Partitions on larger tables. I
tried out Dynamic partitions on smaller tables and it was working fine but
unfortunately when i tried the same on a larger table the map reduce job
terminates throwing an error as

2011-06-16 12:14:28,592 Stage-1 map = 74%, reduce = 0%
[Fatal Error] total number of created files exceeds 100000. Killing the job.
Ended Job = job_201106061630_0536 with errors
FAILED: Execution Error, return code 2 from
org.apache.hadoop.hive.ql.exec.MapRedTask

I tried setting the parameter hive.max.created.files to a larger value, still
the same error
hive>set hive.max.created.files=500000;
The same error was thrown 'total number of created files exceeds 100000' even
after I changed the value to 500000. I doubt whether the value is set for the
config parameter is not getting affected. Or am I setting the wrong parameter to
solve this issue. Please advise

The other parameters I did set on hive CLI for dynamic partitions are
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.dynamic.partition=true;
set hive.exec.max.dynamic.partitions.pernode=300;

The hive QL query I used for dynamic partition is
INSERT OVERWRITE TABLE parameter_part PARTITION(location)
SELECT p.seq_id,p.lead_id,p.arr_datetime,p.computed_value,
p.del_date,p.location FROM parameter_def p;

Please help me out in resolving the same

Thank You.

Regards
Bejoy.K.S