SCRATCH directories renaming on S3 on hive table taking too much time.

Rachit Chauhan

2018-11-05 09:43:22 UTC

Hi everyone

I am running PIG Store command to write to a HIVE table(external table with
location on S3) using "HCatStorer" on AWS EMR.

The pig command looks like this:
STORE return_data INTO 'TABLE_NAME' USING
org.apache.hive.hcatalog.pig.HCatStorer('part_column=20180910');

I have already set in hive-site.xml

"hive.exec.stagingdir" ->/tmp/hive/
"hive.exec.scratchdir" ->/tmp/hive/
"hive.blobstore.optimizations.enabled" -> false

But despite these settings SCRATCH directories are getting created at on S3
and rename operation is happenig at the end causing huge latency for simple
STORE command to finish.
Example from logs, of rename operations i am seeing:
2018-11-05 06:02:50,447 INFO [CommitterEvent Processor #2]
com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem: rename
s3://my_bucket/test_1/mytable/_SCRATCH0.17310537143764304/part_column=20180910/part-m-00102
s3://my_bucket/test_1/mytable/part_column=20180910/part-m-00102
2018-11-05 06:02:52,512 INFO [CommitterEvent Processor #2]
com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem: rename
s3://my_bucket/test_1/mytable/_SCRATCH0.17310537143764304/part_column=20180910/part-m-00103
s3://my_bucket/test_1/mytable/part_column=20180910/part-m-00103
2018-11-05 06:02:54,495 INFO [CommitterEvent Processor #2]
com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem: rename
s3://my_bucket/test_1/mytable/_SCRATCH0.17310537143764304/part_column=20180910/part-m-00104
s3://my_bucket/test_1/mytable/part_column=20180910/part-m-00104

This rename is copying from temp dir(SCRATCH) to final location.

My setup version are:
Hadoop -> Hadoop 2.7.3-amzn-3
HIVE -> Hive 2.3.0-amzn-0
PIG -> Apache Pig version 0.16.0-amzn-1 (r: unknown)

Other properties i am using in PIG SCRIPT are:

SET mapred.output.direct.NativeS3FileSystem false;
SET mapred.output.direct.EmrFileSystem false;
set pig.SplitCombination true
set pig.maxCombinedSplitSize 128000000
set mapred.output.compress true
set mapred.output.compression.codec com.hadoop.compression.lzo.LzoCodec
set io.sort.mb 800
set mapreduce.task.io.sort.mb 800
set io.sort.factor 200
set io.sort.record.percent .05
set mapreduce.job.counters.max 1000
set pig.exec.reducers.bytes.per.reducer 256000000
set pig.exec.reducers.max 300
set mapred.job.map.memory.mb 1536
set mapred.job.reduce.memory.mb 1536
set mapreduce.map.java.opts -Xmx1228m
set mapreduce.reduce.java.opts -Xmx1228m
set mapreduce.fileoutputcommitter.algorithm.version 2
set hive.blobstore.use.blobstore.as.scratchdir false
set hive.mv.files.thread 100
SET hive.exec.stagingdir=/tmp/hive/
SET hive.exec.scratchdir=/tmp/hive/
set pig.temp.dir /tmp/pig

Thanks
Rachit