Hive LLAP with Parquet format

Discussion:

Nar Kumar Chhantyal

2017-05-04 08:55:07 UTC

Hi everyone,

I posted question on SO http://stackoverflow.com/qu
estions/43771050/hive-llap-doesnt-work-with-parquet-format but it didn't
get any love, so I am posting here.

Basically, I have large IoT data stored in Parquet format. I want to enable
faster access to this data. I started Azure HDInsight with LLAP enabled.
After trying different settings, it doesn't seem to work. I am now
suspecting it's probably because of underlying file format.

Does Hive LLAP work with Parquet format as well?
--
Nar-Kumar Chhantyal
--
Social Media: www.viessmann.de/social-media
------------------------------

*[image: Viessmann - climate of innovation] <http://www.viessmann.de/>*

Heizsysteme
Industriesysteme
KÃŒhlsysteme

*Viessmann Werke GmbH & Co. KG*
PersÃ¶nlich haftende Gesellschafter: Viessmann KomplementÃ€r B.V., Venlo (NL)
Eingetragen im Handelsregister (Kamer van Koophandel)
Verwaltungsrat: Prof. Dr. Martin Viessmann (PrÃ€sident), Joachim Janssen
(CEO), Klaus Gantner,
Dr. Ulrich HÃŒllmann, Maximilian Viessmann; Viessmann Werke Beteiligungs
OHG, Allendorf (Eder).
Sitz der Gesellschaft: Allendorf (Eder), Registergericht: AG Marburg (Lahn) HRA
3389, USt-IdNr. DE111845525

Gopal Vijayaraghavan

2017-05-04 19:57:36 UTC

Permalink

Hi,

Post by Nar Kumar Chhantyal
Does Hive LLAP work with Parquet format as well?

LLAP does work with the Parquet format, but it does not work very fast, because the java Parquet reader is slow.

https://issues.apache.org/jira/browse/PARQUET-131
+

https://issues.apache.org/jira/browse/HIVE-14826

In particular to your question, Parquet's columnar data reads haven't been optimized for Azure/S3/GCS.

There was a comparison of ORC vs Parquet for NYC taxi data and it found that for simple queries Parquet read ~4x more data over the network - your problem might be bandwidth related.

You might want to convert a small amount to ORC and see whether the BYTES_READ drops or not.

In my tests with a recent LLAP, Text data was faster on LLAP on S3 & Azure than Parquet, because Text has a vectorized reader & cache support.

Cheers,

Gopal

Edward Capriolo

2017-05-04 23:28:38 UTC

Permalink

The parquet orc thing has to be tje biggest detractor. Your forced to chose
between a format good for impala or good for hive.

Post by Gopal Vijayaraghavan
Hi,

Post by Nar Kumar Chhantyal
Does Hive LLAP work with Parquet format as well?

LLAP does work with the Parquet format, but it does not work very fast,
because the java Parquet reader is slow.
https://issues.apache.org/jira/browse/PARQUET-131
+
https://issues.apache.org/jira/browse/HIVE-14826
In particular to your question, Parquet's columnar data reads haven't been
optimized for Azure/S3/GCS.
There was a comparison of ORC vs Parquet for NYC taxi data and it found
that for simple queries Parquet read ~4x more data over the network - your
problem might be bandwidth related.
You might want to convert a small amount to ORC and see whether the
BYTES_READ drops or not.
In my tests with a recent LLAP, Text data was faster on LLAP on S3 & Azure
than Parquet, because Text has a vectorized reader & cache support.
Cheers,
Gopal