Discussion:
ORC vs TEXT file
pandees waran
2013-08-12 10:39:27 UTC
Permalink
Hi,

Currently, we use TEXTFILE format in hive 0.8 ,while creating the
external tables in intermediate processing .
I have read about ORC in 0.11. I have created the same table in 0.11
with ORC format.
Without any compression, the ORC file(totally 3 files) occupied the
space twice more than the TEXTFILE(only one file).
Even, when i query the data from ORC:
Select count(*) from orc_table

It took more time than the same query against textfile.
But, i see cumulative CPU time is lesser in ORC than the text file.

What sort of queries will benefit, if we use ORC?
In which cases TEXTFILE will be preferred more than ORC?

Thanks.
Edward Capriolo
2013-08-12 13:30:35 UTC
Permalink
Colmnar formats do not always beat row wise storage. Many times gzip plus
block storage will compress something better then columnar storage
especially when you have repeated data in different columns.

Based on what you are saying it could be possible that you missed a setting
and the ocr are not compressed.
Post by pandees waran
Hi,
Currently, we use TEXTFILE format in hive 0.8 ,while creating the
external tables in intermediate processing .
I have read about ORC in 0.11. I have created the same table in 0.11
with ORC format.
Without any compression, the ORC file(totally 3 files) occupied the
space twice more than the TEXTFILE(only one file).
Select count(*) from orc_table
It took more time than the same query against textfile.
But, i see cumulative CPU time is lesser in ORC than the text file.
What sort of queries will benefit, if we use ORC?
In which cases TEXTFILE will be preferred more than ORC?
Thanks.
pandees waran
2013-08-12 13:52:45 UTC
Permalink
Thanks Edward. I shall try compression besides orc and let you know. And
also, it looks like the cpu usage is lesser while querying orc rather
than text file.
But the total time taken by the query time is slightly more in orc than
text file. Could you please explain the difference between cumulative cpu
time and the total time taken (usually in last line in terms or secs)?
Which one should we give preference?
Post by Edward Capriolo
Colmnar formats do not always beat row wise storage. Many times gzip plus
block storage will compress something better then columnar storage
especially when you have repeated data in different columns.
Based on what you are saying it could be possible that you missed a
setting and the ocr are not compressed.
Post by pandees waran
Hi,
Currently, we use TEXTFILE format in hive 0.8 ,while creating the
external tables in intermediate processing .
I have read about ORC in 0.11. I have created the same table in 0.11
with ORC format.
Without any compression, the ORC file(totally 3 files) occupied the
space twice more than the TEXTFILE(only one file).
Select count(*) from orc_table
It took more time than the same query against textfile.
But, i see cumulative CPU time is lesser in ORC than the text file.
What sort of queries will benefit, if we use ORC?
In which cases TEXTFILE will be preferred more than ORC?
Thanks.
Owen O'Malley
2013-08-12 15:20:27 UTC
Permalink
Pandees,
I've never seen a table that was larger with ORC than with text. Can you
share your text's file schema with us? Is the table very small? How many
rows and GB are the tables? The overhead for ORC is typically small, but as
Ed says it is possible for rare cases for the overhead to dominate the data
size itself.

-- Owen
Post by pandees waran
Thanks Edward. I shall try compression besides orc and let you know. And
also, it looks like the cpu usage is lesser while querying orc rather
than text file.
But the total time taken by the query time is slightly more in orc than
text file. Could you please explain the difference between cumulative cpu
time and the total time taken (usually in last line in terms or secs)?
Which one should we give preference?
Post by Edward Capriolo
Colmnar formats do not always beat row wise storage. Many times gzip plus
block storage will compress something better then columnar storage
especially when you have repeated data in different columns.
Based on what you are saying it could be possible that you missed a
setting and the ocr are not compressed.
Post by pandees waran
Hi,
Currently, we use TEXTFILE format in hive 0.8 ,while creating the
external tables in intermediate processing .
I have read about ORC in 0.11. I have created the same table in 0.11
with ORC format.
Without any compression, the ORC file(totally 3 files) occupied the
space twice more than the TEXTFILE(only one file).
Select count(*) from orc_table
It took more time than the same query against textfile.
But, i see cumulative CPU time is lesser in ORC than the text file.
What sort of queries will benefit, if we use ORC?
In which cases TEXTFILE will be preferred more than ORC?
Thanks.
pandees waran
2013-08-12 16:27:03 UTC
Permalink
Hi Owen,

Thanks for your response.

My structure is like:

a)Textfile:
CREATE EXTERNAL TABLE test_textfile (
COL1 BIGINT,
COL2 STRING,
COL3 BIGINT,
COL4 STRING,
COL5 STRING,
COL6 BIGINT,
COL7 BIGINT,
COL8 BIGINT,
COL9 BIGINT,
COl10 BIGINT,
COl11 BIGINT,
COL12 STRING,
COl13 STRING,
COl14 STRING,
COl15 BIGINT,
COl16 STRING,
COL17 DOUBLE,
COl18 DOUBLE,
COl19 DOUBLE,
COl20 DOUBLE,
COl21 DOUBLE,
COL22 DOUBLE,
COl23 DOUBLE,
COL24 DOUBLE,
COl25 DOUBLE,
COL26 DOUBLE,
COl27 DOUBLE,
COL28 DOUBLE,
COL29 DOUBLE,
COl30 DOUBLE,
COl31 DOUBLE,
COL32 DOUBLE,
COL33 STRING,
COl34 STRING,
COl35 DOUBLE,
COL36 DOUBLE,
COl37 DOUBLE,
COL38 DOUBLE,
COl39 DOUBLE,
COL40 DOUBLE,
COl41 DOUBLE,
COL42 DOUBLE,
COL43 DOUBLE,
COl44 DOUBLE,
COl45 DOUBLE,
COL46 DOUBLE,
COL47 DOUBLE,
COl48 DOUBLE,
COl49 DOUBLE,
COL50 DOUBLE,
COL51 DOUBLE,
COl52 DOUBLE,
COl53 DOUBLE,
COl54 DOUBLE,
COL55 DOUBLE,
COL56 STRING,
COL57 DOUBLE,
COL58 DOUBLE,
COL59 DOUBLE,
COl60 DOUBLE,
COl61 STRING,
COL62 STRING,
COL63 STRING,
COL64 STRING,
COl65 STRING,
COl66 STRING,
COl67 STRING,
COL68 STRING,
Col69 STRING,
COL70 STRING,
COL71 STRING,
COl72 STRING,
COl73 STRING,
COL74 STRING
) PARTITIONED BY (
COL75 STRING,
COL76 STRING
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'
STORED AS TEXTFILE LOCATION 's3://test/textfile/';
Using block level compression and bzip2codec for output.

b) With the above set of columns, just i have changed as STORED AS ORC for
creating ORC. Not using any compression option

c)Inserted 7256852 records in both the tables

d)Space occupied in S3:

Storing as ORC(3 files):153.4MB *3=460.2MB
TEXT(single file in bz2 format)=306MB

I need to check ORC with compression enabled.

Please let me know, if i miss anything.

Thanks,
Post by Owen O'Malley
Pandees,
I've never seen a table that was larger with ORC than with text. Can you
share your text's file schema with us? Is the table very small? How many
rows and GB are the tables? The overhead for ORC is typically small, but as
Ed says it is possible for rare cases for the overhead to dominate the data
size itself.
-- Owen
Post by pandees waran
Thanks Edward. I shall try compression besides orc and let you know. And
also, it looks like the cpu usage is lesser while querying orc rather
than text file.
But the total time taken by the query time is slightly more in orc than
text file. Could you please explain the difference between cumulative cpu
time and the total time taken (usually in last line in terms or secs)?
Which one should we give preference?
Post by Edward Capriolo
Colmnar formats do not always beat row wise storage. Many times gzip
plus block storage will compress something better then columnar storage
especially when you have repeated data in different columns.
Based on what you are saying it could be possible that you missed a
setting and the ocr are not compressed.
Post by pandees waran
Hi,
Currently, we use TEXTFILE format in hive 0.8 ,while creating the
external tables in intermediate processing .
I have read about ORC in 0.11. I have created the same table in 0.11
with ORC format.
Without any compression, the ORC file(totally 3 files) occupied the
space twice more than the TEXTFILE(only one file).
Select count(*) from orc_table
It took more time than the same query against textfile.
But, i see cumulative CPU time is lesser in ORC than the text file.
What sort of queries will benefit, if we use ORC?
In which cases TEXTFILE will be preferred more than ORC?
Thanks.
--
Thanks,
Pandeeswaran
Owen O'Malley
2013-08-12 21:03:26 UTC
Permalink
Ok, given the large number of doubles in the schema and bzip2 compression,
I can see why the text would be smaller.

ORC doesn't do compression on floats or doubles, although there is a jira
to do so. (https://issues.apache.org/jira/browse/HIVE-3889)

Bzip is a very aggressive compressor. We should probably add it as an
option with ORC for long term storage files. (
https://issues.apache.org/jira/browse/HIVE-5067)

ORC will give you significant performance gains when you select a subset of
the columns.

select COL70, COL74 from test;

will run much faster on ORC than text files.
Post by pandees waran
Hi Owen,
Thanks for your response.
CREATE EXTERNAL TABLE test_textfile (
COL1 BIGINT,
COL2 STRING,
COL3 BIGINT,
COL4 STRING,
COL5 STRING,
COL6 BIGINT,
COL7 BIGINT,
COL8 BIGINT,
COL9 BIGINT,
COl10 BIGINT,
COl11 BIGINT,
COL12 STRING,
COl13 STRING,
COl14 STRING,
COl15 BIGINT,
COl16 STRING,
COL17 DOUBLE,
COl18 DOUBLE,
COl19 DOUBLE,
COl20 DOUBLE,
COl21 DOUBLE,
COL22 DOUBLE,
COl23 DOUBLE,
COL24 DOUBLE,
COl25 DOUBLE,
COL26 DOUBLE,
COl27 DOUBLE,
COL28 DOUBLE,
COL29 DOUBLE,
COl30 DOUBLE,
COl31 DOUBLE,
COL32 DOUBLE,
COL33 STRING,
COl34 STRING,
COl35 DOUBLE,
COL36 DOUBLE,
COl37 DOUBLE,
COL38 DOUBLE,
COl39 DOUBLE,
COL40 DOUBLE,
COl41 DOUBLE,
COL42 DOUBLE,
COL43 DOUBLE,
COl44 DOUBLE,
COl45 DOUBLE,
COL46 DOUBLE,
COL47 DOUBLE,
COl48 DOUBLE,
COl49 DOUBLE,
COL50 DOUBLE,
COL51 DOUBLE,
COl52 DOUBLE,
COl53 DOUBLE,
COl54 DOUBLE,
COL55 DOUBLE,
COL56 STRING,
COL57 DOUBLE,
COL58 DOUBLE,
COL59 DOUBLE,
COl60 DOUBLE,
COl61 STRING,
COL62 STRING,
COL63 STRING,
COL64 STRING,
COl65 STRING,
COl66 STRING,
COl67 STRING,
COL68 STRING,
Col69 STRING,
COL70 STRING,
COL71 STRING,
COl72 STRING,
COl73 STRING,
COL74 STRING
) PARTITIONED BY (
COL75 STRING,
COL76 STRING
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'
STORED AS TEXTFILE LOCATION 's3://test/textfile/';
Using block level compression and bzip2codec for output.
b) With the above set of columns, just i have changed as STORED AS ORC for
creating ORC. Not using any compression option
c)Inserted 7256852 records in both the tables
Storing as ORC(3 files):153.4MB *3=460.2MB
TEXT(single file in bz2 format)=306MB
I need to check ORC with compression enabled.
Please let me know, if i miss anything.
Thanks,
Post by Owen O'Malley
Pandees,
I've never seen a table that was larger with ORC than with text. Can
you share your text's file schema with us? Is the table very small? How
many rows and GB are the tables? The overhead for ORC is typically small,
but as Ed says it is possible for rare cases for the overhead to dominate
the data size itself.
-- Owen
Post by pandees waran
Thanks Edward. I shall try compression besides orc and let you know.
And also, it looks like the cpu usage is lesser while querying orc rather
than text file.
But the total time taken by the query time is slightly more in orc than
text file. Could you please explain the difference between cumulative cpu
time and the total time taken (usually in last line in terms or secs)?
Which one should we give preference?
Post by Edward Capriolo
Colmnar formats do not always beat row wise storage. Many times gzip
plus block storage will compress something better then columnar storage
especially when you have repeated data in different columns.
Based on what you are saying it could be possible that you missed a
setting and the ocr are not compressed.
Post by pandees waran
Hi,
Currently, we use TEXTFILE format in hive 0.8 ,while creating the
external tables in intermediate processing .
I have read about ORC in 0.11. I have created the same table in 0.11
with ORC format.
Without any compression, the ORC file(totally 3 files) occupied the
space twice more than the TEXTFILE(only one file).
Select count(*) from orc_table
It took more time than the same query against textfile.
But, i see cumulative CPU time is lesser in ORC than the text file.
What sort of queries will benefit, if we use ORC?
In which cases TEXTFILE will be preferred more than ORC?
Thanks.
--
Thanks,
Pandeeswaran
hongzhan li
2013-08-13 00:58:35 UTC
Permalink
if you select all the columns ,the orc will not faster than textfile.but if you select some column not all of the colimns,orc will run faster.
—
Post by pandees waran
Hi,
Currently, we use TEXTFILE format in hive 0.8 ,while creating the
external tables in intermediate processing .
I have read about ORC in 0.11. I have created the same table in 0.11
with ORC format.
Without any compression, the ORC file(totally 3 files) occupied the
space twice more than the TEXTFILE(only one file).
Select count(*) from orc_table
It took more time than the same query against textfile.
But, i see cumulative CPU time is lesser in ORC than the text file.
What sort of queries will benefit, if we use ORC?
In which cases TEXTFILE will be preferred more than ORC?
Thanks.
pandees waran
2013-08-13 03:53:28 UTC
Permalink
Thanks Owen and Hongzhan.
I understand the behavior now.
Post by hongzhan li
if you select all the columns ,the orc will not faster than textfile.but
if you select some column not all of the colimns,orc will run faster.
—
Post by pandees waran
Hi,
Currently, we use TEXTFILE format in hive 0.8 ,while creating the
external tables in intermediate processing .
I have read about ORC in 0.11. I have created the same table in 0.11
with ORC format.
Without any compression, the ORC file(totally 3 files) occupied the
space twice more than the TEXTFILE(only one file).
Select count(*) from orc_table
It took more time than the same query against textfile.
But, i see cumulative CPU time is lesser in ORC than the text file.
What sort of queries will benefit, if we use ORC?
In which cases TEXTFILE will be preferred more than ORC?
Thanks.
--
Thanks,
Pandeeswaran
Continue reading on narkive:
Loading...