Analyze table compute statistics on wide table taking too long

Discussion:

Roger Marin

2015-04-08 00:58:50 UTC

Hi,

I have a hive table with 300 columns that are all strings with around 180k
rows, when I run analyze table compute statistics it seems to be taking
about 40 minutes to complete regardless of the execution engine.

The table also has a large Regex serde.

I an running hive 0.13.0.

Any help would be greatly appreciated.

Regards,
Roger

Gopal Vijayaraghavan

2015-04-08 01:44:27 UTC

Permalink

Post by Roger Marin
The table also has a large Regex serde.

There are no stats fast paths for Regex SerDe.

The statistics computation is lifting each row into memory, parsing it and
throwing it away.

Most of your time would be spent in GC (check the GC time millis), due to
the huge expense of the Regex Serde.

For a direct comparison you could compute stats while turning it into
another format

set hive.stats.autogather=true;
create table tmp1 stored as orc as select * from oldtable;

Due to the nature of the columnar SerDes, that ETL would happen in
parallel to the compute stats off the same stream (i.e autogather).

That said, I have noticed performance issues with the RegexSerde, but
haven¹t bothered to fix it yet - maybe you¹d want to take a shot at fixing
it?

Cheers,
Gopal

Roger Marin

2015-04-08 02:58:20 UTC

Permalink

Hi Gopal,

Thanks for that.

I'm happy to look into improving the Regex serde performance, any tips on
where I should start looking?.

Regards,
Roger

Post by Gopal Vijayaraghavan

Post by Roger Marin
The table also has a large Regex serde.

There are no stats fast paths for Regex SerDe.
The statistics computation is lifting each row into memory, parsing it and
throwing it away.
Most of your time would be spent in GC (check the GC time millis), due to
the huge expense of the Regex Serde.
For a direct comparison you could compute stats while turning it into
another format
set hive.stats.autogather=true;
create table tmp1 stored as orc as select * from oldtable;
Due to the nature of the columnar SerDes, that ETL would happen in
parallel to the compute stats off the same stream (i.e autogather).
That said, I have noticed performance issues with the RegexSerde, but
havenÂ¹t bothered to fix it yet - maybe youÂ¹d want to take a shot at fixing
it?
Cheers,
Gopal

Gopal Vijayaraghavan

2015-04-08 21:59:15 UTC

Permalink

Post by Roger Marin
I'm happy to look into improving the Regex serde performance, any tips
on where I should start looking?.

There are three things off the top of my head.

First up, the matcher needs to be reused within a single scan. You can
also check the groupCount exactly once for a given pattern.

matcher.reset() offers performance benefits in the inner loop.

Second, Text does not implement CharSequence, which would be ideal to run
regex (zero-copy) over ASCII text (tblproperties, I guess).

Converting byte sequence to unicode points is mostly wasted CPU, I would
guess - Text::toString() is actually expensive.

This is not something I¹m entirely certain of, since java Regex might have
fast-paths for String classes - to be experimented with before fixing it.

A ByteCharSequence could technically be implemented for utf-8 as well
(using ByteBuffer::getChar() instead) - but a really fast path for 7 bit
ASCII is mostly where RegexSerde needs help.

Finally, column projection and SerDe StatsProvidingRecordReader.

There is no reason to deserialize all columns that show up in the original
DDL - compute stats only cares about row-count, but which is effectively
skipping ALL of what a RegexSerde does.

You can find out which columns are being read and only extract those
groups.

That is a combination of ColumnProjectionUtils.isReadAllColumns(conf) and
ColumnProjectionUtils.getReadColumnIDs() from the operator conf.

And in case no columns are being read (like in ANALYZE or count(1)), skip
the regex checker entirely, generating merely how many Text instances were
encountered in total.

Does all of that make sense?

I haven¹t seen too much use of the RegexSerde btw, which is why these were
generally left on the backburner (the perf problem is limited to a single
³create table² off it into ORC and use the vectorized filters for
performance).

Cheers,
Gopal

Continue reading on narkive:

Search results for 'Analyze table compute statistics on wide table taking too long' (Questions and Answers)

replies

What made Math most people's most hated subject? How do you feel about your Math classes?

started 2007-10-18 21:47:08 UTC

education & reference