Post by Roger MarinI'm happy to look into improving the Regex serde performance, any tips
on where I should start looking?.
There are three things off the top of my head.
First up, the matcher needs to be reused within a single scan. You can
also check the groupCount exactly once for a given pattern.
matcher.reset() offers performance benefits in the inner loop.
Second, Text does not implement CharSequence, which would be ideal to run
regex (zero-copy) over ASCII text (tblproperties, I guess).
Converting byte sequence to unicode points is mostly wasted CPU, I would
guess - Text::toString() is actually expensive.
This is not something I¹m entirely certain of, since java Regex might have
fast-paths for String classes - to be experimented with before fixing it.
A ByteCharSequence could technically be implemented for utf-8 as well
(using ByteBuffer::getChar() instead) - but a really fast path for 7 bit
ASCII is mostly where RegexSerde needs help.
Finally, column projection and SerDe StatsProvidingRecordReader.
There is no reason to deserialize all columns that show up in the original
DDL - compute stats only cares about row-count, but which is effectively
skipping ALL of what a RegexSerde does.
You can find out which columns are being read and only extract those
groups.
That is a combination of ColumnProjectionUtils.isReadAllColumns(conf) and
ColumnProjectionUtils.getReadColumnIDs() from the operator conf.
And in case no columns are being read (like in ANALYZE or count(1)), skip
the regex checker entirely, generating merely how many Text instances were
encountered in total.
Does all of that make sense?
I haven¹t seen too much use of the RegexSerde btw, which is why these were
generally left on the backburner (the perf problem is limited to a single
³create table² off it into ORC and use the vectorized filters for
performance).
Cheers,
Gopal