Quick Links

Re: speed up text_position() for utf-8

Lists:	pgsql-hackers

From:	John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
To:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Subject:	speed up text_position() for utf-8
Date:	2021-12-13 16:02:47
Message-ID:	CAFBsxsH1Yutrmu+6LLHKK8iXY+vG--Do6zN+2900spHXQNNQKQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Commit 9556aa01c69 (Use single-byte Boyer-Moore-Horspool search even
with multibyte encodings), was a speed improvement for the majority of
cases, but when the match was toward the end of the string, the
slowdown in text_position_get_match_pos() was noticeable. It was found
that there was a lot of overhead in pg_mblen(). [1]

The attached exploratory PoC improves this for utf-8. It applies on
top v25 of my utf-8 verification patch in [2], since one approach
relies on the DFA from it. The other three approaches are:
- a version of pg_utf_mblen() that uses a lookup table [3]
- an inlined copy of pg_utf_mblen()
- an ascii fast path with a fallback to the inlined copy of pg_utf_mblen()

The test is attached and the test function is part of the patch. It's
based on the test used in the commit above. The test searches for a
string that's at the end of a ~1 million byte string. This is on gcc
11 with 2-3 runs to ensure repeatability, but I didn't bother with
statistics because the differences are pretty big:

patch | no match | ascii | mulitbyte
-----------------------------------------+----------+-------+-----------
PG11 | 1120 | 1100 | 900
master | 381 | 2350 | 1900
DFA | 386 | 1640 | 1640
branchless utf mblen | 387 | 4100 | 2600
inline pg_utf_mblen() | 380 | 1080 | 920
inline pg_utf_mblen() + ascii fast path | 382 | 470 | 918

Neither of the branchless approaches worked well. The DFA can't work
as well here as in verification because it must do additional work.
Inlining pg_utf_mblen() restores worst-case performance to PG11
levels. The ascii fast path is a nice improvement on top of that. A
similar approach could work for pg_mbstrlen() as well, but I haven't
looked into that yet. There are other callers of pg_mblen(), but I
haven't looked into whether they are performance-sensitive. A more
general application would be preferable to a targeted one.

[1] /message-id/b65df3d8-1f59-3bd7-ebbe-68b81d5a76a4%40iki.fi
[2] /message-id/CAFBsxsHG%3Dg6W8Mie%2B_NO8dV6O0pO2stxrnS%3Dme5ZmGqk--fd5g%40mail.gmail.com
[3] https://github.com/skeeto/branchless-utf8/blob/master/utf8.h

--
John Naylor
EDB: http://www.enterprisedb.com

Attachment	Content-Type	Size
prototype-textpos-utf8.patch	application/x-patch	10.4 KB
textpos-benchmark-utf8.sql	application/sql	613 bytes

From:	John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
To:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Subject:	Re: speed up text_position() for utf-8
Date:	2021-12-15 16:43:37
Message-ID:	CAFBsxsGAcmsHFsiNh=TywjkwOgebbT87hsrPRkgf-b=HLO=vjA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I wrote:

> The test is attached and the test function is part of the patch. It's
> based on the test used in the commit above. The test searches for a
> string that's at the end of a ~1 million byte string. This is on gcc
> 11 with 2-3 runs to ensure repeatability, but I didn't bother with
> statistics because the differences are pretty big:
>
> patch | no match | ascii | mulitbyte
> -----------------------------------------+----------+-------+-----------
> PG11 | 1120 | 1100 | 900
> master | 381 | 2350 | 1900
> DFA | 386 | 1640 | 1640
> branchless utf mblen | 387 | 4100 | 2600
> inline pg_utf_mblen() | 380 | 1080 | 920
> inline pg_utf_mblen() + ascii fast path | 382 | 470 | 918

I failed to mention that the above numbers are milliseconds, so
smaller is better.

--
John Naylor
EDB: http://www.enterprisedb.com

From:	John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
To:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: speed up text_position() for utf-8
Date:	2021-12-17 21:01:37
Message-ID:	CAFBsxsFUoxgYQ22xjFTVtq7UAoZbHFTigEeQn3bV=2PCyqgSpw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Attached is a short patch series to develop some ideas of inlining
pg_utf_mblen().

0001 puts the main implementation of pg_utf_mblen() into an inline
function and uses this in pg_mblen(). This is somewhat faster in the
strpos tests, so that gives some measure of the speedup expected for
other callers. Text search seems to call this a lot, so this might
have noticeable benefit.

0002 refactors text_position_get_match_pos() to use
pg_mbstrlen_with_len(). This itself is significantly faster when
combined with 0001, likely because the latter can inline the call to
pg_mblen(). The intention is to speed up more than just text_position.

0003 explicitly specializes for the inline version of pg_utf_mblen()
into pg_mbstrlen_with_len(), but turns out to be almost as slow as
master for ascii. It doesn't help if I undo the previous change in
pg_mblen(), and I haven't investigated why yet.

0002 looks good now, but the experience with 0003 makes me hesitant to
propose this seriously until I can figure out what's going on there.

The test is as earlier, a worst-case substring search, times in milliseconds.

patch | no match | ascii | multibyte
--------+----------+-------+-----------
PG11 | 1220 | 1220 | 1150
master | 385 | 2420 | 1980
0001 | 390 | 2180 | 1670
0002 | 389 | 1330 | 1100
0003 | 391 | 2100 | 1360

--
John Naylor
EDB: http://www.enterprisedb.com

Attachment	Content-Type	Size
v2-0002-Refactor-text_position_get_match_pos-to-use-pg_mb.patch	application/octet-stream	2.1 KB
v2-0003-Specialize-pg_mbstrlen_with_len-for-utf-8.patch	application/octet-stream	897 bytes
v2-0001-Move-the-implementation-of-pg_utf_mblen-to-an-inl.patch	application/octet-stream	4.3 KB

From:	John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
To:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: speed up text_position() for utf-8
Date:	2022-01-19 19:24:31
Message-ID:	CAFBsxsFbLU0mzGetXUhZoPQ9VuV=Eu60OAeLDt=UZQZ9-Has7Q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I wrote:

> 0001 puts the main implementation of pg_utf_mblen() into an inline
> function and uses this in pg_mblen(). This is somewhat faster in the
> strpos tests, so that gives some measure of the speedup expected for
> other callers. Text search seems to call this a lot, so this might
> have noticeable benefit.
>
> 0002 refactors text_position_get_match_pos() to use
> pg_mbstrlen_with_len(). This itself is significantly faster when
> combined with 0001, likely because the latter can inline the call to
> pg_mblen(). The intention is to speed up more than just text_position.
>
> 0003 explicitly specializes for the inline version of pg_utf_mblen()
> into pg_mbstrlen_with_len(), but turns out to be almost as slow as
> master for ascii. It doesn't help if I undo the previous change in
> pg_mblen(), and I haven't investigated why yet.
>
> 0002 looks good now, but the experience with 0003 makes me hesitant to
> propose this seriously until I can figure out what's going on there.
>
> The test is as earlier, a worst-case substring search, times in milliseconds.
>
> patch | no match | ascii | multibyte
> --------+----------+-------+-----------
> PG11 | 1220 | 1220 | 1150
> master | 385 | 2420 | 1980
> 0001 | 390 | 2180 | 1670
> 0002 | 389 | 1330 | 1100
> 0003 | 391 | 2100 | 1360

I tried this test on a newer CPU, and 0003 had no regression. Both
systems used gcc 11.2. Rather than try to figure out why an experiment
had unexpected behavior, I plan to test 0001 and 0002 on a couple
different compilers/architectures and call it a day. It's also worth
noting that 0002 by itself seemed to be decently faster on the newer
machine, but not as fast as 0001 and 0002 together.

Looking at the assembly, pg_mblen is inlined into
pg_mbstrlen_[with_len] and pg_mbcliplen, so the specialization for
utf-8 in 0001 would be inlined in the other 3 as well. That's only a
few bytes, so I think it's fine.

--
John Naylor
EDB: http://www.enterprisedb.com

From:	John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
To:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: speed up text_position() for utf-8
Date:	2022-02-02 20:20:56
Message-ID:	CAFBsxsH3Yw61WUfARYo518-ABNYbA6AYsdGpCScHAbenBBh4ng@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I wrote:

> I tried this test on a newer CPU, and 0003 had no regression. Both
> systems used gcc 11.2. Rather than try to figure out why an experiment
> had unexpected behavior, I plan to test 0001 and 0002 on a couple
> different compilers/architectures and call it a day. It's also worth
> noting that 0002 by itself seemed to be decently faster on the newer
> machine, but not as fast as 0001 and 0002 together.

I tested four machines with various combinations of patches, and it
seems the only thing they all agree on is that 0002 is a decent
improvement (full results attached). The others can be faster or
slower. 0002 also simplifies things, so it has that going for it. I
plan to commit that this week unless there are objections.
--
John Naylor
EDB: http://www.enterprisedb.com

Attachment	Content-Type	Size
test-textpos-20220202.txt	text/plain	1.8 KB