Re: BUG #17896: xmlagg exponentially slower than string_agg equivalent

Lists: pgsql-bugs
From: PG Bug reporting form <noreply(at)postgresql(dot)org>
To: pgsql-bugs(at)lists(dot)postgresql(dot)org
Cc: sf_postgresql(at)informationsoftworks(dot)com
Subject: BUG #17896: xmlagg exponentially slower than string_agg equivalent
Date: 2023-04-14 15:12:59
Message-ID: 17896-f4bf54f99b6a3089@postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

The following bug has been logged on the website:

Bug reference: 17896
Logged by: Paul Wehr
Email address: sf_postgresql(at)informationsoftworks(dot)com
PostgreSQL version: 14.6
Operating system: Debian GNU/Linux 11 (bullseye)
Description:

Not a bug in terms of functionality, but an interesting difference in
performance. If nothing else hopefully documenting this work-around might
help anyone else that may be aggregating xmlelements on this scale.

Using xmlagg(xml) on 60,000 rows of xml takes 16.3s on my hardware. If I
replace "xmlagg(xml)" with "string_agg(xml::text)::xml", I can get the same
result in 0.051s. (see script below). I have confirmed this behavior on
other hardware, and in versions 11.19, 13.7 and 14.6.

Also, while string_agg() seems to scale linearly (O(n)), xmlagg() seems to
scale somewhat steeper (O(n^2)?). Running the same script on various second
parameter values in the generate_series() function produced these results:

| | string_agg()
Rows | xmlagg(), ms | work-around, ms
-------+----------------+------------------
10000 | 314.44 | 14.207
20000 | 1111.104 | 23.754
30000 | 2558.706 | 37.565
40000 | 5306.631 | 51.021
50000 | 10178.496 | 61.928
60000 | 16306.651 | 75.146
70000 | 23853.331 | 93.529
80000 | 32243.908 | 105.318

-------- psql test script output ----------------

select version();
version

-----------------------------------------------------------------------------------------------------------------------------
PostgreSQL 14.6 (Debian 14.6-1.pgdg110+1) on x86_64-pc-linux-gnu, compiled
by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit
(1 row)

create temporary table xmlsample as
select xmlelement(name test, 'test') as xml
from generate_series(1,60000) as i (i)
;
SELECT 60000
explain analyze select xmlagg(xml) as xml from xmlsample;
QUERY PLAN

---------------------------------------------------------------------------------------------------------------------
Aggregate (cost=1034.10..1034.11 rows=1 width=32) (actual
time=16304.009..16304.010 rows=1 loops=1)
-> Seq Scan on xmlsample (cost=0.00..903.88 rows=52088 width=32)
(actual time=0.021..12.754 rows=60000 loops=1)
Planning Time: 0.351 ms
Execution Time: 16306.651 ms
(4 rows)

explain analyze select string_agg(xml::text, '')::xml as xml from
xmlsample;
QUERY PLAN

--------------------------------------------------------------------------------------------------------------------
Aggregate (cost=1034.10..1034.12 rows=1 width=32) (actual
time=63.076..63.078 rows=1 loops=1)
-> Seq Scan on xmlsample (cost=0.00..903.88 rows=52088 width=32)
(actual time=0.010..5.591 rows=60000 loops=1)
Planning Time: 0.098 ms
Execution Time: 75.146 ms
(4 rows)


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: sf_postgresql(at)informationsoftworks(dot)com
Cc: pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #17896: xmlagg exponentially slower than string_agg equivalent
Date: 2023-04-15 14:45:42
Message-ID: 1701648.1681569942@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

PG Bug reporting form <noreply(at)postgresql(dot)org> writes:
> Also, while string_agg() seems to scale linearly (O(n)), xmlagg() seems to
> scale somewhat steeper (O(n^2)?).

Yeah, string_agg was optimized long ago to avoid repeat data copying,
but xmlagg hasn't been.

It looks like xmlagg doesn't make any direct use of libxml, which suggests
that this wouldn't be terribly hard to fix if somebody were motivated
to try. The xml type is a bit of a development backwater though ...

regards, tom lane