UNION versus collations

Lists: pgsql-hackers
From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: UNION versus collations
Date: 2024-11-18 22:56:19
Message-ID: 3605568.1731970579@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

prepunion.c's plan_union_children(), which merges
identically-propertied UNION operations into one, has this comment:

* NOTE: currently, we ignore collations while determining if a child has
* the same properties. This is semantically sound only so long as all
* collations have the same notion of equality. It is valid from an
* implementation standpoint because we don't care about the ordering of
* a UNION child's result: UNION ALL results are always unordered, and
* generate_union_paths will force a fresh sort if the top level is a UNION.

This argument seems well past its sell-by date. In the first place,
now that we have nondeterministic collations we can't assume that
"all collations have the same notion of equality". In the second
place, since commit 66c0185a3 it's completely untrue that "we
don't care about the ordering of a UNION child's result", and
also untrue that "generate_union_paths will force a fresh sort".

As far as I can tell from some desultory testing, the implementation
issues don't lead to any observable bugs. Even though the code may
try to produce wrongly-ordered paths for the sub-SELECTs, the pathkey
logic will recognize that it's not really the same ordering, leading
to injection of per-child Sorts. Nonetheless, we're wasting cycles
producing useless ordered paths, and perhaps we might pick a less than
optimal plan in some cases (not entirely sure about that).

The point about nondeterministic collations is undeniable though.
I've not bothered to build a test case, but surely one can be
made wherein a sub-UNION acts differently than expected.

So I think we ought to apply the attached as far back as we have
nondeterministic collations.

regards, tom lane

Attachment Content-Type Size
check-collations-when-merging-UNIONs.patch text/x-diff 1.9 KB

From: Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: UNION versus collations
Date: 2024-11-19 05:56:34
Message-ID: aecbaa08cc9b21187daf9304346c4e845c157753.camel@cybertec.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, 2024-11-18 at 17:56 -0500, Tom Lane wrote:
> prepunion.c's plan_union_children(), which merges
> identically-propertied UNION operations into one, has this comment:
>
> * NOTE: currently, we ignore collations while determining if a child has
> * the same properties. This is semantically sound only so long as all
> * collations have the same notion of equality. It is valid from an
> * implementation standpoint because we don't care about the ordering of
> * a UNION child's result: UNION ALL results are always unordered, and
> * generate_union_paths will force a fresh sort if the top level is a UNION.
>
> This argument seems well past its sell-by date. In the first place,
> now that we have nondeterministic collations we can't assume that
> "all collations have the same notion of equality".
>
> [...]
>
> So I think we ought to apply the attached as far back as we have
> nondeterministic collations.

+1

This also reminded me of [1], where I cannot think of a good fix.

Yours,
Laurenz Albe

[1]: https://postgr.es/m/8ef4899c4acfebca45cc6c042a6dc611d25ffab1.camel%40cybertec.at