Quick Links

Finding rows with text columns beginning with other text columns

Lists:	pgsql-general

From:	Christoph Zwerschke <cito(at)online(dot)de>
To:	pgsql-general(at)postgresql(dot)org
Subject:	Finding rows with text columns beginning with other text columns
Date:	2010-05-09 22:01:43
Message-ID:	4BE730C7.7090700@online.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

Assume we have a table "a" with a text column "txt"
and an index on that column.

A query like the following will then be very perfomant
since it can use the index:

select * from a where txt like 'a%'

(Assume also that the server is using the C locale or the index
is set up with text_pattern_ops, so that this really works.)

Now take a second, similar table "b" (can be the same table).

We want to find all entries in b where txt begins with an
existing txt entry in a:

select * from b join a on b.txt like a.txt||'%'

On the first glance you would expect that this is performant
since it can use the index, but sadly it doesn't work.
The problem seems to be that Postgres can not guarantee that
column a.txt does not contain a '%', so it cannot optimize.

I feel there should be a performat way to query these entries,
but I can't come up with anything. Can anybody help me?

Thanks,
-- Christoph

From:	Alban Hertroys <dalroi(at)solfertje(dot)student(dot)utwente(dot)nl>
To:	Christoph Zwerschke <cito(at)online(dot)de>
Cc:	pgsql-general(at)postgresql(dot)org
Subject:	Re: Finding rows with text columns beginning with other text columns
Date:	2010-05-10 09:50:25
Message-ID:	F90AE329-82B8-42C6-931A-47A0337B7808@solfertje.student.utwente.nl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

On 10 May 2010, at 24:01, Christoph Zwerschke wrote:

> We want to find all entries in b where txt begins with an
> existing txt entry in a:
>
> select * from b join a on b.txt like a.txt||'%'
>
> On the first glance you would expect that this is performant
> since it can use the index, but sadly it doesn't work.
> The problem seems to be that Postgres can not guarantee that
> column a.txt does not contain a '%', so it cannot optimize.
>
> I feel there should be a performat way to query these entries,
> but I can't come up with anything. Can anybody help me?

Have you tried using substring instead of like?

Alban Hertroys

--
If you can't see the forest for the trees,
cut the trees and you'll see there is no forest.

!DSPAM:737,4be7d6ec10411051620847!

From:	Christoph Zwerschke <cito(at)online(dot)de>
To:	pgsql-general(at)postgresql(dot)org
Subject:	Re: Finding rows with text columns beginning with other text columns
Date:	2010-05-10 19:24:38
Message-ID:	4BE85D76.2040900@online.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

Am 10.05.2010 11:50 schrieb Alban Hertroys:
> On 10 May 2010, at 24:01, Christoph Zwerschke wrote:
>
>> select * from b join a on b.txt like a.txt||'%'
>>
>> I feel there should be a performat way to query these entries,
>> but I can't come up with anything. Can anybody help me?
>
> Have you tried using substring instead of like?

How exactly? I tried this:

substr(b.txt, 1, length(a.txt)) = a.txt

but it cannot be optimized and results in a nested loop, too.

It only works with a fixed length:

substr(b.txt, 1, 3) = a.txt

So theoretically I could do something like

select * from b join a
on substr(b.txt, 1, 1) = a.txt and length(b.txt) = 1
union select * from b join a
on substr(b.txt, 1, 2) = a.txt and length(b.txt) = 2
union select * from b join a
on substr(b.txt, 1, 3) = a.txt and length(b.txt) = 3
union ...

... up to the maximum possible string length in a.txt. Not very elegant.

If the question is not finding text cols in b starting with text cols in
a, but text cols in b starting with text cols in a as their first word,
then the following join condition works very well:

split_part(b.txt, ' ', 1) = a.txt

But I'm still looking for a simple solution to the original problem.

-- Christoph

From:	Alban Hertroys <dalroi(at)solfertje(dot)student(dot)utwente(dot)nl>
To:	Christoph Zwerschke <cito(at)online(dot)de>
Cc:	pgsql-general(at)postgresql(dot)org
Subject:	Re: Finding rows with text columns beginning with other text columns
Date:	2010-05-10 21:34:58
Message-ID:	6BA3143F-C2FD-4A2E-9C42-B73970FD08A5@solfertje.student.utwente.nl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

On 10 May 2010, at 21:24, Christoph Zwerschke wrote:

> Am 10.05.2010 11:50 schrieb Alban Hertroys:
> > On 10 May 2010, at 24:01, Christoph Zwerschke wrote:
> >
> >> select * from b join a on b.txt like a.txt||'%'
> >>
> >> I feel there should be a performat way to query these entries,
> >> but I can't come up with anything. Can anybody help me?
> >
> > Have you tried using substring instead of like?
>
> How exactly? I tried this:
>
> substr(b.txt, 1, length(a.txt)) = a.txt
>
> but it cannot be optimized and results in a nested loop, too.

I feared as much, but it was worth a try.

Thinking more on the issue, I don't see a way to prevent the nested loop as there's no way to decide beforehand what part of the string to index for b.txt. It depends on a.txt after all.

You would basically need a cross-table index, those are not supported. If it were, you could create a functional index of substrings of b.txt with string lengths from a.txt (eeps, that'd be a table product!).

Your best solution is probably to add a column to b that contains the substring of b.txt that would match a.txt.

Alban Hertroys

--
If you can't see the forest for the trees,
cut the trees and you'll see there is no forest.

!DSPAM:737,4be87c0b10418212361837!

From:	Christoph Zwerschke <cito(at)online(dot)de>
To:	pgsql-general(at)postgresql(dot)org
Subject:	Re: Finding rows with text columns beginning with other text columns
Date:	2010-05-11 21:04:00
Message-ID:	4BE9C640.1020106@online.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

Am 10.05.2010 23:34 schrieb Alban Hertroys:
> Thinking more on the issue, I don't see a way to prevent the nested
> loop as there's no way to decide beforehand what part of the string to
> index for b.txt. It depends on a.txt after all.

Yes, that seems to be the gist of the matter. I just felt I might have
missed something. Thanks for your answers.

-- Christoph