Lists: | pgsql-generalpgsql-hackers |
---|
From: | bombadil(at)wanadoo(dot)es |
---|---|
To: | Lista PostgreSql <pgsql-general(at)postgresql(dot)org> |
Subject: | Problem (bug?) wih like |
Date: | 2001-12-03 07:37:36 |
Message-ID: | 20011203083736.A20115@fangorn.net |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | Postg토토 사이트 추천SQL Postg사설 토토 사이트SQL |
I'd like to contrast an error that I get when using like in text
fields.
I have a table A an a view v_A of A. Name is a text field (person
names). Look at these queries:
1)
select * from A where name like '%DAVID%'
It works pretty well and fast. Result:
DAVID
DAVID FOO
DAVID
.../...
2)
select * from v_A where name like '%DA%'
It works too, with a result bigger (obviously) than first query.
Result:
DAVID
DANIEL
DAVID FOO
.../...
3)
select * from v_A where name like '%DAVID%'
It freezes psql. Why?. It seems a bug, doesn't it?. Thanks for any
info.
David
From: | DaVinci <davinci(at)escomposlinux(dot)org> |
---|---|
To: | Lista PostgreSql <pgsql-general(at)postgresql(dot)org> |
Subject: | Re: Problem (bug?) wih like |
Date: | 2001-12-03 08:13:02 |
Message-ID: | 20011203081302.GA22120@fangorn.net |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-general pgsql-hackers |
El lunes 03 de diciembre, bombadil(at)wanadoo(dot)es escribió:
> I'd like to contrast an error that I get when using like in text
> fields.
Sorry. Info about my system:
Postgresql 7.1.3
Debian Sid GNU/Linux
i386
Greets.
David
From: | DaVinci <bombadil(at)wanadoo(dot)es> |
---|---|
To: | Lista PostgreSql <pgsql-general(at)postgresql(dot)org> |
Subject: | Re: Problem (bug?) wih like |
Date: | 2001-12-03 08:33:25 |
Message-ID: | 20011203083325.GA22300@fangorn.net |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-general pgsql-hackers |
El lunes 03 de diciembre, bombadil(at)wanadoo(dot)es escribió:
> I'd like to contrast an error that I get when using like in text
> fields.
Sorry. Info about my system:
Postgresql 7.1.3
Debian Sid GNU/Linux
i386
Greets.
David
From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | bombadil(at)wanadoo(dot)es |
Cc: | Lista PostgreSql <pgsql-general(at)postgresql(dot)org> |
Subject: | Re: Problem (bug?) wih like |
Date: | 2001-12-03 14:43:01 |
Message-ID: | 12824.1007390581@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-general pgsql-hackers |
bombadil(at)wanadoo(dot)es writes:
> select * from v_A where name like '%DAVID%'
> It freezes psql.
I don't believe that it's really "frozen". Taking a long time, maybe.
> Why?
You tell us. What's the EXPLAIN query plan for these three queries?
regards, tom lane
From: | bombadil(at)wanadoo(dot)es |
---|---|
To: | Lista PostgreSql <pgsql-general(at)postgresql(dot)org> |
Subject: | Re: Problem (bug?) with like |
Date: | 2001-12-03 15:08:59 |
Message-ID: | 20011203150859.GA823@fangorn.net |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-general 503 스포츠 토토 베트맨 |
El lunes 03 de diciembre, Tom Lane escribió:
> bombadil(at)wanadoo(dot)es writes:
> > select * from v_A where name like '%DAVID%'
> >
> > It freezes psql.
>
> I don't believe that it's really "frozen". Taking a long time, maybe.
Perhaps. But a veeeeeery long time, in any way ;)
I have been waiting more than 3 minutes and... ¡e voila!, here it is
:)
> > Why?
>
> You tell us. What's the EXPLAIN query plan for these three queries?
Ops. Sorry for laziness.
Here are my queries:
--------------------------------------------------------------------------
1) # explain SELECT * from cliente where nombre like '%DAVID%';
Result:
NOTICE: QUERY PLAN:
Seq Scan on cliente (cost=0.00..16139.44 rows=1 width=131)
--------------------------------------------------------------------------
2) # explain SELECT * from v_cliente where nombre like '%DA%';
Result:
NOTICE: QUERY PLAN:
Merge Join (cost=54763.50..62874.36 rows=413980 width=183)
-> Sort (cost=16238.44..16238.44 rows=54 width=131)
-> Seq Scan on cliente c (cost=0.00..16236.88 rows=54 width=131)
-> Sort (cost=38525.06..38525.06 rows=20097 width=74)
-> Subquery Scan d (cost=891.91..37088.66 rows=20097 width=74)
-> Hash Join (cost=891.91..37088.66 rows=20097 width=74)
-> Hash Join (cost=100.89..26377.49 rows=20097 width=58)
-> Merge Join (cost=78.96..17190.49 rows=20097 width=42)
-> Index Scan using dir_via_ndx on dirección d (cost=0.00..8951.65 rows=20097 width=26)
-> Sort (cost=78.96..78.96 rows=176 width=16)
-> Seq Scan on vía v (cost=0.00..72.40 rows=176 width=16)
-> Hash (cost=21.80..21.80 rows=52 width=16)
-> Seq Scan on provincia p (cost=0.00..21.80 rows=52 width=16)
-> Hash (cost=786.20..786.20 rows=1928 width=16)
-> Seq Scan on localidad l (cost=0.00..786.20 rows=1928 width=16)
------------------------------------------------------------------------------
3) # explain SELECT * from v_cliente where nombre like '%DAVID%';
Result:
NOTICE: QUERY PLAN:
Merge Join (cost=54763.50..62874.36 rows=413980 width=183)
-> Sort (cost=16238.44..16238.44 rows=54 width=131)
-> Seq Scan on cliente c (cost=0.00..16236.88 rows=54 width=131)
-> Sort (cost=38525.06..38525.06 rows=20097 width=74)
-> Subquery Scan d (cost=891.91..37088.66 rows=20097 width=74)
-> Hash Join (cost=891.91..37088.66 rows=20097 width=74)
-> Hash Join (cost=100.89..26377.49 rows=20097 width=58)
-> Merge Join (cost=78.96..17190.49 rows=20097 width=42)
-> Index Scan using dir_via_ndx on dirección d (cost=0.00..8951.65 rows=20097 width=26)
-> Sort (cost=78.96..78.96 rows=176 width=16)
-> Seq Scan on vía v (cost=0.00..72.40 rows=176 width=16)
-> Hash (cost=21.80..21.80 rows=52 width=16)
-> Seq Scan on provincia p (cost=0.00..21.80 rows=52 width=16)
-> Hash (cost=786.20..786.20 rows=1928 width=16)
-> Seq Scan on localidad l (cost=0.00..786.20 rows=1928 width=16)
--------------------------------------------------------------------------------
Greets.
David
From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | bombadil(at)wanadoo(dot)es |
Cc: | Lista PostgreSql <pgsql-general(at)postgresql(dot)org> |
Subject: | Re: Problem (bug?) with like |
Date: | 2001-12-03 15:40:45 |
Message-ID: | 13305.1007394045@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-general pgsql-hackers |
bombadil(at)wanadoo(dot)es writes:
> Here are my queries:
You sure you didn't paste in the same result for #2 and #3? They're
the same plan with the same rows estimates --- but I'd expect the rows
estimates, at least, to change given the more-selective LIKE pattern.
Also, how many rows are there really that match '%DA%' and '%DAVID%'?
I suspect the planner is being overoptimistic about the selectivity of
'%DAVID%', and is choosing a plan that doesn't work well when there are
lots of DAVIDs :-(
regards, tom lane
From: | bombadil(at)wanadoo(dot)es |
---|---|
To: | Lista PostgreSql <pgsql-general(at)postgresql(dot)org> |
Subject: | Re: Problem (bug?) with like |
Date: | 2001-12-03 18:30:33 |
Message-ID: | 20011203193033.A480@lorien.net |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-general pgsql-hackers |
El lunes 03 de diciembre, Tom Lane escribió:
> > Here are my queries:
>
> You sure you didn't paste in the same result for #2 and #3? They're
> the same plan with the same rows estimates --- but I'd expect the rows
> estimates, at least, to change given the more-selective LIKE pattern.
I don't know for sure if I have send wrong data, but I think not.
Tomorrow I'll get more info.
> Also, how many rows are there really that match '%DA%' and '%DAVID%'?
Very few for that incredible difference in time:
'%DA%' -> 2 sec.
'%DAVID%' -> 3 min.
> I suspect the planner is being overoptimistic about the selectivity of
> '%DAVID%', and is choosing a plan that doesn't work well when there are
> lots of DAVIDs :-(
I have thought that it only occurs when item to find is present
completely in any number of registers, but this is only a burde
hypothesis :?
Thanks for all.
David
From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | drheart(at)wanadoo(dot)es |
Cc: | Lista PostgreSql <pgsql-general(at)postgresql(dot)org> |
Subject: | Re: Problem (bug?) with like |
Date: | 2001-12-03 18:31:23 |
Message-ID: | 20011203193123.B480@lorien.net |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | Postg토토 결과SQL pgsql-hackers |
bombadil(at)wanadoo(dot)es writes:
> Here are my queries:
You sure you didn't paste in the same result for #2 and #3? They're
the same plan with the same rows estimates --- but I'd expect the rows
estimates, at least, to change given the more-selective LIKE pattern.
Also, how many rows are there really that match '%DA%' and '%DAVID%'?
I suspect the planner is being overoptimistic about the selectivity of
'%DAVID%', and is choosing a plan that doesn't work well when there are
lots of DAVIDs :-(
regards, tom lane
From: | bombadil(at)wanadoo(dot)es |
---|---|
To: | Lista PostgreSql <pgsql-general(at)postgresql(dot)org> |
Subject: | Re: Problem (bug?) with like |
Date: | 2001-12-04 08:15:57 |
Message-ID: | 20011204081557.GA2832@fangorn.net |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | Postg토토 사이트SQL pgsql-hackers |
El lunes 03 de diciembre, Tom Lane escribió:
> You sure you didn't paste in the same result for #2 and #3? They're
> the same plan with the same rows estimates --- but I'd expect the rows
> estimates, at least, to change given the more-selective LIKE pattern.
Ahem... You are right. It seems I have a problem with Vim and insertion of
command result :(
Here are the correct results:
-----------------------------------------------------------------------
1) # explain SELECT * from v_cliente where nombre like '%DA%';
Result:
NOTICE: QUERY PLAN:
Merge Join (cost=54763.50..62874.36 rows=413980 width=183)
-> Sort (cost=16238.44..16238.44 rows=54 width=131)
-> Seq Scan on cliente c (cost=0.00..16236.88 rows=54 width=131)
-> Sort (cost=38525.06..38525.06 rows=20097 width=74)
-> Subquery Scan d (cost=891.91..37088.66 rows=20097 width=74)
-> Hash Join (cost=891.91..37088.66 rows=20097 width=74)
-> Hash Join (cost=100.89..26377.49 rows=20097 width=58)
-> Merge Join (cost=78.96..17190.49 rows=20097 width=42)
-> Index Scan using dir_via_ndx on dirección d (cost=0.00..8951.65 rows=20097 width=26)
-> Sort (cost=78.96..78.96 rows=176 width=16)
-> Seq Scan on vía v (cost=0.00..72.40 rows=176 width=16)
-> Hash (cost=21.80..21.80 rows=52 width=16)
-> Seq Scan on provincia p (cost=0.00..21.80 rows=52 width=16)
-> Hash (cost=786.20..786.20 rows=1928 width=16)
-> Seq Scan on localidad l (cost=0.00..786.20 rows=1928 width=16)
EXPLAIN
---------------------------------------------------------------------------
2) explain SELECT * from v_cliente where nombre like '%DAVID%';
Result:
NOTICE: QUERY PLAN:
Nested Loop (cost=891.91..61414.58 rows=7638 width=183)
-> Seq Scan on cliente c (cost=0.00..16236.88 rows=1 width=131)
-> Subquery Scan d (cost=891.91..37088.66 rows=20097 width=74)
-> Hash Join (cost=891.91..37088.66 rows=20097 width=74)
-> Hash Join (cost=100.89..26377.49 rows=20097 width=58)
-> Merge Join (cost=78.96..17190.49 rows=20097 width=42)
-> Index Scan using dir_via_ndx on dirección d (cost=0.00..8951.65 rows=20097 width=26)
-> Sort (cost=78.96..78.96 rows=176 width=16)
-> Seq Scan on vía v (cost=0.00..72.40 rows=176 width=16)
-> Hash (cost=21.80..21.80 rows=52 width=16)
-> Seq Scan on provincia p (cost=0.00..21.80 rows=52 width=16)
-> Hash (cost=786.20..786.20 rows=1928 width=16)
-> Seq Scan on localidad l (cost=0.00..786.20 rows=1928 width=16)
EXPLAIN
----------------------------------------------------------------------------
> Also, how many rows are there really that match '%DA%' and '%DAVID%'?
1) 2672 rows -> 3.59 sec.
2) 257 rows -> 364.69 sec.
v_cliente and cliente have same number of rows: 38975
I hope this is enough. Greets.
David
From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | bombadil(at)wanadoo(dot)es |
Cc: | Lista PostgreSql <pgsql-general(at)postgresql(dot)org>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Problem (bug?) with like |
Date: | 2001-12-04 15:21:18 |
Message-ID: | 28804.1007479278@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-general Postg윈 토토SQL : |
bombadil(at)wanadoo(dot)es writes:
> 1) # explain SELECT * from v_cliente where nombre like '%DA%';
> Merge Join (cost=54763.50..62874.36 rows=413980 width=183)
> -> Sort (cost=16238.44..16238.44 rows=54 width=131)
> -> Seq Scan on cliente c (cost=0.00..16236.88 rows=54 width=131)
> -> Sort (cost=38525.06..38525.06 rows=20097 width=74)
> -> Subquery Scan d (cost=891.91..37088.66 rows=20097 width=74)
> -> Hash Join (cost=891.91..37088.66 rows=20097 width=74)
> ...
> 2) explain SELECT * from v_cliente where nombre like '%DAVID%';
> Nested Loop (cost=891.91..61414.58 rows=7638 width=183)
> -> Seq Scan on cliente c (cost=0.00..16236.88 rows=1 width=131)
> -> Subquery Scan d (cost=891.91..37088.66 rows=20097 width=74)
> -> Hash Join (cost=891.91..37088.66 rows=20097 width=74)
> ... [same subplan as above]
The problem here is that the planner is being way too optimistic about
the selectivity of LIKE '%DAVID%' --- notice the estimate that only
one matching row will be found in cliente, rather than 54 as with '%DA%'.
So it chooses a plan that avoids the sort overhead needed for an
efficient merge join with the other tables. That would be a win if
there were only one matching row, but as soon as there are lots, it's
a big loss, because the subquery to join the other tables gets redone
for every matching row :-(
>> Also, how many rows are there really that match '%DA%' and '%DAVID%'?
> 1) 2672 rows -> 3.59 sec.
> 2) 257 rows -> 364.69 sec.
I am thinking that the rules for selectivity of LIKE patterns probably
need to be modified. Presently the code assumes that a long constant
string has probability of occurrence proportional to the product of the
probabilities of the individual letters. That might be true in a random
world, but people don't search for random strings. I think we need to
back off the selectivity estimate by some large factor to account for
the fact that the pattern being searched for is probably not random.
Anyone have ideas how to do that?
regards, tom lane
From: | David Walter <dwalter(at)ecs(dot)syr(dot)edu> |
---|---|
To: | pgsql-hackers(at)postgresql(dot)org, Lista PostgreSql <pgsql-general(at)postgresql(dot)org> |
Subject: | New planner for like was -- Problem (bug?) with like |
Date: | 2001-12-04 16:10:30 |
Message-ID: | Pine.SOL.4.10.10112041105540.6144-100000@boreas.ecs.syr.edu |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | 503 토토 사이트 순위 pgsql-hackers |
On Tue, 4 Dec 2001, Tom Lane wrote:
> bombadil(at)wanadoo(dot)es writes:
> > 1) # explain SELECT * from v_cliente where nombre like '%DA%';
>
> > Merge Join (cost=54763.50..62874.36 rows=413980 width=183)
> > -> Sort (cost=16238.44..16238.44 rows=54 width=131)
> The problem here is that the planner is being way too optimistic about
> the selectivity of LIKE '%DAVID%' --- notice the estimate that only
> one matching row will be found in cliente, rather than 54 as with '%DA%'.
> So it chooses a plan that avoids the sort overhead needed for an
> efficient merge join with the other tables. That would be a win if
> there were only one matching row, but as soon as there are lots, it's
> a big loss, because the subquery to join the other tables gets redone
> for every matching row :-(
>
> >> Also, how many rows are there really that match '%DA%' and '%DAVID%'?
>
> > 1) 2672 rows -> 3.59 sec.
> > 2) 257 rows -> 364.69 sec.
>
> I am thinking that the rules for selectivity of LIKE patterns probably
> need to be modified. Presently the code assumes that a long constant
> string has probability of occurrence proportional to the product of the
> probabilities of the individual letters. That might be true in a random
> world, but people don't search for random strings. I think we need to
> back off the selectivity estimate by some large factor to account for
> the fact that the pattern being searched for is probably not random.
> Anyone have ideas how to do that?
>
Is there any statistic being kept by partial index?
for instance
#occurrences of A
B
A fairly small table/index could track these couldn't it?
If it were a btree itself, then statistics could be split appropriately
into sub-branches when the # of occurrences exceeds some threshold.
david
From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | drheart(at)wanadoo(dot)es |
Cc: | Lista PostgreSql <pgsql-general(at)postgresql(dot)org>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Problem (bug?) with like |
Date: | 2001-12-04 17:16:27 |
Message-ID: | 20011204171627.GB450@lorien.net |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | Postg무지개 토토SQL pgsql-hackers |
bombadil(at)wanadoo(dot)es writes:
> 1) # explain SELECT * from v_cliente where nombre like '%DA%';
> Merge Join (cost=54763.50..62874.36 rows=413980 width=183)
> -> Sort (cost=16238.44..16238.44 rows=54 width=131)
> -> Seq Scan on cliente c (cost=0.00..16236.88 rows=54 width=131)
> -> Sort (cost=38525.06..38525.06 rows=20097 width=74)
> -> Subquery Scan d (cost=891.91..37088.66 rows=20097 width=74)
> -> Hash Join (cost=891.91..37088.66 rows=20097 width=74)
> ...
> 2) explain SELECT * from v_cliente where nombre like '%DAVID%';
> Nested Loop (cost=891.91..61414.58 rows=7638 width=183)
> -> Seq Scan on cliente c (cost=0.00..16236.88 rows=1 width=131)
> -> Subquery Scan d (cost=891.91..37088.66 rows=20097 width=74)
> -> Hash Join (cost=891.91..37088.66 rows=20097 width=74)
> ... [same subplan as above]
The problem here is that the planner is being way too optimistic about
the selectivity of LIKE '%DAVID%' --- notice the estimate that only
one matching row will be found in cliente, rather than 54 as with '%DA%'.
So it chooses a plan that avoids the sort overhead needed for an
efficient merge join with the other tables. That would be a win if
there were only one matching row, but as soon as there are lots, it's
a big loss, because the subquery to join the other tables gets redone
for every matching row :-(
>> Also, how many rows are there really that match '%DA%' and '%DAVID%'?
> 1) 2672 rows -> 3.59 sec.
> 2) 257 rows -> 364.69 sec.
I am thinking that the rules for selectivity of LIKE patterns probably
need to be modified. Presently the code assumes that a long constant
string has probability of occurrence proportional to the product of the
probabilities of the individual letters. That might be true in a random
world, but people don't search for random strings. I think we need to
back off the selectivity estimate by some large factor to account for
the fact that the pattern being searched for is probably not random.
Anyone have ideas how to do that?
regards, tom lane
From: | Hannu Krosing <hannu(at)tm(dot)ee> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, drheart(at)wanadoo(dot)es, Lista PostgreSql <pgsql-general(at)postgresql(dot)org>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: [HACKERS] Problem (bug?) with like |
Date: | 2001-12-04 17:30:33 |
Message-ID: | 3C0D0839.2010102@tm.ee |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-general pgsql-hackers |
Tom Lane wrote:
>Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
>
>>But what about '%A%' vs. '%AC%'. Seems the second is reasonably
>>different from the first the our optimizer may be fine with that. Is it
>>only when the strings get longer that we lose specificity?
>>
>
>Yeah, I don't think that the estimates are bad for one or two
>characters. But the estimate gets real small real fast as you
>increase the number of match characters in the LIKE pattern.
>We need to slow that down some.
>
Could we just assign weights to first few characters and then consider
only these
first few characters when determinind probabbility of finding it ?
If someone searches for '%New York City%' we have quite good reasons to
believe
that there are some of these in there so we should factor in the fact
that usually one searches
for strings that do exist by said weights.
Another option would be to gather statistics not only on individual
letters but on bi- or
trigraphs. Then as a next step we could implement proper trigraph indexes ;)
----------------
Hannu
From: | Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | drheart(at)wanadoo(dot)es, Lista PostgreSql <pgsql-general(at)postgresql(dot)org>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Problem (bug?) with like |
Date: | 2001-12-04 19:00:32 |
Message-ID: | 200112041900.fB4J0W421336@candle.pha.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-general pgsql-hackers |
> I am thinking that the rules for selectivity of LIKE patterns probably
> need to be modified. Presently the code assumes that a long constant
> string has probability of occurrence proportional to the product of the
> probabilities of the individual letters. That might be true in a random
> world, but people don't search for random strings. I think we need to
> back off the selectivity estimate by some large factor to account for
> the fact that the pattern being searched for is probably not random.
> Anyone have ideas how to do that?
But what about '%A%' vs. '%AC%'. Seems the second is reasonably
different from the first the our optimizer may be fine with that. Is it
only when the strings get longer that we lose specificity?
--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> |
Cc: | drheart(at)wanadoo(dot)es, Lista PostgreSql <pgsql-general(at)postgresql(dot)org>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Problem (bug?) with like |
Date: | 2001-12-04 19:52:19 |
Message-ID: | 2874.1007495539@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-general pgsql-hackers |
Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> But what about '%A%' vs. '%AC%'. Seems the second is reasonably
> different from the first the our optimizer may be fine with that. Is it
> only when the strings get longer that we lose specificity?
Yeah, I don't think that the estimates are bad for one or two
characters. But the estimate gets real small real fast as you
increase the number of match characters in the LIKE pattern.
We need to slow that down some.
regards, tom lane
From: | Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | drheart(at)wanadoo(dot)es, Lista PostgreSql <pgsql-general(at)postgresql(dot)org>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Problem (bug?) with like |
Date: | 2001-12-04 19:55:30 |
Message-ID: | 200112041955.fB4JtUA26642@candle.pha.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-general pgsql-hackers |
> Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> > But what about '%A%' vs. '%AC%'. Seems the second is reasonably
> > different from the first the our optimizer may be fine with that. Is it
> > only when the strings get longer that we lose specificity?
>
> Yeah, I don't think that the estimates are bad for one or two
> characters. But the estimate gets real small real fast as you
> increase the number of match characters in the LIKE pattern.
> We need to slow that down some.
Yea, maybe a log base 2 decrease:
1 char 1x
2 char 2x
4 char 3x
8 char 4x
16 char 5x
--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
From: | Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> |
---|---|
To: | Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> |
Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, drheart(at)wanadoo(dot)es, Lista PostgreSql <pgsql-general(at)postgresql(dot)org>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: [HACKERS] Problem (bug?) with like |
Date: | 2001-12-28 18:27:26 |
Message-ID: | 200112281827.fBSIRRk28377@candle.pha.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-general pgsql-hackers |
> > Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> > > But what about '%A%' vs. '%AC%'. Seems the second is reasonably
> > > different from the first the our optimizer may be fine with that. Is it
> > > only when the strings get longer that we lose specificity?
> >
> > Yeah, I don't think that the estimates are bad for one or two
> > characters. But the estimate gets real small real fast as you
> > increase the number of match characters in the LIKE pattern.
> > We need to slow that down some.
>
> Yea, maybe a log base 2 decrease:
>
> 1 char 1x
> 2 char 2x
> 4 char 3x
> 8 char 4x
> 16 char 5x
I did a little research on this. I think the problem is that ordinary
characters are assumed to randomly appear in a character string, while
in practice, if the string has already been specified like 'DAV', there
are very few additional characters that can follow it and make sense.
Looking at backend/utils/adt/selfuncs.c, I see this:
#define FIXED_CHAR_SEL 0.04 /* about 1/25 */
...
sel *= FIXED_CHAR_SEL;
which means every additional character reduces the selectivity by 96%.
This seems much too restrictive to me. Because of the new optimizer
buckets, we do have good statistics on the leading character, but
additional characters drastically reduce selectivity. I think perhaps a
number like 0.50 or 50% may be correct.
That would be a table like this:
1 char 2x
2 char 4x
4 char 8x
8 char 16x
16 char 32x
which is more restrictive than I initially suggested above but less
restrictive than we have now.
Should we assume additional characters are indeed randomly appearing in
the string?
--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
From: | Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | drheart(at)wanadoo(dot)es, Lista PostgreSql <pgsql-general(at)postgresql(dot)org>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Problem (bug?) with like |
Date: | 2001-12-28 18:35:41 |
Message-ID: | 200112281835.fBSIZg129445@candle.pha.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-general pgsql-hackers |
> Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> > But what about '%A%' vs. '%AC%'. Seems the second is reasonably
> > different from the first the our optimizer may be fine with that. Is it
> > only when the strings get longer that we lose specificity?
>
> Yeah, I don't think that the estimates are bad for one or two
> characters. But the estimate gets real small real fast as you
> increase the number of match characters in the LIKE pattern.
> We need to slow that down some.
See my earlier email about the 50% idea for LIKE. Do ordinary string
comparisons also have this problem?
--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> |
Cc: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: [GENERAL] Problem (bug?) with like |
Date: | 2001-12-28 18:54:05 |
Message-ID: | 14054.1009565645@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-general pgsql-hackers |
Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> Do ordinary string
> comparisons also have this problem?
No, only LIKE and regex matching.
regards, tom lane
From: | Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | drheart(at)wanadoo(dot)es, Lista PostgreSql <pgsql-general(at)postgresql(dot)org>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Problem (bug?) with like |
Date: | 2001-12-29 04:55:19 |
Message-ID: | 200112290455.fBT4tLm13381@candle.pha.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-general Postg토토 핫SQL : |
> Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> > But what about '%A%' vs. '%AC%'. Seems the second is reasonably
> > different from the first the our optimizer may be fine with that. Is it
> > only when the strings get longer that we lose specificity?
>
> Yeah, I don't think that the estimates are bad for one or two
> characters. But the estimate gets real small real fast as you
> increase the number of match characters in the LIKE pattern.
> We need to slow that down some.
OK, I think I have the proper value for FIXED_CHAR_SEL. It is currently
0.04 or 1/26, meaning the letters are random, though this is not usually
the case.
If we assume our optimizer buckets have given us a reasonable value for
the first character, suppose it is an 'F', there are only a few valid
characters after that, at least in English. There are vowels, and a few
consonants, and given that character, there are only a few characters
that can be valid after that. To my thinking, it is two characters that
represent the same distribution as one random character, leaving 0.20 as
the proper value for FIXED_CHAR_SEL because 0.20 * 0.20 is the same as
0.04.
Added to TODO:
* Change FIXED_CHAR_SEL to 0.20 from 0.04 to give better selectivity (Bruce)
If people think there is a better value for this, please chime in.
--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
From: | Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | drheart(at)wanadoo(dot)es, Lista PostgreSql <pgsql-general(at)postgresql(dot)org>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Problem (bug?) with like |
Date: | 2001-12-29 05:08:12 |
Message-ID: | 200112290508.fBT58Cn14715@candle.pha.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-general pgsql-hackers |
> The problem here is that the planner is being way too optimistic about
> the selectivity of LIKE '%DAVID%' --- notice the estimate that only
> one matching row will be found in cliente, rather than 54 as with '%DA%'.
> So it chooses a plan that avoids the sort overhead needed for an
> efficient merge join with the other tables. That would be a win if
> there were only one matching row, but as soon as there are lots, it's
> a big loss, because the subquery to join the other tables gets redone
> for every matching row :-(
>
> >> Also, how many rows are there really that match '%DA%' and '%DAVID%'?
>
> > 1) 2672 rows -> 3.59 sec.
> > 2) 257 rows -> 364.69 sec.
>
> I am thinking that the rules for selectivity of LIKE patterns probably
> need to be modified. Presently the code assumes that a long constant
> string has probability of occurrence proportional to the product of the
> probabilities of the individual letters. That might be true in a random
> world, but people don't search for random strings. I think we need to
> back off the selectivity estimate by some large factor to account for
> the fact that the pattern being searched for is probably not random.
> Anyone have ideas how to do that?
Let's use the above example with the new FIXED_CHAR_SEL values:
With the new 0.20 value for FIXED_CHAR_SEL, we see for DA and DAVID
above:
DA 1) 0.20 ^ 2
.04
DAVID 2) 0.20 ^ 5
.00032
If we divide these two, we get:
> 0.04 / 0.00032
125
while looking at the total counts reported above, we get:
> 2672 / 257
~10.39688715953307392996
The 0.04 value gives a value of:
> 0.04 ^ 2
.0016
> 0.04 ^ 5
.0000001024
> .0016 / .0000001024
15625
Clearly the 0.20 value is 10x too large, while the 0.04 value is 1000x
too large. Because this was a contrived example, and because some have
more random text than DAVID in their field, I think 0.20 is the proper
value.
--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026