Re: 9/18 Visual Planner Meeting Wrapup

Lists: Postg스포츠 토토 베트맨SQL : Postg스포츠 토토 베트맨SQL 메일 링리스트 : 2008-09-19 이후 PDXPUG 23:49
From: "Mark Wong" <markwkm(at)gmail(dot)com>
To: "Postgresql PDX_Users" <pdxpug(at)postgresql(dot)org>
Subject: 9/18 Visual Planner Meeting Wrapup
Date: 2008-09-19 21:53:15
Message-ID: 70c01d1d0809191453l65bfb307k5c788f0052ad198c@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pdxpug

While Selena and Gabrielle were dining with the other speakers at the
Linux Plumbers Conference, this month's PDXPUG meeting featured Tom
Raney and his Google Summer of Code project at Portland State
University under the mentoring of Dr. Len Shapiro. Before getting
into the details of the Visual Planner, Tom gave a great high level
overview of PostgreSQL's decision making process for determining how
to execute a SQL statement. Then he gave brief description of changes
in the PostgreSQL code that are required in order for the Visual
Planner to capture all the plans that the database considered
executing. Using the tool, Tom demonstrated how to browse through all
the discarded plans to see the cost estimates of each plan indicating
why those plans were discarded. Hopefully people who will be attended
the PostgreSQL West Conference 2008 in October will have a chance to
see Tom repeat his performance.

Len also posed an interesting question that I will try to repeat accurately:

Can anyone come up with an example of a meaningful query where there
is a join that does not use an implicit or explicit foreign key?

There were a couple of new comers and we hope to see them again.
After the meeting some of us headed to the Lucky Lab for refreshments!


From: Joshua Drake <jd(at)commandprompt(dot)com>
To: "Mark Wong" <markwkm(at)gmail(dot)com>
Cc: "Postgresql PDX_Users" <pdxpug(at)postgresql(dot)org>
Subject: Re: 9/18 Visual Planner Meeting Wrapup
Date: 2008-09-19 22:02:25
Message-ID: 20080919150225.383a35f3@jd-laptop
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pdxpug

On Fri, 19 Sep 2008 14:53:15 -0700
"Mark Wong" <markwkm(at)gmail(dot)com> wrote:

> Can anyone come up with an example of a meaningful query where there
> is a join that does not use an implicit or explicit foreign key?

Only partial. A LEFT or RIGHT JOIN will do this but it needs a partial
key to work.

So: No.

Sincerely,

Joshua D. Drake

--
The PostgreSQL Company since 1997: http://www.commandprompt.com/
PostgreSQL Community Conference: http://www.postgresqlconference.org/
United States PostgreSQL Association: http://www.postgresql.us/
Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate


From: jterwill(at)cecs(dot)pdx(dot)edu
To: "Mark Wong" <markwkm(at)gmail(dot)com>
Cc: "Postgresql PDX_Users" <pdxpug(at)postgresql(dot)org>
Subject: Re: 9/18 Visual Planner Meeting Wrapup
Date: 2008-09-19 22:55:36
Message-ID: 20080919155536.11086ekbgydouqjs@webmail.cecs.pdx.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pdxpug

Perhaps I'm missing the point of the question, since I wasn't able to
attend, but I'll give it a shot:

Show me all faculty members (table Faculty) that have the same last
name as a student (table Student).

There would, of course, be no reason to have an FK between those
columns, implicit or explicit. I imagine that a number of
data-mining-type queries that are trying to discover new information
or correlations within a database might have this characteristic.

Miss you guys (and the food up here actually kinda sucks),
James

Quoting "Mark Wong" <markwkm(at)gmail(dot)com>:

> While Selena and Gabrielle were dining with the other speakers at the
> Linux Plumbers Conference, this month's PDXPUG meeting featured Tom
> Raney and his Google Summer of Code project at Portland State
> University under the mentoring of Dr. Len Shapiro. Before getting
> into the details of the Visual Planner, Tom gave a great high level
> overview of PostgreSQL's decision making process for determining how
> to execute a SQL statement. Then he gave brief description of changes
> in the PostgreSQL code that are required in order for the Visual
> Planner to capture all the plans that the database considered
> executing. Using the tool, Tom demonstrated how to browse through all
> the discarded plans to see the cost estimates of each plan indicating
> why those plans were discarded. Hopefully people who will be attended
> the PostgreSQL West Conference 2008 in October will have a chance to
> see Tom repeat his performance.
>
> Len also posed an interesting question that I will try to repeat accurately:
>
> Can anyone come up with an example of a meaningful query where there
> is a join that does not use an implicit or explicit foreign key?
>
> There were a couple of new comers and we hope to see them again.
> After the meeting some of us headed to the Lucky Lab for refreshments!
>
> --
> Sent via pdxpug mailing list (pdxpug(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pdxpug
>

----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.


From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Mark Wong <markwkm(at)gmail(dot)com>
Cc: Postgresql PDX_Users <pdxpug(at)postgresql(dot)org>
Subject: Re: 9/18 Visual Planner Meeting Wrapup
Date: 2008-09-19 23:49:01
Message-ID: 1221868141.6194.351.camel@dell.linuxdev.us.dell.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: Postg스포츠 토토 베트맨SQL : Postg스포츠 토토 베트맨SQL 메일 링리스트 : 2008-09-19 이후 PDXPUG 23:49

On Fri, 2008-09-19 at 14:53 -0700, Mark Wong wrote:
> While Selena and Gabrielle were dining with the other speakers at the
> Linux Plumbers Conference, this month's PDXPUG meeting featured Tom
> Raney and his Google Summer of Code project at Portland State
> University under the mentoring of Dr. Len Shapiro. Before getting
> into the details of the Visual Planner, Tom gave a great high level
> overview of PostgreSQL's decision making process for determining how
> to execute a SQL statement. Then he gave brief description of changes
> in the PostgreSQL code that are required in order for the Visual
> Planner to capture all the plans that the database considered
> executing. Using the tool, Tom demonstrated how to browse through all
> the discarded plans to see the cost estimates of each plan indicating
> why those plans were discarded. Hopefully people who will be attended
> the PostgreSQL West Conference 2008 in October will have a chance to
> see Tom repeat his performance.

Looking forward to it!

> Len also posed an interesting question that I will try to repeat accurately:
>
> Can anyone come up with an example of a meaningful query where there
> is a join that does not use an implicit or explicit foreign key?
>

Arbitrary value association is one of the best strengths of the
relational model, and a primary advantage over any hierarchical data
model (like OO or XML data models).

A FK (like any constraint) can only exist from one relation variable to
another relation variable. A join, however, operates on relation
_values_. So, using relational expressions can easily defeat the
usefulness of a FK. For example, let's say you have two tables receiving
sensor data with high-resolution timestamps, "sensor1" and "sensor2".

A useful query might be something like:

SELECT date_trunc('hour', sensor1_ts) as ts, count(*) as sensor1_count
FROM sensor1
NATURAL JOIN
SELECT date_trunc('hour', sensor2_ts) as ts, count(*) as sensor2_count
FROM sensor2

No possible physical design (that is, choice of relation variables and
associated constraints like FKs) could possibly represent the
relationship you're establishing between the sensor readings in the
query. Notice that there's no requirement about the order or nature of
the arriving sensor data, and thus there can be no constraint.

Regards,
Jeff Davis


From: "Len Shapiro" <lenshap(at)gmail(dot)com>
To: "Mark Wong" <markwkm(at)gmail(dot)com>
Cc: "Postgresql PDX_Users" <pdxpug(at)postgresql(dot)org>, "David Maier" <maier(at)cs(dot)pdx(dot)edu>
Subject: Re: 9/18 Visual Planner Meeting Wrapup
Date: 2008-10-21 03:38:51
Message-ID: c5ee9b8a0810202038h1f3af5b8r30627f9cb601d1f6@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pdxpug

> Len also posed an interesting question that I will try to repeat accurately:
>
> Can anyone come up with an example of a meaningful query where there
> is a join that does not use an implicit or explicit foreign key?

Many thanks to Mark for repeating my question accurately and to the
people who provided answers to it. I didn't find the answers to be
satisfactory because I was looking for a query that was useful in a
real life situation.

My interest in the question is that if all useful joins are foreign
key joins then it makes things like optimizer estimates of join
selectivities much simpler. However, there are several useful
non-foreign key joins. They were provided to me by David Maier. I'll
provide the SQL for the first of them, and let you figure out the
rest:

"Who are the congressional candidates running in my district?"

SELECT candname
FROM candidate JOIN resident ON candidate.zip = resident.zip
WHERE resident.name = 'Len Shapiro'

"Which zip codes have both a Burgerville and a White Castle?"

"Who can TA each of the course offerings this year?"
[There can be more than one offering of the same course
number in a year, so course number is only a partial
key]

"Which warehouses stock any of the parts in this order?"

"Which undergrads graduated from the same high school
as which current applicants?"

"What color printers are in my building?"

On Fri, Sep 19, 2008 at 2:53 PM, Mark Wong <markwkm(at)gmail(dot)com> wrote:
> While Selena and Gabrielle were dining with the other speakers at the
> Linux Plumbers Conference, this month's PDXPUG meeting featured Tom
> Raney and his Google Summer of Code project at Portland State
> University under the mentoring of Dr. Len Shapiro. Before getting
> into the details of the Visual Planner, Tom gave a great high level
> overview of PostgreSQL's decision making process for determining how
> to execute a SQL statement. Then he gave brief description of changes
> in the PostgreSQL code that are required in order for the Visual
> Planner to capture all the plans that the database considered
> executing. Using the tool, Tom demonstrated how to browse through all
> the discarded plans to see the cost estimates of each plan indicating
> why those plans were discarded. Hopefully people who will be attended
> the PostgreSQL West Conference 2008 in October will have a chance to
> see Tom repeat his performance.
>
> Len also posed an interesting question that I will try to repeat accurately:
>
> Can anyone come up with an example of a meaningful query where there
> is a join that does not use an implicit or explicit foreign key?
>
> There were a couple of new comers and we hope to see them again.
> After the meeting some of us headed to the Lucky Lab for refreshments!
>
> --
> Sent via pdxpug mailing list (pdxpug(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pdxpug
>


From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: len(at)pdx(dot)edu
Cc: "Mark Wong" <markwkm(at)gmail(dot)com>, "Postgresql PDX_Users" <pdxpug(at)postgresql(dot)org>, "David Maier" <maier(at)cs(dot)pdx(dot)edu>
Subject: Re: 9/18 Visual Planner Meeting Wrapup
Date: 2008-10-21 04:29:07
Message-ID: 2E3F62AE-7650-4494-896B-F2097D4D926D@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pdxpug

On Oct 20, 2008, at 20:38, Len Shapiro wrote:

> "Who are the congressional candidates running in my district?"
>
> SELECT candname
> FROM candidate JOIN resident ON candidate.zip = resident.zip
> WHERE resident.name = 'Len Shapiro'

And would that be a join that's not on a foreign key because
candidate.zip and resident.zip are foreign keys to a zip_codes table,
and not otherwise related to one another?

Best,

David


From: "Len Shapiro" <lenshap(at)gmail(dot)com>
To: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc: "Mark Wong" <markwkm(at)gmail(dot)com>, "Postgresql PDX_Users" <pdxpug(at)postgresql(dot)org>, "David Maier" <maier(at)cs(dot)pdx(dot)edu>
Subject: Re: 9/18 Visual Planner Meeting Wrapup
Date: 2008-10-21 14:40:58
Message-ID: c5ee9b8a0810210740s6a52c763ie703c3c331f6da0e@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pdxpug

David,

Thanks for the astute observation. By "foreign key join" I did not
mean just a join between foreign keys. It is difficult to compute the
cardinality of such a join. I meant a join between a foreign key and
the (unique) attribute to which it refers. It is simple to compute
the cardinality of this latter type of join.

All the best,

Len

On Mon, Oct 20, 2008 at 9:29 PM, David E. Wheeler <david(at)kineticode(dot)com> wrote:
> On Oct 20, 2008, at 20:38, Len Shapiro wrote:
>
>> "Who are the congressional candidates running in my district?"
>>
>> SELECT candname
>> FROM candidate JOIN resident ON candidate.zip = resident.zip
>> WHERE resident.name = 'Len Shapiro'
>
> And would that be a join that's not on a foreign key because candidate.zip
> and resident.zip are foreign keys to a zip_codes table, and not otherwise
> related to one another?
>
> Best,
>
> David
>


From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: len(at)pdx(dot)edu
Cc: Mark Wong <markwkm(at)gmail(dot)com>, Postgresql PDX_Users <pdxpug(at)postgresql(dot)org>, David Maier <maier(at)cs(dot)pdx(dot)edu>
Subject: Re: 9/18 Visual Planner Meeting Wrapup
Date: 2008-10-21 17:07:08
Message-ID: 1224608828.28882.51.camel@jdavis
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pdxpug

On Mon, 2008-10-20 at 20:38 -0700, Len Shapiro wrote:
> My interest in the question is that if all useful joins are foreign
> key joins then it makes things like optimizer estimates of join
> selectivities much simpler. However, there are several useful
> non-foreign key joins. They were provided to me by David Maier. I'll
> provide the SQL for the first of them, and let you figure out the
> rest:

I might agree with this question (though I don't think so, I'd need more
details to be sure):

> "Who are the congressional candidates running in my district?"
>

But I disagree with the rest. The reason is that, for each of these,
there exists some very reasonable database design in which the join does
turn into a FK join (based on our offline conversation, you consider
self joins to be implied FKs as well). I will present a few examples of
designs below:

> SELECT candname
> FROM candidate JOIN resident ON candidate.zip = resident.zip
> WHERE resident.name = 'Len Shapiro'

3 tables: person, resident, and candidate. Resident and candidate both
have FKs to person, and so the 3-way join can be expressed using FK
relationships.

> "Which zip codes have both a Burgerville and a White Castle?"

This looks either like a single table with a self join, and if not, it
can be similarly decomposed as above.

> "Who can TA each of the course offerings this year?"
> [There can be more than one offering of the same course
> number in a year, so course number is only a partial
> key]

3 tables: course, course_offering, and potential_ta. I don't really know
what the conditions here are supposed to be, but I'm pretty sure I can
construct it using entirely FK joins.

And so on. If a question presupposes that they use some particular
database design, I think that disqualifies it from answering your
problem.

I think the only correct answer to your problem involves functional
relationships that cannot be (sanely) materialized (as in my example,
the functional relationship between a specific timestamp and the hour in
which it occurs). To be useful, the function must not be one-to-one (as
in my example), otherwise you could just join using the original value.

If it relies on a functional relationship, the FK clearly cannot
reference the result of a function (or if it can, you can simply ask a
second question which relies on a different functional relationship),
and thus it makes it impossible to simply alter the design to answer the
question with FK joins.

I happen to think time provides the most practical examples. For
instance, let's say you have a bunch of events with a timestamp and
timezone. You'd like to join based on the starting time of the events in
local time. And let's say you'd also like to join based on GMT to answer
a separate question. There's no design that can answer both those
questions without relying on a function that can't be materialized into
a relationship table. (To give these questions meaning, the first one
might be "how did the turnout for these events compare with others
happening at the same local time" and the second question might be "how
did international news coverage of these events compare with others that
happened at the same GMT time").

Time truncation (i.e. getting the hour when you have microsecond
resolution) provides more practical examples, although someone could
plausibly break the timestamp into it's various fields. That's why you
need to have a separate question that adds complexity (e.g. date math or
timezones) to prevent that from being a realistic design (to use
timezones you'd need to use the + operator, which can't be part of a FK
relationship).

Other practical examples exist, however. Salary ranges is a simple one:
join people who make within $5k of each other, for instance. There are a
million similar questions relating to demographics.

Regards,
Jeff Davis


From: "Len Shapiro" <lenshap(at)gmail(dot)com>
To: "Jeff Davis" <pgsql(at)j-davis(dot)com>
Cc: "Mark Wong" <markwkm(at)gmail(dot)com>, "Postgresql PDX_Users" <pdxpug(at)postgresql(dot)org>, "David Maier" <maier(at)cs(dot)pdx(dot)edu>
Subject: Re: 9/18 Visual Planner Meeting Wrapup
Date: 2008-10-22 16:05:41
Message-ID: c5ee9b8a0810220905w20f1407fn8885da9d456de253@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pdxpug

Jeff,

Thanks for your words of wisdom. However, I do not think they "cut
the mustard".

The difficulty with your solutions is that they do not help an
optimizer to determine the cardinality of a join. Consider the
candidate-zip code example. The optimizer needs catalog information
to determine cardinalities, and it is hoping that zip is a foreign
key. It may be the case that, in some imaginary schema, zip is a
foreign key. But that imaginary schema is not in the catalog for the
optimizer to investigate. So the optimizer is SOL. That's my $.02,
anyway.

All the best,

Len

On Tue, Oct 21, 2008 at 10:07 AM, Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
> On Mon, 2008-10-20 at 20:38 -0700, Len Shapiro wrote:
>> My interest in the question is that if all useful joins are foreign
>> key joins then it makes things like optimizer estimates of join
>> selectivities much simpler. However, there are several useful
>> non-foreign key joins. They were provided to me by David Maier. I'll
>> provide the SQL for the first of them, and let you figure out the
>> rest:
>
> I might agree with this question (though I don't think so, I'd need more
> details to be sure):
>
>> "Who are the congressional candidates running in my district?"
>>
>
> But I disagree with the rest. The reason is that, for each of these,
> there exists some very reasonable database design in which the join does
> turn into a FK join (based on our offline conversation, you consider
> self joins to be implied FKs as well). I will present a few examples of
> designs below:
>
>> SELECT candname
>> FROM candidate JOIN resident ON candidate.zip = resident.zip
>> WHERE resident.name = 'Len Shapiro'
>
> 3 tables: person, resident, and candidate. Resident and candidate both
> have FKs to person, and so the 3-way join can be expressed using FK
> relationships.
>
>> "Which zip codes have both a Burgerville and a White Castle?"
>
> This looks either like a single table with a self join, and if not, it
> can be similarly decomposed as above.
>
>> "Who can TA each of the course offerings this year?"
>> [There can be more than one offering of the same course
>> number in a year, so course number is only a partial
>> key]
>
> 3 tables: course, course_offering, and potential_ta. I don't really know
> what the conditions here are supposed to be, but I'm pretty sure I can
> construct it using entirely FK joins.
>
> And so on. If a question presupposes that they use some particular
> database design, I think that disqualifies it from answering your
> problem.
>
> I think the only correct answer to your problem involves functional
> relationships that cannot be (sanely) materialized (as in my example,
> the functional relationship between a specific timestamp and the hour in
> which it occurs). To be useful, the function must not be one-to-one (as
> in my example), otherwise you could just join using the original value.
>
> If it relies on a functional relationship, the FK clearly cannot
> reference the result of a function (or if it can, you can simply ask a
> second question which relies on a different functional relationship),
> and thus it makes it impossible to simply alter the design to answer the
> question with FK joins.
>
> I happen to think time provides the most practical examples. For
> instance, let's say you have a bunch of events with a timestamp and
> timezone. You'd like to join based on the starting time of the events in
> local time. And let's say you'd also like to join based on GMT to answer
> a separate question. There's no design that can answer both those
> questions without relying on a function that can't be materialized into
> a relationship table. (To give these questions meaning, the first one
> might be "how did the turnout for these events compare with others
> happening at the same local time" and the second question might be "how
> did international news coverage of these events compare with others that
> happened at the same GMT time").
>
> Time truncation (i.e. getting the hour when you have microsecond
> resolution) provides more practical examples, although someone could
> plausibly break the timestamp into it's various fields. That's why you
> need to have a separate question that adds complexity (e.g. date math or
> timezones) to prevent that from being a realistic design (to use
> timezones you'd need to use the + operator, which can't be part of a FK
> relationship).
>
> Other practical examples exist, however. Salary ranges is a simple one:
> join people who make within $5k of each other, for instance. There are a
> million similar questions relating to demographics.
>
> Regards,
> Jeff Davis
>
>


From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: len(at)pdx(dot)edu
Cc: Mark Wong <markwkm(at)gmail(dot)com>, Postgresql PDX_Users <pdxpug(at)postgresql(dot)org>, David Maier <maier(at)cs(dot)pdx(dot)edu>
Subject: Re: 9/18 Visual Planner Meeting Wrapup
Date: 2008-10-22 16:52:01
Message-ID: 1224694321.19961.37.camel@jdavis
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pdxpug

On Wed, 2008-10-22 at 09:05 -0700, Len Shapiro wrote:
> The difficulty with your solutions is that they do not help an
> optimizer to determine the cardinality of a join. Consider the
> candidate-zip code example. The optimizer needs catalog information
> to determine cardinalities, and it is hoping that zip is a foreign
> key. It may be the case that, in some imaginary schema, zip is a
> foreign key. But that imaginary schema is not in the catalog for the
> optimizer to investigate. So the optimizer is SOL. That's my $.02,
> anyway.
>

I don't know what you mean by imaginary schema. You asked for a join,
which is an operation on value(s), so we'd have to find a join where
there is no reasonable design such that the join is a FK join.

And isn't it hard to determine the cardinality of any self join? That
would mean any self join would answer your problem. In fact, it seems
like all (or most) of the examples given can be expressed as a self join
in some reasonable database schema.

Regards,
Jeff Davis


From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: len(at)pdx(dot)edu
Cc: Mark Wong <markwkm(at)gmail(dot)com>, Postgresql PDX_Users <pdxpug(at)postgresql(dot)org>, David Maier <maier(at)cs(dot)pdx(dot)edu>
Subject: Re: 9/18 Visual Planner Meeting Wrapup
Date: 2008-10-22 16:52:55
Message-ID: 1224694375.19961.39.camel@jdavis
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pdxpug

On Tue, 2008-10-21 at 10:07 -0700, Jeff Davis wrote:
> I think the only correct answer to your problem involves functional
> relationships that cannot be (sanely) materialized (as in my example,
> the functional relationship between a specific timestamp and the hour in
> which it occurs). To be useful, the function must not be one-to-one (as
> in my example), otherwise you could just join using the original value.

[ replying to myself ]

On second thought, there's no reason the function must not be
one-to-one, although many practical examples involve functional
relationships that are not one-to-one.

An example that uses a one-to-one function would be something like:
"give me all events that had a lower turnout than the same event at the
same time on the previous day". This is a self-join using date math
(subtracting one day), so there can't possibly be a FK relationship
there.

Also, I am not 100% sure that correct answers need to use a functional
relationship like that, but they are the only examples I can think of.
This is because a functional relationship is basically like a relation,
except that it can't be materialized within reason (e.g. a relation
representing the + operator would be huge).

The inability to materialize the relation means that it can't be stored
in a variable, and therefore there can be no FK or constraint of any
kind (you can only constrain variables, not values).

Regards,
Jeff Davis


From: "Len Shapiro" <lenshap(at)gmail(dot)com>
To: "Jeff Davis" <pgsql(at)j-davis(dot)com>
Cc: "Postgresql PDX_Users" <pdxpug(at)postgresql(dot)org>, "David Maier" <maier(at)cs(dot)pdx(dot)edu>, "Mark Wong" <markwkm(at)gmail(dot)com>
Subject: Re: 9/18 Visual Planner Meeting Wrapup
Date: 2008-11-10 23:52:15
Message-ID: c5ee9b8a0811101552w293267ffo10a7be3cf6701646@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pdxpug

Jeff and I have agreed that, in an attempt to avoid spam, this
will be the last email in our conversation that we will copy to
others. If you'd like to be cced on future emails in this
conversation, please contact me and I will include you.

I think it will help to clarify this conversation if I give some
more details of my motivation for wanting to be able to claim that
most joins are what I call foreign key (FK) joins. I agree now that
there are several meaningful joins that are not FK joins but I think I
can still say that almost all joins are FK joins.
Here's what I mean by "FK join": L join R on joining attributes
L(l) and R(r), where values of l normally exist in the column r and
the column r is unique. It is not necessary that every value of l
exist in R(r), in any particular instance. If every value of l exists
in R(r) in every instance, then an FK constraint holds.
Optimization is a difficult task, and estimating the size of
operator results is one of the most difficult parts of optimization.
Join size estimation is particularly difficult - existing formulas
make strong assumptions which are often violated. However, if a join
is an FK join as described above, it is possible to estimate the size
of the join quite accurately. The first case is when the constraint
holds. In the case the size of the join of L join R is exactly L. If
the constraint does not hold, the optimizer can sample the extent to
which it does not hold and apply the percentage to the size estimate.
Voila, a clean and accurate size estimate for the join.
If this all makes sense, I plan to ask a student to look into the
PG planner (optimizer) to see if it's possible to integrate this idea
into it.

Perhaps now it is clearer why I am not happy with an "imaginary"
schema. It does no good if there is another table K such that L join
K is an FK join but K is not in the database. Because then the
optimizer cannot check that the target join attribute in K is unique
and K cannot be sampled.

All the best,

Len

On Wed, Oct 22, 2008 at 8:52 AM, Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
> On Wed, 2008-10-22 at 09:05 -0700, Len Shapiro wrote:
>> The difficulty with your solutions is that they do not help an
>> optimizer to determine the cardinality of a join. Consider the
>> candidate-zip code example. The optimizer needs catalog information
>> to determine cardinalities, and it is hoping that zip is a foreign
>> key. It may be the case that, in some imaginary schema, zip is a
>> foreign key. But that imaginary schema is not in the catalog for the
>> optimizer to investigate. So the optimizer is SOL. That's my $.02,
>> anyway.
>>
>
> I don't know what you mean by imaginary schema. You asked for a join,
> which is an operation on value(s), so we'd have to find a join where
> there is no reasonable design such that the join is a FK join.
>
> And isn't it hard to determine the cardinality of any self join? That
> would mean any self join would answer your problem. In fact, it seems
> like all (or most) of the examples given can be expressed as a self join
> in some reasonable database schema.
>
> Regards,
> Jeff Davis
>
>


From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: len(at)pdx(dot)edu
Cc: "Jeff Davis" <pgsql(at)j-davis(dot)com>, "Postgresql PDX_Users" <pdxpug(at)postgresql(dot)org>, "David Maier" <maier(at)cs(dot)pdx(dot)edu>, "Mark Wong" <markwkm(at)gmail(dot)com>
Subject: Re: 9/18 Visual Planner Meeting Wrapup
Date: 2008-11-10 23:56:20
Message-ID: 81322F2A-A6B6-4413-A0D8-FC4CF880E99E@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pdxpug

On Nov 10, 2008, at 3:52 PM, Len Shapiro wrote:

> Perhaps now it is clearer why I am not happy with an "imaginary"
> schema. It does no good if there is another table K such that L join
> K is an FK join but K is not in the database. Because then the
> optimizer cannot check that the target join attribute in K is unique
> and K cannot be sampled.

Yes, having the context from which the question is asked helps a lot.
I much better understand what it was you were trying to ascertain.

Best,

David