Re: problem with subqueries

Lists: pgsql-sql
From: pete(at)phillipsfamily(dot)freeserve(dot)co(dot)uk
To: pgsql-sql(at)postgresql(dot)org
Subject: problem with subqueries
Date: 2002-10-05 22:19:31
Message-ID: 200210052219.g95MJVx01588@phillipsfamily.freeserve.co.uk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

Hi

Any help appreciated - I have spent 2 days trying to get this query to
work!

I have an orders database and a customer database.

The orders database has a date field for each order.

Because I want to obtain a monthly breakdown, I created a view called
monthcustomer as this select:

select orders.ord_date, customer.cname,
date_part('month',orders.ord_date) AS "month",
date_part('year',orders.ord_date) AS "year",
orders.number_of_items;

Each month will have multiple numbers of items, so to get a monthly
breakdown I tried this:

select distinct year,month,
(select sum(monthcustomer.number_of_items) from monthcustomer where
monthcustomer.year=m.year and monthcustomer.month=m.month) as NumPots
from monthcustomer m;

This goes off and never comes back - CPU is hitting the top of the
chart! I have to ^C to interrupt it, as it runs for ages (I've left
this run for 10-20 minutes and it is still running).

I have indexes on the columns involved from the original tables.

Any help appreciated.

Pete

PS: Using pgsql 7.2-70 from Suse distribution.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pete(at)phillipsfamily(dot)freeserve(dot)co(dot)uk
Cc: pgsql-sql(at)postgresql(dot)org
Subject: Re: problem with subqueries
Date: 2002-10-06 01:01:33
Message-ID: 6763.1033866093@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

pete(at)phillipsfamily(dot)freeserve(dot)co(dot)uk writes:
> select distinct year,month,
> (select sum(monthcustomer.number_of_items) from monthcustomer where
> monthcustomer.year=m.year and monthcustomer.month=m.month) as NumPots
> from monthcustomer m;

> This goes off and never comes back -

No surprise, considering the sub-select is going to be evaluated
separately for every row of monthcustomer --- and then most of those
evaluations will be thrown away by the DISTINCT :-(

A straightforward way of reducing the redundant computations would be
to do the DISTINCT first:

select year,month,
(select sum(monthcustomer.number_of_items) from monthcustomer where
monthcustomer.year=m.year and monthcustomer.month=m.month) as NumPots
from
(select distinct year, month from monthcustomer) as m;

But it appears to me that you are reinventing the wheel. Isn't this
query the equivalent of a grouped aggregation --- viz,

select year, month, sum(number_of_items) as NumPots
from monthcustomer
group by year, month

regards, tom lane


From: Stephan Szabo <sszabo(at)megazone23(dot)bigpanda(dot)com>
To: <pete(at)phillipsfamily(dot)freeserve(dot)co(dot)uk>
Cc: <pgsql-sql(at)postgresql(dot)org>
Subject: Re: problem with subqueries
Date: 2002-10-06 01:02:09
Message-ID: 20021005175906.R58075-100000@megazone23.bigpanda.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql


On Sat, 5 Oct 2002 pete(at)phillipsfamily(dot)freeserve(dot)co(dot)uk wrote:

> Because I want to obtain a monthly breakdown, I created a view called
> monthcustomer as this select:
>
> select orders.ord_date, customer.cname,
> date_part('month',orders.ord_date) AS "month",
> date_part('year',orders.ord_date) AS "year",
> orders.number_of_items;
>
> Each month will have multiple numbers of items, so to get a monthly
> breakdown I tried this:
>
> select distinct year,month,
> (select sum(monthcustomer.number_of_items) from monthcustomer where
> monthcustomer.year=m.year and monthcustomer.month=m.month) as NumPots
> from monthcustomer m;
>
> This goes off and never comes back - CPU is hitting the top of the
> chart! I have to ^C to interrupt it, as it runs for ages (I've left

That's going to run that inner select once for every row in monthcustomer
probably.

Would
select year, month, sum(monthcustomer.number_of_items) as NumPots frmo
monthcustomer m group by year, month;
have the same effect, get the sum of the items for each year/month
combination along with which year and month?


From: pete(at)phillipsfamily(dot)freeserve(dot)co(dot)uk
To: pgsql-sql(at)postgresql(dot)org
Cc: tgl(at)sss(dot)pgh(dot)pa(dot)us, kenzo(at)kennethambrose(dot)com
Subject: Re: problem with subqueries
Date: 2002-10-06 11:13:41
Message-ID: 200210061113.g96BDfG10468@phillipsfamily.freeserve.co.uk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql


Hi guys. Thanks for the rapid replies so far.

To answer some of the questions:

>you did not indicate an explicit join - or even a "from" clause for that
>matter- in the example of your create view statement.

My original post was a simplified version. Here is the actual view
creating statement:

create view monthord as select ord_date, extract (month from ord_date)
as month, extract (year from ord_date) as year,r_region,
number_of_items from orders,customer where ccode = codenum;

>But it appears to me that you are reinventing the wheel. Isn't this
>query the equivalent of a grouped aggregation

Yes - but again I was simplifying - I want to run a sub query for each
region, so I get output like this:

year month Reg1 Reg2 Reg3 Reg4
----- ----- ---- ---- ----- ----
1999 Jan 20 45 10 27
1999 Feb 30 43 18 37
...
2002 Oct 7 89 60 17

The subquery I have tried to run is actually this (there is probably a
way to do this all in SQL, but at present I would like to just
understand why my subqueries take so long).

-----start of script ---------------
#!/bin/sh
# prepare a query as shell variables

# set up timeframe
SDATE=`date --date '2 years ago' +'%Y-%m-01'`
EDATE=`date --date 'next month' +'%Y-%m-01'`

# use the QUERY env variable to build the full query
QUERY="select distinct year,month, "

for reg in `psql -U postgres -d product_db -c "select distinct r_code from regionlist order by r_code;"`
do

QUERY="$QUERY (select sum(monthcustomer.number_of_items) from monthcustomer where monthcustomer.year=m.year and monthcustomer.month=m.month and r_region like '$reg') as $reg,"

done

# now add a total column
QUERY="$QUERY (select sum(monthcustomer.number_of_items) from monthcustomer where ord_date >= '$SDATE' and ord_date < '$EDATE' and monthcustomer.year=m.year and monthcustomer.month=m.month) as TOTAL from monthcustomer m;"

# execute the query
psql -U postgres -d newmaggot -c "$QUERY"

-----end of script ---------------

As you can see, I get all the regions, and loop through them building
up a bunch of subqueries as I go along.

Hopefully this will answer some of your questions. I will try some of
the suggestions later on today (such as placing the distinct clause at
the end) but first I have to go and get some decorating done :-( (yes,
that sound you can hear is the crack of my wife's whip!).

Regards,
Pete

--
I do whatever the Voices tell me to do


From: Stephan Szabo <sszabo(at)megazone23(dot)bigpanda(dot)com>
To: <pete(at)phillipsfamily(dot)freeserve(dot)co(dot)uk>
Cc: <pgsql-sql(at)postgresql(dot)org>, <tgl(at)sss(dot)pgh(dot)pa(dot)us>, <kenzo(at)kennethambrose(dot)com>
Subject: Re: problem with subqueries
Date: 2002-10-06 16:18:19
Message-ID: 20021006090327.Q65664-100000@megazone23.bigpanda.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql


On Sun, 6 Oct 2002 pete(at)phillipsfamily(dot)freeserve(dot)co(dot)uk wrote:

> Hi guys. Thanks for the rapid replies so far.
>
> To answer some of the questions:
>
> >you did not indicate an explicit join - or even a "from" clause for that
> >matter- in the example of your create view statement.
>
> My original post was a simplified version. Here is the actual view
> creating statement:
>
> create view monthord as select ord_date, extract (month from ord_date)
> as month, extract (year from ord_date) as year,r_region,
> number_of_items from orders,customer where ccode = codenum;
>
>
> >But it appears to me that you are reinventing the wheel. Isn't this
> >query the equivalent of a grouped aggregation
>
> Yes - but again I was simplifying - I want to run a sub query for each
> region, so I get output like this:
>
> year month Reg1 Reg2 Reg3 Reg4
> ----- ----- ---- ---- ----- ----
> 1999 Jan 20 45 10 27
> 1999 Feb 30 43 18 37
> ...
> 2002 Oct 7 89 60 17
>
> The subquery I have tried to run is actually this (there is probably a
> way to do this all in SQL, but at present I would like to just
> understand why my subqueries take so long).

Well, you're running <n> subqueries for each row in monthcustomer
because the distinct happens afterwards in your query. So if you've
got 4 regions and 1 total and 100,000 rows in monthcustomer, you're
looking at something on the order of 500,000 subqueries. Doing the
distinct before that step should lower the number to
((#year/month combinations) * (#regions+1)).

In any case, you may be better off with one of:

a) Doing something programatic to turn a result set like:
year|month|region|value
1999|Jan |1 |20
1999|Jan |2 |45
...
into the form you want. The above can be gotten by group by
probably and would require no subqueries.

b) Keeping a summary table that you update via triggers. This
requires a bit of work to get the triggers, but it probably
makes the query faster.