Quick Links

Re: gprof SELECT COUNT(*) results

Lists:	pgsql-hackers

From:	Qingqing Zhou <zhouqq(at)cs(dot)toronto(dot)edu>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	gprof SELECT COUNT(*) results
Date:	2005-11-24 18:25:18
Message-ID:	Pine.LNX.4.58.0511241312580.27330@eon.cs
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I did some gprof on a simple "SELECT COUNT(*) FROM test" query on cvs tip.

Linux josh.db 2.4.29-1 #2 Tue Jan 25 17:03:33 EST 2005 i686 unknown
gcc: 2.96
gprof: 2.13.90.0.2
./configure --without-readline

There are 260k or so records in table test(i int), about 1500 pages. I
give a shared_buffers to 3000, which is enough to hold all data pages.
Other GUCs are by default. After some warmups (to make sure these pages
are in the file system buffers), I do "SELECT COUNT(*)" for 10 times of
each round, and I tested 3 rounds. The results are:

- Round 1 -
% cumulative self self total
time seconds seconds calls s/call s/call name
16.67 0.27 0.27 2648542 0.00 0.00 LWLockAcquire
13.58 0.49 0.22 2648543 0.00 0.00 LWLockRelease
8.02 0.62 0.13 5266128 0.00 0.00 LockBuffer
8.02 0.75 0.13 2621456 0.00 0.00 heapgettup

- Round 2 -
% cumulative self self total
time seconds seconds calls s/call s/call name
19.14 0.31 0.31 2648542 0.00 0.00 LWLockAcquire
13.58 0.53 0.22 2648543 0.00 0.00 LWLockRelease
11.11 0.71 0.18 2621456 0.00 0.00 heapgettup
6.79 0.82 0.11 5266128 0.00 0.00 LockBuffer

- Round 3 -
% cumulative self self total
time seconds seconds calls s/call s/call name
17.12 0.25 0.25 2648542 0.00 0.00 LWLockAcquire
8.22 0.37 0.12 2648543 0.00 0.00 LWLockRelease
7.53 0.48 0.11 2621456 0.00 0.00 heapgettup
6.85 0.58 0.10 2621440 0.00 0.00 ExecEvalConst

There are some variance in the results, so my question is:
(1) Are these results faithful?
(2) If so, does it indicate that LWLock needs some improvements?

Regards,
Qingqing

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Qingqing Zhou <zhouqq(at)cs(dot)toronto(dot)edu>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: gprof SELECT COUNT(*) results
Date:	2005-11-24 20:59:19
Message-ID:	1132865959.4347.138.camel@localhost.localdomain
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 2005-11-24 at 13:25 -0500, Qingqing Zhou wrote:
> I did some gprof on a simple "SELECT COUNT(*) FROM test" query on cvs tip.
>
> Linux josh.db 2.4.29-1 #2 Tue Jan 25 17:03:33 EST 2005 i686 unknown
> gcc: 2.96
> gprof: 2.13.90.0.2
> ./configure --without-readline
>
> There are 260k or so records in table test(i int), about 1500 pages. I
> give a shared_buffers to 3000, which is enough to hold all data pages.
> Other GUCs are by default. After some warmups (to make sure these pages
> are in the file system buffers), I do "SELECT COUNT(*)" for 10 times of
> each round, and I tested 3 rounds. The results are:
>
> - Round 1 -
> % cumulative self self total
> time seconds seconds calls s/call s/call name
> 16.67 0.27 0.27 2648542 0.00 0.00 LWLockAcquire
> 13.58 0.49 0.22 2648543 0.00 0.00 LWLockRelease
> 8.02 0.62 0.13 5266128 0.00 0.00 LockBuffer
> 8.02 0.75 0.13 2621456 0.00 0.00 heapgettup
>
> - Round 2 -
> % cumulative self self total
> time seconds seconds calls s/call s/call name
> 19.14 0.31 0.31 2648542 0.00 0.00 LWLockAcquire
> 13.58 0.53 0.22 2648543 0.00 0.00 LWLockRelease
> 11.11 0.71 0.18 2621456 0.00 0.00 heapgettup
> 6.79 0.82 0.11 5266128 0.00 0.00 LockBuffer
>
> - Round 3 -
> % cumulative self self total
> time seconds seconds calls s/call s/call name
> 17.12 0.25 0.25 2648542 0.00 0.00 LWLockAcquire
> 8.22 0.37 0.12 2648543 0.00 0.00 LWLockRelease
> 7.53 0.48 0.11 2621456 0.00 0.00 heapgettup
> 6.85 0.58 0.10 2621440 0.00 0.00 ExecEvalConst
>
> There are some variance in the results, so my question is:
> (1) Are these results faithful?
> (2) If so, does it indicate that LWLock needs some improvements?

Maybe, maybe not. The whole system is designed around high levels of
concurrent access. If you know for certain you don't ever need that then
other systems are probably the right choice. Concurrency has a cost and
a benefit. If you measure the cost, but not the benefit, it will seem
expensive.

Your results show you have 2.6m rows, not 260k rows. Yes?

Best Regards, Simon Riggs

From:	Qingqing Zhou <zhouqq(at)cs(dot)toronto(dot)edu>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: gprof SELECT COUNT(*) results
Date:	2005-11-24 21:27:05
Message-ID:	Pine.LNX.4.58.0511241618290.16035@josh.db
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 24 Nov 2005, Simon Riggs wrote:

>
> Maybe, maybe not. The whole system is designed around high levels of
> concurrent access. If you know for certain you don't ever need that then
> other systems are probably the right choice. Concurrency has a cost and
> a benefit. If you measure the cost, but not the benefit, it will seem
> expensive.
>

Yeah, understood. What I can't understand that in this case why it costs
so much -- without concurrency, the LWLock code path just invloves
spinlock_lock/unlock and serveral simple instructions?

What's more, we can see that for each row, a LWLock pair is invoked. So on
a more aggressive thought, can we change it to page level? I know it is
terriblly difficult since our query processor infrastructure is based on a
single-tuple interface ...

> Your results show you have 2.6m rows, not 260k rows. Yes?

It is 260k since I test each round by 10 "SELECT COUNT(*)".

Regards,
Qingqing

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	Qingqing Zhou <zhouqq(at)cs(dot)toronto(dot)edu>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: gprof SELECT COUNT(*) results
Date:	2005-11-25 03:23:15
Message-ID:	87veyhjtr0.fsf@stark.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Qingqing Zhou <zhouqq(at)cs(dot)toronto(dot)edu> writes:

> Yeah, understood. What I can't understand that in this case why it costs
> so much -- without concurrency, the LWLock code path just invloves
> spinlock_lock/unlock and serveral simple instructions?

You executed LWLock 2.6 million times in just under 300ms. If my math is right
that's about 115 nanoseconds per lock or about 300 cycles on a 2.6Ghz
processor.

That sounds like a lot but it's about the right order of magnitude. Was this
on a multiprocessor machine? In which case a big part of that time is probably
spent synchronizing between the processors.

--
greg

From:	Qingqing Zhou <zhouqq(at)cs(dot)toronto(dot)edu>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: gprof SELECT COUNT(*) results
Date:	2005-11-25 04:02:37
Message-ID:	Pine.LNX.4.58.0511242249540.13062@eon.cs
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 24 Nov 2005, Greg Stark wrote:

>
>
> You executed LWLock 2.6 million times in just under 300ms. If my math is right
> that's about 115 nanoseconds per lock or about 300 cycles on a 2.6Ghz
> processor.
>
> That sounds like a lot but it's about the right order of magnitude. Was this
> on a multiprocessor machine? In which case a big part of that time is probably
> spent synchronizing between the processors.
>

Your math is right iff my math is right :-) It is a 2.4G desktop computer.
I may need to write some separate tests to see if this is what we should
pay for bus lock instruction.

Regards,
Qingqing

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Qingqing Zhou <zhouqq(at)cs(dot)toronto(dot)edu>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: gprof SELECT COUNT(*) results
Date:	2005-11-25 04:48:33
Message-ID:	18712.1132894113@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Qingqing Zhou <zhouqq(at)cs(dot)toronto(dot)edu> writes:
> Yeah, understood. What I can't understand that in this case why it costs
> so much -- without concurrency, the LWLock code path just invloves
> spinlock_lock/unlock and serveral simple instructions?

I don't see those costing nearly as much as your results show
... perhaps there's something platform-specific at work?
What I see, down to the 1% level, is

Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls ms/call ms/call name
37.98 13.91 13.91 _mcount
6.53 16.30 2.39 5242900 0.00 0.00 heapgettup
3.33 17.52 1.22 10542596 0.00 0.00 LockBuffer
3.30 18.73 1.21 5242880 0.00 0.00 advance_transition_function
2.68 19.71 0.98 5242880 0.00 0.00 IncrBufferRefCount
2.46 20.61 0.90 5385174 0.00 0.00 LWLockRelease
2.38 21.48 0.87 5271273 0.00 0.00 ReleaseAndReadBuffer
2.35 22.34 0.86 5385174 0.00 0.00 LWLockAcquire
2.18 23.14 0.80 5242938 0.00 0.00 ReleaseBuffer
2.10 23.91 0.77 5242900 0.00 0.00 ExecStoreTuple
1.97 24.63 0.72 noshlibs
1.91 25.33 0.70 5242900 0.00 0.00 SeqNext
1.72 25.96 0.63 5271294 0.00 0.00 ResourceOwnerRememberBuffer
1.72 26.59 0.63 5242900 0.00 0.00 heap_getnext
1.72 27.22 0.63 5242880 0.00 0.00 advance_aggregates
1.69 27.84 0.62 5242940 0.00 0.00 ExecProcNode
1.64 28.44 0.60 $$dyncall
1.61 29.03 0.59 5242900 0.00 0.00 MemoryContextReset
1.53 29.59 0.56 5242880 0.00 0.00 HeapTupleSatisfiesSnapshot
1.45 30.12 0.53 5242880 0.00 0.00 int8inc
1.37 30.62 0.50 5243140 0.00 0.00 ExecClearTuple
1.17 31.05 0.43 5242880 0.00 0.00 ExecEvalExprSwitchContext
1.15 31.47 0.42 5271294 0.00 0.00 ResourceOwnerForgetBuffer
1.12 31.88 0.41 SeqNext
1.09 32.28 0.40 5271294 0.00 0.00 ResourceOwnerEnlargeBuffers
1.09 32.68 0.40 5242900 0.00 0.00 ExecScan
1.04 33.06 0.38 5242900 0.00 0.00 ExecSeqScan

(This is for 20, not 10, iterations of your example, but otherwise it's
the same test case.)

I've since gotten rid of the IncrBufferRefCount, ReleaseBuffer,
and ResourceOwnerXXX entries by eliminating some inefficiency in
ExecStoreTuple, so that puts the buffer lock stuff further up,
but it's still not all that critical by my numbers.

> What's more, we can see that for each row, a LWLock pair is invoked. So on
> a more aggressive thought, can we change it to page level?

Yeah, I was wondering the same. It'd be possible to rewrite the seqscan
stuff so that we do the visibility tests for all the tuples on a given
page at once, taking the buffer content lock just once, and saving aside
the valid tuple IDs to return later. This should definitely be faster
when all the tuples actually get fetched. It might be a bit slower for
a LIMIT query, but I'm not sure if we care that much. The only other
objection I can think of is that if there are any broken tuples on a
page, this approach would likely make it impossible to fetch any of the
non-broken ones :-(

regards, tom lane

From:	Qingqing Zhou <zhouqq(at)cs(dot)toronto(dot)edu>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: gprof SELECT COUNT(*) results
Date:	2005-11-25 04:58:57
Message-ID:	Pine.LNX.4.58.0511242354580.13944@eon.cs
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 24 Nov 2005, Qingqing Zhou wrote:
>
> I may need to write some separate tests to see if this is what we should
> pay for bus lock instruction.
>

Here I come up with a test program to see how spinlock costs:

$/pgsql/src/backend/storage/lmgr#./a.out
Spinlock pair(2648542) duration: 143.134 ms
$/pgsql/src/backend/storage/lmgr#./a.out
Spinlock pair(2648542) duration: 143.107 ms
$/pgsql/src/backend/storage/lmgr#./a.out
Spinlock pair(2648542) duration: 143.104 ms

So seems lock instruction really costs ...

Regards,
Qingqing

---

/*
* spintest.c -
* Test spinlock acquire/release without concurrency.
*
* To compile (the -pg is to match the gprof make I used):
* backend/storage/lmgr#gcc -O2 -pg -Wall -I ../../../include/ spintest.c
*/

#include "postgres.h"
#include "storage/lwlock.h"
#include "storage/spin.h"
#include <sys/time.h>

#define TIMES 2648542

int NumLocks = 0;

void
s_lock(volatile slock_t *lock, const char *file, int line)
{
fprintf(stderr, "should never be here\n");
abort();
}

int
main(void)
{
int i;
slock_t lock = 0;
struct timeval start_t, stop_t;
long usecs;

gettimeofday(&start_t, NULL);
for (i = 0; i < TIMES; i ++)
{
SpinLockAcquire_NoHoldoff(&lock);

/* pretend to do something */
NumLocks ++;

SpinLockRelease_NoHoldoff(&lock);
}
gettimeofday(&stop_t, NULL);

if (stop_t.tv_usec < start_t.tv_usec)
{
stop_t.tv_sec--;
stop_t.tv_usec += 1000000;
}

usecs = (long) (stop_t.tv_sec - start_t.tv_sec) * 1000000
+ (long) (stop_t.tv_usec - start_t.tv_usec);

fprintf (stdout, "Spinlock pair(%u) duration: %ld.%03ld ms\n",
TIMES,
(long) ((stop_t.tv_sec - start_t.tv_sec) * 1000 +
(stop_t.tv_usec - start_t.tv_usec) / 1000),
(long) (stop_t.tv_usec - start_t.tv_usec) % 1000);

return 0;
}

From:	Qingqing Zhou <zhouqq(at)cs(dot)toronto(dot)edu>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: gprof SELECT COUNT(*) results
Date:	2005-11-25 05:58:23
Message-ID:	Pine.LNX.4.58.0511250044520.14187@eon.cs
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 24 Nov 2005, Tom Lane wrote:
>
> I don't see those costing nearly as much as your results show
> ... perhaps there's something platform-specific at work?
> What I see, down to the 1% level, is
>

I can see your computer is really slow, so my theory is that since it is
easy to hold a running-slowly horse than a fast one, so my spinlock on a
2.4G modern machine should takes relatively longer time to get effective.
Just kidding. I am not sure what's happened, but in previous email there
is a program I wrote to test the spinlock performance. In my machine, the
profiling results matches the single spinlock test.

>
> The only other objection I can think of is that if there are any broken
> tuples on a page, this approach would likely make it impossible to fetch
> any of the non-broken ones :-(
>

What do you mean by "broken tuple"? An data corrupted tuple? So you mean
if scan operator find a broken tuple on a page, then it will abort the
operation without returning any other good tuples? I think this is
acceptable.

Regards,
Qingqing

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Qingqing Zhou <zhouqq(at)cs(dot)toronto(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: gprof SELECT COUNT(*) results
Date:	2005-11-25 10:23:54
Message-ID:	1132914234.4347.169.camel@localhost.localdomain
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 2005-11-24 at 23:48 -0500, Tom Lane wrote:

> > What's more, we can see that for each row, a LWLock pair is invoked. So on
> > a more aggressive thought, can we change it to page level?
>
> Yeah, I was wondering the same. It'd be possible to rewrite the seqscan
> stuff so that we do the visibility tests for all the tuples on a given
> page at once, taking the buffer content lock just once, and saving aside
> the valid tuple IDs to return later. This should definitely be faster
> when all the tuples actually get fetched.

And me also. Looks good from here. It would be a very localised
optimization, so fairly easy to implement.

Even on your lower gprof numbers this would be 15% faster stand-alone.
Take into account the significant reduction in LWlock requests and it
might go much faster still on a busy system. The gain for other users
would be very good also.

> It might be a bit slower for
> a LIMIT query, but I'm not sure if we care that much.

Agreed. It would be fairly easy to make this happen only for full
SeqScans, and the greater majority of those don't have a LIMIT of less
than one block.

> The only other
> objection I can think of is that if there are any broken tuples on a
> page, this approach would likely make it impossible to fetch any of the
> non-broken ones :-(

I was thinking of the brute force approach: take a complete copy of the
block when we read the first tuple off it. That way any wierdness
on-block is avoided until we logically attempt to read that tuple. It
also allows us to palloc the right amount of space first time.

This could be an interesting win: no other DBMS can take this approach.
I think it has further advantages also:

1. Copying the tuples locally will also mean that the data will probably
be in the L2 cache rather than main memory, so this could make things
faster still.

2. For large scans, the buffer for that block doesn't need to be pinned
after the first touch, so the buffer could be cleared and replaced with
another before we have logically finished reading the block. That will
improve fluidity in shared_buffers and improve efficiency.

Are you, or will you be implementing this? If not, I'll implement this -
it seems an important BI optimization.

Best Regards, Simon Riggs

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Qingqing Zhou <zhouqq(at)cs(dot)toronto(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: gprof SELECT COUNT(*) results
Date:	2005-11-25 14:54:38
Message-ID:	22075.1132930478@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> On Thu, 2005-11-24 at 23:48 -0500, Tom Lane wrote:
>> Yeah, I was wondering the same. It'd be possible to rewrite the seqscan
>> stuff so that we do the visibility tests for all the tuples on a given
>> page at once, taking the buffer content lock just once, and saving aside
>> the valid tuple IDs to return later. This should definitely be faster
>> when all the tuples actually get fetched.

> I was thinking of the brute force approach: take a complete copy of the
> block when we read the first tuple off it. That way any wierdness
> on-block is avoided until we logically attempt to read that tuple. It
> also allows us to palloc the right amount of space first time.

That occurred to me too, but I rejected it on two grounds:
* All that data copying will be expensive.
* We couldn't update tuple commit hint bits on the real page.

Also, I'm not at all sure that it's a good idea to release the buffer
pin before we're genuinely done with the page. That sort of change
would impact such things as the VACUUM interlock algorithm. Maybe it'd
be OK but I'm disinclined to mess with it for a dubious speed gain.

Even with my version of the proposal, it'd be risky to use the check-
all-in-advance approach for non-MVCC snapshots such as SnapshotNow.
I'm not sure that there are any places that would get broken by missing
tuples that weren't yet committed when the page was first touched ...
but I'm not sure there aren't, either. We could avoid this risk by
having two paths through heapgettup, though.

> Are you, or will you be implementing this?

I plan to take a look at it soon.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Qingqing Zhou <zhouqq(at)cs(dot)toronto(dot)edu>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: gprof SELECT COUNT(*) results
Date:	2005-11-25 15:20:11
Message-ID:	22258.1132932011@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Qingqing Zhou <zhouqq(at)cs(dot)toronto(dot)edu> writes:
> I can see your computer is really slow, so my theory is that since it is
> easy to hold a running-slowly horse than a fast one, so my spinlock on a
> 2.4G modern machine should takes relatively longer time to get effective.
> Just kidding.

Is that "modern machine" a Xeon by any chance? We know that Xeons have
fairly awful concurrent performance, and the long latency for bus lock
instructions may well be the reason why. FWIW, the numbers I showed
last night were for an HPPA machine, which I used just because I chanced
to have CVS tip already built for profiling on it. I've since
reproduced the test on a spiffy new dual Xeon that Red Hat just bought
me :-) ... and I get similar numbers to yours. It'd be interesting to
see the results from an SMP Opteron, if anyone's got one handy.

>> The only other objection I can think of is that if there are any broken
>> tuples on a page, this approach would likely make it impossible to fetch
>> any of the non-broken ones :-(

> What do you mean by "broken tuple"? An data corrupted tuple?

The specific case that's worrying me is a tuple with a corrupted xmin,
such that tqual.c fails with a "can't access transaction status"
message.

> So you mean
> if scan operator find a broken tuple on a page, then it will abort the
> operation without returning any other good tuples? I think this is
> acceptable.

I think it would be totally unacceptable if it meant that there was no
way at all to look at the other data on the same page with a corrupt
tuple. Seqscans with "LIMIT n" have been the traditional tool for
getting as much info as you could out of a corrupted page. However,
it now occurs to me that you could still cherry-pick non-corrupt tuples
with TID-scan queries, so this objection shouldn't be considered fatal.

regards, tom lane

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Qingqing Zhou <zhouqq(at)cs(dot)toronto(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: gprof SELECT COUNT(*) results
Date:	2005-11-25 15:38:50
Message-ID:	1132933130.2906.7.camel@localhost.localdomain
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 2005-11-25 at 09:54 -0500, Tom Lane wrote:
> Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> > On Thu, 2005-11-24 at 23:48 -0500, Tom Lane wrote:
> >> Yeah, I was wondering the same. It'd be possible to rewrite the seqscan
> >> stuff so that we do the visibility tests for all the tuples on a given
> >> page at once, taking the buffer content lock just once, and saving aside
> >> the valid tuple IDs to return later. This should definitely be faster
> >> when all the tuples actually get fetched.
>
> > I was thinking of the brute force approach: take a complete copy of the
> > block when we read the first tuple off it. That way any wierdness
> > on-block is avoided until we logically attempt to read that tuple. It
> > also allows us to palloc the right amount of space first time.
>
> That occurred to me too, but I rejected it on two grounds:
> * All that data copying will be expensive.
> * We couldn't update tuple commit hint bits on the real page.

OK, the second one is the clincher.

> Also, I'm not at all sure that it's a good idea to release the buffer
> pin before we're genuinely done with the page. That sort of change
> would impact such things as the VACUUM interlock algorithm. Maybe it'd
> be OK but I'm disinclined to mess with it for a dubious speed gain.

I think its safer not to try it all in one go. It's worth coming back to
later though: VACUUM will never remove rows the scanner can see anyway.

> Even with my version of the proposal, it'd be risky to use the check-
> all-in-advance approach for non-MVCC snapshots such as SnapshotNow.
> I'm not sure that there are any places that would get broken by missing
> tuples that weren't yet committed when the page was first touched ...
> but I'm not sure there aren't, either. We could avoid this risk by
> having two paths through heapgettup, though.

That seems prudent. The main performance optimization is required for
MVCC access anyway; we seldom scan catalog tables and hopefully they
never get too big.

> > Are you, or will you be implementing this?
>
> I plan to take a look at it soon.

Much appreciated and well done to Qingqing for spotting this.

Best Regards, Simon Riggs

From:	Olivier Thauvin <olivier(dot)thauvin(at)aerov(dot)jussieu(dot)fr>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: gprof SELECT COUNT(*) results
Date:	2005-11-25 15:59:27
Message-ID:	200511251659.31683.olivier.thauvin@aerov.jussieu.fr
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Le Vendredi 25 Novembre 2005 16:20, Tom Lane a écrit :
> Qingqing Zhou <zhouqq(at)cs(dot)toronto(dot)edu> writes:
> > I can see your computer is really slow, so my theory is that since it is
> > easy to hold a running-slowly horse than a fast one, so my spinlock on a
> > 2.4G modern machine should takes relatively longer time to get effective.
> > Just kidding.
>
> Is that "modern machine" a Xeon by any chance? We know that Xeons have
> fairly awful concurrent performance, and the long latency for bus lock
> instructions may well be the reason why. FWIW, the numbers I showed
> last night were for an HPPA machine, which I used just because I chanced
> to have CVS tip already built for profiling on it. I've since
> reproduced the test on a spiffy new dual Xeon that Red Hat just bought
> me :-) ... and I get similar numbers to yours. It'd be interesting to
> see the results from an SMP Opteron, if anyone's got one handy.

Is that what you're looking for ?

[thauvin(at)samsufi ~]$ egrep "(model name|MHz)" /proc/cpuinfo
model name : AMD Opteron(tm) Processor 250
cpu MHz : 2391.040
model name : AMD Opteron(tm) Processor 250
cpu MHz : 2391.040

$ cat /etc/mandrake-release
Mandrakelinux release 10.1 (Community) for x86_64

I can try to backport Postgresql 8.1.0 rpms from developement tree on mandriva
10.1, install and run some test if you're really interested.

From:	Qingqing Zhou <zhouqq(at)cs(dot)toronto(dot)edu>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: gprof SELECT COUNT(*) results
Date:	2005-11-25 17:57:20
Message-ID:	Pine.LNX.4.58.0511251247030.26007@eon.cs
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 25 Nov 2005, Tom Lane wrote:
>
> Is that "modern machine" a Xeon by any chance?
>

$#cat /proc/cpuinfo | grep "model name"
model name : Intel(R) Pentium(R) 4 CPU 2.40GHz

I can find a 4way Xeon (but it is shared by many users):
/h/164/zhouqq#cat /proc/cpuinfo |grep "model name"
model name : Intel(R) Xeon(TM) CPU 2.40GHz
model name : Intel(R) Xeon(TM) CPU 2.40GHz
model name : Intel(R) Xeon(TM) CPU 2.40GHz
model name : Intel(R) Xeon(TM) CPU 2.40GHz
$/pgsql/src/backend/storage/lmgr#./a.out
Spinlock pair(2648542) duration: 161.785 ms
$/pgsql/src/backend/storage/lmgr#./a.out
Spinlock pair(2648542) duration: 160.661 ms
$/pgsql/src/backend/storage/lmgr#./a.out
Spinlock pair(2648542) duration: 155.505 ms

>
> it now occurs to me that you could still cherry-pick non-corrupt tuples
> with TID-scan queries, so this objection shouldn't be considered fatal.
>

It is great that it is not that difficult to do it! What's more, the
parallel scan will be easier to implement based on page level scan
operators.

Regards,
Qingqing

From:	"Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: gprof SELECT COUNT(*) results
Date:	2005-11-26 08:13:33
Message-ID:	dm95dpd4kdm95dp$2d4k$1@news.hub.org@news.hub.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote
>
> I plan to take a look at it soon.
>

"Stand corrected"[Merlin]! In-memory "SELECT COUNT(*)" doubles the
performance due to my test.

- before -
1.56 s

- now -
0.72 s

Regards,
Qingqing

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Qingqing Zhou <zhouqq(at)cs(dot)toronto(dot)edu>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: gprof SELECT COUNT(*) results
Date:	2005-11-26 12:39:21
Message-ID:	1133008761.2906.121.camel@localhost.localdomain
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, 2005-11-26 at 03:13 -0500, Qingqing Zhou wrote:
> "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote
> >
> > I plan to take a look at it soon.
> >
>
> "Stand corrected"[Merlin]! In-memory "SELECT COUNT(*)" doubles the
> performance due to my test.
>
> - before -
> 1.56 s
>
> - now -
> 0.72 s
>

...and for emphasis: this optimization of SeqScans is not possible with
any other database system, so its a big win for PostgreSQL.

Best Regards, Simon Riggs

From:	Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Qingqing Zhou <zhouqq(at)cs(dot)toronto(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: gprof SELECT COUNT(*) results
Date:	2005-11-26 16:21:00
Message-ID:	43888B6C.8050501@familyhealth.com.au
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> ...and for emphasis: this optimization of SeqScans is not possible with
> any other database system, so its a big win for PostgreSQL.

With any other db system? That's a big call. Why? Not even other MVCC
systems?

Chris

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Qingqing Zhou <zhouqq(at)cs(dot)toronto(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: gprof SELECT COUNT(*) results
Date:	2005-11-26 16:53:21
Message-ID:	3903.1133024001@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au> writes:
>> ...and for emphasis: this optimization of SeqScans is not possible with
>> any other database system, so its a big win for PostgreSQL.

> With any other db system? That's a big call. Why?

One could equally well spin it negatively, as "this optimization is not
needed with any other database" ...

regards, tom lane

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>, Qingqing Zhou <zhouqq(at)cs(dot)toronto(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: gprof SELECT COUNT(*) results
Date:	2005-11-26 17:57:01
Message-ID:	1133027821.2906.156.camel@localhost.localdomain
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, 2005-11-26 at 11:53 -0500, Tom Lane wrote:
> Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au> writes:
> >> ...and for emphasis: this optimization of SeqScans is not possible with
> >> any other database system, so its a big win for PostgreSQL.
>
> > With any other db system? That's a big call. Why?
>
> One could equally well spin it negatively, as "this optimization is not
> needed with any other database" ...

Why is it spin to call it a big win? You amaze me. After all the time
we've spent investigating tuning the buffer manager, to reduce the
locking required by a SeqScan to about 1% of what it was before is just
way cool. When did you last spend a few hours tuning an internals
feature (not the optimizer) and have something go 50% faster? Probably
at least a month :-)

Thinking more about other systems, ISTM that Oracle can do this, as can
any MVCC based system. OTOH DB2 and SQLServer take block level read
locks, so they can do this too, but at major loss of concurrency and
threat of deadlock. Having said that, *any* system that chose not to do
this would be severely sub-optimal with buffer manager contention. So
AFAICS they all need this optimization.

So, I think I'll take back the claim of uniqueness, but I'm happy with
the accuracy of the statement that this is a big win for PostgreSQL. And
a bigger win than most on the poor benighted Xeon, where reducing memory
contention/traffic seems to be crucial.

Best Regards, Simon Riggs

From:	"Luke Lonergan" <llonergan(at)greenplum(dot)com>
To:	"Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Cc:	"Ayush Parashar" <aparashar(at)greenplum(dot)com>
Subject:	Re: gprof SELECT COUNT(*) results
Date:	2005-11-26 20:19:36
Message-ID:	BFAE0358.147EA%llonergan@greenplum.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Nice job Qingqing and Tom!

The improved executor / agg performance will likely help BI / data warehouse
customers a lot. I¹ll get some DBT-3 results to substantiate as soon as we
can.

- Luke

On 11/26/05 12:13 AM, "Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu> wrote:

>
>
> "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote
>> >
>> > I plan to take a look at it soon.
>> >
>
> "Stand corrected"[Merlin]! In-memory "SELECT COUNT(*)" doubles the
> performance due to my test.
>
> - before -
> 1.56 s
>
> - now -
> 0.72 s
>
> Regards,
> Qingqing
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo(at)postgresql(dot)org so that your
> message can get through to the mailing list cleanly
>
>

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>, Qingqing Zhou <zhouqq(at)cs(dot)toronto(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: gprof SELECT COUNT(*) results
Date:	2005-11-26 23:13:31
Message-ID:	20108.1133046811@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
>>> ...and for emphasis: this optimization of SeqScans is not possible with
>>> any other database system, so its a big win for PostgreSQL.

> Why is it spin to call it a big win?

I didn't say it wasn't a big win; it was the first part of the sentence
that bothered me. Without a lot more knowledge of the internals of the
commercial DBMSes than I think is public, you can't say whether this is
possible/useful/relevant for them.

regards, tom lane

From:	Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Qingqing Zhou <zhouqq(at)cs(dot)toronto(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: gprof SELECT COUNT(*) results
Date:	2005-11-27 09:59:09
Message-ID:	4389836D.8070501@familyhealth.com.au
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> Thinking more about other systems, ISTM that Oracle can do this, as can
> any MVCC based system. OTOH DB2 and SQLServer take block level read
> locks, so they can do this too, but at major loss of concurrency and
> threat of deadlock. Having said that, *any* system that chose not to do
> this would be severely sub-optimal with buffer manager contention. So
> AFAICS they all need this optimization.

MySQL/Innodb would presumably do it also I wonder...

Chris

From:	"Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Qingqing Zhou <zhouqq(at)cs(dot)toronto(dot)edu>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: gprof SELECT COUNT(*) results
Date:	2005-11-29 00:34:45
Message-ID:	20051129003444.GQ78939@pervasive.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Nov 25, 2005 at 10:20:11AM -0500, Tom Lane wrote:
> Qingqing Zhou <zhouqq(at)cs(dot)toronto(dot)edu> writes:
> > I can see your computer is really slow, so my theory is that since it is
> > easy to hold a running-slowly horse than a fast one, so my spinlock on a
> > 2.4G modern machine should takes relatively longer time to get effective.
> > Just kidding.
>
> Is that "modern machine" a Xeon by any chance? We know that Xeons have
> fairly awful concurrent performance, and the long latency for bus lock
> instructions may well be the reason why. FWIW, the numbers I showed
> last night were for an HPPA machine, which I used just because I chanced
> to have CVS tip already built for profiling on it. I've since
> reproduced the test on a spiffy new dual Xeon that Red Hat just bought
> me :-) ... and I get similar numbers to yours. It'd be interesting to
> see the results from an SMP Opteron, if anyone's got one handy.

Is there still interest in this? I've got a dual Opteron running FBSD.
(What would be the profiler to use on FBSD?)
--
Jim C. Nasby, Sr. Engineering Consultant jnasby(at)pervasive(dot)com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
Cc:	Qingqing Zhou <zhouqq(at)cs(dot)toronto(dot)edu>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: gprof SELECT COUNT(*) results
Date:	2005-11-29 00:35:45
Message-ID:	17723.1133224545@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Jim C. Nasby" <jnasby(at)pervasive(dot)com> writes:
> (What would be the profiler to use on FBSD?)

gprof should work fine.

regards, tom lane