Quick Links

Re: RFC: Improve CPU cache locality of syscache searches

Lists:	pgsql-hackers

From:	John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
To:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	RFC: Improve CPU cache locality of syscache searches
Date:	2021-08-04 16:39:29
Message-ID:	CAFBsxsE35VLJ3hHkjJARB3QWqJ0zWeDw-jzqrfzkzMPuD_Ctvw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

CPU caches have multiple levels, so I had an idea to use that concept in
the syscaches.

Imagine if catcache buckets were cacheline-sized. In that case, we would
store the most recently accessed hash values and pointers to catctup, in
addition to the dlist_head:

typedef struct cc_bucket
{
uint32 hashes[4];
catctup *ct[4];
dlist_head;
};

You can think of the bucket here as a 4-way set associative L1 and the list
walk as an inclusive L2. There might be an existing term for this scheme,
but I don't know what it is offhand.

Creating entries:

Instead of calling dlist_push_head(), we call dlist_push_tail() and then
stash the entry in the L1 so it's still quickly available if it's accessed
in the near future. This way, older entries evicted from L1 will tend to
remain toward the front of the list. Walking the list will always go from
oldest to newest, which might have better prefetch behavior (not sure).

Search:

Buckets with <= 4 entries would only need to access a single cacheline to
get the catctup pointer with the matching hash value. If we have to walk
the list to find a match, we stash it in the L1, which is faster than
calling dlist_move_head().

L1 Eviction:

When putting an entry here, we memmove() everything down in each array and
place the pointer and the hash value in the front, evicting the fourth
(possibly NULL) entry.

The buckets would now be 4 times larger on 64-bit machines, but that might
not be a problem if we just divide the initial bucket size by four as well.
If the minimum nbuckets is 2, then the smaller caches would waste a bit of
space, but it doesn't seem too bad. As far as when to rehash the bucket
array, the fill factor would logically jump from 2 to 8. The worst-case
list walk would be longer, but with a better overall memory access pattern,
it might be fine.

I think it would be straightforward to code this up -- the difficulty is
testing and accounting for random effects due to binary layout changes.
Thoughts?

--
John Naylor
EDB: http://www.enterprisedb.com

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: RFC: Improve CPU cache locality of syscache searches
Date:	2021-08-04 19:44:44
Message-ID:	20210804194444.gneyztouva3fmngg@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 2021-08-04 12:39:29 -0400, John Naylor wrote:
> CPU caches have multiple levels, so I had an idea to use that concept in
> the syscaches.

I do think we loose a good bit to syscache efficiency in real workloads, so I
think it's worth investing time into it.

However:

> Imagine if catcache buckets were cacheline-sized. In that case, we would
> store the most recently accessed hash values and pointers to catctup, in
> addition to the dlist_head:
>
> typedef struct cc_bucket
> {
> uint32 hashes[4];
> catctup *ct[4];
> dlist_head;
> };

I'm not convinced that the above the right idea though. Even if the hash
matches, you're still going to need to fetch at least catctup->keys[0] from
a separate cacheline to be able to return the cache entry.

If the hashes we use are decent and we're resizing at reasonable thresholds,
we shouldn't constantly lookup up values that are later in a collision chain -
particularly because we move cache hits to the head of the bucket. So I don't
think a scheme as you described would actually result in a really better cache
hit ratio?

ISTM that something like

struct cc_bucket_1
{
uint32 hashes[3]; // 12
// 4 bytes alignment padding
Datum key0s[3]; // 24
catctup *ct[3]; // 24
// cacheline boundary
dlist_head conflicts; // 16
};

would be better for 1 key values?

It's obviously annoying to need different bucket types for different key
counts, but given how much 3 unused key Datums waste, it seems worth paying
for?

One issue with stuff like this is that aset.c won't return cacheline aligned
values...

It's possible that it'd be better to put the catctup pointers onto a
neigboring cacheline (so you still get TLB benefits, as well as making it easy
for prefetchers) and increase the number of hashes/keys that fit on one
cacheline.

If we stuffed four values into one bucket we could potentially SIMD the hash
and Datum comparisons ;)

Greetings,

Andres Freund

From:	John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: RFC: Improve CPU cache locality of syscache searches
Date:	2021-08-05 16:27:49
Message-ID:	CAFBsxsGkBtEVjjMLZcRQqKxUCZBauoiLBPmH3X-EDSSWd__Yug@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Aug 4, 2021 at 3:44 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> On 2021-08-04 12:39:29 -0400, John Naylor wrote:
> > typedef struct cc_bucket
> > {
> > uint32 hashes[4];
> > catctup *ct[4];
> > dlist_head;
> > };
>
> I'm not convinced that the above the right idea though. Even if the hash
> matches, you're still going to need to fetch at least catctup->keys[0]
from
> a separate cacheline to be able to return the cache entry.

I see your point. It doesn't make sense to inline only part of the
information needed.

> struct cc_bucket_1
> {
> uint32 hashes[3]; // 12
> // 4 bytes alignment padding
> Datum key0s[3]; // 24
> catctup *ct[3]; // 24
> // cacheline boundary
> dlist_head conflicts; // 16
> };
>
> would be better for 1 key values?
>
> It's obviously annoying to need different bucket types for different key
> counts, but given how much 3 unused key Datums waste, it seems worth
paying
> for?

Yeah, it's annoying, but it does make a big difference to keep out unused
Datums:

keys cachelines
3 values 4 values

1 1 1/4 1 1/2
2 1 5/8 2
3 2 2 1/2
4 2 3/8 3

Or, looking at it another way, limiting the bucket size to 2 cachelines, we
can fit:

keys values
1 5
2 4
3 3
4 2

Although I'm guessing inlining just two values in the 4-key case wouldn't
buy much.

> If we stuffed four values into one bucket we could potentially SIMD the
hash
> and Datum comparisons ;)

;-) That's an interesting future direction to consider when we support
building with x86-64-v2. It'd be pretty easy to compare a vector of hashes
and quickly get the array index for the key comparisons (ignoring for the
moment how to handle the rare case of multiple identical hashes).
However, we currently don't memcmp() the Datums and instead call an
"eqfast" function, so I don't see how that part would work in a vector
setting.

--
John Naylor
EDB: http://www.enterprisedb.com

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: RFC: Improve CPU cache locality of syscache searches
Date:	2021-08-05 20:12:01
Message-ID:	20210805201201.cnc4hagkglxk4pos@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 2021-08-05 12:27:49 -0400, John Naylor wrote:
> On Wed, Aug 4, 2021 at 3:44 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > On 2021-08-04 12:39:29 -0400, John Naylor wrote:
> > > typedef struct cc_bucket
> > > {
> > > uint32 hashes[4];
> > > catctup *ct[4];
> > > dlist_head;
> > > };
> >
> > I'm not convinced that the above the right idea though. Even if the hash
> > matches, you're still going to need to fetch at least catctup->keys[0]
> from
> > a separate cacheline to be able to return the cache entry.
>
> I see your point. It doesn't make sense to inline only part of the
> information needed.

At least not for the unconditionally needed information.

> Although I'm guessing inlining just two values in the 4-key case wouldn't
> buy much.

Not so sure about that. I'd guess that two key comparisons take more cycles
than a cacheline fetch the further keys (perhaps not if we had inlined key
comparisons). I.e. I'd expect out-of-order + speculative execution to hide the
latency for fetching the second cacheline for later key values.

> > If we stuffed four values into one bucket we could potentially SIMD the
> hash
> > and Datum comparisons ;)
>
> ;-) That's an interesting future direction to consider when we support
> building with x86-64-v2. It'd be pretty easy to compare a vector of hashes
> and quickly get the array index for the key comparisons (ignoring for the
> moment how to handle the rare case of multiple identical hashes).
> However, we currently don't memcmp() the Datums and instead call an
> "eqfast" function, so I don't see how that part would work in a vector
> setting.

It definitely couldn't work unconditionally - we have to deal with text,
oidvector, comparisons after all. But we could use it for the other
types. However, looking at the syscaches, I think it'd not very often be
applicable for caches with enough columns.

I have wondered before whether we should have syscache definitions generate
code specific to each syscache definition. I do think that'd give a good bit
of performance boost. But I don't see a trivial way to get there without
notational overhead.

We could define syscaches in a header using a macro that's defined differently
in syscache.c than everywhere else. The header would declare a set of
functions for each syscache, syscache.c would define them to call an
always_inline function with the relevant constants.

Or perhaps we should move syscache definitions into the pg_*.h headers, and
generate the relevant code as part of their processing. That seems like it
could be nice from a modularity POV alone. And it could do better than the
current approach, because we could hardcode the types for columns in the
syscache definition without increasing notational overhead.

Greetings,

Andres Freund

From:	Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	John Naylor <john(dot)naylor(at)enterprisedb(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: RFC: Improve CPU cache locality of syscache searches
Date:	2021-08-06 03:43:55
Message-ID:	95b6948cdd258c73a5f7ae4109b43706@postgrespro.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Andres Freund писал 2021-08-05 23:12:
> Hi,
>
> On 2021-08-05 12:27:49 -0400, John Naylor wrote:
>> On Wed, Aug 4, 2021 at 3:44 PM Andres Freund <andres(at)anarazel(dot)de>
>> wrote:
>> > On 2021-08-04 12:39:29 -0400, John Naylor wrote:
>> > > typedef struct cc_bucket
>> > > {
>> > > uint32 hashes[4];
>> > > catctup *ct[4];
>> > > dlist_head;
>> > > };
>> >
>> > I'm not convinced that the above the right idea though. Even if the hash
>> > matches, you're still going to need to fetch at least catctup->keys[0]
>> from
>> > a separate cacheline to be able to return the cache entry.
>>
>> I see your point. It doesn't make sense to inline only part of the
>> information needed.
>
> At least not for the unconditionally needed information.
>
>
>> Although I'm guessing inlining just two values in the 4-key case
>> wouldn't
>> buy much.
>
> Not so sure about that. I'd guess that two key comparisons take more
> cycles
> than a cacheline fetch the further keys (perhaps not if we had inlined
> key
> comparisons). I.e. I'd expect out-of-order + speculative execution to
> hide the
> latency for fetching the second cacheline for later key values.
>
>
>> > If we stuffed four values into one bucket we could potentially SIMD the
>> hash
>> > and Datum comparisons ;)
>>
>> ;-) That's an interesting future direction to consider when we support
>> building with x86-64-v2. It'd be pretty easy to compare a vector of
>> hashes
>> and quickly get the array index for the key comparisons (ignoring for
>> the
>> moment how to handle the rare case of multiple identical hashes).
>> However, we currently don't memcmp() the Datums and instead call an
>> "eqfast" function, so I don't see how that part would work in a
>> vector
>> setting.
>
> It definitely couldn't work unconditionally - we have to deal with
> text,
> oidvector, comparisons after all. But we could use it for the other
> types. However, looking at the syscaches, I think it'd not very often
> be
> applicable for caches with enough columns.
>
>
> I have wondered before whether we should have syscache definitions
> generate
> code specific to each syscache definition. I do think that'd give a
> good bit
> of performance boost. But I don't see a trivial way to get there
> without
> notational overhead.
>
> We could define syscaches in a header using a macro that's defined
> differently
> in syscache.c than everywhere else. The header would declare a set of
> functions for each syscache, syscache.c would define them to call an
> always_inline function with the relevant constants.
>
> Or perhaps we should move syscache definitions into the pg_*.h headers,
> and
> generate the relevant code as part of their processing. That seems like
> it
> could be nice from a modularity POV alone. And it could do better than
> the
> current approach, because we could hardcode the types for columns in
> the
> syscache definition without increasing notational overhead.

Why don't use simplehash or something like that? Open-addressing schemes
show superior cache locality.

It could be combined: use hashValue as a key in simplehash and dlist for
hashValue collision handling. 99.99% of time there will be no collisions
on hashValue itself, therefore it will be almost always two-three memory
lookups: one-two for dlist_head in simple_hash by hashValue and then
fetching first element in dlist.

And code will remain almost same. Just "bucket" search will change a
bit.

(And I'd recommend use lower fill factor for this simplehash. At most
0.85).

Well, simplehash entry will be 24 bytes this way. If simplehash template
supports external key/element storage, then it could be shrunk to 16
bytes,
and syscache entries will not need dlist_node. (But it doesn't at the
moment).

And custom open-addressing table could be even with 8 bytes per element:
- element is a pointer to entry,
- missing node is encoded as NULL,
- with fill factor as low as 0.66 there will be small amount of
collisions,
therefore non-empty entry will be matched entry most of time, and
memory
lookup for key comparison will be amortized free.
Note that 8byte entry with fill factor 0.66 consumes amortized 12.12
byte,
while 16byte entry with fill factor 0.99 consumes amortized 16.16byte.

regards,
Yura Sokolov

y(dot)sokolov(at)postgrespro(dot)ru
funny(dot)falcon(at)gmail(dot)com

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
Cc:	John Naylor <john(dot)naylor(at)enterprisedb(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: RFC: Improve CPU cache locality of syscache searches
Date:	2021-08-06 03:49:45
Message-ID:	20210806034945.fbfrcqtjffzfuo7b@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 2021-08-06 06:43:55 +0300, Yura Sokolov wrote:
> Why don't use simplehash or something like that? Open-addressing schemes
> show superior cache locality.

I thought about that as well - but it doesn't really resolve the question of
what we want to store in-line in the hashtable and what not. We can't store
the tuples themselves in the hashtable for a myriad of reasons (need pointer
stability, they're variably sized, way too large to move around frequently).

> Well, simplehash entry will be 24 bytes this way. If simplehash template
> supports external key/element storage, then it could be shrunk to 16 bytes,
> and syscache entries will not need dlist_node. (But it doesn't at the
> moment).

I think storing keys outside of the hashtable entry defeats the purpose of the
open addressing, given that they are always checked and that our conflict
ratio should be fairly low.

Greetings,

Andres Freund

From:	Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	John Naylor <john(dot)naylor(at)enterprisedb(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: RFC: Improve CPU cache locality of syscache searches
Date:	2021-08-06 05:20:30
Message-ID:	3638c9d9fed78e51b8aa301fa051cada@postgrespro.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Andres Freund писал 2021-08-06 06:49:
> Hi,
>
> On 2021-08-06 06:43:55 +0300, Yura Sokolov wrote:
>> Why don't use simplehash or something like that? Open-addressing
>> schemes
>> show superior cache locality.
>
> I thought about that as well - but it doesn't really resolve the
> question of
> what we want to store in-line in the hashtable and what not. We can't
> store
> the tuples themselves in the hashtable for a myriad of reasons (need
> pointer
> stability, they're variably sized, way too large to move around
> frequently).
>
>
>> Well, simplehash entry will be 24 bytes this way. If simplehash
>> template
>> supports external key/element storage, then it could be shrunk to 16
>> bytes,
>> and syscache entries will not need dlist_node. (But it doesn't at the
>> moment).
>
> I think storing keys outside of the hashtable entry defeats the purpose
> of the
> open addressing, given that they are always checked and that our
> conflict
> ratio should be fairly low.

It's opposite: if conflict ratio were high, then key outside of
hashtable will
be expensive, since lookup to non-matched key will cost excess memory
access.
But with low conflict ratio we will usually hit matched entry at first
probe.
And since we will use entry soon, it doesn't matter when it will go to
CPU L1
cache: during lookup or during actual usage.

regards,
Yura Sokolov

From:	"Andres Freund" <andres(at)anarazel(dot)de>
To:	"Yura Sokolov" <y(dot)sokolov(at)postgrespro(dot)ru>
Cc:	"John Naylor" <john(dot)naylor(at)enterprisedb(dot)com>, "PostgreSQL Hackers" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: RFC: Improve CPU cache locality of syscache searches
Date:	2021-08-06 06:11:53
Message-ID:	c490b87d-6b59-4f44-a3e7-cb660e2940cb@www.fastmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On Thu, Aug 5, 2021, at 22:20, Yura Sokolov wrote:
> Andres Freund писал 2021-08-06 06:49:
> > Hi,
> >
> > On 2021-08-06 06:43:55 +0300, Yura Sokolov wrote:
> >> Why don't use simplehash or something like that? Open-addressing
> >> schemes
> >> show superior cache locality.
> >
> > I thought about that as well - but it doesn't really resolve the
> > question of
> > what we want to store in-line in the hashtable and what not. We can't
> > store
> > the tuples themselves in the hashtable for a myriad of reasons (need
> > pointer
> > stability, they're variably sized, way too large to move around
> > frequently).
> >
> >
> >> Well, simplehash entry will be 24 bytes this way. If simplehash
> >> template
> >> supports external key/element storage, then it could be shrunk to 16
> >> bytes,
> >> and syscache entries will not need dlist_node. (But it doesn't at the
> >> moment).
> >
> > I think storing keys outside of the hashtable entry defeats the purpose
> > of the
> > open addressing, given that they are always checked and that our
> > conflict
> > ratio should be fairly low.
>
> It's opposite: if conflict ratio were high, then key outside of
> hashtable will
> be expensive, since lookup to non-matched key will cost excess memory
> access.
> But with low conflict ratio we will usually hit matched entry at first
> probe.
> And since we will use entry soon, it doesn't matter when it will go to
> CPU L1
> cache: during lookup or during actual usage.

Often enough it does matter, because there will be earlier dependencies on whether a lookup is a cache hit/miss than on the content of the cached tuple.

Regards,

Andres

From:	John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: RFC: Improve CPU cache locality of syscache searches
Date:	2021-08-06 14:48:30
Message-ID:	CAFBsxsEtCJkO8vOjfBzfHdsG0TDGjt5CGRE3RG_1+0WOpyw3sg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Aug 5, 2021 at 4:12 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> I have wondered before whether we should have syscache definitions
generate
> code specific to each syscache definition. I do think that'd give a good
bit
> of performance boost. But I don't see a trivial way to get there without
> notational overhead.
>
> We could define syscaches in a header using a macro that's defined
differently
> in syscache.c than everywhere else. The header would declare a set of
> functions for each syscache, syscache.c would define them to call an
> always_inline function with the relevant constants.
>
> Or perhaps we should move syscache definitions into the pg_*.h headers,
and
> generate the relevant code as part of their processing. That seems like it
> could be nice from a modularity POV alone. And it could do better than the
> current approach, because we could hardcode the types for columns in the
> syscache definition without increasing notational overhead.

I had code laying around to generate the info needed for initialization,
but I apparently deleted it and never emailed it. :-(

If we went that far, I wonder if we could specialize the cc_bucket for the
actual types, rather than just number of Datums, which would further save
on space. More complex, though.

Also, if comparisons got cheaper, maybe we could get away with not storing
the full hash in the bucket in favor of a tag of the 16 highest bits. (The
catctup would still need the full hash for invalidation, at least).

Another idea I came across while researching in-memory key-value stores is
"bulk chaining" -- see [1] for an example. In that case, our example
2-cacheline bucket would have a dlist_head pointing not to the catctups,
but to another bucket. So that throws away my L1/L2 analogy and uses a
chain of buckets, each of which contain multiple sets of
hashes/keys/pointers. That's closer to a normal collision chain, but with
better cache locality. If we did that, I imagine the catctup could dispense
with storing its Datum array and its dlist_node entirely.

[1]
https://www.usenix.org/system/files/conference/nsdi14/nsdi14-paper-lim.pdf

--
John Naylor
EDB: http://www.enterprisedb.com

From:	John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: RFC: Improve CPU cache locality of syscache searches
Date:	2021-08-18 18:19:21
Message-ID:	CAFBsxsEiaaeqtRezs+5gwZNTbJhE+O3ZLVfp1uOed2-ui4k9Lw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

OK, here is a hackish WIP to see if we get anywhere with the L1 concept:

0001 is extracted from a patch from Horiguchi-san to remove the "dead" flag
0002 adds the large bucket, but doesn't use it for anything
0003 uses the new bucket for the L1 cache
0004 changes when to rehash
0005 is Horiguchi-san's v7 benchmark
0006 removes expiration stuff from the benchmark and prevents alloc errors
with my patches while running the benchmark

This doesn't align the bucket array to a cacheline boundary, nor does it
change the initial number of buckets

make -C contrib/catcachebench/ install
psql -c 'create extension catcachebench'

# This is also from Horiguchi-san
perl gen_tbl.pl | psql > /dev/null

# warmup
psql -c 'select catcachebench(0)'

# measure median of 3

master:

psql -c 'select catcachebench(1)'
catcachebench
---------------
6084.992169

patch:

./inst/bin/psql -c 'select catcachebench(1)'
catcachebench
---------------
5508.072532

That's decent, but not exactly stellar. I get a huge slowdown in
catcachebench(2), however, so I'll have to dig into why before going any
further.

Some time I'll also try the function specialization Andres mentioned and
see how big of a speedup that gives.

--
John Naylor
EDB: http://www.enterprisedb.com

Attachment	Content-Type	Size
v1-0002-Specialize-bucket-types-by-number-of-keys.patch	application/octet-stream	7.0 KB
v1-0001-Remove-dead-flag-from-CatCTup.patch	application/octet-stream	6.2 KB
v1-0004-Rationalize-rehashing-threshold.patch	application/octet-stream	1.7 KB
v1-0005-catcachebench.patch	application/octet-stream	11.9 KB
v1-0003-Use-hash-bucket-as-a-level-one-cache-to-avoid-wal.patch	application/octet-stream	5.3 KB
v1-0006-Some-adjustments-to-catcachebench-and-catcache.c-.patch	application/octet-stream	4.4 KB
gen_tbl.pl	text/x-perl-script	301 bytes

From:	John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: RFC: Improve CPU cache locality of syscache searches
Date:	2021-08-19 23:10:37
Message-ID:	CAFBsxsEj=9Og1-w81u2zRhh-eJTPLE-N83eDvbcSUF-SqocFgQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I've made a small step in this direction in the attached. It uses a
template approach to generate type-specific SearchCatCache* functions, for
now only the 4-key ones. Since there's only a few of those, it's manageable
to invoke the template and change the call sites by hand. To get this to
scale up to all syscaches would require some scripting, but this much is
enough to get feedback on the approach. One possible concern in starting
down this path is that hundreds of call sites would have to be changed. (If
there's a good way around that, it hasn't occurred to me.) It might also be
possible to replace the per-key invocations with a single memcpy() for most
types, but that's a bit more work.

--
John Naylor
EDB: http://www.enterprisedb.com

Attachment	Content-Type	Size
v1-0001-Specialize-SearchSysCache4-by-the-search-key-data.patch	application/octet-stream	14.9 KB

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: RFC: Improve CPU cache locality of syscache searches
Date:	2021-08-27 19:42:04
Message-ID:	20210827194204.iun2hts7trrqynir@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 2021-08-19 19:10:37 -0400, John Naylor wrote:
> I've made a small step in this direction in the attached. It uses a
> template approach to generate type-specific SearchCatCache* functions, for
> now only the 4-key ones. Since there's only a few of those, it's manageable
> to invoke the template and change the call sites by hand. To get this to
> scale up to all syscaches would require some scripting, but this much is
> enough to get feedback on the approach. One possible concern in starting
> down this path is that hundreds of call sites would have to be changed. (If
> there's a good way around that, it hasn't occurred to me.)

I was thinking of avoiding the need for that via a macro. I.e. have
SearchSysCache4(AMOPSTRATEGY, ...) be mapped to
SearchSysCache4AMOPSTRATEGY(...). That way callers wouldn't need to care, and
we could continue to evolve the internals without continually doing
large-scale code changes?

> diff --git a/src/include/utils/catcache_search_template.h b/src/include/utils/catcache_search_template.h
> new file mode 100644
> index 0000000000..6f5dc2573c
> --- /dev/null
> +++ b/src/include/utils/catcache_search_template.h
> @@ -0,0 +1,176 @@
> +/*-------------------------------------------------------------------------
> + * catcache_search_template.h
> + *
> + * A template for type-specialized SearchCatCache functions
> + *
> + * Usage notes:
> + *
> + * To generate functions specialized for a set of catcache keys,
> + * the following parameter macros should be #define'd before this
> + * file is included.
> + *
> + * - CC_SEARCH - the name of the search function to be generated
> + * - CC_NKEYS - the number of search keys for the tuple
> + * - FASTEQFUNCx (x in 1,2,3,4) - type-specific equality function(s)
> + * - HASHFUNCx (x in 1,2,3,4) - type-specific hash function(s)

It's not clear to me we need such a complicated interface here. Can't we just
have a pg_attribute_always_inline function with a bunch more parameters?

Greetings,

Andres Freund

From:	John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: RFC: Improve CPU cache locality of syscache searches
Date:	2021-08-31 19:06:32
Message-ID:	CAFBsxsF7+s=_AeV8X8v3PWO2509kJurXs9z8wh=A0_sX6tmEAA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Aug 27, 2021 at 3:42 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> Hi,
>
> On 2021-08-19 19:10:37 -0400, John Naylor wrote:
> > I've made a small step in this direction in the attached. It uses a
> > template approach to generate type-specific SearchCatCache* functions,
for
> > now only the 4-key ones. Since there's only a few of those, it's
manageable
> > to invoke the template and change the call sites by hand. To get this to
> > scale up to all syscaches would require some scripting, but this much is
> > enough to get feedback on the approach. One possible concern in starting
> > down this path is that hundreds of call sites would have to be changed.
(If
> > there's a good way around that, it hasn't occurred to me.)
>
> I was thinking of avoiding the need for that via a macro. I.e. have
> SearchSysCache4(AMOPSTRATEGY, ...) be mapped to
> SearchSysCache4AMOPSTRATEGY(...). That way callers wouldn't need to care,
and
> we could continue to evolve the internals without continually doing
> large-scale code changes?

Makes sense.

> > + * To generate functions specialized for a set of catcache keys,
> > + * the following parameter macros should be #define'd before this
> > + * file is included.
> > + *
> > + * - CC_SEARCH - the name of the search function to be generated
> > + * - CC_NKEYS - the number of search keys for the tuple
> > + * - FASTEQFUNCx (x in 1,2,3,4) - type-specific equality function(s)
> > + * - HASHFUNCx (x in 1,2,3,4) - type-specific hash function(s)
>
> It's not clear to me we need such a complicated interface here. Can't we
just
> have a pg_attribute_always_inline function with a bunch more parameters?

Were you thinking in terms of passing the type oid in parameters, like this?

HeapTuple
SearchCatCache1(CatCache *cache, Datum v1, Oid t1)
{
return SearchCatCacheInternal(cache, 1, v1, t1, 0, 0, 0, 0, 0, 0);
}

And then CatalogCacheComputeHashValue() and CatalogCacheCompareTuple()
would likewise take types as parameters, which they could use to pick the
right functions at compile time?

--
John Naylor
EDB: http://www.enterprisedb.com

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: RFC: Improve CPU cache locality of syscache searches
Date:	2021-08-31 20:59:06
Message-ID:	20210831205906.4wk3s4lvgzkdaqpi@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 2021-08-31 15:06:32 -0400, John Naylor wrote:
> Were you thinking in terms of passing the type oid in parameters, like this?
>
> HeapTuple
> SearchCatCache1(CatCache *cache, Datum v1, Oid t1)
> {
> return SearchCatCacheInternal(cache, 1, v1, t1, 0, 0, 0, 0, 0, 0);
> }
>
> And then CatalogCacheComputeHashValue() and CatalogCacheCompareTuple()
> would likewise take types as parameters, which they could use to pick the
> right functions at compile time?

I was thinking that the script could emit something like

static const cachedesc syscachedesc_AGGFNOID = {
.reloid = AggregateRelationId,
.indoid = AggregateFnoidIndexId,
.nkeys = 1,
.keys = {
{
.varno = Anum_pg_aggregate_aggfnoid,
.type = Oid,
.comparator = ...
}
}
};

static CatCache syscache_AGGFNOID;

HeapTuple
SearchSysCacheAGGFNOID(Datum key1)
{
return SearchSysCacheInternal(&syscachedesc_AGGFNOID, syscache_AGGFNOID, key1);
}

and SearchSysCacheInternal would have a pg_attribute_always_inline()
fastpath. That fastpath would basically be SearchCatCacheInternal(), with an
extra const cachedesc * paramter. Then the compiler should be able to generate
direct function calls to the hash functions in CatalogCacheComputeHashValue()
and direct calls to the equal function in CatalogCacheCompareTuple().

One difficulty would be that I don't see how to make this work with syscache.c
being compiled separately from catcache.c - but there's probably no need
to. The script could generate a syscache_impl.h that catcache.c includes.

Greetings,

Andres Freund