Quick Links

Re: define pg_structiszero(addr, s, r)

Lists:	pgsql-hackers

From:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
To:	pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	define pg_structiszero(addr, s, r)
Date:	2024-09-18 04:16:12
Message-ID:	ZupUDDyf1hHI4ibn@ip-10-97-1-34.eu-west-3.compute.internal
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi hackers,

There is some places where we check that a struct is full of zeroes:

pgstat_report_bgwriter()
pgstat_report_checkpointer()
pgstat_relation_flush_cb()

Indeed that's the way we check if there is pending statistics to flush/report.

The current code is like (taking pgstat_relation_flush_cb() as an example):

"
static const PgStat_TableCounts all_zeroes;
.
.
if (memcmp(&lstats->counts, &all_zeroes,
sizeof(PgStat_TableCounts)) == 0)
.
.
"

The static declaration is not "really" related to the purpose of the function
it is declared in. It's there "only" to initialize a memory area with zeroes
and to use it in the memcmp.

I think it would make sense to "hide" all of this in a new macro, so please find
attached a patch proposal doing so (Andres suggested something along those lines
in [1] IIUC).

The macro is created in pgstat_internal.h as it looks like that "only" the
statistics related code would benefit of it currently (could be moved to other
header file later on if needed).

[1]: /message-id/20230105002733.ealhzubjaiqis6ua%40awork3.anarazel.de

Looking forward to your feedback,

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment	Content-Type	Size
v1-0001-define-pg_structiszero-addr-s-r.patch	text/x-diff	4.4 KB

From:	Michael Paquier <michael(at)paquier(dot)xyz>
To:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
Cc:	pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-09-18 06:07:15
Message-ID:	ZupuE74kkcC_LhGy@paquier.xyz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Sep 18, 2024 at 04:16:12AM +0000, Bertrand Drouvot wrote:
> The macro is created in pgstat_internal.h as it looks like that "only" the
> statistics related code would benefit of it currently (could be moved to other
> header file later on if needed).

I'm OK to add a helper macro in pgstat_internal.h as this is a pattern
used only for some stats kinds (the other one I'm aware of is the
allzero check for pages around bufmgr.c), cleaning up all these static
declarations to make the memcpy() calls cheaper. That can also be
useful for anybody doing a custom pgstats kind, fixed or
variable-numbered.

#define pg_structiszero(addr, s, r) \

Locating that at the top of pgstat_internal.h seems a bit out of order
to me. Perhaps it would be better to move it closer to the inline
functions?

Also, is this the best name to use here? Right, this is something
that may be quite generic. However, if we limit its scope in the
stats, perhaps this should be named pgstat_entry_all_zeros() or
something like that?
--
Michael

From:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
To:	Michael Paquier <michael(at)paquier(dot)xyz>
Cc:	pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-09-18 07:54:20
Message-ID:	ZuqHLCdZXtEsbyb/@ip-10-97-1-34.eu-west-3.compute.internal
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On Wed, Sep 18, 2024 at 03:07:15PM +0900, Michael Paquier wrote:
> On Wed, Sep 18, 2024 at 04:16:12AM +0000, Bertrand Drouvot wrote:
> > The macro is created in pgstat_internal.h as it looks like that "only" the
> > statistics related code would benefit of it currently (could be moved to other
> > header file later on if needed).
>
> I'm OK to add a helper macro in pgstat_internal.h as this is a pattern
> used only for some stats kinds (the other one I'm aware of is the
> allzero check for pages around bufmgr.c), cleaning up all these static
> declarations to make the memcpy() calls cheaper. That can also be
> useful for anybody doing a custom pgstats kind, fixed or
> variable-numbered.

Thanks for looking at it!

>
> #define pg_structiszero(addr, s, r) \
>
> Locating that at the top of pgstat_internal.h seems a bit out of order
> to me. Perhaps it would be better to move it closer to the inline
> functions?

Makes sense, done that way in v2 attached.

>
> Also, is this the best name to use here? Right, this is something
> that may be quite generic. However, if we limit its scope in the
> stats, perhaps this should be named pgstat_entry_all_zeros() or
> something like that?

Agree, we could still rename it later on if there is a need outside of
the statistics code area. Done in v2.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment	Content-Type	Size
v2-0001-define-pgstat_entry_all_zeros-addr-s-r.patch	text/x-diff	4.5 KB

From:	Peter Eisentraut <peter(at)eisentraut(dot)org>
To:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-09-18 08:03:21
Message-ID:	3023d29e-0fe7-4bec-9028-7f3c0102e437@eisentraut.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 18.09.24 06:16, Bertrand Drouvot wrote:
> +#define pg_structiszero(addr, s, r) \
> + do { \
> + /* We assume this initializes to zeroes */ \
> + static const s all_zeroes; \
> + r = (memcmp(addr, &all_zeroes, sizeof(all_zeroes)) == 0); \
> + } while (0)

This assumption is kind of the problem, isn't it? Because, you can't
assume that. And the existing code is arguably kind of wrong. But
moreover, this macro also assumes that the "addr" argument has no random
padding bits.

In the existing code, you can maybe make a local analysis that the code
is working correctly, although I'm not actually sure. But if you are
repackaging this as a general macro under a general-sounding name, then
the requirements should be more stringent.

From:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
To:	Peter Eisentraut <peter(at)eisentraut(dot)org>
Cc:	pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-09-18 18:57:39
Message-ID:	ZusioyiXTyKpOP/2@ip-10-97-1-34.eu-west-3.compute.internal
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On Wed, Sep 18, 2024 at 10:03:21AM +0200, Peter Eisentraut wrote:
> On 18.09.24 06:16, Bertrand Drouvot wrote:
> > +#define pg_structiszero(addr, s, r) \
> > + do { \
> > + /* We assume this initializes to zeroes */ \
> > + static const s all_zeroes; \
> > + r = (memcmp(addr, &all_zeroes, sizeof(all_zeroes)) == 0); \
> > + } while (0)
>

Thanks for the feedback.

> This assumption is kind of the problem, isn't it? Because, you can't assume
> that. And the existing code is arguably kind of wrong. But moreover, this
> macro also assumes that the "addr" argument has no random padding bits.
>
> In the existing code, you can maybe make a local analysis that the code is
> working correctly, although I'm not actually sure.

I think it is but will give it a closer look.

> But if you are
> repackaging this as a general macro under a general-sounding name, then the
> requirements should be more stringent.

Agree. That said in v2 ([1]), it has been renamed to pgstat_entry_all_zeros().

I think that I will:

1/ take a closer look regarding the existing assumption
2/ if 1/ outcome is fine, then add more detailed comments around
pgstat_entry_all_zeros() to make sure it's not used outside of the existing
context

Sounds good to you?

[1]: /message-id/ZuqHLCdZXtEsbyb/%40ip-10-97-1-34.eu-west-3.compute.internal

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>
Cc:	pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-10-28 14:32:51
Message-ID:	f73f4287-aadd-4914-8052-76167fb4934c@iki.fi
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 18/09/2024 21:57, Bertrand Drouvot wrote:
> On Wed, Sep 18, 2024 at 10:03:21AM +0200, Peter Eisentraut wrote:
>> On 18.09.24 06:16, Bertrand Drouvot wrote:
>>> +#define pg_structiszero(addr, s, r) \
>>> + do { \
>>> + /* We assume this initializes to zeroes */ \
>>> + static const s all_zeroes; \
>>> + r = (memcmp(addr, &all_zeroes, sizeof(all_zeroes)) == 0); \
>>> + } while (0)

Not new with this patch, but do we guarantee padding bytes to be zeros?

How about this instead:

static inline bool
pg_is_all_zeros(const char *p, size_t len)
{
for (size_t i = 0; i < len; i++)
{
if (p[i] != 0)
return false;
}
return true;
}

Is there's a de facto standard name for that function? I was surprised
that I couldn't find one with a quick google search. That seems like the
kind of small utility function that every C program needs.

How performance sensitive is this? If it's not, then the above seems
like the most straightforward way to do this, which is good. If it is
performance sensitive, it's still good, because the compiler can
optimize that well: https://godbolt.org/z/x9hPWjheq.

--
Heikki Linnakangas
Neon (https://neon.tech)

From:	Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-10-28 15:02:48
Message-ID:	CAEudQAq_D5edSnAy2y0Viuk9Y53LTR=FOu8240SSDWmSm_ToAA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Em seg., 28 de out. de 2024 às 11:33, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
escreveu:

> On 18/09/2024 21:57, Bertrand Drouvot wrote:
> > On Wed, Sep 18, 2024 at 10:03:21AM +0200, Peter Eisentraut wrote:
> >> On 18.09.24 06:16, Bertrand Drouvot wrote:
> >>> +#define pg_structiszero(addr, s, r)
> \
> >>> + do {
> \
> >>> + /* We assume this initializes to zeroes */
> \
> >>> + static const s all_zeroes;
> \
> >>> + r = (memcmp(addr, &all_zeroes, sizeof(all_zeroes)) == 0);
> \
> >>> + } while (0)
>
> Not new with this patch, but do we guarantee padding bytes to be zeros?
>
> How about this instead:
>
> static inline bool
> pg_is_all_zeros(const char *p, size_t len)
> {
> for (size_t i = 0; i < len; i++)
> {
> if (p[i] != 0)
> return false;
> }
> return true;
> }
>
> It seems to me that this way is more optimized.

static inline bool
is_all_zeros(const char *p, size_t len)
{
for (size_t i = len; i >= 0; i--)
{
if (p[i] != 0)
return false;
}
return true;
}

main:
sub rsp, 24
lea rdx, [rsp + 12]
lea rcx, [rsp + 16]
lea rdi, [rip + .L.str]
lea rsi, [rsp + 8]
xor eax, eax
call __isoc99_scanf(at)PLT
lea rdi, [rip + .L.str.1]
xor esi, esi
xor eax, eax
call printf(at)PLT
xor eax, eax
add rsp, 24
ret

best regards,
Ranier Vilela

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-10-28 15:08:31
Message-ID:	3969083.1730128111@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Ranier Vilela <ranier(dot)vf(at)gmail(dot)com> writes:
> It seems to me that [reversing the loop direction] is more optimized.

That's far from clear: you're ignoring the possibility that memory
access logic is better optimized for forward scanning than reverse
scanning. I'd stick with the forward scan without some extremely
detailed testing.

regards, tom lane

From:	Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-10-28 16:29:08
Message-ID:	CAEudQAo7BpBqjP5oTGLNycux+Yi3qNa3WaSLkSVrvN9GbnhQfw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Em seg., 28 de out. de 2024 às 12:08, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> escreveu:

> Ranier Vilela <ranier(dot)vf(at)gmail(dot)com> writes:
> > It seems to me that [reversing the loop direction] is more optimized.
>
> That's far from clear: you're ignoring the possibility that memory
> access logic is better optimized for forward scanning than reverse
> scanning. I'd stick with the forward scan without some extremely
> detailed testing.
>
I don't disagree.
After posting, I was wondering if the first possible is not zero, it should
probably be at the beginning and not at the end.

best regards,
Ranier Vilela

From:	Peter Smith <smithpb2250(at)gmail(dot)com>
To:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
Cc:	Peter Eisentraut <peter(at)eisentraut(dot)org>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-10-28 22:36:34
Message-ID:	CAHut+PvHmWiPyqiDRnD7FYsc4QskXpKEZAH3Z8Ahd_GSnRXWrw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Sep 19, 2024 at 4:57 AM Bertrand Drouvot
<bertranddrouvot(dot)pg(at)gmail(dot)com> wrote:
>
> Hi,
>
> On Wed, Sep 18, 2024 at 10:03:21AM +0200, Peter Eisentraut wrote:
> > On 18.09.24 06:16, Bertrand Drouvot wrote:
> > > +#define pg_structiszero(addr, s, r) \
> > > + do { \
> > > + /* We assume this initializes to zeroes */ \
> > > + static const s all_zeroes; \
> > > + r = (memcmp(addr, &all_zeroes, sizeof(all_zeroes)) == 0); \
> > > + } while (0)
> >
>
> Thanks for the feedback.
>
> > This assumption is kind of the problem, isn't it? Because, you can't assume
> > that. And the existing code is arguably kind of wrong. But moreover, this
> > macro also assumes that the "addr" argument has no random padding bits.
> >
> > In the existing code, you can maybe make a local analysis that the code is
> > working correctly, although I'm not actually sure.
>
> I think it is but will give it a closer look.
>
> > But if you are
> > repackaging this as a general macro under a general-sounding name, then the
> > requirements should be more stringent.
>
> Agree. That said in v2 ([1]), it has been renamed to pgstat_entry_all_zeros().
>
> I think that I will:
>
> 1/ take a closer look regarding the existing assumption
> 2/ if 1/ outcome is fine, then add more detailed comments around
> pgstat_entry_all_zeros() to make sure it's not used outside of the existing
> context
>
> Sounds good to you?
>
> [1]: /message-id/ZuqHLCdZXtEsbyb/%40ip-10-97-1-34.eu-west-3.compute.internal
>
> Regards,
>
> --
> Bertrand Drouvot
> PostgreSQL Contributors Team
> RDS Open Source Databases
> Amazon Web Services: https://aws.amazon.com
>

Hi, if this is performance critical then FWIW my understanding is that
memcmp can outperform simple loop checking, and my hacky testing
seemed to confirm that.

See https://godbolt.org/z/GWY1ob9bE

static inline bool
is_all_zeros2(const char *p, size_t len)
{
static size_t nz = 0;
static const char *z = NULL;

if (nz < len)
{
if (z) free((void *)z);
nz = len;
z = (const char *)calloc(1, nz);
}

return memcmp(p, z, len) == 0;
}

Executor x86-64 gcc 14.2 (C, Editor #1)
Program stdout

Iterate 1000000 times...
check zeros using loop -- elapsed=0.041196s
check zeros using memcmp -- elapsed=0.016407s

======
Kind Regards,
Peter Smith.
Fujitsu Australia

From:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	Peter Eisentraut <peter(at)eisentraut(dot)org>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-10-29 07:54:24
Message-ID:	ZyCUsIu/xKFz0djS@ip-10-97-1-34.eu-west-3.compute.internal
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On Mon, Oct 28, 2024 at 04:32:51PM +0200, Heikki Linnakangas wrote:
> On 18/09/2024 21:57, Bertrand Drouvot wrote:
> > On Wed, Sep 18, 2024 at 10:03:21AM +0200, Peter Eisentraut wrote:
> > > On 18.09.24 06:16, Bertrand Drouvot wrote:
> > > > +#define pg_structiszero(addr, s, r) \
> > > > + do { \
> > > > + /* We assume this initializes to zeroes */ \
> > > > + static const s all_zeroes; \
> > > > + r = (memcmp(addr, &all_zeroes, sizeof(all_zeroes)) == 0); \
> > > > + } while (0)
>
> Not new with this patch

Thanks for looking at it!

> but do we guarantee padding bytes to be zeros?

I can see padding in one of the 3 structs of interest: PgStat_BgWriterStats and
PgStat_CheckpointerStats have no padding but PgStat_TableCounts has:

(gdb) ptype /o struct PgStat_TableCounts
/* offset | size */ type = struct PgStat_TableCounts {
/* 0 | 8 */ PgStat_Counter numscans;
/* 8 | 8 */ PgStat_Counter tuples_returned;
/* 16 | 8 */ PgStat_Counter tuples_fetched;
/* 24 | 8 */ PgStat_Counter tuples_inserted;
/* 32 | 8 */ PgStat_Counter tuples_updated;
/* 40 | 8 */ PgStat_Counter tuples_deleted;
/* 48 | 8 */ PgStat_Counter tuples_hot_updated;
/* 56 | 8 */ PgStat_Counter tuples_newpage_updated;
/* 64 | 1 */ _Bool truncdropped;
/* XXX 7-byte hole */
/* 72 | 8 */ PgStat_Counter delta_live_tuples;
/* 80 | 8 */ PgStat_Counter delta_dead_tuples;
/* 88 | 8 */ PgStat_Counter changed_tuples;
/* 96 | 8 */ PgStat_Counter blocks_fetched;
/* 104 | 8 */ PgStat_Counter blocks_hit;

/* total size (bytes): 112 */
}

According to my testing, I can see that "static const PgStat_TableCounts all_zeroes"
all_zeroes is fully made of zeros (padding included). OTOH I would not bet on the
fact that's guaranteed to be the case 100% of the time.

But even, if that is not the case I don't think that it is a big issue: the
check is in pgstat_relation_flush_cb() to decide if we want to avoid unnecessarily
modifying the stats entry. If padding would contain non zeros then we would
"just" unnecessarily modify the stats entry (adding "zeros" to the shared stats).

> How about this instead:
>
> static inline bool
> pg_is_all_zeros(const char *p, size_t len)
> {
> for (size_t i = 0; i < len; i++)
> {
> if (p[i] != 0)
> return false;
> }
> return true;
> }
>

Yeah, that sounds good to me. It's more generic than the initial proposal that
was taking care of a struct memory area. Also, the same "logic" is already
in PageIsVerifiedExtended().

V3 attached is using the above proposal and also makes the change in
PageIsVerifiedExtended(). Then, the new inline function has been placed in
utils/memutils.h (not sure that's the best place though).

> Is there's a de facto standard name for that function? I was surprised that
> I couldn't find one with a quick google search.

Same here. I was just able to find "memiszero" in [0]. So the naming proposal
in v3 is pg_mem_is_all_zeros().

> That seems like the kind of
> small utility function that every C program needs.

Yeah.

> How performance sensitive is this?

I don't think that's very sensitive. I think it's "cheap" as compared to what lead
to those code paths (stats increments).

> If it's not, then the above seems like
> the most straightforward way to do this, which is good. If it is performance
> sensitive, it's still good, because the compiler can optimize that well:
> https://godbolt.org/z/x9hPWjheq.

Yeah, I also think that's fine. Peter Smith did some testing in [1] comparing
memcmp and simple loop checking (thanks Peter for the testing!):

"
Iterate 1000000 times...
check zeros using loop -- elapsed=0.041196s
check zeros using memcmp -- elapsed=0.016407s
"

So, in this test, the loop is 0.024789s longer means 0.024789s/1000000=24 Nanosecond
slower per comparison (If my math is correct). I don't see this as a red flag and
still go with the loop proposal as this is the one already used in
PageIsVerifiedExtended().

[0]: https://in3.readthedocs.io/en/develop/api-c.html
[1]: /message-id/CAHut%2BPvHmWiPyqiDRnD7FYsc4QskXpKEZAH3Z8Ahd_GSnRXWrw%40mail.gmail.com

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment	Content-Type	Size
v3-0001-New-pg_mem_is_all_zeros-p-len-inline-function.patch	text/x-diff	5.5 KB

From:	Peter Eisentraut <peter(at)eisentraut(dot)org>
To:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-10-29 09:23:37
Message-ID:	c9ade148-ff76-45c5-9760-1a45fbd42b00@eisentraut.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 29.10.24 08:54, Bertrand Drouvot wrote:
> +static inline bool
> +pg_mem_is_all_zeros(const char *p, size_t len)

This should be a void * pointer please.

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>
Cc:	Peter Eisentraut <peter(at)eisentraut(dot)org>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-10-29 09:39:03
Message-ID:	2ed3e31e-55ee-41f6-8841-b096a798f504@iki.fi
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 29/10/2024 09:54, Bertrand Drouvot wrote:
>> https://godbolt.org/z/x9hPWjheq.
>
> Yeah, I also think that's fine. Peter Smith did some testing in [1] comparing
> memcmp and simple loop checking (thanks Peter for the testing!):
>
> "
> Iterate 1000000 times...
> check zeros using loop -- elapsed=0.041196s
> check zeros using memcmp -- elapsed=0.016407s
> "
>
> So, in this test, the loop is 0.024789s longer means 0.024789s/1000000=24 Nanosecond
> slower per comparison (If my math is correct).

I believe that test program is bogus. Look at the assembly code; the
compiler optimized away the loops.

--
Heikki Linnakangas
Neon (https://neon.tech)

From:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
To:	Peter Eisentraut <peter(at)eisentraut(dot)org>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-10-29 13:58:15
Message-ID:	ZyDp946kZZ7OhaL5@ip-10-97-1-34.eu-west-3.compute.internal
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

hi,

On Tue, Oct 29, 2024 at 10:23:37AM +0100, Peter Eisentraut wrote:
> On 29.10.24 08:54, Bertrand Drouvot wrote:
> > +static inline bool
> > +pg_mem_is_all_zeros(const char *p, size_t len)
>
> This should be a void * pointer please.

Thanks for looking at it!
Yeah better, done in v4 attached.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment	Content-Type	Size
v4-0001-New-pg_mem_is_all_zeros-p-len-inline-function.patch	text/x-diff	5.5 KB

From:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	Peter Smith <smithpb2250(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-10-29 14:03:02
Message-ID:	ZyDrFgYfRwRXWSmh@ip-10-97-1-34.eu-west-3.compute.internal
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On Tue, Oct 29, 2024 at 11:39:03AM +0200, Heikki Linnakangas wrote:
> On 29/10/2024 09:54, Bertrand Drouvot wrote:
> > > https://godbolt.org/z/x9hPWjheq.
> >
> > Yeah, I also think that's fine. Peter Smith did some testing in [1] comparing
> > memcmp and simple loop checking (thanks Peter for the testing!):
> >
> > "
> > Iterate 1000000 times...
> > check zeros using loop -- elapsed=0.041196s
> > check zeros using memcmp -- elapsed=0.016407s
> > "
> >
> > So, in this test, the loop is 0.024789s longer means 0.024789s/1000000=24 Nanosecond
> > slower per comparison (If my math is correct).
>
> I believe that test program is bogus. Look at the assembly code; the
> compiler optimized away the loops.

Oh right. It looks like that moving the "scanf" within each loop "helps" and
that both give pretty comparable results.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

From:	Peter Eisentraut <peter(at)eisentraut(dot)org>
To:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-10-30 08:09:54
Message-ID:	36371fab-f90c-427c-9c14-210e93e9cd12@eisentraut.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 29.10.24 14:58, Bertrand Drouvot wrote:
> hi,
>
> On Tue, Oct 29, 2024 at 10:23:37AM +0100, Peter Eisentraut wrote:
>> On 29.10.24 08:54, Bertrand Drouvot wrote:
>>> +static inline bool
>>> +pg_mem_is_all_zeros(const char *p, size_t len)
>>
>> This should be a void * pointer please.
>
> Thanks for looking at it!
> Yeah better, done in v4 attached.

Sorry for the confusion. I didn't mean to discourage you from the const
qualifier. That should still be there.

From:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
To:	Peter Eisentraut <peter(at)eisentraut(dot)org>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-10-30 09:23:41
Message-ID:	ZyH7HZfDh7DYNvKv@ip-10-97-1-34.eu-west-3.compute.internal
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On Wed, Oct 30, 2024 at 09:09:54AM +0100, Peter Eisentraut wrote:
> On 29.10.24 14:58, Bertrand Drouvot wrote:
> > hi,
> >
> > On Tue, Oct 29, 2024 at 10:23:37AM +0100, Peter Eisentraut wrote:
> > > On 29.10.24 08:54, Bertrand Drouvot wrote:
> > > > +static inline bool
> > > > +pg_mem_is_all_zeros(const char *p, size_t len)
> > >
> > > This should be a void * pointer please.
> >
> > Thanks for looking at it!
> > Yeah better, done in v4 attached.
>
> Sorry for the confusion. I didn't mean to discourage you from the const
> qualifier. That should still be there.

doh! Of course we want it to be read only. No problem, I should have put more
thoughts on it. Done in v5 attached.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment	Content-Type	Size
v5-0001-New-pg_mem_is_all_zeros-p-len-inline-function.patch	text/x-diff	5.5 KB

From:	Peter Smith <smithpb2250(at)gmail(dot)com>
To:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
Cc:	Peter Eisentraut <peter(at)eisentraut(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-10-30 20:55:45
Message-ID:	CAHut+Pu9Un427eoPuEetNhk2PZe1-sgSxEuG_2OuTwtxC3cBOw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

+/*
+ * Test if a memory region starting at p and of size len is full of zeroes.
+ */
+static inline bool
+pg_mem_is_all_zeros(const void *ptr, size_t len)

The function comment should say 'ptr' instead of 'p', right?

======
Kind Regards,
Peter Smith.
Fujitsu Australia

From:	Michael Paquier <michael(at)paquier(dot)xyz>
To:	Peter Smith <smithpb2250(at)gmail(dot)com>
Cc:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-10-31 00:59:35
Message-ID:	ZyLWd2jR8FO7j5UL@paquier.xyz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Oct 31, 2024 at 07:55:45AM +1100, Peter Smith wrote:
> +/*
> + * Test if a memory region starting at p and of size len is full of zeroes.
> + */
> +static inline bool
> +pg_mem_is_all_zeros(const void *ptr, size_t len)
>
> The function comment should say 'ptr' instead of 'p', right?

Yes.

+static inline bool
+pg_mem_is_all_zeros(const void *ptr, size_t len)

While we're talking about wordsmithing things, I would not choose
"mem" for this routine, just stick to "memory". There is not much in
the code that does memory-specific things like what you are proposing
here. Still, this would be more consistent with the macros for memory
barriers at least. Hence, "pg_memory_is_all_zeros()" makes more
sense?

The location of memutils.h is sensible.

+ if (pg_mem_is_all_zeros(pagebytes , (BLCKSZ / sizeof(size_t))))

Extra space not required after pagebytes.
--
Michael

From:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
To:	Michael Paquier <michael(at)paquier(dot)xyz>
Cc:	Peter Smith <smithpb2250(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-10-31 05:52:49
Message-ID:	ZyMbMZMYPRIHGFzb@ip-10-97-1-34.eu-west-3.compute.internal
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On Thu, Oct 31, 2024 at 09:59:35AM +0900, Michael Paquier wrote:
> On Thu, Oct 31, 2024 at 07:55:45AM +1100, Peter Smith wrote:
> > +/*
> > + * Test if a memory region starting at p and of size len is full of zeroes.
> > + */
> > +static inline bool
> > +pg_mem_is_all_zeros(const void *ptr, size_t len)
> >
> > The function comment should say 'ptr' instead of 'p', right?
>
> Yes.

Thank you both for looking at it. Agree, done in the new version attached.

> +static inline bool
> +pg_mem_is_all_zeros(const void *ptr, size_t len)
>
> While we're talking about wordsmithing things, I would not choose
> "mem" for this routine, just stick to "memory". There is not much in
> the code that does memory-specific things like what you are proposing
> here. Still, this would be more consistent with the macros for memory
> barriers at least. Hence, "pg_memory_is_all_zeros()" makes more
> sense?

That makes sense to me, done that way in the attached.

> The location of memutils.h is sensible.

Thanks for sharing your thoughts on it.

> + if (pg_mem_is_all_zeros(pagebytes , (BLCKSZ / sizeof(size_t))))
>
> Extra space not required after pagebytes.

Fat finger here, thanks!

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment	Content-Type	Size
v6-0001-New-pg_memory_is_all_zeros-ptr-len-inline-functio.patch	text/x-diff	5.5 KB

From:	Michael Paquier <michael(at)paquier(dot)xyz>
To:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
Cc:	Peter Smith <smithpb2250(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-10-31 06:17:23
Message-ID:	ZyMg8zXEfJZ74ZBX@paquier.xyz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Oct 31, 2024 at 05:52:49AM +0000, Bertrand Drouvot wrote:
> That makes sense to me, done that way in the attached.

Seems kind of OK seen from here. I am wondering if others more
comments about the name of the macro, its location, the fact that we
still have pagebytes in bufpage.c, etc.
--
Michael

From:	David Rowley <dgrowleyml(at)gmail(dot)com>
To:	Michael Paquier <michael(at)paquier(dot)xyz>
Cc:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-11-01 04:45:44
Message-ID:	CAApHDvrXzPAr3FxoBuB7b3D-okNoNA2jxLun1rW8Yw5wkbqusw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 31 Oct 2024 at 19:17, Michael Paquier <michael(at)paquier(dot)xyz> wrote:
> Seems kind of OK seen from here. I am wondering if others more
> comments about the name of the macro, its location, the fact that we
> still have pagebytes in bufpage.c, etc.

This change looks incorrect:

@@ -126,18 +124,9 @@ PageIsVerifiedExtended(Page page, BlockNumber
blkno, int flags)
}

/* Check all-zeroes case */
- all_zeroes = true;
pagebytes = (size_t *) page;
- for (i = 0; i < (BLCKSZ / sizeof(size_t)); i++)
- {
- if (pagebytes[i] != 0)
- {
- all_zeroes = false;
- break;
- }
- }

- if (all_zeroes)
+ if (pg_memory_is_all_zeros(pagebytes, (BLCKSZ / sizeof(size_t))))
return true;

I think this should be passing BLCKSZ rather than (BLCKSZ /
sizeof(size_t)), otherwise, it'll just check the first 1 kilobyte is
zero rather than the entire page.

However, I've also performance concerns as if I look at the assembly
of PageIsVerifiedExtended, I see the zero checking is now done 1 byte
at a time:

(gcc 11.4)

leaq 1024(%rbx), %rdx <-- 1KB bug
.p2align 4,,10
.p2align 3
.L60:
cmpb $0, (%rax) <-- check 1 byte is zero.
jne .L59
addq $1, %rax <-- increment by 1 byte.
cmpq %rax, %rdx <-- check if we've done 1024 bytes yet.
jne .L60

Whereas previously it was doing:

movq %rbx, %rax
leaq 8192(%rbx), %rdx <-- 8KB
jmp .L60
.p2align 4,,10
.p2align 3
.L90:
addq $8, %rax <-- increment by 8 bytes (or sizeof(size_t))
cmpq %rax, %rdx
je .L61
.L60:
cmpq $0, (%rax) <-- checks an entire 8 bytes are zero.

I didn't test how performance-critical this is, but the header comment
for this function does use the words "cheaply detect".

* This is called when a page has just been read in from disk. The idea is
* to cheaply detect trashed pages before we go nuts following bogus line
* pointers, testing invalid transaction identifiers, etc.

so it seems a bit dangerous to switch this loop to byte-at-a-time
rather than doing 8 bytes at once without testing the performance
isn't affected.

David

From:	Michael Paquier <michael(at)paquier(dot)xyz>
To:	David Rowley <dgrowleyml(at)gmail(dot)com>
Cc:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-11-01 06:27:38
Message-ID:	ZyR02ofHiWG1HmLI@paquier.xyz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Nov 01, 2024 at 05:45:44PM +1300, David Rowley wrote:
> I think this should be passing BLCKSZ rather than (BLCKSZ /
> sizeof(size_t)), otherwise, it'll just check the first 1 kilobyte is
> zero rather than the entire page.

Ugh, Friday brain fart. The attached should be able to fix that, this
brings back the movl to its correct path:
- movl $1024, %esi
+ movl $8192, %esi

Does that look fine to you?

> I didn't test how performance-critical this is, but the header comment
> for this function does use the words "cheaply detect".

Under gcc -O2 or -O3, the single-byte check or the 8-byte check don't
make a difference. Please see the attached (allzeros.txt) for a quick
check if you want to check by yourself. With 1M iterations, both
average around 3ms for 1M iterations on my laptop (not the fastest
thing around).

Under -O0, though, the difference is noticeable:
- 1-byte check: 3.52s for 1M iterations, averaging one check at
3.52ns.
- 8-byte check: 0.46s for 1M iterations, averaging one check at
0.46ns.

Even for that, I doubt that this is going to be noticeable in
practice, still the difference exists.
--
Michael

Attachment	Content-Type	Size
allzeros.c	text/x-csrc	699 bytes
page-allzeros.patch	text/x-diff	729 bytes

From:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
To:	David Rowley <dgrowleyml(at)gmail(dot)com>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, Peter Smith <smithpb2250(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-11-01 06:33:51
Message-ID:	ZyR2Tzdmj9c6E/QB@ip-10-97-1-34.eu-west-3.compute.internal
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On Fri, Nov 01, 2024 at 05:45:44PM +1300, David Rowley wrote:
> On Thu, 31 Oct 2024 at 19:17, Michael Paquier <michael(at)paquier(dot)xyz> wrote:
> > Seems kind of OK seen from here. I am wondering if others more
> > comments about the name of the macro, its location, the fact that we
> > still have pagebytes in bufpage.c, etc.
>
> This change looks incorrect:
>
> @@ -126,18 +124,9 @@ PageIsVerifiedExtended(Page page, BlockNumber
> blkno, int flags)
> }
>
> /* Check all-zeroes case */
> - all_zeroes = true;
> pagebytes = (size_t *) page;
> - for (i = 0; i < (BLCKSZ / sizeof(size_t)); i++)
> - {
> - if (pagebytes[i] != 0)
> - {
> - all_zeroes = false;
> - break;
> - }
> - }
>
> - if (all_zeroes)
> + if (pg_memory_is_all_zeros(pagebytes, (BLCKSZ / sizeof(size_t))))
> return true;
>
> I think this should be passing BLCKSZ rather than (BLCKSZ /
> sizeof(size_t)), otherwise, it'll just check the first 1 kilobyte is
> zero rather than the entire page.

Thanks for looking at it!

Nice catch, indeed by using the new function we are changing the pointer arithmetic
here and then we should pass BLCKSZ instead.

> However, I've also performance concerns as if I look at the assembly
> of PageIsVerifiedExtended, I see the zero checking is now done 1 byte
> at a time:
>
> (gcc 11.4)
>
> leaq 1024(%rbx), %rdx <-- 1KB bug
> .p2align 4,,10
> .p2align 3
> .L60:
> cmpb $0, (%rax) <-- check 1 byte is zero.
> jne .L59
> addq $1, %rax <-- increment by 1 byte.
> cmpq %rax, %rdx <-- check if we've done 1024 bytes yet.
> jne .L60
>
> Whereas previously it was doing:
>
> movq %rbx, %rax
> leaq 8192(%rbx), %rdx <-- 8KB
> jmp .L60
> .p2align 4,,10
> .p2align 3
> .L90:
> addq $8, %rax <-- increment by 8 bytes (or sizeof(size_t))
> cmpq %rax, %rdx
> je .L61
> .L60:
> cmpq $0, (%rax) <-- checks an entire 8 bytes are zero.
>
> I didn't test how performance-critical this is, but the header comment
> for this function does use the words "cheaply detect".
>
> * This is called when a page has just been read in from disk. The idea is
> * to cheaply detect trashed pages before we go nuts following bogus line
> * pointers, testing invalid transaction identifiers, etc.
>
> so it seems a bit dangerous to switch this loop to byte-at-a-time
> rather than doing 8 bytes at once without testing the performance
> isn't affected.

Agree, I did a quick test (see [0]) and it looks like it's significantly slower
using the new inline function.

We could try to write a more elaborate version of pg_memory_is_all_zeros(), but
as it looks like there is only one use case, then it's probably better to not
implement (revert) this change here and "just" add a comment as to why pg_memory_is_all_zeros()
should not be used here, thoughts?

[0]: https://godbolt.org/z/xqnW4MPY5

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

From:	Michael Paquier <michael(at)paquier(dot)xyz>
To:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
Cc:	David Rowley <dgrowleyml(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-11-01 07:14:09
Message-ID:	ZyR_wewtRWQsBBCT@paquier.xyz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Nov 01, 2024 at 06:33:51AM +0000, Bertrand Drouvot wrote:
> We could try to write a more elaborate version of pg_memory_is_all_zeros(), but
> as it looks like there is only one use case, then it's probably better to not
> implement (revert) this change here and "just" add a comment as to why pg_memory_is_all_zeros()
> should not be used here, thoughts?
>
> [0]: https://godbolt.org/z/xqnW4MPY5

Note that the two printf() calls make the code less optimized.

Anyway, I see the following from bufpage.s for these lines under -O2:

1) On HEAD at 07e9e28b56db:
.LVL306:
.loc 3 201 23 is_stmt 1 discriminator 1 view .LVU547
cmpq $1024, %rbx <- Yep, that's wrong.
jne .L417

2) On HEAD at 49d6c7d8daba:
.LVL299:
.loc 1 131 16 is_stmt 0 discriminator 1 view .LVU524
cmpq $8192, %rbx
je .L419

3) With the patch sent at [1]:
.LVL306:
.loc 3 201 23 is_stmt 1 discriminator 1 view .LVU545
cmpq $8192, %rbx
jne .L417

So it does not matter one way or another for 2) or 3), does it?

[1]: /message-id/ZyR02ofHiWG1HmLI@paquier.xyz
--
Michael

From:	David Rowley <dgrowleyml(at)gmail(dot)com>
To:	Michael Paquier <michael(at)paquier(dot)xyz>
Cc:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-11-01 07:19:39
Message-ID:	CAApHDvoL5B87E8YamUu+-5RhuGzV1C8tvHQehZ78nV1nCXq7Jw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 1 Nov 2024 at 19:27, Michael Paquier <michael(at)paquier(dot)xyz> wrote:
> Under gcc -O2 or -O3, the single-byte check or the 8-byte check don't
> make a difference. Please see the attached (allzeros.txt) for a quick
> check if you want to check by yourself. With 1M iterations, both
> average around 3ms for 1M iterations on my laptop (not the fastest
> thing around).
>
> Under -O0, though, the difference is noticeable:
> - 1-byte check: 3.52s for 1M iterations, averaging one check at
> 3.52ns.
> - 8-byte check: 0.46s for 1M iterations, averaging one check at
> 0.46ns.
>
> Even for that, I doubt that this is going to be noticeable in
> practice, still the difference exists.

The reason you're not seeing the slowdown with -O2 and -O3 is because
your compiler didn't think there was anything to do so didn't emit the
code you were trying to benchmark. Try looking at allzeros.s after
doing "gcc allzeros.c -S -O2".

I've attached an updated version for you to try. I used a volatile
bool and assigned the function result to it to prevent the compiler
from optimising out the test.

$ gcc allzeros.c -O2 -o allzeros
$ ./allzeros
char: done in 1607800 nanoseconds
size_t: done in 208800 nanoseconds (7.70019 times faster)

$ gcc allzeros.c -O3 -o allzeros
$ ./allzeros
char: done in 1584500 nanoseconds
size_t: done in 225700 nanoseconds (7.02038 times faster)

David

Attachment	Content-Type	Size
allzeros_v2.c	text/plain	1.6 KB

From:	David Rowley <dgrowleyml(at)gmail(dot)com>
To:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, Peter Smith <smithpb2250(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-11-01 07:21:50
Message-ID:	CAApHDvrEDWPEyWQF7UhAt+8JR6w2eVpv1xP21FqFH+c=8Lb3JA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 1 Nov 2024 at 19:33, Bertrand Drouvot
<bertranddrouvot(dot)pg(at)gmail(dot)com> wrote:
> Agree, I did a quick test (see [0]) and it looks like it's significantly slower
> using the new inline function.
>
> We could try to write a more elaborate version of pg_memory_is_all_zeros(), but
> as it looks like there is only one use case, then it's probably better to not
> implement (revert) this change here and "just" add a comment as to why pg_memory_is_all_zeros()
> should not be used here, thoughts?

My vote is to just revert this usage of the function. Anything more
elaborate would need to check pointer alignment before using any types
larger than char. The previous code does not need to do that because
the page is going to be at least MAXALIGNed.

David

From:	David Rowley <dgrowleyml(at)gmail(dot)com>
To:	Michael Paquier <michael(at)paquier(dot)xyz>
Cc:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-11-01 07:27:45
Message-ID:	CAApHDvq=udkVyRBTvU+K6Ov4VTOzDQpG8=081-8+bNQTTTHxFQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 1 Nov 2024 at 20:14, Michael Paquier <michael(at)paquier(dot)xyz> wrote:
> 2) On HEAD at 49d6c7d8daba:
> .LVL299:
> .loc 1 131 16 is_stmt 0 discriminator 1 view .LVU524
> cmpq $8192, %rbx
> je .L419
>
> 3) With the patch sent at [1]:
> .LVL306:
> .loc 3 201 23 is_stmt 1 discriminator 1 view .LVU545
> cmpq $8192, %rbx
> jne .L417
>
> So it does not matter one way or another for 2) or 3), does it?

The patch in [1] will fix the bug. But I'm still concerned about the
performance implications of moving to byte-at-a-time processing. There
are about 8 times more instructions being expected to do the same
work.

David

> [1]: /message-id/ZyR02ofHiWG1HmLI@paquier.xyz

From:	Michael Paquier <michael(at)paquier(dot)xyz>
To:	David Rowley <dgrowleyml(at)gmail(dot)com>
Cc:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-11-01 07:36:45
Message-ID:	ZySFDUu5-1jConZE@paquier.xyz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Nov 01, 2024 at 08:21:50PM +1300, David Rowley wrote:
> My vote is to just revert this usage of the function. Anything more
> elaborate would need to check pointer alignment before using any types
> larger than char. The previous code does not need to do that because
> the page is going to be at least MAXALIGNed.

Fine, here you go. The attached reverts back this part in bufpage.c
to what it was in 49d6c7d8daba.
--
Michael

Attachment	Content-Type	Size
bufpage-revert.patch	text/x-diff	895 bytes

From:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
To:	Michael Paquier <michael(at)paquier(dot)xyz>
Cc:	David Rowley <dgrowleyml(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-11-01 07:44:22
Message-ID:	ZySG1uHmmgQT+dpE@ip-10-97-1-34.eu-west-3.compute.internal
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On Fri, Nov 01, 2024 at 04:36:45PM +0900, Michael Paquier wrote:
> On Fri, Nov 01, 2024 at 08:21:50PM +1300, David Rowley wrote:
> > My vote is to just revert this usage of the function. Anything more
> > elaborate would need to check pointer alignment before using any types
> > larger than char. The previous code does not need to do that because
> > the page is going to be at least MAXALIGNed.
>
> Fine, here you go. The attached reverts back this part in bufpage.c
> to what it was in 49d6c7d8daba.

Thanks! Worth to add a comment as to why pg_memory_is_all_zeros() should not
be used here?

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

From:	Michael Paquier <michael(at)paquier(dot)xyz>
To:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
Cc:	David Rowley <dgrowleyml(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-11-01 07:49:21
Message-ID:	ZySIAShKvyGxdc8n@paquier.xyz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Nov 01, 2024 at 07:44:22AM +0000, Bertrand Drouvot wrote:
> Thanks!

Cool, will fix that in a bit.

> Worth to add a comment as to why pg_memory_is_all_zeros() should not
> be used here?

I would not add one in bufpage.c, documenting that where
pg_memory_is_all_zeros() is defined may be more adapted.
--
Michael

From:	David Rowley <dgrowleyml(at)gmail(dot)com>
To:	Michael Paquier <michael(at)paquier(dot)xyz>
Cc:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-11-01 08:47:05
Message-ID:	CAApHDvqULM7KhreSa8PU2bshrguOaRdJsfjKCTvDkByYRFLaMg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 1 Nov 2024 at 20:49, Michael Paquier <michael(at)paquier(dot)xyz> wrote:
>
> On Fri, Nov 01, 2024 at 07:44:22AM +0000, Bertrand Drouvot wrote:
> > Worth to add a comment as to why pg_memory_is_all_zeros() should not
> > be used here?
>
> I would not add one in bufpage.c, documenting that where
> pg_memory_is_all_zeros() is defined may be more adapted.

The thought of having to write a comment to warn people not to use it
for performance-critical things makes me think it might be better just
to write a more optimal version of the function so we don't need to
warn people. I looked around at the callers of the function I saw the
following numbers of bytes being used for the length: 8192 (the one in
question), 88, 32 and 112.

I don't know how performance-critical the final three of those are,
but I imagine all apart from the 32-byte one might be better with a
non-inlined and more optimised version of the function. The problem
with inlining the optimised version is that it's more code to inline.

I've attached what I thought a more optimal version might look like in
case anyone thinks making it better is a good idea.

David

Attachment	Content-Type	Size
pg_memory_is_all_zeros_speedup.patch	application/octet-stream	808 bytes

From:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
To:	David Rowley <dgrowleyml(at)gmail(dot)com>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, Peter Smith <smithpb2250(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-11-01 12:50:10
Message-ID:	ZyTOgpIlQTgMBrah@ip-10-97-1-34.eu-west-3.compute.internal
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On Fri, Nov 01, 2024 at 09:47:05PM +1300, David Rowley wrote:
> On Fri, 1 Nov 2024 at 20:49, Michael Paquier <michael(at)paquier(dot)xyz> wrote:
> >
> > On Fri, Nov 01, 2024 at 07:44:22AM +0000, Bertrand Drouvot wrote:
> > > Worth to add a comment as to why pg_memory_is_all_zeros() should not
> > > be used here?
> >
> > I would not add one in bufpage.c, documenting that where
> > pg_memory_is_all_zeros() is defined may be more adapted.
>
> The thought of having to write a comment to warn people not to use it
> for performance-critical things makes me think it might be better just
> to write a more optimal version of the function so we don't need to
> warn people.

Yeah, that's probably a good idea to write a more elaborate function.

> I looked around at the callers of the function I saw the
> following numbers of bytes being used for the length: 8192 (the one in
> question), 88, 32 and 112.
>
> I don't know how performance-critical the final three of those are,
> but I imagine all apart from the 32-byte one might be better with a
> non-inlined and more optimised version of the function. The problem
> with inlining the optimised version is that it's more code to inline.

I agree that's more code to inline and contains multiple loops and branches.

For the last 3 callers, I think that non inline would still be "cheap" as compared
to what lead to those code paths (stats increments).

> I've attached what I thought a more optimal version might look like in
> case anyone thinks making it better is a good idea.
>

Thanks for the proposal!

I like the idea, I think that's worth to add a few comments, something like:

1 ===

+ while (((uintptr_t) p & (sizeof(size_t) - 1)) != 0)

Add a comment like "Checks bytes, byte by byte, until the pointer is aligned"?

2 ===

+ for (; p < aligned_end; p += sizeof(size_t))

Add a comment like "Multiple bytes comparison(s) at once"?

3 ===

+ while (p < end)
+ {

Add a comment like "Compare remaining bytes, byte by byte"?

4 ===

Out of curiosity I did test your proposal and it performs well (see [0]) for
the PageIsVerifiedExtended() case.

[0]: https://godbolt.org/z/Mdaxz5W7c

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

From:	Japin Li <japinli(at)hotmail(dot)com>
To:	David Rowley <dgrowleyml(at)gmail(dot)com>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-11-01 15:01:13
Message-ID:	ME0P300MB044533A6CD665AFA30ED9354B6562@ME0P300MB0445.AUSP300.PROD.OUTLOOK.COM
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 01 Nov 2024 at 21:47, David Rowley <dgrowleyml(at)gmail(dot)com> wrote:
> On Fri, 1 Nov 2024 at 20:49, Michael Paquier <michael(at)paquier(dot)xyz> wrote:
>>
>> On Fri, Nov 01, 2024 at 07:44:22AM +0000, Bertrand Drouvot wrote:
>> > Worth to add a comment as to why pg_memory_is_all_zeros() should not
>> > be used here?
>>
>> I would not add one in bufpage.c, documenting that where
>> pg_memory_is_all_zeros() is defined may be more adapted.
>
> The thought of having to write a comment to warn people not to use it
> for performance-critical things makes me think it might be better just
> to write a more optimal version of the function so we don't need to
> warn people. I looked around at the callers of the function I saw the
> following numbers of bytes being used for the length: 8192 (the one in
> question), 88, 32 and 112.
>
> I don't know how performance-critical the final three of those are,
> but I imagine all apart from the 32-byte one might be better with a
> non-inlined and more optimised version of the function. The problem
> with inlining the optimised version is that it's more code to inline.
>
+1

Is there a possible overflow?

+ const char *end = &p[len];

How about use MAXALIGN64 macro here?

When handling the aligned, is it possible to handle multiple values
(such as 4 or 8) in one iteration?

It might be faster. However, I'm not tested.

--
Regrads,
Japin Li

From:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
To:	David Rowley <dgrowleyml(at)gmail(dot)com>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, Peter Smith <smithpb2250(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-11-04 09:50:05
Message-ID:	ZyiYzakU83uFMfX0@ip-10-97-1-34.eu-west-3.compute.internal
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On Fri, Nov 01, 2024 at 12:50:10PM +0000, Bertrand Drouvot wrote:
> Hi,
>
> On Fri, Nov 01, 2024 at 09:47:05PM +1300, David Rowley wrote:
> > On Fri, 1 Nov 2024 at 20:49, Michael Paquier <michael(at)paquier(dot)xyz> wrote:
> > I've attached what I thought a more optimal version might look like in
> > case anyone thinks making it better is a good idea.
> >
>
> Thanks for the proposal!
>
> I like the idea, I think that's worth to add a few comments, something like:
>
> 1 ===
>
> + while (((uintptr_t) p & (sizeof(size_t) - 1)) != 0)
>
> Add a comment like "Checks bytes, byte by byte, until the pointer is aligned"?
>
> 2 ===
>
> + for (; p < aligned_end; p += sizeof(size_t))
>
> Add a comment like "Multiple bytes comparison(s) at once"?
>
> 3 ===
>
> + while (p < end)
> + {
>
> Add a comment like "Compare remaining bytes, byte by byte"?
>
> 4 ===
>
> Out of curiosity I did test your proposal and it performs well (see [0]) for
> the PageIsVerifiedExtended() case.

Also,

5 ===

Shouldn't we handle the cases where ptr is NULL and/or len == 0? (should probably
have already been done in the current version of pg_memory_is_all_zeros() though).

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

From:	David Rowley <dgrowleyml(at)gmail(dot)com>
To:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, Peter Smith <smithpb2250(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-11-04 11:24:48
Message-ID:	CAApHDvoU1SfD6rWjGp2tpLd+VGg_TfXhPJgCSNY5Z5LULXiv1A@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, 2 Nov 2024 at 01:50, Bertrand Drouvot
<bertranddrouvot(dot)pg(at)gmail(dot)com> wrote:
>
> On Fri, Nov 01, 2024 at 09:47:05PM +1300, David Rowley wrote:
> > I've attached what I thought a more optimal version might look like in
> > case anyone thinks making it better is a good idea.
> >
>
> Thanks for the proposal!
>
> I like the idea, I think that's worth to add a few comments, something like:

I'm happy if you want to pick this up and continue working on it. I'm
mostly just keen to not leave the suboptimal version of the function
in core as it is.

One thing you might want to consider is if it's worth having a macro
to help decide if you want to inline the function for smaller sizes
and not inline for larger sizes. A macro that checks something like:
if (__builtin_constant_p(len) && len <= 32) could call an inline
version of the function for smaller sizes and do a function call for
lager sizes. Compilers seem to have heuristics that result in
behaviour like this for library functions such as memset and memcpy.
Maybe some experimentation with godbolt.org would yield the crossover
point in bytes where compilers switch tactics.

I just feel that at the rate we receive small code change suggestions
on the mailing lists, it's just a matter of time before someone will
come along and suggest we use pg_memory_is_all_zeros() in
PageIsVerifiedExtended() again. If we don't optimize that function,
then there's a chance a committer might re-commit what's just been
reverted.

David

From:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
To:	David Rowley <dgrowleyml(at)gmail(dot)com>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, Peter Smith <smithpb2250(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-11-04 17:17:54
Message-ID:	ZykBwloj77DVlhnN@ip-10-97-1-34.eu-west-3.compute.internal
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On Tue, Nov 05, 2024 at 12:24:48AM +1300, David Rowley wrote:
> On Sat, 2 Nov 2024 at 01:50, Bertrand Drouvot
> <bertranddrouvot(dot)pg(at)gmail(dot)com> wrote:
> >
> > On Fri, Nov 01, 2024 at 09:47:05PM +1300, David Rowley wrote:
> > > I've attached what I thought a more optimal version might look like in
> > > case anyone thinks making it better is a good idea.
> > >
> >
> > Thanks for the proposal!
> >
> > I like the idea, I think that's worth to add a few comments, something like:
>
> I'm happy if you want to pick this up and continue working on it.

Sure, please find attached v1, the changes are:

- switch from "const char" to "const unsigned char" (could have been done in the
current version of pg_memory_is_all_zeros() though)
- added some comments
- adding an Assert for ptr != 0
- re-introduce the function call in PageIsVerifiedExtended()
- propose a commit message

> One thing you might want to consider is if it's worth having a macro
> to help decide if you want to inline the function for smaller sizes
> and not inline for larger sizes. A macro that checks something like:
> if (__builtin_constant_p(len) && len <= 32) could call an inline
> version of the function for smaller sizes and do a function call for
> lager sizes. Compilers seem to have heuristics that result in
> behaviour like this for library functions such as memset and memcpy.
> Maybe some experimentation with godbolt.org would yield the crossover
> point in bytes where compilers switch tactics.

Yeah, I did some experiments in [0], what we can observe:

- with gcc 14.2 (X86-64), the assembly code contains calls to memset and memcpy
as soon as the BLCKSZ > 8192 (means, IIUC, those are inlined if <= 8192)
- with clang 19.1.0 the switch is done once BLCKSZ > 256

Given that our sizes of interest here are 8192, 88, 32 and 112 (as mentioned
up-thead) and if we follow the same logic as the compilers above do for memset
and memcpy, then we may want to keep the function inlined (gcc case) or do
a switch at 256 (clang). The switch at 256 would concern only the page case.

More use cases could come in the future with different sizes but it looks
hazardous to commit on a len that would trigger the inline/non-inline switch
on our side.

So, I'm tempted to just keep the function inlined, thoughts?

> I just feel that at the rate we receive small code change suggestions
> on the mailing lists, it's just a matter of time before someone will
> come along and suggest we use pg_memory_is_all_zeros() in
> PageIsVerifiedExtended() again. If we don't optimize that function,
> then there's a chance a committer might re-commit what's just been
> reverted.

Agree.

[0]: https://godbolt.org/z/hrYda53Tj

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment	Content-Type	Size
v1-0001-Optimize-pg_memory_is_all_zeros.patch	text/x-diff	3.3 KB

From:	Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>
To:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
Cc:	David Rowley <dgrowleyml(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Smith <smithpb2250(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-11-04 17:38:54
Message-ID:	CAEudQAqteX_B_2a+qRfG3gzOP5vOGTmihHkXCbC_FhjuZhfGhw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi.

Em seg., 4 de nov. de 2024 às 14:18, Bertrand Drouvot <
bertranddrouvot(dot)pg(at)gmail(dot)com> escreveu:

> Hi,
>
> On Tue, Nov 05, 2024 at 12:24:48AM +1300, David Rowley wrote:
> > On Sat, 2 Nov 2024 at 01:50, Bertrand Drouvot
> > <bertranddrouvot(dot)pg(at)gmail(dot)com> wrote:
> > >
> > > On Fri, Nov 01, 2024 at 09:47:05PM +1300, David Rowley wrote:
> > > > I've attached what I thought a more optimal version might look like
> in
> > > > case anyone thinks making it better is a good idea.
> > > >
> > >
> > > Thanks for the proposal!
> > >
> > > I like the idea, I think that's worth to add a few comments, something
> like:
> >
> > I'm happy if you want to pick this up and continue working on it.
>
> Sure, please find attached v1, the changes are:
>
> - switch from "const char" to "const unsigned char" (could have been done
> in the
> current version of pg_memory_is_all_zeros() though)
> - added some comments
> - adding an Assert for ptr != 0
> - re-introduce the function call in PageIsVerifiedExtended()
> - propose a commit message
>

I think we can add a small optimization to this last patch [1].
The variable *aligned_end* is only needed in the second loop (for).
So, only before the for loop do we actually declare it.

Result before this change:
check zeros using BERTRAND 1 0.000031s

Result after this change:
check zeros using BERTRAND 1 0.000018s

+ const unsigned char *aligned_end;

+ /* Multiple bytes comparison(s) at once */
+ aligned_end = (const unsigned char *) ((uintptr_t) end &
(~(sizeof(size_t) - 1)));
+ for (; p < aligned_end; p += sizeof(size_t))

best regards,
Ranier Vilela

[1] https://godbolt.org/z/fvdzjbnc7

From:	David Rowley <dgrowleyml(at)gmail(dot)com>
To:	Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>
Cc:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Smith <smithpb2250(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-11-05 03:23:34
Message-ID:	CAApHDvrYW5m8iwwd3NxZxtcNJ_+M9BdMiDidvyUcPKJHxb4Zdg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, 5 Nov 2024 at 06:39, Ranier Vilela <ranier(dot)vf(at)gmail(dot)com> wrote:
> I think we can add a small optimization to this last patch [1].
> The variable *aligned_end* is only needed in the second loop (for).
> So, only before the for loop do we actually declare it.
>
> Result before this change:
> check zeros using BERTRAND 1 0.000031s
>
> Result after this change:
> check zeros using BERTRAND 1 0.000018s
>
> + const unsigned char *aligned_end;
>
> + /* Multiple bytes comparison(s) at once */
> + aligned_end = (const unsigned char *) ((uintptr_t) end & (~(sizeof(size_t) - 1)));
> + for (; p < aligned_end; p += sizeof(size_t))

I think we all need to stop using Godbolt's servers to run benchmarks
on. These servers are likely to be running various other workloads in
highly virtualised environments and are not going to be stable servers
that would give consistent benchmark results.

I tried your optimisation in the attached allzeros.c and here are my results:

# My version
$ gcc allzeros.c -O2 -o allzeros && for i in {1..3}; do ./allzeros; done
char: done in 1566400 nanoseconds
size_t: done in 195400 nanoseconds (8.01638 times faster than char)
char: done in 1537500 nanoseconds
size_t: done in 196300 nanoseconds (7.8324 times faster than char)
char: done in 1543600 nanoseconds
size_t: done in 196300 nanoseconds (7.86347 times faster than char)

# Ranier's optimization
$ gcc allzeros.c -O2 -D RANIERS_OPTIMIZATION -o allzeros && for i in
{1..3}; do ./allzeros; done
char: done in 1943100 nanoseconds
size_t: done in 531700 nanoseconds (3.6545 times faster than char)
char: done in 1957200 nanoseconds
size_t: done in 458400 nanoseconds (4.26963 times faster than char)
char: done in 1949500 nanoseconds
size_t: done in 469000 nanoseconds (4.15672 times faster than char)

Seems to be about half as fast with gcc on -O2

David

From:	David Rowley <dgrowleyml(at)gmail(dot)com>
To:	Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>
Cc:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Smith <smithpb2250(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-11-05 04:08:41
Message-ID:	CAApHDvopp0AUMLQE4TpH1e=qF3dhxfVjU8eJE7_tsWMjdrzi-A@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, 5 Nov 2024 at 06:39, Ranier Vilela <ranier(dot)vf(at)gmail(dot)com> wrote:
> I think we can add a small optimization to this last patch [1].

I think if you want to make it faster, you could partially unroll the
inner-most loop, like:

// size_t * 4
for (; p < aligned_end - (sizeof(size_t) * 3); p += sizeof(size_t) * 4)
{
if (((size_t *) p)[0] != 0 | ((size_t *) p)[1] != 0 | ((size_t *)
p)[2] != 0 | ((size_t *) p)[3] != 0)
return false;
}

$ gcc allzeros.c -O2 -o allzeros && ./allzeros
char: done in 1595000 nanoseconds
size_t: done in 198300 nanoseconds (8.04337 times faster than char)
size_t * 4: done in 81500 nanoseconds (19.5706 times faster than char)
size_t * 8: done in 71000 nanoseconds (22.4648 times faster than char)

The final one above is 110GB/sec, so probably only going that fast
because the memory being checked is in L1. DDR5 is only 64GB/sec. So
it's probably overkill to unroll the loop that much.

Also, doing something like that means the final byte-at-a-time loop
might have more to do, which might cases with a long remainder slower.
To make up for that there's some incentive to introduce yet another
loop to process single size_t's up to aligned_end. Then you end up
with even more code.

I was happy enough with my patch with Bertrand's comments. I'm not
sure why unsigned chars are better than chars. It doesn't seem to have
any effect on the compiled code.

David

From:	Michael Paquier <michael(at)paquier(dot)xyz>
To:	David Rowley <dgrowleyml(at)gmail(dot)com>
Cc:	Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>, Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-11-05 04:12:03
Message-ID:	ZymbExZuntJmxuWK@paquier.xyz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Nov 05, 2024 at 04:23:34PM +1300, David Rowley wrote:
> I tried your optimisation in the attached allzeros.c and here are my results:
>
> # My version
> $ gcc allzeros.c -O2 -o allzeros && for i in {1..3}; do ./allzeros; done
> char: done in 1543600 nanoseconds
> size_t: done in 196300 nanoseconds (7.86347 times faster than char)
>
> # Ranier's optimization
> $ gcc allzeros.c -O2 -D RANIERS_OPTIMIZATION -o allzeros && for i in
> size_t: done in 531700 nanoseconds (3.6545 times faster than char)
> char: done in 1957200 nanoseconds

I am not seeing numbers as good as yours, but the winner is clear as
well here:

$ gcc allzeros.c -O2 -o allzeros && for i in {1..3}; do
./allzeros; done
char: done in 6578995 nanoseconds
size_t: done in 829916 nanoseconds (7.9273 times faster than char)
char: done in 6581465 nanoseconds
size_t: done in 829948 nanoseconds (7.92997 times faster than char)
char: done in 6585748 nanoseconds
size_t: done in 834929 nanoseconds (7.88779 times faster than char)

$ gcc allzeros.c -O2 -D RANIERS_OPTIMIZATION -o allzeros && for i in
{1..3}; do ./allzeros;
done char: done in 6591803 nanoseconds
size_t: done in 1236102 nanoseconds (5.33273 times faster than char)
char: done in 6606219 nanoseconds
size_t: done in 1235979 nanoseconds (5.34493 times faster than char)
char: done in 6594413 nanoseconds
size_t: done in 1238770 nanoseconds (5.32336 times faster than char)

I'm surprised to see that assigning aligned_end at these two different
locations has this much effect once the compiler optimizes the
surroundings, but well.
--
Michael

From:	Michael Paquier <michael(at)paquier(dot)xyz>
To:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
Cc:	David Rowley <dgrowleyml(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-11-05 04:31:58
Message-ID:	ZymfviixIBCNvzTJ@paquier.xyz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Nov 04, 2024 at 05:17:54PM +0000, Bertrand Drouvot wrote:
> Hi,
>
> On Tue, Nov 05, 2024 at 12:24:48AM +1300, David Rowley wrote:
> > On Sat, 2 Nov 2024 at 01:50, Bertrand Drouvot
> > <bertranddrouvot(dot)pg(at)gmail(dot)com> wrote:
> > >
> > > On Fri, Nov 01, 2024 at 09:47:05PM +1300, David Rowley wrote:
> > > > I've attached what I thought a more optimal version might look like in
> > > > case anyone thinks making it better is a good idea.
> > > >
> > >
> > > Thanks for the proposal!
> > >
> > > I like the idea, I think that's worth to add a few comments, something like:
> >
> > I'm happy if you want to pick this up and continue working on it.
>
> Sure, please find attached v1, the changes are:
>
> - switch from "const char" to "const unsigned char" (could have been done in the
> current version of pg_memory_is_all_zeros() though)
> - added some comments

+ * The test is divided into three phases for efficiency:
+ * - Initial alignment (byte per byte comparison)
+ * - Multiple bytes comparison at once
+ * - Remaining bytes (byte per byte comparison)

It does not look like this insists enough on the alignment part of the
optization? A MAXALIGN'd size would use only size_t comparisons, and
a pointer aligned would do no byte comparisons.

> - adding an Assert for ptr != 0

I'm not sure that the Assert() addition is a good idea. That could
get hot very easily depending on the caller, even if for assert
builds we don't care much about the performance, that could lead to
some paths being a lot slower.

On Fri, 01 Nov 2024 at 21:47, David Rowley <dgrowleyml(at)gmail(dot)com> wrote:
> The thought of having to write a comment to warn people not to use it
> for performance-critical things makes me think it might be better just
> to write a more optimal version of the function so we don't need to
> warn people. I looked around at the callers of the function I saw the
> following numbers of bytes being used for the length: 8192 (the one in
> question), 88, 32 and 112.
>
> I don't know how performance-critical the final three of those are,
> but I imagine all apart from the 32-byte one might be better with a
> non-inlined and more optimised version of the function. The problem
> with inlining the optimised version is that it's more code to inline.

These three don't matter in terms of performance. The cycles spent
for all-zero checks of the checkpointer and bgwriter are nothing
compared to the concurrent I/O activity they handle, and the sizes are
small compared to the 8k pages. The flush callback for relation
pgstats happens once at a given interval (see around
PGSTAT_[MIN|MAX]_INTERVAL).
--
Michael

From:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
To:	David Rowley <dgrowleyml(at)gmail(dot)com>
Cc:	Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Smith <smithpb2250(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-11-05 05:49:22
Message-ID:	Zymx4q58bpBp37mw@ip-10-97-1-34.eu-west-3.compute.internal
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On Tue, Nov 05, 2024 at 05:08:41PM +1300, David Rowley wrote:
> I was happy enough with my patch with Bertrand's comments. I'm not
> sure why unsigned chars are better than chars. It doesn't seem to have
> any effect on the compiled code.
>

I think that unsigned chars is better than char for byte level memory
operations (no sign extension issues).

Though I agree that using char in this function is not an issue as p is only
compared with 0. This is more a matter of taste here.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

From:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
To:	Michael Paquier <michael(at)paquier(dot)xyz>
Cc:	David Rowley <dgrowleyml(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-11-05 05:59:46
Message-ID:	Zym0UvNM+fswQsIZ@ip-10-97-1-34.eu-west-3.compute.internal
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On Tue, Nov 05, 2024 at 01:31:58PM +0900, Michael Paquier wrote:
> On Mon, Nov 04, 2024 at 05:17:54PM +0000, Bertrand Drouvot wrote:
> > Hi,
> >
> > On Tue, Nov 05, 2024 at 12:24:48AM +1300, David Rowley wrote:
> > > On Sat, 2 Nov 2024 at 01:50, Bertrand Drouvot
> > > <bertranddrouvot(dot)pg(at)gmail(dot)com> wrote:
> > > >
> > > > On Fri, Nov 01, 2024 at 09:47:05PM +1300, David Rowley wrote:
> > > > > I've attached what I thought a more optimal version might look like in
> > > > > case anyone thinks making it better is a good idea.
> > > > >
> > > >
> > > > Thanks for the proposal!
> > > >
> > > > I like the idea, I think that's worth to add a few comments, something like:
> > >
> > > I'm happy if you want to pick this up and continue working on it.
> >
> > Sure, please find attached v1, the changes are:
> >
> > - switch from "const char" to "const unsigned char" (could have been done in the
> > current version of pg_memory_is_all_zeros() though)
> > - added some comments
>
> + * The test is divided into three phases for efficiency:
> + * - Initial alignment (byte per byte comparison)
> + * - Multiple bytes comparison at once
> + * - Remaining bytes (byte per byte comparison)
>
> It does not look like this insists enough on the alignment part of the
> optization? A MAXALIGN'd size would use only size_t comparisons, and
> a pointer aligned would do no byte comparisons.

I'm not sure to get this one. Is it more clear in the code comments that
we can start multiple bytes comparison once p is aligned?

> > - adding an Assert for ptr != 0
>
> I'm not sure that the Assert() addition is a good idea. That could
> get hot very easily depending on the caller, even if for assert
> builds we don't care much about the performance, that could lead to
> some paths being a lot slower.

Yeah, agree, removed in v2 attached.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment	Content-Type	Size
v2-0001-Optimize-pg_memory_is_all_zeros.patch	text/x-diff	3.3 KB

From:	Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>
To:	David Rowley <dgrowleyml(at)gmail(dot)com>
Cc:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Smith <smithpb2250(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-11-05 12:49:53
Message-ID:	CAEudQAqWbLS1q4VZ+d4Tx5T-DHryQ8juu=Kgn9wfiV8iB-pzyA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Em ter., 5 de nov. de 2024 às 00:23, David Rowley <dgrowleyml(at)gmail(dot)com>
escreveu:

> On Tue, 5 Nov 2024 at 06:39, Ranier Vilela <ranier(dot)vf(at)gmail(dot)com> wrote:
> > I think we can add a small optimization to this last patch [1].
> > The variable *aligned_end* is only needed in the second loop (for).
> > So, only before the for loop do we actually declare it.
> >
> > Result before this change:
> > check zeros using BERTRAND 1 0.000031s
> >
> > Result after this change:
> > check zeros using BERTRAND 1 0.000018s
> >
> > + const unsigned char *aligned_end;
> >
> > + /* Multiple bytes comparison(s) at once */
> > + aligned_end = (const unsigned char *) ((uintptr_t) end &
> (~(sizeof(size_t) - 1)));
> > + for (; p < aligned_end; p += sizeof(size_t))
>
> I think we all need to stop using Godbolt's servers to run benchmarks
> on. These servers are likely to be running various other workloads in
> highly virtualised environments and are not going to be stable servers
> that would give consistent benchmark results.
>
> I tried your optimisation in the attached allzeros.c and here are my
> results:
>
> # My version
> $ gcc allzeros.c -O2 -o allzeros && for i in {1..3}; do ./allzeros; done
> char: done in 1566400 nanoseconds
> size_t: done in 195400 nanoseconds (8.01638 times faster than char)
> char: done in 1537500 nanoseconds
> size_t: done in 196300 nanoseconds (7.8324 times faster than char)
> char: done in 1543600 nanoseconds
> size_t: done in 196300 nanoseconds (7.86347 times faster than char)
>
> # Ranier's optimization
> $ gcc allzeros.c -O2 -D RANIERS_OPTIMIZATION -o allzeros && for i in
> {1..3}; do ./allzeros; done
> char: done in 1943100 nanoseconds
> size_t: done in 531700 nanoseconds (3.6545 times faster than char)
> char: done in 1957200 nanoseconds
> size_t: done in 458400 nanoseconds (4.26963 times faster than char)
> char: done in 1949500 nanoseconds
> size_t: done in 469000 nanoseconds (4.15672 times faster than char)
>
> Seems to be about half as fast with gcc on -O2
>
Thanks for test coding.

I've tried with msvc 2022 32bits
Here the results:

C:\usr\src\tests\allzeros>allzeros
char: done in 71431900 nanoseconds
size_t: done in 18010900 nanoseconds (3.96604 times faster than char)

C:\usr\src\tests\allzeros>allzeros
char: done in 71070100 nanoseconds
size_t: done in 19654300 nanoseconds (3.61601 times faster than char)

C:\usr\src\tests\allzeros>allzeros
char: done in 68682400 nanoseconds
size_t: done in 19841100 nanoseconds (3.46162 times faster than char)

C:\usr\src\tests\allzeros>allzeros
char: done in 63215100 nanoseconds
size_t: done in 17920200 nanoseconds (3.52759 times faster than char)

C:\usr\src\tests\allzeros>c /DRANIERS_OPTIMIZATION

Microsoft (R) Program Maintenance Utility Versão 14.40.33813.0
Direitos autorais da Microsoft Corporation. Todos os direitos reservados.

C:\usr\src\tests\allzeros>allzeros
char: done in 67213800 nanoseconds
size_t: done in 15049200 nanoseconds (4.46627 times faster than char)

C:\usr\src\tests\allzeros>allzeros
char: done in 51505900 nanoseconds
size_t: done in 13645700 nanoseconds (3.77452 times faster than char)

C:\usr\src\tests\allzeros>allzeros
char: done in 62852600 nanoseconds
size_t: done in 17863800 nanoseconds (3.51843 times faster than char)

C:\usr\src\tests\allzeros>allzeros
char: done in 51877200 nanoseconds
size_t: done in 13759900 nanoseconds (3.77017 times faster than char)

The function used to replace clock_getime is:
timespec_get(ts, TIME_UTC)

best regards,
Ranier Vilela

From:	Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>
To:	Michael Paquier <michael(at)paquier(dot)xyz>
Cc:	David Rowley <dgrowleyml(at)gmail(dot)com>, Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-11-05 12:53:19
Message-ID:	CAEudQApPnmNpS+2h-mMRERYz+Xthzmt+_=18vwLJTCgFjnBhmA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Em ter., 5 de nov. de 2024 às 01:12, Michael Paquier <michael(at)paquier(dot)xyz>
escreveu:

> On Tue, Nov 05, 2024 at 04:23:34PM +1300, David Rowley wrote:
> > I tried your optimisation in the attached allzeros.c and here are my
> results:
> >
> > # My version
> > $ gcc allzeros.c -O2 -o allzeros && for i in {1..3}; do ./allzeros; done
> > char: done in 1543600 nanoseconds
> > size_t: done in 196300 nanoseconds (7.86347 times faster than char)
> >
> > # Ranier's optimization
> > $ gcc allzeros.c -O2 -D RANIERS_OPTIMIZATION -o allzeros && for i in
> > size_t: done in 531700 nanoseconds (3.6545 times faster than char)
> > char: done in 1957200 nanoseconds
>
> I am not seeing numbers as good as yours, but the winner is clear as
> well here:
>
Thanks for testing.

>
> $ gcc allzeros.c -O2 -o allzeros && for i in {1..3}; do
> ./allzeros; done
> char: done in 6578995 nanoseconds
> size_t: done in 829916 nanoseconds (7.9273 times faster than char)
> char: done in 6581465 nanoseconds
> size_t: done in 829948 nanoseconds (7.92997 times faster than char)
> char: done in 6585748 nanoseconds
> size_t: done in 834929 nanoseconds (7.88779 times faster than char)
>
> $ gcc allzeros.c -O2 -D RANIERS_OPTIMIZATION -o allzeros && for i in
> {1..3}; do ./allzeros;
> done char: done in 6591803 nanoseconds
> size_t: done in 1236102 nanoseconds (5.33273 times faster than char)
> char: done in 6606219 nanoseconds
> size_t: done in 1235979 nanoseconds (5.34493 times faster than char)
> char: done in 6594413 nanoseconds
> size_t: done in 1238770 nanoseconds (5.32336 times faster than char)
>
> I'm surprised to see that assigning aligned_end at these two different
> locations has this much effect once the compiler optimizes the
> surroundings, but well.
>
I think that's a plus point for the benefit of not touching the memory if
it's not explicitly necessary.

best regards,
Ranier Vilela

From:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
To:	David Rowley <dgrowleyml(at)gmail(dot)com>
Cc:	Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Smith <smithpb2250(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-11-05 15:03:23
Message-ID:	Zyozu5B6+W8FyF8m@ip-10-97-1-34.eu-west-3.compute.internal
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On Tue, Nov 05, 2024 at 05:08:41PM +1300, David Rowley wrote:
> On Tue, 5 Nov 2024 at 06:39, Ranier Vilela <ranier(dot)vf(at)gmail(dot)com> wrote:
> > I think we can add a small optimization to this last patch [1].
>
> I think if you want to make it faster, you could partially unroll the
> inner-most loop, like:
>
> // size_t * 4
> for (; p < aligned_end - (sizeof(size_t) * 3); p += sizeof(size_t) * 4)
> {
> if (((size_t *) p)[0] != 0 | ((size_t *) p)[1] != 0 | ((size_t *)
> p)[2] != 0 | ((size_t *) p)[3] != 0)
> return false;
> }

Another option could be to use SIMD instructions to check multiple bytes
is zero in a single operation. Maybe just an idea to keep in mind and experiment
if we feel the need later on.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

From:	David Rowley <dgrowleyml(at)gmail(dot)com>
To:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
Cc:	Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Smith <smithpb2250(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-11-05 23:16:33
Message-ID:	CAApHDvq7P-JgFhgtxUPqhavG-qSDVUhyWaEX9M8_MNorFEijZA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 6 Nov 2024 at 04:03, Bertrand Drouvot
<bertranddrouvot(dot)pg(at)gmail(dot)com> wrote:
> Another option could be to use SIMD instructions to check multiple bytes
> is zero in a single operation. Maybe just an idea to keep in mind and experiment
> if we feel the need later on.

Could do. I just wrote it that way to give the compiler flexibility to
do SIMD implicitly. That seemed easier than messing around with SIMD
intrinsics. I guess the compiler won't use SIMD with the single
size_t-at-a-time version as it can't be certain it's ok to access the
memory beyond the first zero word. Because I wrote the "if" condition
using bitwise-OR, there's no boolean short-circuiting, so the compiler
sees it must be safe to access all the memory for the loop iteration.

If I use -march=native or -march=znver2 on my Zen2 machine, gcc does
use SIMD operators. Clang uses some 128-bit registers without
specifying -march:

drowley(at)amd3990x:~$ gcc -O2 allzeros.c -march=native -o allzeros &&
for i in {1..3}; do ./allzeros; done
char: done in 1940539 nanoseconds
size_t: done in 261731 nanoseconds (7.41425 times faster than char)
size_t * 4: done in 130415 nanoseconds (14.8797 times faster than char)
size_t * 8: done in 70031 nanoseconds (27.7097 times faster than char)
char: done in 3030132 nanoseconds
size_t: done in 477044 nanoseconds (6.35189 times faster than char)
size_t * 4: done in 123551 nanoseconds (24.5254 times faster than char)
size_t * 8: done in 68549 nanoseconds (44.2039 times faster than char)
char: done in 3214037 nanoseconds
size_t: done in 256901 nanoseconds (12.5108 times faster than char)
size_t * 4: done in 126017 nanoseconds (25.5048 times faster than char)
size_t * 8: done in 73167 nanoseconds (43.9274 times faster than char)

David

From:	Michael Paquier <michael(at)paquier(dot)xyz>
To:	David Rowley <dgrowleyml(at)gmail(dot)com>
Cc:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-11-06 00:19:54
Message-ID:	Zyq2Kp98n8KVAOBJ@paquier.xyz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Nov 06, 2024 at 12:16:33PM +1300, David Rowley wrote:
> On Wed, 6 Nov 2024 at 04:03, Bertrand Drouvot
> <bertranddrouvot(dot)pg(at)gmail(dot)com> wrote:
>> Another option could be to use SIMD instructions to check multiple bytes
>> is zero in a single operation. Maybe just an idea to keep in mind and experiment
>> if we feel the need later on.
>
> Could do. I just wrote it that way to give the compiler flexibility to
> do SIMD implicitly. That seemed easier than messing around with SIMD
> intrinsics. I guess the compiler won't use SIMD with the single
> size_t-at-a-time version as it can't be certain it's ok to access the
> memory beyond the first zero word. Because I wrote the "if" condition
> using bitwise-OR, there's no boolean short-circuiting, so the compiler
> sees it must be safe to access all the memory for the loop iteration.

How complex would that be compared to the latest patch proposed if
done this way? If you can force SIMD without having to know about
these specific gcc switches (aka -march is not set by default in the
tree except for some armv8 path), then the performance happens
magically. If that makes the code more readable, that's even better.
--
Michael

From:	David Rowley <dgrowleyml(at)gmail(dot)com>
To:	Michael Paquier <michael(at)paquier(dot)xyz>
Cc:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-11-06 00:38:50
Message-ID:	CAApHDvpcEMOnWRxsHEnZKnJB2fTA0xZQoJGKVEsa7d_CiYp+mQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 6 Nov 2024 at 13:20, Michael Paquier <michael(at)paquier(dot)xyz> wrote:
>
> On Wed, Nov 06, 2024 at 12:16:33PM +1300, David Rowley wrote:
> > On Wed, 6 Nov 2024 at 04:03, Bertrand Drouvot
> > <bertranddrouvot(dot)pg(at)gmail(dot)com> wrote:
> >> Another option could be to use SIMD instructions to check multiple bytes
> >> is zero in a single operation. Maybe just an idea to keep in mind and experiment
> >> if we feel the need later on.
> >
> > Could do. I just wrote it that way to give the compiler flexibility to
> > do SIMD implicitly. That seemed easier than messing around with SIMD
> > intrinsics. I guess the compiler won't use SIMD with the single
> > size_t-at-a-time version as it can't be certain it's ok to access the
> > memory beyond the first zero word. Because I wrote the "if" condition
> > using bitwise-OR, there's no boolean short-circuiting, so the compiler
> > sees it must be safe to access all the memory for the loop iteration.
>
> How complex would that be compared to the latest patch proposed if
> done this way? If you can force SIMD without having to know about
> these specific gcc switches (aka -march is not set by default in the
> tree except for some armv8 path), then the performance happens
> magically. If that makes the code more readable, that's even better.

We could just write it that way and leave it up to the compiler to
decide whether to use SIMD or not. Going by [1], gcc with -O2 uses
SIMD instructions from 14.1 and clang with -O2 does it from version
8.0.0. gcc 14.1 was release in May 2024, so still quite new. It'll be
quite a bit older once PG18 is out. Using the bitwise-OR method, more
and more people will benefit as gcc14.1 and beyond becomes more
mainstream.

Clang 8.0.0 is from March 2019, so quite old already.

David

[1] https://godbolt.org/z/MTqao8scW

From:	Michael Paquier <michael(at)paquier(dot)xyz>
To:	David Rowley <dgrowleyml(at)gmail(dot)com>
Cc:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-11-06 00:52:21
Message-ID:	Zyq9xTkt0-BLCZci@paquier.xyz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Nov 06, 2024 at 01:38:50PM +1300, David Rowley wrote:
> We could just write it that way and leave it up to the compiler to
> decide whether to use SIMD or not. Going by [1], gcc with -O2 uses
> SIMD instructions from 14.1 and clang with -O2 does it from version
> 8.0.0. gcc 14.1 was release in May 2024, so still quite new. It'll be
> quite a bit older once PG18 is out. Using the bitwise-OR method, more
> and more people will benefit as gcc14.1 and beyond becomes more
> mainstream.
>
> Clang 8.0.0 is from March 2019, so quite old already.

Okay, WFM to keep things the way they are in the patch.
--
Michael

From:	David Rowley <dgrowleyml(at)gmail(dot)com>
To:	Michael Paquier <michael(at)paquier(dot)xyz>
Cc:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-11-06 02:53:48
Message-ID:	CAApHDvp2qcO_0=FZy6PHPZKsRgOUuEXqdCiki3bCNweex9-_Uw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 6 Nov 2024 at 13:52, Michael Paquier <michael(at)paquier(dot)xyz> wrote:
>
> On Wed, Nov 06, 2024 at 01:38:50PM +1300, David Rowley wrote:
> > We could just write it that way and leave it up to the compiler to
> > decide whether to use SIMD or not. Going by [1], gcc with -O2 uses
> > SIMD instructions from 14.1 and clang with -O2 does it from version
> > 8.0.0. gcc 14.1 was release in May 2024, so still quite new. It'll be
> > quite a bit older once PG18 is out. Using the bitwise-OR method, more
> > and more people will benefit as gcc14.1 and beyond becomes more
> > mainstream.
> >
> > Clang 8.0.0 is from March 2019, so quite old already.
>
> Okay, WFM to keep things the way they are in the patch.

I'm not sure if I'm clear on what works for you. The latest patch I
saw did 1 size_t per iteration. Are you saying we should do just
size_t per loop? or we should form the code in a way that allows the
compiler to use SIMD instructions?

David

From:	Michael Paquier <michael(at)paquier(dot)xyz>
To:	David Rowley <dgrowleyml(at)gmail(dot)com>
Cc:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-11-06 04:44:58
Message-ID:	ZyryQMT7X35dtMpt@paquier.xyz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Nov 06, 2024 at 03:53:48PM +1300, David Rowley wrote:
> I'm not sure if I'm clear on what works for you. The latest patch I
> saw did 1 size_t per iteration. Are you saying we should do just
> size_t per loop? or we should form the code in a way that allows the
> compiler to use SIMD instructions?

Oh, sorry. I thought that you wanted to keep the size_t checks as
done in [1] anyway.

But your suggestion is different, and instructions like xmmword would
be enough to show up as you group more the checks 8 at a time:
bool
pg_memory_is_all_zeros_size_t_times_8(const void *ptr, size_t len)
{
const char *p = (const char *) ptr;
const char *end = &p[len];
const char *aligned_end = (const char *) ((uintptr_t) end & (~(sizeof(size_t) - 1)));

while (((uintptr_t) p & (sizeof(size_t) - 1)) != 0)
{
if (p == end)
return true;
if (*p++ != 0)
return false;
}
for (; p < aligned_end - (sizeof(size_t) * 7); p += sizeof(size_t) * 8)
{
if (((size_t *) p)[0] != 0 |
((size_t *) p)[1] != 0 |
((size_t *) p)[2] != 0 |
((size_t *) p)[3] != 0 |
((size_t *) p)[4] != 0 |
((size_t *) p)[5] != 0 |
((size_t *) p)[6] != 0 |
((size_t *) p)[7] != 0)
return false;
}
while (p < end)
{
if (*p++ != 0)
return false;
}
return true;
}

That's smart for large areas to cover. The patch should document why
we are doing it this way. This should have a few more parenthesis in
the second loop, or -Wparentheses would complain.

Should the last loop check only 1 byte at a time or should this stuff
include one more step before the last one you wrote to do a couple of
checks with size_t? That may matter for areas small enough (len <
sizeof(size_t) * 8) causing the second step to not be taken, but large
enough (len > sizeof(size_t)) to apply a couple of size_t checks per
loop.

[1]: /message-id/Zym0UvNM+fswQsIZ@ip-10-97-1-34.eu-west-3.compute.internal
--
Michael

From:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
To:	David Rowley <dgrowleyml(at)gmail(dot)com>
Cc:	Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Smith <smithpb2250(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-11-06 11:38:30
Message-ID:	ZytVNvIbc1vc0qZz@ip-10-97-1-34.eu-west-3.compute.internal
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

ohhh, great, thanks!

> That seemed easier than messing around with SIMD
> intrinsics.

I had in mind to use SIMD intrinsics actually when posting the SIMD idea but...

> I guess the compiler won't use SIMD with the single
> size_t-at-a-time version as it can't be certain it's ok to access the
> memory beyond the first zero word. Because I wrote the "if" condition
> using bitwise-OR, there's no boolean short-circuiting, so the compiler
> sees it must be safe to access all the memory for the loop iteration.

that's a better idea! Yeah, I think that now the compiler sees that all comparisons
can be done in parallel and combined with a single OR operation (so, good candidate
to use SIMD optimization).

> If I use -march=native or -march=znver2 on my Zen2 machine, gcc does
> use SIMD operators. Clang uses some 128-bit registers without
> specifying -march:
>
> drowley(at)amd3990x:~$ gcc -O2 allzeros.c -march=native -o allzeros &&
> for i in {1..3}; do ./allzeros; done
> char: done in 1940539 nanoseconds
> size_t: done in 261731 nanoseconds (7.41425 times faster than char)
> size_t * 4: done in 130415 nanoseconds (14.8797 times faster than char)
> size_t * 8: done in 70031 nanoseconds (27.7097 times faster than char)
> char: done in 3030132 nanoseconds
> size_t: done in 477044 nanoseconds (6.35189 times faster than char)
> size_t * 4: done in 123551 nanoseconds (24.5254 times faster than char)
> size_t * 8: done in 68549 nanoseconds (44.2039 times faster than char)
> char: done in 3214037 nanoseconds
> size_t: done in 256901 nanoseconds (12.5108 times faster than char)
> size_t * 4: done in 126017 nanoseconds (25.5048 times faster than char)
> size_t * 8: done in 73167 nanoseconds (43.9274 times faster than char)
>

Thanks for the tests! Out of curiosity, using gcc 11.4.0 (SIMD instructions not
generated) and get:

$ gcc -O2 allzeros_simd.c -o allzeros_simd ; ./allzeros_simd
char: done in 2655385 nanoseconds
size_t: done in 476021 nanoseconds (5.57829 times faster than char)
size_t SIMD DAVID: done in 174816 nanoseconds (15.1896 times faster than char)

$ gcc -march=native -O2 allzeros_simd.c -o allzeros_simd ; ./allzeros_simd
char: done in 2681146 nanoseconds
size_t: done in 395041 nanoseconds (6.78701 times faster than char)
size_t SIMD DAVID: done in 175608 nanoseconds (15.2678 times faster than char)

=> It's faster than the size_t one.

But of course, it's even faster with SIMD:

$ /usr/local/gcc-14.1.0/bin/gcc-14.1.0 -O2 allzeros_simd.c -o allzeros_simd ; ./allzeros_simd
char: done in 5318674 nanoseconds
size_t: done in 443591 nanoseconds (11.99 times faster than char)
size_t SIMD DAVID: done in 179650 nanoseconds (29.6058 times faster than char)

$ /usr/local/gcc-14.1.0/bin/gcc-14.1.0 -march=native -O2 allzeros_simd.c -o allzeros_simd ; ./allzeros_simd
char: done in 5319534 nanoseconds
size_t: done in 426599 nanoseconds (12.4696 times faster than char)
size_t SIMD DAVID: done in 128687 nanoseconds (41.337 times faster than char)

So, I don't see any reason why not to use this SIMD approach: please find v7
attached.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment	Content-Type	Size
v7-0001-Optimize-pg_memory_is_all_zeros.patch	text/x-diff	3.7 KB

From:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
To:	Michael Paquier <michael(at)paquier(dot)xyz>
Cc:	David Rowley <dgrowleyml(at)gmail(dot)com>, Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-11-06 11:40:06
Message-ID:	ZytVlmPJSRzmu5nS@ip-10-97-1-34.eu-west-3.compute.internal
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On Wed, Nov 06, 2024 at 01:44:58PM +0900, Michael Paquier wrote:
> Should the last loop check only 1 byte at a time or should this stuff
> include one more step before the last one you wrote to do a couple of
> checks with size_t? That may matter for areas small enough (len <
> sizeof(size_t) * 8) causing the second step to not be taken, but large
> enough (len > sizeof(size_t)) to apply a couple of size_t checks per
> loop.

Do you mean add:

"
for (; p < aligned_end; p += sizeof(size_t))
{
if (*(size_t *)p != 0)
return false;
}
"

just before the last loop?

If so, I did a few tests and did not see any major improvements. So, I thought
it's simpler to not add more code in this inline function in v7 shared up-thread.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

From:	Peter Eisentraut <peter(at)eisentraut(dot)org>
To:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com>
Cc:	Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Smith <smithpb2250(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-11-06 18:34:19
Message-ID:	0ead6139-707f-4a21-b1df-afdd7ae5d1a6@eisentraut.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 05.11.24 16:03, Bertrand Drouvot wrote:
> On Tue, Nov 05, 2024 at 05:08:41PM +1300, David Rowley wrote:
>> On Tue, 5 Nov 2024 at 06:39, Ranier Vilela <ranier(dot)vf(at)gmail(dot)com> wrote:
>>> I think we can add a small optimization to this last patch [1].
>>
>> I think if you want to make it faster, you could partially unroll the
>> inner-most loop, like:
>>
>> // size_t * 4
>> for (; p < aligned_end - (sizeof(size_t) * 3); p += sizeof(size_t) * 4)
>> {
>> if (((size_t *) p)[0] != 0 | ((size_t *) p)[1] != 0 | ((size_t *)
>> p)[2] != 0 | ((size_t *) p)[3] != 0)
>> return false;
>> }
>
> Another option could be to use SIMD instructions to check multiple bytes
> is zero in a single operation. Maybe just an idea to keep in mind and experiment
> if we feel the need later on.

Speaking of which, couldn't you just use

pg_popcount(ptr, len) == 0

From:	David Rowley <dgrowleyml(at)gmail(dot)com>
To:	Peter Eisentraut <peter(at)eisentraut(dot)org>
Cc:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Smith <smithpb2250(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-11-06 19:05:10
Message-ID:	CAApHDvp==ynKyrV0OjuBH_rJ208hcK4kUJUc8v_0=EyFYAU2Zw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 7 Nov 2024 at 07:34, Peter Eisentraut <peter(at)eisentraut(dot)org> wrote:
> Speaking of which, couldn't you just use
>
> pg_popcount(ptr, len) == 0

That might be quite good for small lengths or for use cases where the
memory is always or almost always zero. The problem is there's no
early exit when you find the first non-zero which means, for larger
lengths, reading much more memory. That'll both take longer and
possibly evict cache lines which might be useful to have in the near
future.

David

From:	David Rowley <dgrowleyml(at)gmail(dot)com>
To:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-11-06 19:10:17
Message-ID:	CAApHDvpPjr32eBVdtgmDs0pZa=ODc+odAwGFVocYAnAeGyxqng@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 7 Nov 2024 at 00:40, Bertrand Drouvot
<bertranddrouvot(dot)pg(at)gmail(dot)com> wrote:
> Do you mean add:
>
> "
> for (; p < aligned_end; p += sizeof(size_t))
> {
> if (*(size_t *)p != 0)
> return false;
> }
> "
>
> just before the last loop?
>
> If so, I did a few tests and did not see any major improvements. So, I thought
> it's simpler to not add more code in this inline function in v7 shared up-thread.

Did you try with a size where there's a decent remainder, say 124
bytes? FWIW, one of the cases has 112 bytes, and I think that is
aligned memory meaning we'll do the first 64 in the SIMD loop and have
to do 48 bytes in the byte-at-a-time loop. If you had the loop Michael
mentioned, that would instead be 6 loops of size_t-at-a-time.

David

From:	Michael Paquier <michael(at)paquier(dot)xyz>
To:	David Rowley <dgrowleyml(at)gmail(dot)com>
Cc:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-11-07 00:44:32
Message-ID:	ZywNcJRvkJ73lcb-@paquier.xyz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Nov 07, 2024 at 08:10:17AM +1300, David Rowley wrote:
> Did you try with a size where there's a decent remainder, say 124
> bytes? FWIW, one of the cases has 112 bytes, and I think that is
> aligned memory meaning we'll do the first 64 in the SIMD loop and have
> to do 48 bytes in the byte-at-a-time loop. If you had the loop Michael
> mentioned, that would instead be 6 loops of size_t-at-a-time.

See the attached allzeros.c, based on the previous versions exchanged.
And now just imagine a structure like that:
#define BLCKSZ 48
typedef union AlignedBlock
{
char data[BLCKSZ];
double force_align_d;
int64_t force_align_i64;
} AlignedBlock;

This structure is optimized so as the first step to do the char step
is skipped because the pointer is aligned when allocated, and the
second step with the potential SIMD is skipped because the structure
is small enough at 48 bytes. Hence only the last step would do the
allzero check. Adding a size_t step to force a loop is going to be
more efficient, as proved upthread:
$ gcc -o allzeros -march=native -O2 allzeros.c
$ ./allzeros
allzeros: done in 118332297 nanoseconds
allzeros_v2: done in 13877745 nanoseconds (8.52677 times faster)

The allzero check is used for pgstat entries, and it could be possible
that some out-of-core code needs to rely on such small-ish sizes, or
even something else when a patch author feels like it. So let's make
that optimized as much as we think we can: that's what this discussion
is about.
--
Michael

From:	Michael Paquier <michael(at)paquier(dot)xyz>
To:	David Rowley <dgrowleyml(at)gmail(dot)com>
Cc:	Peter Eisentraut <peter(at)eisentraut(dot)org>, Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-11-07 00:45:44
Message-ID:	ZywNuH34bxvo7CL4@paquier.xyz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Nov 07, 2024 at 08:05:10AM +1300, David Rowley wrote:
> That might be quite good for small lengths or for use cases where the
> memory is always or almost always zero. The problem is there's no
> early exit when you find the first non-zero which means, for larger
> lengths, reading much more memory. That'll both take longer and
> possibly evict cache lines which might be useful to have in the near
> future.

Didn't know this one either, thanks for the explanation.
--
Michael

From:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
To:	Michael Paquier <michael(at)paquier(dot)xyz>
Cc:	David Rowley <dgrowleyml(at)gmail(dot)com>, Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-11-07 08:41:34
Message-ID:	Zyx9PhpGFKFktn9M@ip-10-97-1-34.eu-west-3.compute.internal
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On Thu, Nov 07, 2024 at 09:44:32AM +0900, Michael Paquier wrote:
> On Thu, Nov 07, 2024 at 08:10:17AM +1300, David Rowley wrote:
> > Did you try with a size where there's a decent remainder, say 124
> > bytes? FWIW, one of the cases has 112 bytes, and I think that is
> > aligned memory meaning we'll do the first 64 in the SIMD loop and have
> > to do 48 bytes in the byte-at-a-time loop. If you had the loop Michael
> > mentioned, that would instead be 6 loops of size_t-at-a-time.
>
> See the attached allzeros.c, based on the previous versions exchanged.
> And now just imagine a structure like that:
> #define BLCKSZ 48
> typedef union AlignedBlock
> {
> char data[BLCKSZ];
> double force_align_d;
> int64_t force_align_i64;
> } AlignedBlock;
>
> The allzero check is used for pgstat entries, and it could be possible
> that some out-of-core code needs to rely on such small-ish sizes, or
> even something else when a patch author feels like it. So let's make
> that optimized as much as we think we can: that's what this discussion
> is about.

Yeah, fully agree. My initial testing was not "good" enough and so was not showing
as much improvement as your's and David's ones.

Please find v8 attached.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment	Content-Type	Size
v8-0001-Optimize-pg_memory_is_all_zeros.patch	text/x-diff	4.0 KB

From:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
To:	Michael Paquier <michael(at)paquier(dot)xyz>
Cc:	David Rowley <dgrowleyml(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-11-07 08:41:52
Message-ID:	Zyx9UF8WWeherIHj@ip-10-97-1-34.eu-west-3.compute.internal
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On Thu, Nov 07, 2024 at 09:45:44AM +0900, Michael Paquier wrote:
> On Thu, Nov 07, 2024 at 08:05:10AM +1300, David Rowley wrote:
> > That might be quite good for small lengths or for use cases where the
> > memory is always or almost always zero. The problem is there's no
> > early exit when you find the first non-zero which means, for larger
> > lengths, reading much more memory. That'll both take longer and
> > possibly evict cache lines which might be useful to have in the near
> > future.
>
> Didn't know this one either, thanks for the explanation.

+1, thanks!

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

From:	Michael Paquier <michael(at)paquier(dot)xyz>
To:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
Cc:	David Rowley <dgrowleyml(at)gmail(dot)com>, Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-11-08 05:34:34
Message-ID:	Zy2i6qf0AdJ_4Tq8@paquier.xyz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Nov 07, 2024 at 08:41:34AM +0000, Bertrand Drouvot wrote:
> Yeah, fully agree. My initial testing was not "good" enough and so was not showing
> as much improvement as your's and David's ones.
>
> Please find v8 attached.

I've tested that with a couple of structures in an independent module
and that seems to do the job. The patch should be split into two
parts for clarity: one to switch pg_memory_is_all_zeros to use the
optimized version and a second to change PageIsVerifiedExtended().
Not a big deal as the code paths are entirely independent, that's just
my view of the matter.

I've done a round of comment and term cleanup for the whole patch,
while on it.

Btw, gcc seems a bit smarter than clang when it comes to optimizing
the code depending on the size of the structures. gcc gives up on
SIMD if it's sure that the structure on which we are going to use the
allzero check won't need it at all, and clang keeps it even if it does
not need it. That was interesting to see, while going through the
review..
--
Michael

Attachment	Content-Type	Size
v9-0001-Optimize-pg_memory_is_all_zeros.patch	text/x-diff	4.2 KB

From:	David Rowley <dgrowleyml(at)gmail(dot)com>
To:	Michael Paquier <michael(at)paquier(dot)xyz>
Cc:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-11-08 10:18:09
Message-ID:	CAApHDvp2jx_=pFbgj-O1_ZmzP9WOZKfwLzZrS_=ZmbsqMQQ59g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 8 Nov 2024 at 18:34, Michael Paquier <michael(at)paquier(dot)xyz> wrote:
> I've done a round of comment and term cleanup for the whole patch,
> while on it.

I don't think "intrinsics" is the correct word to use here:

+ * - 8 * sizeof(size_t) comparisons using bitwise OR, to encourage compilers
+ * to use SIMD intrinsics if available, up to the last aligned location

and

+ * All comparisons are combined with a single OR operation, making it a
+ * good candidate for SIMD intrinsics, if available.

an intrinsic function is a function built into the compiler that
provides some lower-level functionality. e.g. __builtin_popcount().

I'm slightly worried due to the current rate we're receiving cleanup
suggestions that someone might come along and think they'd be doing us
a favour by submitting a patch to "fixup the inefficient bitwise-ORs
and use boolean-OR". Maybe a comment like the following might prevent
that from happening.

+ * For performance reasons, we manually unroll this loop and purposefully
+ * use bitwise-ORs to combine each comparison. This prevents boolean
+ * short-circuiting and lets the compiler know that it's safe to access all 8
+ * elements regardless of the result of the other comparisons. This seems
+ * to be enough to coax a few compilers into using SIMD instructions.

> Btw, gcc seems a bit smarter than clang when it comes to optimizing
> the code depending on the size of the structures. gcc gives up on
> SIMD if it's sure that the structure on which we are going to use the
> allzero check won't need it at all, and clang keeps it even if it does
> not need it. That was interesting to see, while going through the
> review..

Can you share your test case for this? I tried with [1] and the
latest gcc does not seem to be smart enough to figure this out. I
tried adding some additional len checks that the compiler can use as a
cue and won't need to emit code for the checks providing the function
does get inlined. That was enough to get the compiler to not emit the
loops when they'll not be used. See the -DCHECK_LEN flag I'm passing
in the 2nd compiler window. I just don't know if putting something
like that into the code is a good idea as if the function wasn't
inlined for some reason, the extra len checks would have to be
compiled into the function.

David

[1] https://godbolt.org/z/xa81ro8GK

From:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
To:	David Rowley <dgrowleyml(at)gmail(dot)com>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-11-08 17:33:52
Message-ID:	Zy5LgNyHzOhnYTTy@ip-10-97-1-34.eu-west-3.compute.internal
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On Fri, Nov 08, 2024 at 11:18:09PM +1300, David Rowley wrote:
> On Fri, 8 Nov 2024 at 18:34, Michael Paquier <michael(at)paquier(dot)xyz> wrote:
> > I've done a round of comment and term cleanup for the whole patch,
> > while on it.
>
> I don't think "intrinsics" is the correct word to use here:
>
> + * - 8 * sizeof(size_t) comparisons using bitwise OR, to encourage compilers
> + * to use SIMD intrinsics if available, up to the last aligned location
>
> and
>
> + * All comparisons are combined with a single OR operation, making it a
> + * good candidate for SIMD intrinsics, if available.
>
> an intrinsic function is a function built into the compiler that
> provides some lower-level functionality. e.g. __builtin_popcount().

Agree, replaced by "instructions" in v10 attached.

> I'm slightly worried due to the current rate we're receiving cleanup
> suggestions that someone might come along and think they'd be doing us
> a favour by submitting a patch to "fixup the inefficient bitwise-ORs
> and use boolean-OR".

That's a good point, better to be cautious here.

> Maybe a comment like the following might prevent
> that from happening.
>
> + * For performance reasons, we manually unroll this loop and purposefully
> + * use bitwise-ORs to combine each comparison. This prevents boolean
> + * short-circuiting and lets the compiler know that it's safe to access all 8
> + * elements regardless of the result of the other comparisons. This seems
> + * to be enough to coax a few compilers into using SIMD instructions.

Sounds good to me, used the above in v10.

v10 also splits the patch into 2 parts as suggested by Michael up-thread.

>
> > Btw, gcc seems a bit smarter than clang when it comes to optimizing
> > the code depending on the size of the structures. gcc gives up on
> > SIMD if it's sure that the structure on which we are going to use the
> > allzero check won't need it at all, and clang keeps it even if it does
> > not need it. That was interesting to see, while going through the
> > review..
>
> Can you share your test case for this?

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment	Content-Type	Size
v10-0001-Optimize-pg_memory_is_all_zeros.patch	text/x-diff	3.4 KB
v10-0002-Make-use-of-pg_memory_is_all_zeros-in-PageIsVeri.patch	text/x-diff	1.4 KB

From:	Michael Paquier <michael(at)paquier(dot)xyz>
To:	David Rowley <dgrowleyml(at)gmail(dot)com>
Cc:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-11-08 23:00:35
Message-ID:	Zy6YE27k1xV4I-cf@paquier.xyz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Nov 08, 2024 at 11:18:09PM +1300, David Rowley wrote:
> I'm slightly worried due to the current rate we're receiving cleanup
> suggestions that someone might come along and think they'd be doing us
> a favour by submitting a patch to "fixup the inefficient bitwise-ORs
> and use boolean-OR". Maybe a comment like the following might prevent
> that from happening.

Not sure, but OK by me to tweak things more.

> Can you share your test case for this? I tried with [1] and the
> latest gcc does not seem to be smart enough to figure this out. I
> tried adding some additional len checks that the compiler can use as a
> cue and won't need to emit code for the checks providing the function
> does get inlined. That was enough to get the compiler to not emit the
> loops when they'll not be used. See the -DCHECK_LEN flag I'm passing
> in the 2nd compiler window. I just don't know if putting something
> like that into the code is a good idea as if the function wasn't
> inlined for some reason, the extra len checks would have to be
> compiled into the function.

Feel free to use that (I hope it works), and see the difference once
the aligned structure is 121 bytes or more:
https://godbolt.org/z/94393nPGG

At least, I can see that the SIMD loop is ignored.
--
Michael

From:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
To:	Michael Paquier <michael(at)paquier(dot)xyz>
Cc:	David Rowley <dgrowleyml(at)gmail(dot)com>, Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-11-09 04:02:20
Message-ID:	Zy7ezDuTTudvGzPY@ip-10-97-1-34.eu-west-3.compute.internal
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On Sat, Nov 09, 2024 at 08:00:35AM +0900, Michael Paquier wrote:
> On Fri, Nov 08, 2024 at 11:18:09PM +1300, David Rowley wrote:
> > I'm slightly worried due to the current rate we're receiving cleanup
> > suggestions that someone might come along and think they'd be doing us
> > a favour by submitting a patch to "fixup the inefficient bitwise-ORs
> > and use boolean-OR". Maybe a comment like the following might prevent
> > that from happening.
>
> Not sure, but OK by me to tweak things more.
>
> > Can you share your test case for this? I tried with [1] and the
> > latest gcc does not seem to be smart enough to figure this out. I
> > tried adding some additional len checks that the compiler can use as a
> > cue and won't need to emit code for the checks providing the function
> > does get inlined. That was enough to get the compiler to not emit the
> > loops when they'll not be used. See the -DCHECK_LEN flag I'm passing
> > in the 2nd compiler window. I just don't know if putting something
> > like that into the code is a good idea as if the function wasn't
> > inlined for some reason, the extra len checks would have to be
> > compiled into the function.
>
> Feel free to use that (I hope it works), and see the difference once
> the aligned structure is 121 bytes or more:
> https://godbolt.org/z/94393nPGG
>
> At least, I can see that the SIMD loop is ignored.

What I see (with the godbolt you shared) is that with BLCKSZ of 120:

gcc: then no SIMD instructions are used (I think that's because sizeof(AlignedBlock)
is 120 which is not a multiple of 16 (SIMD xmm register size)).

with BLCKSZ of 121:

gcc: then SIMD instructions are used (I think that's because sizeof(AlignedBlock)
is 128 which is a multiple of 16 (SIMD xmm register size)).

While clang uses SIMD instructions in both cases (more complex code with more
branches at least in the 120 case).

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

From:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
To:	David Rowley <dgrowleyml(at)gmail(dot)com>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-11-09 04:15:04
Message-ID:	Zy7hyG8JUMC5P2T3@ip-10-97-1-34.eu-west-3.compute.internal
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On Fri, Nov 08, 2024 at 11:18:09PM +1300, David Rowley wrote:
> I tried with [1] and the
> latest gcc does not seem to be smart enough to figure this out. I
> tried adding some additional len checks that the compiler can use as a
> cue and won't need to emit code for the checks providing the function
> does get inlined. That was enough to get the compiler to not emit the
> loops when they'll not be used. See the -DCHECK_LEN flag I'm passing
> in the 2nd compiler window. I just don't know if putting something
> like that into the code is a good idea as if the function wasn't
> inlined for some reason, the extra len checks would have to be
> compiled into the function.
>
> David
>
> [1] https://godbolt.org/z/xa81ro8GK

Looking at it, that looks like an issue.

I mean, without the -DCHECK_LEN flag then the SIMD code will read up to 48 bytes
beyond the struct's memory (which is 16 bytes):

This is fine:
"
movdqu xmm0, XMMWORD PTR [rdi]
"

But I don't think it is:

"
movdqu xmm2, XMMWORD PTR [rdi+16]
movdqu xmm1, XMMWORD PTR [rdi+32]
movdqu xmm3, XMMWORD PTR [rdi+48]
"

given that the struct size is only 16 bytes.

Thoughts?

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

From:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
To:	David Rowley <dgrowleyml(at)gmail(dot)com>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-11-11 06:19:50
Message-ID:	ZzGiBtUWWfgga9TH@ip-10-97-1-34.eu-west-3.compute.internal
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On Sat, Nov 09, 2024 at 04:15:04AM +0000, Bertrand Drouvot wrote:
> Hi,
>
> On Fri, Nov 08, 2024 at 11:18:09PM +1300, David Rowley wrote:
> > I tried with [1] and the
> > latest gcc does not seem to be smart enough to figure this out. I
> > tried adding some additional len checks that the compiler can use as a
> > cue and won't need to emit code for the checks providing the function
> > does get inlined. That was enough to get the compiler to not emit the
> > loops when they'll not be used. See the -DCHECK_LEN flag I'm passing
> > in the 2nd compiler window. I just don't know if putting something
> > like that into the code is a good idea as if the function wasn't
> > inlined for some reason, the extra len checks would have to be
> > compiled into the function.
> >
> > David
> >
> > [1] https://godbolt.org/z/xa81ro8GK
>
> Looking at it, that looks like an issue.
>
> I mean, without the -DCHECK_LEN flag then the SIMD code will read up to 48 bytes
> beyond the struct's memory (which is 16 bytes):
>
> This is fine:
> "
> movdqu xmm0, XMMWORD PTR [rdi]
> "
>
> But I don't think it is:
>
> "
> movdqu xmm2, XMMWORD PTR [rdi+16]
> movdqu xmm1, XMMWORD PTR [rdi+32]
> movdqu xmm3, XMMWORD PTR [rdi+48]
> "
>
> given that the struct size is only 16 bytes.
>
> Thoughts?

What about "simply" starting pg_memory_is_all_zeros() with?

"
if (len < sizeof(size_t)*8) {
while (p < end) {
if (*p++ != 0)
return false;
}
return true;
}
"

That way:

- we prevents reading beyond the memory area in the SIMD section (if < 64 bytes)
- we make sure that aligned_end can not be after the real end (could be if the
len is < 8 bytes)
- there is no need for additional size checks later in the code
- len < 64 bytes will be read byte per byte but that's likely "enough" (if not
faster) for those "small" sizes

Thoughts?

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

From:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
To:	David Rowley <dgrowleyml(at)gmail(dot)com>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-11-11 17:07:51
Message-ID:	ZzI552QwMh4pF50N@ip-10-97-1-34.eu-west-3.compute.internal
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On Mon, Nov 11, 2024 at 06:19:50AM +0000, Bertrand Drouvot wrote:
> Hi,
>
> On Sat, Nov 09, 2024 at 04:15:04AM +0000, Bertrand Drouvot wrote:
> > Hi,
> >
> > On Fri, Nov 08, 2024 at 11:18:09PM +1300, David Rowley wrote:
> > > I tried with [1] and the
> > > latest gcc does not seem to be smart enough to figure this out. I
> > > tried adding some additional len checks that the compiler can use as a
> > > cue and won't need to emit code for the checks providing the function
> > > does get inlined. That was enough to get the compiler to not emit the
> > > loops when they'll not be used. See the -DCHECK_LEN flag I'm passing
> > > in the 2nd compiler window. I just don't know if putting something
> > > like that into the code is a good idea as if the function wasn't
> > > inlined for some reason, the extra len checks would have to be
> > > compiled into the function.
> > >
> > > David
> > >
> > > [1] https://godbolt.org/z/xa81ro8GK
> >
> > Looking at it, that looks like an issue.
> >
> > I mean, without the -DCHECK_LEN flag then the SIMD code will read up to 48 bytes
> > beyond the struct's memory (which is 16 bytes):
> >
> > This is fine:
> > "
> > movdqu xmm0, XMMWORD PTR [rdi]
> > "
> >
> > But I don't think it is:
> >
> > "
> > movdqu xmm2, XMMWORD PTR [rdi+16]
> > movdqu xmm1, XMMWORD PTR [rdi+32]
> > movdqu xmm3, XMMWORD PTR [rdi+48]
> > "
> >
> > given that the struct size is only 16 bytes.
> >
> > Thoughts?
>
> What about "simply" starting pg_memory_is_all_zeros() with?
>
> "
> if (len < sizeof(size_t)*8) {
> while (p < end) {
> if (*p++ != 0)
> return false;
> }
> return true;
> }
> "
>
> That way:
>
> - we prevents reading beyond the memory area in the SIMD section (if < 64 bytes)
> - we make sure that aligned_end can not be after the real end (could be if the
> len is < 8 bytes)
> - there is no need for additional size checks later in the code
> - len < 64 bytes will be read byte per byte but that's likely "enough" (if not
> faster) for those "small" sizes
>

To handle the "what about the len check if the function is not inlined?", I
can't think about a good approach.

I thought about using a macro like:

"
#define pg_memory_is_all_zeros(ptr, len) \
((len) < sizeof(size_t) * 8 ? \
pg_memory_is_all_zeros_small((ptr), (len)) : \
pg_memory_is_all_zeros_large((ptr), (len)))
"

but that would not help (as the len check would need to be done too, so same
run time cost).

I also thought about using __builtin_constant_p(len) but that would not help
because still we need the len check for safety.

So please find attached v11 that uses "only" one len check into the function to
ensure that we won't read beyond the memory area (should its size be < 64 or < 8).

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment	Content-Type	Size
v11-0001-Optimize-pg_memory_is_all_zeros.patch	text/x-diff	4.2 KB
v11-0002-Make-use-of-pg_memory_is_all_zeros-in-PageIsVeri.patch	text/x-diff	1.4 KB

From:	Michael Paquier <michael(at)paquier(dot)xyz>
To:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
Cc:	David Rowley <dgrowleyml(at)gmail(dot)com>, Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-11-12 03:28:53
Message-ID:	ZzLLdVBYC9Udi9mF@paquier.xyz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Nov 11, 2024 at 05:07:51PM +0000, Bertrand Drouvot wrote:
> To handle the "what about the len check if the function is not inlined?", I
> can't think about a good approach.

FWIW, my choice would be to not over-engineer things more than what's
in v10 posted at [1], hence do something without the exception case
where the size is less than 64b. We've proved that this would be
better for the 8k block case that does the size_t based comparison
anyway, without impacting the existing other cases with the pgstats
entries.

[1]: /message-id/Zy5LgNyHzOhnYTTy%40ip-10-97-1-34.eu-west-3.compute.internal
--
Michael

From:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
To:	Michael Paquier <michael(at)paquier(dot)xyz>
Cc:	David Rowley <dgrowleyml(at)gmail(dot)com>, Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-11-12 06:09:04
Message-ID:	ZzLxAJuGzyqA7cUo@ip-10-97-1-34.eu-west-3.compute.internal
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On Tue, Nov 12, 2024 at 12:28:53PM +0900, Michael Paquier wrote:
> On Mon, Nov 11, 2024 at 05:07:51PM +0000, Bertrand Drouvot wrote:
> > To handle the "what about the len check if the function is not inlined?", I
> > can't think about a good approach.
>
> FWIW, my choice would be to not over-engineer things more than what's
> in v10 posted at [1], hence do something without the exception case
> where the size is less than 64b.

I think that the 64b len check done in v11 is mandatory for safety reasons.

1. First reason:

"
for (; p < aligned_end - (sizeof(size_t) * 7); p += sizeof(size_t) * 8)
"

The loop above reads 64 bytes at once, so would read beyond the memory area bounds
if len < 64: That could cause crash or read invalid data.

It's observed in [1] (using the godbolt shared in [2]), where we can see:

"
movdqu xmm2, XMMWORD PTR [rdi+16]
movdqu xmm1, XMMWORD PTR [rdi+32]
movdqu xmm3, XMMWORD PTR [rdi+48]
"

while the struct size is 16 bytes (so we are reading 48 bytes beyond it).

2. Second reason

"
const unsigned char *aligned_end = (const unsigned char *)
((uintptr_t) end & (~(sizeof(size_t) - 1)));
"

aligned_end could be beyond the end for len < 8, so that we could read
invalid data or crash here:

"
for (; p < aligned_end; p += sizeof(size_t)) {
"

The len < 8 check is covered into the len < 64 check, so only the 64b check is
needed.

[1]: /message-id/Zy7hyG8JUMC5P2T3%40ip-10-97-1-34.eu-west-3.compute.internal
[2]: /message-id/CAApHDvp2jx_%3DpFbgj-O1_ZmzP9WOZKfwLzZrS_%3DZmbsqMQQ59g%40mail.gmail.com

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

From:	Michael Paquier <michael(at)paquier(dot)xyz>
To:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
Cc:	David Rowley <dgrowleyml(at)gmail(dot)com>, Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-11-12 06:56:13
Message-ID:	ZzL8DTzWJTlYmQN4@paquier.xyz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Nov 12, 2024 at 06:09:04AM +0000, Bertrand Drouvot wrote:
> I think that the 64b len check done in v11 is mandatory for safety reasons.
>
> The loop above reads 64 bytes at once, so would read beyond the memory area bounds
> if len < 64: That could cause crash or read invalid data.

Sorry, I was not following your argument. You're right that we need
something else here. However..

+ /*
+ * For len < 64, compare byte per byte to ensure we'll not read beyond the
+ * memory area.
+ */
+ if (len < sizeof(size_t) * 8)
+ {
+ while (p < end)
+ {
+ if (*p++ != 0)
+ return false;
+ }
+ return true;
+ }
+
+ /* Compare bytes until the pointer "p" is aligned */
+ while (((uintptr_t) p & (sizeof(size_t) - 1)) != 0)
+ {
+ if (p == end)
+ return true;
+
+ if (*p++ != 0)
+ return false;
+ }
+

Still, this is not optimal, based on what's been discussed upthread.
The byte-per-byte check is more expensive than the size_t check, so
shouldn't you make sure that you stack some size_t checks if dealing
with something smaller than 64 bytes? That would be a bit more
complex, sure, but if you leave that within the block doing "len <
sizeof(size_t) * 8", perhaps that's OK.. Or just do what you
mentioned upthread with a second macro for sizes <= 64. You'd need
three steps in this first block rather than one:
- byte-per-byte, up to aligned location.
- size_t loop.
- Again byte-per-byte, until the end,
--
Michael

From:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
To:	Michael Paquier <michael(at)paquier(dot)xyz>
Cc:	David Rowley <dgrowleyml(at)gmail(dot)com>, Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-11-12 10:56:20
Message-ID:	ZzM0VNK161K95ZHn@ip-10-97-1-34.eu-west-3.compute.internal
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On Tue, Nov 12, 2024 at 03:56:13PM +0900, Michael Paquier wrote:
> On Tue, Nov 12, 2024 at 06:09:04AM +0000, Bertrand Drouvot wrote:
> > I think that the 64b len check done in v11 is mandatory for safety reasons.
> >
> > The loop above reads 64 bytes at once, so would read beyond the memory area bounds
> > if len < 64: That could cause crash or read invalid data.
>
> Sorry, I was not following your argument. You're right that we need
> something else here. However..
>
> + /*
> + * For len < 64, compare byte per byte to ensure we'll not read beyond the
> + * memory area.
> + */
> + if (len < sizeof(size_t) * 8)
> + {
> + while (p < end)
> + {
> + if (*p++ != 0)
> + return false;
> + }
> + return true;
> + }
> +
> + /* Compare bytes until the pointer "p" is aligned */
> + while (((uintptr_t) p & (sizeof(size_t) - 1)) != 0)
> + {
> + if (p == end)
> + return true;
> +
> + if (*p++ != 0)
> + return false;
> + }
> +
>
> Still, this is not optimal, based on what's been discussed upthread.
> The byte-per-byte check is more expensive than the size_t check,

I think that depends of the memory area size. If the size is small enough then the
byte per byte can be good enough.

For example, with the allzeros_small.c attached:

== with BLCKSZ 32

$ /usr/local/gcc-14.1.0/bin/gcc-14.1.0 -march=native -O2 allzeros_small.c -o allzeros_small ; ./allzeros_small
byte per byte: done in 22528 nanoseconds
size_t: done in 6949 nanoseconds (3.24191 times faster than byte per byte)
SIMD v10: done in 7562 nanoseconds (2.97911 times faster than byte per byte)
SIMD v11: done in 22096 nanoseconds (1.01955 times faster than byte per byte)

== with BLCKSZ 63

$ /usr/local/gcc-14.1.0/bin/gcc-14.1.0 -march=native -O2 allzeros_small.c -o allzeros_small ; ./allzeros_small
byte per byte: done in 29246 nanoseconds
size_t: done in 10555 nanoseconds (2.77082 times faster than byte per byte)
SIMD v10: done in 11220 nanoseconds (2.6066 times faster than byte per byte)
SIMD v11: done in 29126 nanoseconds (1.00412 times faster than byte per byte)

Obviously v11 is about the same time as "byte per byte" but we can see that the
size_t or v10 improvment is not that much for small size.

While for larger size:

== with BLCKSZ 256

$ /usr/local/gcc-14.1.0/bin/gcc-14.1.0 -march=native -O2 allzeros_small.c -o allzeros_small ; ./allzeros_small
byte per byte: done in 102703 nanoseconds
size_t: done in 15381 nanoseconds (6.67726 times faster than byte per byte)
SIMD v10: done in 7241 nanoseconds (14.1835 times faster than byte per byte)
SIMD v11: done in 7899 nanoseconds (13.002 times faster than byte per byte)

== with BLCKSZ 8192

$ /usr/local/gcc-14.1.0/bin/gcc-14.1.0 -march=native -O2 allzeros_small.c -o allzeros_small ; ./allzeros_small
byte per byte: done in 2993458 nanoseconds
size_t: done in 436650 nanoseconds (6.85551 times faster than byte per byte)
SIMD v10: done in 136413 nanoseconds (21.9441 times faster than byte per byte)
SIMD v11: done in 155474 nanoseconds (19.2538 times faster than byte per byte)

It's sensitive improvment.

> shouldn't you make sure that you stack some size_t checks if dealing
> with something smaller than 64 bytes?

Based on the above I've the feeling that doing byte per byte comparison for
small size only (< 64b) is good enough. I'm not sure that adding extra complexity
for small sizes is worth it.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment	Content-Type	Size
allzeros_small.c	text/x-csrc	7.2 KB

From:	Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>
To:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, David Rowley <dgrowleyml(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-11-12 16:32:36
Message-ID:	CAEudQAqLKUaaaQMtJEUmQaCKxB=sU_2CcN=s7TeWAtRCx4oaPw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Em ter., 12 de nov. de 2024 às 07:56, Bertrand Drouvot <
bertranddrouvot(dot)pg(at)gmail(dot)com> escreveu:

> Hi,
>
> On Tue, Nov 12, 2024 at 03:56:13PM +0900, Michael Paquier wrote:
> > On Tue, Nov 12, 2024 at 06:09:04AM +0000, Bertrand Drouvot wrote:
> > > I think that the 64b len check done in v11 is mandatory for safety
> reasons.
> > >
> > > The loop above reads 64 bytes at once, so would read beyond the memory
> area bounds
> > > if len < 64: That could cause crash or read invalid data.
> >
> > Sorry, I was not following your argument. You're right that we need
> > something else here. However..
> >
> > + /*
> > + * For len < 64, compare byte per byte to ensure we'll not read
> beyond the
> > + * memory area.
> > + */
> > + if (len < sizeof(size_t) * 8)
> > + {
> > + while (p < end)
> > + {
> > + if (*p++ != 0)
> > + return false;
> > + }
> > + return true;
> > + }
> > +
> > + /* Compare bytes until the pointer "p" is aligned */
> > + while (((uintptr_t) p & (sizeof(size_t) - 1)) != 0)
> > + {
> > + if (p == end)
> > + return true;
> > +
> > + if (*p++ != 0)
> > + return false;
> > + }
> > +
> >
> > Still, this is not optimal, based on what's been discussed upthread.
> > The byte-per-byte check is more expensive than the size_t check,
>
> I think that depends of the memory area size. If the size is small enough
> then the
> byte per byte can be good enough.
>
> For example, with the allzeros_small.c attached:
>
It seems to me that it is enough to protect the SIMD loop when the size is
smaller.

if (len > sizeof(size_t) * 8)
{
for (; p < aligned_end - (sizeof(size_t) * 7); p += sizeof(size_t) *
8)
{
if ((((size_t *) p)[0] != 0) | (((size_t *) p)[1] != 0) |
(((size_t *) p)[2] != 0) | (((size_t *) p)[3] != 0) |
(((size_t *) p)[4] != 0) | (((size_t *) p)[5] != 0) |
(((size_t *) p)[6] != 0) | (((size_t *) p)[7] != 0))
return false;
}
}

See v1_allzeros_small.c attached.

>
> == with BLCKSZ 32
>
> $ /usr/local/gcc-14.1.0/bin/gcc-14.1.0 -march=native -O2 allzeros_small.c
> -o allzeros_small ; ./allzeros_small
> byte per byte: done in 22528 nanoseconds
> size_t: done in 6949 nanoseconds (3.24191 times faster than byte per byte)
> SIMD v10: done in 7562 nanoseconds (2.97911 times faster than byte per
> byte)
> SIMD v11: done in 22096 nanoseconds (1.01955 times faster than byte per
> byte)
>
gcc -march=native -O2 v1_allzeros_small.c -o v1_allzeros_small ;
./v1_allzeros_small
byte per byte: done in 97345 nanoseconds
size_t: done in 20305 nanoseconds (4.79414 times faster than byte per byte)
SIMD v10: done in 25813 nanoseconds (3.77116 times faster than byte per
byte)
SIMD v11: done in 24580 nanoseconds (3.96033 times faster than byte per
byte)

>
> == with BLCKSZ 63
>
> $ /usr/local/gcc-14.1.0/bin/gcc-14.1.0 -march=native -O2 allzeros_small.c
> -o allzeros_small ; ./allzeros_small
> byte per byte: done in 29246 nanoseconds
> size_t: done in 10555 nanoseconds (2.77082 times faster than byte per byte)
> SIMD v10: done in 11220 nanoseconds (2.6066 times faster than byte per
> byte)
> SIMD v11: done in 29126 nanoseconds (1.00412 times faster than byte per
> byte)
>
gcc -march=native -O2 v1_allzeros_small.c -o v1_allzeros_small ;
./v1_allzeros_small
byte per byte: done in 57763 nanoseconds
size_t: done in 19760 nanoseconds (2.92323 times faster than byte per byte)
SIMD v10: done in 24088 nanoseconds (2.398 times faster than byte per byte)
SIMD v11: done in 20151 nanoseconds (2.86651 times faster than byte per
byte)

> Obviously v11 is about the same time as "byte per byte" but we can see
> that the
> size_t or v10 improvment is not that much for small size.
>
> While for larger size:
>
> == with BLCKSZ 256
>
> $ /usr/local/gcc-14.1.0/bin/gcc-14.1.0 -march=native -O2 allzeros_small.c
> -o allzeros_small ; ./allzeros_small
> byte per byte: done in 102703 nanoseconds
> size_t: done in 15381 nanoseconds (6.67726 times faster than byte per byte)
> SIMD v10: done in 7241 nanoseconds (14.1835 times faster than byte per
> byte)
> SIMD v11: done in 7899 nanoseconds (13.002 times faster than byte per byte)
>
gcc -march=native -O2 v1_allzeros_small.c -o v1_allzeros_small ;
./v1_allzeros_small
byte per byte: done in 213276 nanoseconds
size_t: done in 45288 nanoseconds (4.70933 times faster than byte per byte)
SIMD v10: done in 15840 nanoseconds (13.4644 times faster than byte per
byte)
SIMD v11: done in 15773 nanoseconds (13.5216 times faster than byte per
byte)

> == with BLCKSZ 8192
>
> $ /usr/local/gcc-14.1.0/bin/gcc-14.1.0 -march=native -O2 allzeros_small.c
> -o allzeros_small ; ./allzeros_small
> byte per byte: done in 2993458 nanoseconds
> size_t: done in 436650 nanoseconds (6.85551 times faster than byte per
> byte)
> SIMD v10: done in 136413 nanoseconds (21.9441 times faster than byte per
> byte)
> SIMD v11: done in 155474 nanoseconds (19.2538 times faster than byte per
> byte)
>
gcc -march=native -O2 v1_allzeros_small.c -o v1_allzeros_small ;
./v1_allzeros_small
byte per byte: done in 10358761 nanoseconds
size_t: done in 864673 nanoseconds (11.98 times faster than byte per byte)
SIMD v10: done in 342880 nanoseconds (30.211 times faster than byte per
byte)
SIMD v11: done in 341332 nanoseconds (30.3481 times faster than byte per
byte)

best regards,
Ranier Vilela

Attachment	Content-Type	Size
v1_allzeros_small.c	text/x-csrc	7.2 KB

From:	Michael Paquier <michael(at)paquier(dot)xyz>
To:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
Cc:	David Rowley <dgrowleyml(at)gmail(dot)com>, Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-11-13 00:25:37
Message-ID:	ZzPyAZmanAZDa8ir@paquier.xyz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Nov 12, 2024 at 10:56:20AM +0000, Bertrand Drouvot wrote:
> I think that depends of the memory area size. If the size is small enough then the
> byte per byte can be good enough.
>
> For example, with the allzeros_small.c attached:
>
> == with BLCKSZ 32
>
> $ /usr/local/gcc-14.1.0/bin/gcc-14.1.0 -march=native -O2 allzeros_small.c -o allzeros_small ; ./allzeros_small
> byte per byte: done in 22528 nanoseconds
> size_t: done in 6949 nanoseconds (3.24191 times faster than byte per byte)
> SIMD v10: done in 7562 nanoseconds (2.97911 times faster than byte per byte)
> SIMD v11: done in 22096 nanoseconds (1.01955 times faster than byte per byte)

Some numbers from here, for the same test case at 32 bytes, with an
older version of gcc:
$ gcc --version
gcc (Debian 10.2.1-6) 10.2.1 20210110
$ gcc -march=native -O2 allzeros_small.c -o allzeros_small ;
./allzeros_small
byte per byte: done in 28193 nanoseconds
size_t: done in 4382 nanoseconds (6.43382 times faster than byte per byte)
SIMD v10: done in 8074 nanoseconds (3.49183 times faster than byte per byte)
SIMD v11: done in 26970 nanoseconds (1.04535 times faster than byte per byte)

> == with BLCKSZ 63
>
> $ /usr/local/gcc-14.1.0/bin/gcc-14.1.0 -march=native -O2 allzeros_small.c -o allzeros_small ; ./allzeros_small
> byte per byte: done in 29246 nanoseconds
> size_t: done in 10555 nanoseconds (2.77082 times faster than byte per byte)
> SIMD v10: done in 11220 nanoseconds (2.6066 times faster than byte per byte)
> SIMD v11: done in 29126 nanoseconds (1.00412 times faster than byte per byte)
>
> Obviously v11 is about the same time as "byte per byte" but we can see that the
> size_t or v10 improvment is not that much for small size.

For 63 bytes:
byte per byte: done in 52611 nanoseconds
size_t: done in 21309 nanoseconds (2.46896 times faster than byte per byte)
SIMD v10: done in 16181 nanoseconds (3.25141 times faster than byte per byte)
SIMD v11: done in 51931 nanoseconds (1.01309 times faster than byte per byte)

> While for larger size:
>
> It's sensitive improvment.

Yep, for large sizes.

> Based on the above I've the feeling that doing byte per byte comparison for
> small size only (< 64b) is good enough. I'm not sure that adding extra complexity
> for small sizes is worth it.

Well, this is also telling us that we are at least 2 times faster if
we use allzeros_size_t() for areas smaller than 64 bytes rather than
allzeros_byte_per_byte() per your measurement, and I'm seeing even
faster numbers. So that seems worth the addition, especially for
smaller sizes where this is 6 times faster here.
--
Michael

From:	Michael Paquier <michael(at)paquier(dot)xyz>
To:	Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>
Cc:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-11-13 00:33:31
Message-ID:	ZzPz21lsle97ixU7@paquier.xyz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Nov 12, 2024 at 01:32:36PM -0300, Ranier Vilela wrote:
> See v1_allzeros_small.c attached.

In your pg_memory_is_all_zeros_v11:
while (((uintptr_t) p & (sizeof(size_t) - 1)) != 0)
{
if (p == end)
return true;

if (*p++ != 0)
return false;
}

if (len > sizeof(size_t) * 8)
{
for (; p < aligned_end - (sizeof(size_t) * 7); p += sizeof(size_t) * 8)
{
if ((((size_t *) p)[0] != 0) | (((size_t *) p)[1] != 0) |
(((size_t *) p)[2] != 0) | (((size_t *) p)[3] != 0) |
(((size_t *) p)[4] != 0) | (((size_t *) p)[5] != 0) |
(((size_t *) p)[6] != 0) | (((size_t *) p)[7] != 0))
return false;
}
}

If I'm reading that right, this could still read a couple of bytes
past the wanted memory area. For example, imagine a case of 65 bytes
with a location a bit unaligned (more than 2 bytes). You'd want to
check the remaining size after the first loop, not the initial one.

I'd be OK to have a quick loop for the less-than-64-byte case rather
than more checks depending on sizeof(size_t) spread, like Bertrand is
suggesting. I'd like to imagine that compilers would like that a bit
better, though I am not completely sure, either.
--
Michael

From:	Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>
To:	Michael Paquier <michael(at)paquier(dot)xyz>
Cc:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-11-13 00:51:23
Message-ID:	CAEudQApWsEr05z4vr+ckv6fvhnHGWh+0pT1eqyJVzS6yvyPHXQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Em ter., 12 de nov. de 2024 às 21:33, Michael Paquier <michael(at)paquier(dot)xyz>
escreveu:

> On Tue, Nov 12, 2024 at 01:32:36PM -0300, Ranier Vilela wrote:
> > See v1_allzeros_small.c attached.
>
> In your pg_memory_is_all_zeros_v11:
> while (((uintptr_t) p & (sizeof(size_t) - 1)) != 0)
> {
> if (p == end)
> return true;
>
> if (*p++ != 0)
> return false;
> }
>
> if (len > sizeof(size_t) * 8)
> {
> for (; p < aligned_end - (sizeof(size_t) * 7); p += sizeof(size_t) *
> 8)
> {
> if ((((size_t *) p)[0] != 0) | (((size_t *) p)[1] != 0) |
> (((size_t *) p)[2] != 0) | (((size_t *) p)[3] != 0) |
> (((size_t *) p)[4] != 0) | (((size_t *) p)[5] != 0) |
> (((size_t *) p)[6] != 0) | (((size_t *) p)[7] != 0))
> return false;
> }
> }
>
> If I'm reading that right, this could still read a couple of bytes
> past the wanted memory area.

Yeah, this is possible.

> For example, imagine a case of 65 bytes
> with a location a bit unaligned (more than 2 bytes). You'd want to
> check the remaining size after the first loop, not the initial one.
>
> I'd be OK to have a quick loop for the less-than-64-byte case rather
> than more checks depending on sizeof(size_t) spread, like Bertrand is
> suggesting.

I'm ok too. Maybe we are trying to optimize early.

best regards,
Ranier Vilela

From:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
To:	Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, David Rowley <dgrowleyml(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-11-13 06:41:08
Message-ID:	ZzRKBKgxgiF+GVry@ip-10-97-1-34.eu-west-3.compute.internal
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On Tue, Nov 12, 2024 at 01:32:36PM -0300, Ranier Vilela wrote:
> It seems to me that it is enough to protect the SIMD loop when the size is
> smaller.
>
> if (len > sizeof(size_t) * 8)
> {
> for (; p < aligned_end - (sizeof(size_t) * 7); p += sizeof(size_t) *
> 8)
> {
> if ((((size_t *) p)[0] != 0) | (((size_t *) p)[1] != 0) |
> (((size_t *) p)[2] != 0) | (((size_t *) p)[3] != 0) |
> (((size_t *) p)[4] != 0) | (((size_t *) p)[5] != 0) |
> (((size_t *) p)[6] != 0) | (((size_t *) p)[7] != 0))
> return false;
> }
> }
>
> See v1_allzeros_small.c attached.

Thanks for looking at it!

It's not enough, as that would not fix the second reason mentioned in [1].

[1]: /message-id/ZzLxAJuGzyqA7cUo%40ip-10-97-1-34.eu-west-3.compute.internal

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

From:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
To:	Michael Paquier <michael(at)paquier(dot)xyz>
Cc:	David Rowley <dgrowleyml(at)gmail(dot)com>, Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-11-13 07:50:50
Message-ID:	ZzRaWs4KxgPnJGNO@ip-10-97-1-34.eu-west-3.compute.internal
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On Wed, Nov 13, 2024 at 09:25:37AM +0900, Michael Paquier wrote:
> So that seems worth the addition, especially for
> smaller sizes where this is 6 times faster here.

So, something like v12 in pg_memory_is_all_zeros_v12() in allzeros_small.c
attached?

If so, that gives us:

== with BLCKSZ 32

$ /usr/local/gcc-14.1.0/bin/gcc-14.1.0 -march=native -O2 allzeros_small.c -o allzeros_small ; ./allzeros_small
byte per byte: done in 22421 nanoseconds
size_t: done in 7269 nanoseconds (3.08447 times faster than byte per byte)
SIMD v10: done in 6349 nanoseconds (3.53142 times faster than byte per byte)
SIMD v11: done in 22080 nanoseconds (1.01544 times faster than byte per byte)
SIMD v12: done in 5595 nanoseconds (4.00733 times faster than byte per byte)

== with BLCKSZ 63

$ /usr/local/gcc-14.1.0/bin/gcc-14.1.0 -march=native -O2 allzeros_small.c -o allzeros_small ; ./allzeros_small
byte per byte: done in 29525 nanoseconds
size_t: done in 11232 nanoseconds (2.62865 times faster than byte per byte)
SIMD v10: done in 10828 nanoseconds (2.72673 times faster than byte per byte)
SIMD v11: done in 42056 nanoseconds (0.70204 times faster than byte per byte)
SIMD v12: done in 10468 nanoseconds (2.8205 times faster than byte per byte)

== with BLCKSZ 256

$ /usr/local/gcc-14.1.0/bin/gcc-14.1.0 -march=native -O2 allzeros_small.c -o allzeros_small ; ./allzeros_small
byte per byte: done in 120483 nanoseconds
size_t: done in 23098 nanoseconds (5.21617 times faster than byte per byte)
SIMD v10: done in 6737 nanoseconds (17.8838 times faster than byte per byte)
SIMD v11: done in 6621 nanoseconds (18.1971 times faster than byte per byte)
SIMD v12: done in 6519 nanoseconds (18.4818 times faster than byte per byte)

== with BLCKSZ 8192

$ /usr/local/gcc-14.1.0/bin/gcc-14.1.0 -march=native -O2 allzeros_small.c -o allzeros_small ; ./allzeros_small
byte per byte: done in 3393459 nanoseconds
size_t: done in 707304 nanoseconds (4.79774 times faster than byte per byte)
SIMD v10: done in 233559 nanoseconds (14.5293 times faster than byte per byte)
SIMD v11: done in 225951 nanoseconds (15.0186 times faster than byte per byte)
SIMD v12: done in 225766 nanoseconds (15.0309 times faster than byte per byte)

That's better for small size but given the extra len checks that
has been added I think we're back to David's point in [1]: What if the function
is not inlined for some reason?

So, out of curiosity, let's see what happens if not inlined in [2] (see the
-O2 -DNOT_INLINE compiler window):

- if a[3]: it looks like gcc is smart enough to create an optimized version
for that size using constant propagation
- if a[63]: Same as above
- if a[256]: Same as above
- if a[8192]: Same as above

I did a quick check with clang and it looks like it is not as smart as gcc
for the non inline case.

Anyway it's not like we have the choice: we need (at least) one len check for
safety reason (to not crash or read invalid data).

So, I'd vote for pg_memory_is_all_zeros_v12() then, thoughts?

[1]: /message-id/CAApHDvp2jx_%3DpFbgj-O1_ZmzP9WOZKfwLzZrS_%3DZmbsqMQQ59g%40mail.gmail.com
[2]: https://godbolt.org/z/8s44GKqcc

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment	Content-Type	Size
allzeros_small.c	text/x-csrc	10.3 KB

From:	Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>
To:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, David Rowley <dgrowleyml(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-11-13 13:28:35
Message-ID:	CAEudQAqGprE1QjctScmJqpSM_vgqOmKsChMiPUeKX1R8MrPZ-A@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Em qua., 13 de nov. de 2024 às 04:50, Bertrand Drouvot <
bertranddrouvot(dot)pg(at)gmail(dot)com> escreveu:

> Hi,
>
> On Wed, Nov 13, 2024 at 09:25:37AM +0900, Michael Paquier wrote:
> > So that seems worth the addition, especially for
> > smaller sizes where this is 6 times faster here.
>
> So, something like v12 in pg_memory_is_all_zeros_v12() in allzeros_small.c
> attached?
>
I ran the latest version (allzeros_small.c) with v12.

>
> If so, that gives us:
>
> == with BLCKSZ 32
>
> $ /usr/local/gcc-14.1.0/bin/gcc-14.1.0 -march=native -O2 allzeros_small.c
> -o allzeros_small ; ./allzeros_small
> byte per byte: done in 22421 nanoseconds
> size_t: done in 7269 nanoseconds (3.08447 times faster than byte per byte)
> SIMD v10: done in 6349 nanoseconds (3.53142 times faster than byte per
> byte)
> SIMD v11: done in 22080 nanoseconds (1.01544 times faster than byte per
> byte)
> SIMD v12: done in 5595 nanoseconds (4.00733 times faster than byte per
> byte)
>
$ gcc -march=native -O2 allzeros_small.c -o allzeros_small ;
./allzeros_small
byte per byte: done in 43882 nanoseconds
size_t: done in 8845 nanoseconds (4.96122 times faster than byte per byte)
SIMD v10: done in 10673 nanoseconds (4.1115 times faster than byte per byte)
SIMD v11: done in 29177 nanoseconds (1.50399 times faster than byte per
byte)
SIMD v12: done in 9992 nanoseconds (4.39171 times faster than byte per
byte)

> == with BLCKSZ 63
>
> $ /usr/local/gcc-14.1.0/bin/gcc-14.1.0 -march=native -O2 allzeros_small.c
> -o allzeros_small ; ./allzeros_small
> byte per byte: done in 29525 nanoseconds
> size_t: done in 11232 nanoseconds (2.62865 times faster than byte per byte)
> SIMD v10: done in 10828 nanoseconds (2.72673 times faster than byte per
> byte)
> SIMD v11: done in 42056 nanoseconds (0.70204 times faster than byte per
> byte)
> SIMD v12: done in 10468 nanoseconds (2.8205 times faster than byte per
> byte)
>
gcc -march=native -O2 allzeros_small.c -o allzeros_small ; ./allzeros_small
byte per byte: done in 68887 nanoseconds
size_t: done in 20147 nanoseconds (3.41922 times faster than byte per byte)
SIMD v10: done in 21410 nanoseconds (3.21752 times faster than byte per
byte)
SIMD v11: done in 56987 nanoseconds (1.20882 times faster than byte per
byte)
SIMD v12: done in 25102 nanoseconds (2.74428 times faster than byte per
byte)

>
> == with BLCKSZ 256
>
> $ /usr/local/gcc-14.1.0/bin/gcc-14.1.0 -march=native -O2 allzeros_small.c
> -o allzeros_small ; ./allzeros_small
> byte per byte: done in 120483 nanoseconds
> size_t: done in 23098 nanoseconds (5.21617 times faster than byte per byte)
> SIMD v10: done in 6737 nanoseconds (17.8838 times faster than byte per
> byte)
> SIMD v11: done in 6621 nanoseconds (18.1971 times faster than byte per
> byte)
> SIMD v12: done in 6519 nanoseconds (18.4818 times faster than byte per
> byte)
>
$ gcc -march=native -O2 allzeros_small.c -o allzeros_small ;
./allzeros_small
byte per byte: done in 211759 nanoseconds
size_t: done in 45879 nanoseconds (4.6156 times faster than byte per byte)
SIMD v10: done in 12262 nanoseconds (17.2695 times faster than byte per
byte)
SIMD v11: done in 12018 nanoseconds (17.6202 times faster than byte per
byte)
SIMD v12: done in 11993 nanoseconds (17.6569 times faster than byte per
byte)

>
> == with BLCKSZ 8192
>
> $ /usr/local/gcc-14.1.0/bin/gcc-14.1.0 -march=native -O2 allzeros_small.c
> -o allzeros_small ; ./allzeros_small
> byte per byte: done in 3393459 nanoseconds
> size_t: done in 707304 nanoseconds (4.79774 times faster than byte per
> byte)
> SIMD v10: done in 233559 nanoseconds (14.5293 times faster than byte per
> byte)
> SIMD v11: done in 225951 nanoseconds (15.0186 times faster than byte per
> byte)
> SIMD v12: done in 225766 nanoseconds (15.0309 times faster than byte per
> byte)
>
$ gcc -march=native -O2 allzeros_small.c -o allzeros_small ;
./allzeros_small
byte per byte: done in 12786295 nanoseconds
size_t: done in 1071590 nanoseconds (11.9321 times faster than byte per
byte)
SIMD v10: done in 413219 nanoseconds (30.9431 times faster than byte per
byte)
SIMD v11: done in 423469 nanoseconds (30.1942 times faster than byte per
byte)
SIMD v12: done in 414106 nanoseconds (30.8769 times faster than byte per
byte

>
> That's better for small size but given the extra len checks that
> has been added I think we're back to David's point in [1]: What if the
> function
> is not inlined for some reason?
>
> So, out of curiosity, let's see what happens if not inlined in [2] (see the
> -O2 -DNOT_INLINE compiler window):
>
> - if a[3]: it looks like gcc is smart enough to create an optimized version
> for that size using constant propagation
> - if a[63]: Same as above
> - if a[256]: Same as above
> - if a[8192]: Same as above
>
> I did a quick check with clang and it looks like it is not as smart as gcc
> for the non inline case.
>
> Anyway it's not like we have the choice: we need (at least) one len check
> for
> safety reason (to not crash or read invalid data).
>
> So, I'd vote for pg_memory_is_all_zeros_v12() then, thoughts?
>
I think that's good enough.

best regards,
Ranier Vilela

From:	Michael Paquier <michael(at)paquier(dot)xyz>
To:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
Cc:	David Rowley <dgrowleyml(at)gmail(dot)com>, Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-11-14 00:27:06
Message-ID:	ZzVD2tUN-hSVT6xG@paquier.xyz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Nov 13, 2024 at 07:50:50AM +0000, Bertrand Drouvot wrote:
> I did a quick check with clang and it looks like it is not as smart as gcc
> for the non inline case.

Not as much, still smart enough to skip the > 64B part when dealing
with a structure that does not require it. So it's actually still
good considering where we are at now on HEAD for the 8kB all-zero page
case.

> Anyway it's not like we have the choice: we need (at least) one len check for
> safety reason (to not crash or read invalid data).
>
> So, I'd vote for pg_memory_is_all_zeros_v12() then, thoughts?

Makes sense to me to just do that, with a first < 8 loop, and a second
for the 8~63 range. And I can understand the code I read here as it
self-documents what it does.

* There is no risk to read beyond the memory area thanks to the len < 64
* check done below.

This comment should do a s/below/above/ and a s/check/checks/, as it
refers to the two checks done before the trick with the 64B-per-loop
check.

There is also a "cant'" in the last size_t check. Simple typo.
--
Michael

From:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
To:	Michael Paquier <michael(at)paquier(dot)xyz>
Cc:	David Rowley <dgrowleyml(at)gmail(dot)com>, Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-11-14 10:09:49
Message-ID:	ZzXMbY+1BPEvcEnY@ip-10-97-1-34.eu-west-3.compute.internal
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On Thu, Nov 14, 2024 at 09:27:06AM +0900, Michael Paquier wrote:
> Makes sense to me to just do that, with a first < 8 loop, and a second
> for the 8~63 range.

Thanks for looking at it!

> There is also a "cant'" in the last size_t check. Simple typo.

Please find attached v12, with more comments and comments changes to explain
the multiple cases (for safety) and phases (for efficiency).

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment	Content-Type	Size
v12-0001-Optimize-pg_memory_is_all_zeros.patch	text/x-diff	5.5 KB
v12-0002-Make-use-of-pg_memory_is_all_zeros-in-PageIsVeri.patch	text/x-diff	1.4 KB

From:	Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>
To:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, David Rowley <dgrowleyml(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-11-14 11:22:23
Message-ID:	CAEudQArxZZk2YiZ8nvGym26yTtNoCiS6bTtmdWq1QvKayevOZQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Em qui., 14 de nov. de 2024 às 07:09, Bertrand Drouvot <
bertranddrouvot(dot)pg(at)gmail(dot)com> escreveu:

> Hi,
>
> On Thu, Nov 14, 2024 at 09:27:06AM +0900, Michael Paquier wrote:
> > Makes sense to me to just do that, with a first < 8 loop, and a second
> > for the 8~63 range.
>
> Thanks for looking at it!
>
> > There is also a "cant'" in the last size_t check. Simple typo.
>
> Please find attached v12, with more comments and comments changes to
> explain
> the multiple cases (for safety) and phases (for efficiency).
>
Is it worth mentioning that pg_memory_is_all_zeros does not work correctly
on 32-bit systems?

(63 < (size_t) * 8) /* 63 - 32*/

Or do we adjust magic constants according to 32/64 bit?

best regards,
Ranier Vilela

From:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
To:	Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, David Rowley <dgrowleyml(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-11-14 11:58:01
Message-ID:	ZzXlydqdb9ajBhA4@ip-10-97-1-34.eu-west-3.compute.internal
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On Thu, Nov 14, 2024 at 08:22:23AM -0300, Ranier Vilela wrote:
> Em qui., 14 de nov. de 2024 às 07:09, Bertrand Drouvot <
> bertranddrouvot(dot)pg(at)gmail(dot)com> escreveu:
>
> > Hi,
> >
> > On Thu, Nov 14, 2024 at 09:27:06AM +0900, Michael Paquier wrote:
> > > Makes sense to me to just do that, with a first < 8 loop, and a second
> > > for the 8~63 range.
> >
> > Thanks for looking at it!
> >
> > > There is also a "cant'" in the last size_t check. Simple typo.
> >
> > Please find attached v12, with more comments and comments changes to
> > explain
> > the multiple cases (for safety) and phases (for efficiency).
> >
> Is it worth mentioning that pg_memory_is_all_zeros does not work correctly
> on 32-bit systems?
>
> (63 < (size_t) * 8) /* 63 - 32*/

I think that the code is fully portable on 32-bit systems as it's using size_t
in all the places.

I agree that the comments are "64-bit" focused though, but I don't think that's
an issue (as I think it's already the case in multiple places in the core code).

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

From:	Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>
To:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, David Rowley <dgrowleyml(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-11-14 12:13:19
Message-ID:	CAEudQApQP_25qBMvP2m2LYRQ_B=_ZWzDgFVO-5jA4sh5DfjOCg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Em qui., 14 de nov. de 2024 às 08:58, Bertrand Drouvot <
bertranddrouvot(dot)pg(at)gmail(dot)com> escreveu:

> Hi,
>
> On Thu, Nov 14, 2024 at 08:22:23AM -0300, Ranier Vilela wrote:
> > Em qui., 14 de nov. de 2024 ąs 07:09, Bertrand Drouvot <
> > bertranddrouvot(dot)pg(at)gmail(dot)com> escreveu:
> >
> > > Hi,
> > >
> > > On Thu, Nov 14, 2024 at 09:27:06AM +0900, Michael Paquier wrote:
> > > > Makes sense to me to just do that, with a first < 8 loop, and a
> second
> > > > for the 8~63 range.
> > >
> > > Thanks for looking at it!
> > >
> > > > There is also a "cant'" in the last size_t check. Simple typo.
> > >
> > > Please find attached v12, with more comments and comments changes to
> > > explain
> > > the multiple cases (for safety) and phases (for efficiency).
> > >
> > Is it worth mentioning that pg_memory_is_all_zeros does not work
> correctly
> > on 32-bit systems?
> >
> > (63 < (size_t) * 8) /* 63 - 32*/
>
> I think that the code is fully portable on 32-bit systems as it's using
> size_t
> in all the places.
>
Maybe I'm doing something wrong.
But I'm testing in 32-bit, with the size set to 63, with v12 and I'm seeing
the SIMD loop execute.
Because the test
if (len < sizeof(size_t) * 8) // 8-63 bytes
failed.

I expected that with size 63, it would be solved by case 2, or am I wrong?

best regards,
Ranier Vilela

PS. Windows 11 64 bits
msvc 32 bits 2022

From:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
To:	Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, David Rowley <dgrowleyml(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-11-14 12:33:20
Message-ID:	ZzXuECLOduurf4NX@ip-10-97-1-34.eu-west-3.compute.internal
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On Thu, Nov 14, 2024 at 09:13:19AM -0300, Ranier Vilela wrote:
> Em qui., 14 de nov. de 2024 às 08:58, Bertrand Drouvot <
> Maybe I'm doing something wrong.
> But I'm testing in 32-bit, with the size set to 63, with v12 and I'm seeing
> the SIMD loop execute.

Yeah, that's expected and safe as each iteration reads 32 bytes on 32-bit.

> if (len < sizeof(size_t) * 8) // 8-63 bytes
> failed.
>
> I expected that with size 63, it would be solved by case 2, or am I wrong?

Case 2 should be read as "in the 4-31" bytes range on 32-bit system as all
comparisons are done in size_t.

What would be unsafe on 32-bit would be to read up to 32 bytes while len < 32
and that can not happen.

As mentioned up-thread the comments are wrong on 32-bit, indeed they must be read
as:

Case 1: len < 4 bytes
Case 2: len in the 4-31 bytes range
Case 3: len >= 32 bytes

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

From:	Michael Paquier <michael(at)paquier(dot)xyz>
To:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
Cc:	Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-11-15 00:30:25
Message-ID:	ZzaWIQJ6sU8UuQ-u@paquier.xyz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Nov 14, 2024 at 12:33:20PM +0000, Bertrand Drouvot wrote:
> Case 2 should be read as "in the 4-31" bytes range on 32-bit system as all
> comparisons are done in size_t.

I'd suggest to use a -m32 in your gcc switches, when it comes to
tests, but you already know that.. Anyway, as you say, the
portability of v12 is OK even for sizeof(size_t) == 4 because we don't
rely on any hardcoded values, and this patch does what it should in
this case (double-checked myself manually for the three cases with
-m32).

> What would be unsafe on 32-bit would be to read up to 32 bytes while len < 32
> and that can not happen.
>
> As mentioned up-thread the comments are wrong on 32-bit, indeed they must be read
> as:
>
> Case 1: len < 4 bytes
> Case 2: len in the 4-31 bytes range
> Case 3: len >= 32 bytes

This part could be indeed better than what's proposed in v12, so I
would recommend to use sizeof(size_t) a bit more consistently rather
than have the reader guess that. Note that some parts of v12-0001 use
sizeof(size_t) in the comments, which makes things inconsistent.

Some comments feel duplicated, as well, like the "no risk" mentions,
which are clear enough based on the description and the limitations of
the previous cases. I'd like to suggest a few tweaks, making the
comments more flexible. See 0003 that applies on top of your latest
patch set, reattaching v12 again.
--
Michael

Attachment	Content-Type	Size
v13-0001-Optimize-pg_memory_is_all_zeros.patch	text/x-diff	5.5 KB
v13-0002-Make-use-of-pg_memory_is_all_zeros-in-PageIsVeri.patch	text/x-diff	1.4 KB
v13-0003-Tweak-some-comments-in-0001.patch	text/x-diff	4.4 KB

From:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
To:	Michael Paquier <michael(at)paquier(dot)xyz>
Cc:	Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-11-15 06:46:52
Message-ID:	ZzbuXK+Mo15/MVLJ@ip-10-97-1-34.eu-west-3.compute.internal
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On Fri, Nov 15, 2024 at 09:30:25AM +0900, Michael Paquier wrote:
> On Thu, Nov 14, 2024 at 12:33:20PM +0000, Bertrand Drouvot wrote:
> Anyway, as you say, the
> portability of v12 is OK even for sizeof(size_t) == 4 because we don't
> rely on any hardcoded values, and this patch does what it should in
> this case (double-checked myself manually for the three cases with
> -m32).

Yeah, thanks for the testing!

> > What would be unsafe on 32-bit would be to read up to 32 bytes while len < 32
> > and that can not happen.
> >
> > As mentioned up-thread the comments are wrong on 32-bit, indeed they must be read
> > as:
> >
> > Case 1: len < 4 bytes
> > Case 2: len in the 4-31 bytes range
> > Case 3: len >= 32 bytes
>
> This part could be indeed better than what's proposed in v12, so I
> would recommend to use sizeof(size_t) a bit more consistently rather
> than have the reader guess that.

Makes sense even if that looks "more difficult" to read.

> Some comments feel duplicated, as well, like the "no risk" mentions,
> which are clear enough based on the description and the limitations of
> the previous cases. I'd like to suggest a few tweaks, making the
> comments more flexible. See 0003 that applies on top of your latest
> patch set, reattaching v12 again.

Thanks! Applied on v13 attached, except for things like:

"
- /* Compare bytes until the pointer "p" is aligned */
+ /* Compare bytes until the pointer "p" is aligned. */
"

which is adding a "." at the end of single line comments (as the few already
part of this file don't do so).

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment	Content-Type	Size
v13-0001-Optimize-pg_memory_is_all_zeros.patch	text/x-diff	5.4 KB
v13-0002-Make-use-of-pg_memory_is_all_zeros-in-PageIsVeri.patch	text/x-diff	1.4 KB

From:	Michael Paquier <michael(at)paquier(dot)xyz>
To:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
Cc:	Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-11-15 07:49:26
Message-ID:	Zzb9BqUxuXWk34Vt@paquier.xyz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Nov 15, 2024 at 06:46:52AM +0000, Bertrand Drouvot wrote:
> Makes sense even if that looks "more difficult" to read.

I don't think it's that "bad".

> which is adding a "." at the end of single line comments (as the few already
> part of this file don't do so).

I did so after looking at the surroundings. One way or the other is
fine by me as long as we're consistent.

I still need to have a second look. For now, let's wait a bit, in
case there are any comments.
--
Michael

From:	Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>
To:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, David Rowley <dgrowleyml(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-11-15 12:54:33
Message-ID:	CAEudQAr3TTTs7et_ke8V0JhggmPnYOJdwZPYuZePF7MPjcn-yQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Em sex., 15 de nov. de 2024 às 03:46, Bertrand Drouvot <
bertranddrouvot(dot)pg(at)gmail(dot)com> escreveu:

> Hi,
>
> On Fri, Nov 15, 2024 at 09:30:25AM +0900, Michael Paquier wrote:
> > On Thu, Nov 14, 2024 at 12:33:20PM +0000, Bertrand Drouvot wrote:
> > Anyway, as you say, the
> > portability of v12 is OK even for sizeof(size_t) == 4 because we don't
> > rely on any hardcoded values, and this patch does what it should in
> > this case (double-checked myself manually for the three cases with
> > -m32).
>
> Yeah, thanks for the testing!
>
> > > What would be unsafe on 32-bit would be to read up to 32 bytes while
> len < 32
> > > and that can not happen.
> > >
> > > As mentioned up-thread the comments are wrong on 32-bit, indeed they
> must be read
> > > as:
> > >
> > > Case 1: len < 4 bytes
> > > Case 2: len in the 4-31 bytes range
> > > Case 3: len >= 32 bytes
> >
> > This part could be indeed better than what's proposed in v12, so I
> > would recommend to use sizeof(size_t) a bit more consistently rather
> > than have the reader guess that.
>
> Makes sense even if that looks "more difficult" to read.
>
> > Some comments feel duplicated, as well, like the "no risk" mentions,
> > which are clear enough based on the description and the limitations of
> > the previous cases. I'd like to suggest a few tweaks, making the
> > comments more flexible. See 0003 that applies on top of your latest
> > patch set, reattaching v12 again.
>
> Thanks! Applied on v13 attached, except for things like:
>
> "
> - /* Compare bytes until the pointer "p" is aligned */
> + /* Compare bytes until the pointer "p" is aligned. */
> "
>
> which is adding a "." at the end of single line comments (as the few
> already
> part of this file don't do so).
>
There is a tiny typo with V13.
+ /* "len" in the [sizeof(size_t) * 8, inf] range */

But, I'm not sure if I'm still doing something wrong.
If so, forgive me for the noise.

In the v3_allzeros_check.c attached,
the results is:
cc -march=native -O2 v3_allzeros_check.c -o v3_allzeros_check ;
./v3_allzeros_check
pagebytes[BLCKSZ-2]=1
byte per byte: is_allzeros
size_t: is_allzeros
SIMD v10: is_allzeros
SIMD v11: is_allzeros
SIMD v12: is_allzeros
SIMD v14: is_allzeros

Of course I expected "not is_allzeros".

Anyway, I made another attempt to optimize a bit more, I don't know if it's
safe though.
results with v3_allzeros_small.c attached:
WITH 8192 BLCKSZ
Ubuntu 22.04 64 bits

gcc -march=native -O2 v3_allzeros_small.c -o v3_allzeros_small ;
./v3_allzeros_small
byte per byte: done in 5027744 nanoseconds
size_t: done in 382521 nanoseconds (13.1437 times faster than byte per byte)
SIMD v10: done in 157777 nanoseconds (31.8661 times faster than byte per
byte)
SIMD v11: done in 159696 nanoseconds (31.4832 times faster than byte per
byte)
SIMD v12: done in 168117 nanoseconds (29.9062 times faster than byte per
byte)
SIMD v14: done in 21008 nanoseconds (239.325 times faster than byte per
byte)

best regards,
Ranier Vilela

Attachment	Content-Type	Size
v3_allzeros_small.c	text/x-csrc	14.3 KB
v3_allzeros_check.c	text/x-csrc	13.0 KB

From:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
To:	Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, David Rowley <dgrowleyml(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-11-15 14:43:32
Message-ID:	ZzdeFPnAdtYBu/tm@ip-10-97-1-34.eu-west-3.compute.internal
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On Fri, Nov 15, 2024 at 09:54:33AM -0300, Ranier Vilela wrote:
> There is a tiny typo with V13.
> + /* "len" in the [sizeof(size_t) * 8, inf] range */

I think "[sizeof(size_t) * 8, inf[ range" is correct. Infinity can not be included
into a interval.

Thinking about it, actually, "[sizeof(size_t) * 8, inf)" (note the ')' at the end)
might be the proper notation from a mathematical point of view.

> But, I'm not sure if I'm still doing something wrong.
> If so, forgive me for the noise.
>
> Of course I expected "not is_allzeros".

That's the test case which is "wrong" (not the function):

"
size_t pagebytes[BLCKSZ] = {0};
volatile bool result;

pagebytes[BLCKSZ-2] = 1;

result = pg_memory_is_all_zeros_v12(pagebytes, BLCKSZ);
"

The pagebytes is an array of size_t (8 bytes each), so the actual array size
is 8192 * 8 = 65536 bytes.

So, pagebytes[BLCKSZ-2] = 1, sets byte 65528 ((8192-2)*8) to 1.

But the function is checking up to BLCKSZ bytes (8192), so the results you
observed (which are correct).

> Anyway, I made another attempt to optimize a bit more, I don't know if it's
> safe though.

There is an issue in your v14, it calls:

"
return pg_memory_is_all_zeros_simd(ptr, ptr + len);
"

but you defined it that way:

"
static inline bool
pg_memory_is_all_zeros_simd(const size_t *p, const size_t * end)

while that should be:

"
static inline bool
pg_memory_is_all_zeros_simd(const void *p, const void *end)
"

Doing so, I do not observe any improvments with v14.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

From:	Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>
To:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, David Rowley <dgrowleyml(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-11-16 14:40:58
Message-ID:	CAEudQApzEUAg+EYs=jtvSTA7E7zZ21vc5Xcz64Ra_1FAxq8LcA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Em sex., 15 de nov. de 2024 às 11:43, Bertrand Drouvot <
bertranddrouvot(dot)pg(at)gmail(dot)com> escreveu:

> Hi,
>
> On Fri, Nov 15, 2024 at 09:54:33AM -0300, Ranier Vilela wrote:
> > There is a tiny typo with V13.
> > + /* "len" in the [sizeof(size_t) * 8, inf] range */
>
> I think "[sizeof(size_t) * 8, inf[ range" is correct. Infinity can not be
> included
> into a interval.
>
> Thinking about it, actually, "[sizeof(size_t) * 8, inf)" (note the ')' at
> the end)
> might be the proper notation from a mathematical point of view.
>
Thanks for clarifying.

>
> > But, I'm not sure if I'm still doing something wrong.
> > If so, forgive me for the noise.
> >
> > Of course I expected "not is_allzeros".
>
> That's the test case which is "wrong" (not the function):
>
> "
> size_t pagebytes[BLCKSZ] = {0};
> volatile bool result;
>
> pagebytes[BLCKSZ-2] = 1;
>
> result = pg_memory_is_all_zeros_v12(pagebytes, BLCKSZ);
> "
>
> The pagebytes is an array of size_t (8 bytes each), so the actual array
> size
> is 8192 * 8 = 65536 bytes.
>
> So, pagebytes[BLCKSZ-2] = 1, sets byte 65528 ((8192-2)*8) to 1.
>
> But the function is checking up to BLCKSZ bytes (8192), so the results you
> observed (which are correct).
>
Thanks for pointing out my mistake.

>
> > Anyway, I made another attempt to optimize a bit more, I don't know if
> it's
> > safe though.
>
> There is an issue in your v14, it calls:
>
> "
> return pg_memory_is_all_zeros_simd(ptr, ptr + len);
> "
>
> but you defined it that way:
>
> "
> static inline bool
> pg_memory_is_all_zeros_simd(const size_t *p, const size_t * end)
>
> "
>
> while that should be:
>
> "
> static inline bool
> pg_memory_is_all_zeros_simd(const void *p, const void *end)
>
What I'm trying here, obviously, is a hack.
If it works, and the compiler accepts it, it's ok for me.

> "
>
> Doing so, I do not observe any improvments with v14.
>
So.
Again new results from v4_allzeros_small.c attached:
Linux Ubuntu 22.04
gcc 13 64 bits

With BLCKSZ 32
gcc -march=native -O2 v4_allzeros_small.c -o v4_allzeros_small ;
./v4_allzeros_small
byte per byte: done in 44092 nanoseconds
size_t: done in 13456 nanoseconds (3.27675 times faster than byte per byte)
SIMD v10: done in 14249 nanoseconds (3.09439 times faster than byte per
byte)
SIMD v11: done in 32516 nanoseconds (1.35601 times faster than byte per
byte)
SIMD v12: done in 14973 nanoseconds (2.94477 times faster than byte per
byte)
SIMD v14: done in 13101 nanoseconds (3.36554 times faster than byte per
byte)

With BLCKSZ 63
gcc -march=native -O2 v4_allzeros_small.c -o v4_allzeros_small ;
./v4_allzeros_small
byte per byte: done in 67656 nanoseconds
size_t: done in 25768 nanoseconds (2.62558 times faster than byte per byte)
SIMD v10: done in 21446 nanoseconds (3.15471 times faster than byte per
byte)
SIMD v11: done in 56887 nanoseconds (1.18931 times faster than byte per
byte)
SIMD v12: done in 22863 nanoseconds (2.95919 times faster than byte per
byte)
SIMD v14: done in 21273 nanoseconds (3.18037 times faster than byte per
byte)

With BLCKSZ 256
gcc -march=native -O2 v4_allzeros_small.c -o v4_allzeros_small ;
./v4_allzeros_small
byte per byte: done in 220064 nanoseconds
size_t: done in 45886 nanoseconds (4.79589 times faster than byte per byte)
SIMD v10: done in 12032 nanoseconds (18.2899 times faster than byte per
byte)
SIMD v11: done in 11965 nanoseconds (18.3923 times faster than byte per
byte)
SIMD v12: done in 12041 nanoseconds (18.2762 times faster than byte per
byte)
SIMD v14: done in 12575 nanoseconds (17.5001 times faster than byte per
byte)

With BLCKSZ 8192
gcc -march=native -O2 v4_allzeros_small.c -o v4_allzeros_small ;
./v4_allzeros_small
byte per byte: done in 10365876 nanoseconds
size_t: done in 827654 nanoseconds (12.5244 times faster than byte per byte)
SIMD v10: done in 347755 nanoseconds (29.808 times faster than byte per
byte)
SIMD v11: done in 342813 nanoseconds (30.2377 times faster than byte per
byte)
SIMD v12: done in 341124 nanoseconds (30.3874 times faster than byte per
byte)
SIMD v14: done in 50646 nanoseconds (204.673 times faster than byte per
byte)

Results with v4_allzeros_check.c attached:
gcc -march=native -O2 v4_allzeros_check.c -o v4_allzeros_check ;
./v4_allzeros_check
sizeof(pagebytes)=32
byte per byte: is_allzeros
size_t: is_allzeros
SIMD v10: is_allzeros
SIMD v11: is_allzeros
SIMD v12: is_allzeros
SIMD v14: is_allzeros

gcc -march=native -O2 v4_allzeros_check.c -o v4_allzeros_check ;
./v4_allzeros_check
sizeof(pagebytes)=63
byte per byte: is_allzeros
size_t: is_allzeros
SIMD v10: is_allzeros
SIMD v11: is_allzeros
SIMD v12: is_allzeros
SIMD v14: is_allzeros

gcc -march=native -O2 v4_allzeros_check.c -o v4_allzeros_check ;
./v4_allzeros_check
sizeof(pagebytes)=256
byte per byte: is_allzeros
size_t: is_allzeros
SIMD v10: is_allzeros
SIMD v11: is_allzeros
SIMD v12: is_allzeros
p01=(0x7ffedb8ac430)
end=(0x7ffedb8ac530)
p02=(0x7ffedb8ac530)
SIMD v14: is_allzeros

gcc -march=native -O2 v4_allzeros_check.c -o v4_allzeros_check ;
./v4_allzeros_check
sizeof(pagebytes)=8192
byte per byte: is_allzeros
size_t: is_allzeros
SIMD v10: is_allzeros
SIMD v11: is_allzeros
SIMD v12: is_allzeros
p01=(0x7ffd8864c200)
end=(0x7ffd8864e200)
p02=(0x7ffd8864e200)
SIMD v14: is_allzeros

If this hack is safe and correct, I think that 204 times faster,
it is very good, for a block size 8192.

That said,
V13 is fine as is.
LGTM.

best regards,
Ranier Vilela

From:	Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>
To:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, David Rowley <dgrowleyml(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-11-16 14:42:54
Message-ID:	CAEudQAqaGt7_doBsFcr-08283oXCn1=RGfEjsaozc72RxS3eFQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Em sáb., 16 de nov. de 2024 às 11:40, Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>
escreveu:

>
> Em sex., 15 de nov. de 2024 às 11:43, Bertrand Drouvot <
> bertranddrouvot(dot)pg(at)gmail(dot)com> escreveu:
>
>> Hi,
>>
>> On Fri, Nov 15, 2024 at 09:54:33AM -0300, Ranier Vilela wrote:
>> > There is a tiny typo with V13.
>> > + /* "len" in the [sizeof(size_t) * 8, inf] range */
>>
>> I think "[sizeof(size_t) * 8, inf[ range" is correct. Infinity can not be
>> included
>> into a interval.
>>
>> Thinking about it, actually, "[sizeof(size_t) * 8, inf)" (note the ')' at
>> the end)
>> might be the proper notation from a mathematical point of view.
>>
> Thanks for clarifying.
>
>
>>
>> > But, I'm not sure if I'm still doing something wrong.
>> > If so, forgive me for the noise.
>> >
>> > Of course I expected "not is_allzeros".
>>
>> That's the test case which is "wrong" (not the function):
>>
>> "
>> size_t pagebytes[BLCKSZ] = {0};
>> volatile bool result;
>>
>> pagebytes[BLCKSZ-2] = 1;
>>
>> result = pg_memory_is_all_zeros_v12(pagebytes, BLCKSZ);
>> "
>>
>> The pagebytes is an array of size_t (8 bytes each), so the actual array
>> size
>> is 8192 * 8 = 65536 bytes.
>>
>> So, pagebytes[BLCKSZ-2] = 1, sets byte 65528 ((8192-2)*8) to 1.
>>
>> But the function is checking up to BLCKSZ bytes (8192), so the results you
>> observed (which are correct).
>>
> Thanks for pointing out my mistake.
>
>
>>
>> > Anyway, I made another attempt to optimize a bit more, I don't know if
>> it's
>> > safe though.
>>
>> There is an issue in your v14, it calls:
>>
>> "
>> return pg_memory_is_all_zeros_simd(ptr, ptr + len);
>> "
>>
>> but you defined it that way:
>>
>> "
>> static inline bool
>> pg_memory_is_all_zeros_simd(const size_t *p, const size_t * end)
>>
>> "
>>
>> while that should be:
>>
>> "
>> static inline bool
>> pg_memory_is_all_zeros_simd(const void *p, const void *end)
>>
> What I'm trying here, obviously, is a hack.
> If it works, and the compiler accepts it, it's ok for me.
>
>
>> "
>>
>> Doing so, I do not observe any improvments with v14.
>>
> So.
> Again new results from v4_allzeros_small.c attached:
> Linux Ubuntu 22.04
> gcc 13 64 bits
>
> With BLCKSZ 32
> gcc -march=native -O2 v4_allzeros_small.c -o v4_allzeros_small ;
> ./v4_allzeros_small
> byte per byte: done in 44092 nanoseconds
> size_t: done in 13456 nanoseconds (3.27675 times faster than byte per byte)
> SIMD v10: done in 14249 nanoseconds (3.09439 times faster than byte per
> byte)
> SIMD v11: done in 32516 nanoseconds (1.35601 times faster than byte per
> byte)
> SIMD v12: done in 14973 nanoseconds (2.94477 times faster than byte per
> byte)
> SIMD v14: done in 13101 nanoseconds (3.36554 times faster than byte per
> byte)
>
> With BLCKSZ 63
> gcc -march=native -O2 v4_allzeros_small.c -o v4_allzeros_small ;
> ./v4_allzeros_small
> byte per byte: done in 67656 nanoseconds
> size_t: done in 25768 nanoseconds (2.62558 times faster than byte per byte)
> SIMD v10: done in 21446 nanoseconds (3.15471 times faster than byte per
> byte)
> SIMD v11: done in 56887 nanoseconds (1.18931 times faster than byte per
> byte)
> SIMD v12: done in 22863 nanoseconds (2.95919 times faster than byte per
> byte)
> SIMD v14: done in 21273 nanoseconds (3.18037 times faster than byte per
> byte)
>
> With BLCKSZ 256
> gcc -march=native -O2 v4_allzeros_small.c -o v4_allzeros_small ;
> ./v4_allzeros_small
> byte per byte: done in 220064 nanoseconds
> size_t: done in 45886 nanoseconds (4.79589 times faster than byte per byte)
> SIMD v10: done in 12032 nanoseconds (18.2899 times faster than byte per
> byte)
> SIMD v11: done in 11965 nanoseconds (18.3923 times faster than byte per
> byte)
> SIMD v12: done in 12041 nanoseconds (18.2762 times faster than byte per
> byte)
> SIMD v14: done in 12575 nanoseconds (17.5001 times faster than byte per
> byte)
>
> With BLCKSZ 8192
> gcc -march=native -O2 v4_allzeros_small.c -o v4_allzeros_small ;
> ./v4_allzeros_small
> byte per byte: done in 10365876 nanoseconds
> size_t: done in 827654 nanoseconds (12.5244 times faster than byte per
> byte)
> SIMD v10: done in 347755 nanoseconds (29.808 times faster than byte per
> byte)
> SIMD v11: done in 342813 nanoseconds (30.2377 times faster than byte per
> byte)
> SIMD v12: done in 341124 nanoseconds (30.3874 times faster than byte per
> byte)
> SIMD v14: done in 50646 nanoseconds (204.673 times faster than byte per
> byte)
>
> Results with v4_allzeros_check.c attached:
> gcc -march=native -O2 v4_allzeros_check.c -o v4_allzeros_check ;
> ./v4_allzeros_check
> sizeof(pagebytes)=32
> byte per byte: is_allzeros
> size_t: is_allzeros
> SIMD v10: is_allzeros
> SIMD v11: is_allzeros
> SIMD v12: is_allzeros
> SIMD v14: is_allzeros
>
> gcc -march=native -O2 v4_allzeros_check.c -o v4_allzeros_check ;
> ./v4_allzeros_check
> sizeof(pagebytes)=63
> byte per byte: is_allzeros
> size_t: is_allzeros
> SIMD v10: is_allzeros
> SIMD v11: is_allzeros
> SIMD v12: is_allzeros
> SIMD v14: is_allzeros
>
> gcc -march=native -O2 v4_allzeros_check.c -o v4_allzeros_check ;
> ./v4_allzeros_check
> sizeof(pagebytes)=256
> byte per byte: is_allzeros
> size_t: is_allzeros
> SIMD v10: is_allzeros
> SIMD v11: is_allzeros
> SIMD v12: is_allzeros
> p01=(0x7ffedb8ac430)
> end=(0x7ffedb8ac530)
> p02=(0x7ffedb8ac530)
> SIMD v14: is_allzeros
>
> gcc -march=native -O2 v4_allzeros_check.c -o v4_allzeros_check ;
> ./v4_allzeros_check
> sizeof(pagebytes)=8192
> byte per byte: is_allzeros
> size_t: is_allzeros
> SIMD v10: is_allzeros
> SIMD v11: is_allzeros
> SIMD v12: is_allzeros
> p01=(0x7ffd8864c200)
> end=(0x7ffd8864e200)
> p02=(0x7ffd8864e200)
> SIMD v14: is_allzeros
>
> If this hack is safe and correct, I think that 204 times faster,
> it is very good, for a block size 8192.
>
> That said,
> V13 is fine as is.
> LGTM.
>
Now with files attached.

best regards,
Ranier Vilela

Attachment	Content-Type	Size
v4_allzeros_small.c	text/x-csrc	14.4 KB
v4_allzeros_check.c	text/x-csrc	13.2 KB

From:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
To:	Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, David Rowley <dgrowleyml(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-11-17 05:36:27
Message-ID:	ZzmA24RVVBUhRVg8@ip-10-97-1-34.eu-west-3.compute.internal
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On Sat, Nov 16, 2024 at 11:42:54AM -0300, Ranier Vilela wrote:
> > Em sex., 15 de nov. de 2024 às 11:43, Bertrand Drouvot <
> > bertranddrouvot(dot)pg(at)gmail(dot)com> escreveu:
> >
> >> while that should be:
> >>
> >> "
> >> static inline bool
> >> pg_memory_is_all_zeros_simd(const void *p, const void *end)
> >>
> > What I'm trying here, obviously, is a hack.
> > If it works, and the compiler accepts it, it's ok for me.
> >
> > If this hack is safe and correct, I think that 204 times faster,
> > it is very good, for a block size 8192.

The "hack" is not correct, because it's doing:

"
static inline bool
all_zeros_simd(const size_t *p, const size_t * end)
{
for (; p < (end - sizeof(size_t) * 7); p += sizeof(size_t) * 8)
{
if ((((size_t *) p)[0] != 0) | (((size_t *) p)[1] != 0) |
(((size_t *) p)[2] != 0) | (((size_t *) p)[3] != 0) |
(((size_t *) p)[4] != 0) | (((size_t *) p)[5] != 0) |
(((size_t *) p)[6] != 0) | (((size_t *) p)[7] != 0))
return false;
}
.
.
"

"p += sizeof(size_t) * 8" advances by 64 elements. But those elements are "size_t"
elements (since you're using size_t pointers as the function arguments).

Then instead of advancing by 64 bytes, you know advance by 512 bytes. But
you only check 64 bytes per iteration -> you're missing 448 bytes to check
per iteration.

We can "visualize" this by adding a few output messages like:

"
static inline bool
all_zeros_simd(const size_t *p, const size_t * end)
{
for (; p < (end - sizeof(size_t) * 7); p += sizeof(size_t) * 8)
{
printf("Current p: %p\n", (void*)p);
printf("Checking elements:\n");
printf("[0]: %p = %zu\n", (void*)&((size_t *)p)[0], ((size_t *)p)[0]);
printf("[1]: %p = %zu\n", (void*)&((size_t *)p)[1], ((size_t *)p)[1]);
printf("[2]: %p = %zu\n", (void*)&((size_t *)p)[2], ((size_t *)p)[2]);
printf("[3]: %p = %zu\n", (void*)&((size_t *)p)[3], ((size_t *)p)[3]);
printf("[4]: %p = %zu\n", (void*)&((size_t *)p)[4], ((size_t *)p)[4]);
printf("[5]: %p = %zu\n", (void*)&((size_t *)p)[5], ((size_t *)p)[5]);
printf("[6]: %p = %zu\n", (void*)&((size_t *)p)[6], ((size_t *)p)[6]);
printf("[7]: %p = %zu\n", (void*)&((size_t *)p)[7], ((size_t *)p)[7]);

const size_t *next_p = p + sizeof(size_t) * 8;
printf("Next p will be: %p (advance of %zu bytes)\n",
(void*)next_p,
(size_t)((char*)next_p - (char*)p));
.
.
.
"

Then we get things like:

"
Current p: 0x7fff2a93e500
Checking elements:
[0]: 0x7fff2a93e500 = 0
[1]: 0x7fff2a93e508 = 0
[2]: 0x7fff2a93e510 = 0
[3]: 0x7fff2a93e518 = 0
[4]: 0x7fff2a93e520 = 0
[5]: 0x7fff2a93e528 = 0
[6]: 0x7fff2a93e530 = 0
[7]: 0x7fff2a93e538 = 0
Next p will be: 0x7fff2a93e700 (advance of 512 bytes)
"

Meaning that you're checking 64 bytes per iteration (for example from 0x500 to
0x508 is 8 bytes) while advancing by 512 bytes: you're missing 448 bytes to check
per iteration.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

From:	Michael Paquier <michael(at)paquier(dot)xyz>
To:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
Cc:	Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: define pg_structiszero(addr, s, r)
Date:	2024-11-18 03:04:57
Message-ID:	Zzqu2SNowXwcxcx9@paquier.xyz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Nov 15, 2024 at 02:43:32PM +0000, Bertrand Drouvot wrote:
> Thinking about it, actually, "[sizeof(size_t) * 8, inf)" (note the ')' at the end)
> might be the proper notation from a mathematical point of view.

v13 relied on a notation I've always used in maths, as infinity cannot
be included. Anyway, yes, using a parenthesis seems to me more
consistent after looking around with what we have in the tree, so I've
tweaked that and applied the two patches after an extra round of
checks.
--
Michael