Quick Links

Re: Online verification of checksums

Lists:	pgsql-hackers

From:	Michael Banck <michael(dot)banck(at)credativ(dot)de>
To:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Online verification of checksums
Date:	2018-07-26 11:59:33
Message-ID:	1532606373.3422.5.camel@credativ.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

v10 almost added online activation of checksums, but all we've got is
pg_verify_checksums, i.e. offline verification of checkums.

However, we also got (online) checksum verification during base backups,
and I have ported/adapted David Steele's recheck code to my personal
fork of pg_checksums[1], removed the online check (for verification) and
that seems to work fine.

I've now forward-ported this change to pg_verify_checksums, in order to
make this application useful for online clusters, see attached patch.

I've tested this in a tight loop (while true; do pg_verify_checksums -D
data1 -d > /dev/null || /bin/true; done)[2] while doing "while true; do
createdb pgbench; pgbench -i -s 10 pgbench > /dev/null; dropdb pgbench;
done", which I already used to develop the original code in the fork and
which brought up a few bugs.

I got one checksums verification failure this way, all others were
caught by the recheck (I've introduced a 500ms delay for the first ten
failures) like this:

|pg_verify_checksums: checksum verification failed on first attempt in
|file "data1/base/16837/16850", block 7770: calculated checksum 785 but
|expected 5063
|pg_verify_checksums: block 7770 in file "data1/base/16837/16850"
|verified ok on recheck

However, I am also seeing sporadic (maybe 0.5 times per pgbench run)
failures like this:

|pg_verify_checksums: short read of block 2644 in file
|"data1/base/16637/16650", got only 4096 bytes

This is not strictly a verification failure, should we do anything about
this? In my fork, I am also rechecking on this[3] (and I am happy to
extend the patch that way), but that makes the code and the patch more
complicated and I wanted to check the general opinion on this case
first.

Michael

[1] https://github.com/credativ/pg_checksums/commit/dc052f0d6f1282d3c821
5b0eb28b8e7c4e74f9e5
[2] while patching out the somewhat unhelpful (in regular operation,
anyway) debug message for every successful checksum verification
[3] https://github.com/credativ/pg_checksums/blob/master/pg_checksums.c#
L160
--
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax: +49 2166 9901-100
Email: michael(dot)banck(at)credativ(dot)de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

Attachment	Content-Type	Size
online-verification-of-checksums_V1.patch	text/x-patch	4.1 KB

From:	Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
To:	Michael Banck <michael(dot)banck(at)credativ(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2018-08-30 18:06:18
Message-ID:	e55c9f5b-c092-8535-6d0a-7e5a70fd71e7@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 26/07/2018 13:59, Michael Banck wrote:
> I've now forward-ported this change to pg_verify_checksums, in order to
> make this application useful for online clusters, see attached patch.

Why not provide this functionality as a server function or command.
Then you can access blocks with proper locks and don't have to do this
rather ad hoc retry logic on concurrent access.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Magnus Hagander <magnus(at)hagander(dot)net>
To:	Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
Cc:	Michael Banck <michael(dot)banck(at)credativ(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2018-08-30 19:17:25
Message-ID:	CABUevEyvhjRbcARcd51HPMsVipEvopU-qo9bS21yrryhPvjSyw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Aug 30, 2018 at 8:06 PM, Peter Eisentraut <
peter(dot)eisentraut(at)2ndquadrant(dot)com> wrote:

> On 26/07/2018 13:59, Michael Banck wrote:
> > I've now forward-ported this change to pg_verify_checksums, in order to
> > make this application useful for online clusters, see attached patch.
>
> Why not provide this functionality as a server function or command.
> Then you can access blocks with proper locks and don't have to do this
> rather ad hoc retry logic on concurrent access.
>

I think it would make sense to provide this functionality in the "checksum
worker" infrastruture suggested in the online checksum enabling patch. But
I think being able to run it from the outside would also be useful,
particularly when it's this simple.

But why do we need a sleep in it? AFAICT this is basically the same code
that we have in basebackup.c, and that one does not need the sleep?
Certainly 500ms would be very long since we're just protecting against a
torn page, but the comment is wrong I think, and we're actually sleeping
0.5ms?

--
Magnus Hagander
Me: https://www.hagander.net/ <http://www.hagander.net/>
Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/>

From:	Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To:	Michael Banck <michael(dot)banck(at)credativ(dot)de>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2018-09-01 06:27:42
Message-ID:	alpine.DEB.2.21.1809010826340.32764@lancre
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hallo Michael,

> I've now forward-ported this change to pg_verify_checksums, in order to
> make this application useful for online clusters, see attached patch.

Patch does not seem to apply anymore, could you rebase it?

--
Fabien.

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Michael Banck <michael(dot)banck(at)credativ(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2018-09-03 20:29:18
Message-ID:	3d9a49ce-f7ba-fbbb-409a-4d3d395493e8@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

The patch is mostly copying the verification / retry logic from
basebackup.c, but I think it omitted a rather important detail that
makes it incorrect in the presence of concurrent writes.

The very first thing basebackup does is this:

startptr = do_pg_start_backup(...);

i.e. it waits for a checkpoint, remembering the LSN. And then when
checking a page it does this:

if (!PageIsNew(page) && PageGetLSN(page) < startptr)
{
... verify the page checksum
}

Obviously, pg_verify_checksums can't do that easily because it's
supposed to run from outside the database instance. But the startptr
detail is pretty important because it supports this retry reasoning:

/*
* Retry the block on the first failure. It's
* possible that we read the first 4K page of the
* block just before postgres updated the entire block
* so it ends up looking torn to us. We only need to
* retry once because the LSN should be updated to
* something we can ignore on the next pass. If the
* error happens again then it is a true validation
* failure.
*/

Imagine the 8kB page as two 4kB pages, with the initial state being
[A1,A2] and another process over-writing it with [B1,B2]. If you read
the 8kB page, what states can you see?

I don't think POSIX provides any guarantees about atomicity of the write
calls (and even if it does, the filesystems on Linux don't seem to). So
you may observe both [A1,B2] or [A2,B1], or various inconsistent mixes
of the two versions, depending on timing. Well, torn pages ...

Pretty much the only thing you can rely on is that when one process does

write([B1,B2])

the other process may first read [A1,B2], but the next read will return
[B1,B2] (or possibly newer data, if there was another write). It will
not read the "stale" A1 again.

The basebackup relies on this kinda implicitly - on the retry it'll
notice the LSN changed (thanks to the startptr check), and the page will
be skipped entirely. This is pretty important, because the new page
might be torn in some other way.

The pg_verify_checksum apparently ignores this skip logic, because on
the retry it simply re-reads the page again, verifies the checksum and
reports an error. Which is broken, because the newly read page might be
torn again due to a concurrent write.

So IMHO this should do something similar to basebackup - check the page
LSN, and if it changed then skip the page.

I'm afraid this requires using the last checkpoint LSN, the way startptr
is used in basebackup. In particular we can't simply remember LSN from
the first read, because we might actually read [B1,A2] on the first try,
and then [B1,B2] or [B1,C2] on the retry. (Actually, the page may be
torn in various other ways, not necessarily at the 4kB boundary - it
might be torn right after the LSN, for example).

FWIW I also don't understand the purpose of pg_sleep(), it does not seem
to protect against anything, really.

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Michael Banck <michael(dot)banck(at)credativ(dot)de>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2018-09-17 14:04:18
Message-ID:	20180917140418.GA23519@nighthawk.caipicrew.dd-dns.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On Mon, Sep 03, 2018 at 10:29:18PM +0200, Tomas Vondra wrote:
> The patch is mostly copying the verification / retry logic from
> basebackup.c, but I think it omitted a rather important detail that
> makes it incorrect in the presence of concurrent writes.
>
> The very first thing basebackup does is this:
>
> startptr = do_pg_start_backup(...);
>
> i.e. it waits for a checkpoint, remembering the LSN. And then when
> checking a page it does this:
>
> if (!PageIsNew(page) && PageGetLSN(page) < startptr)
> {
> ... verify the page checksum
> }
>
> Obviously, pg_verify_checksums can't do that easily because it's
> supposed to run from outside the database instance.

It reads pg_control anyway, so couldn't we just take
ControlFile->checkPoint?

Other than that, basebackup.c seems to only look at pages which haven't
been changed since the backup starting checkpoint (see above if
statement). That's reasonable for backups, but is it just as reasonable
for online verification?

> But the startptr detail is pretty important because it supports this
> retry reasoning:
>
> /*
> * Retry the block on the first failure. It's
> * possible that we read the first 4K page of the
> * block just before postgres updated the entire block
> * so it ends up looking torn to us. We only need to
> * retry once because the LSN should be updated to
> * something we can ignore on the next pass. If the
> * error happens again then it is a true validation
> * failure.
> */
>
> Imagine the 8kB page as two 4kB pages, with the initial state being
> [A1,A2] and another process over-writing it with [B1,B2]. If you read
> the 8kB page, what states can you see?
>
> I don't think POSIX provides any guarantees about atomicity of the write
> calls (and even if it does, the filesystems on Linux don't seem to). So
> you may observe both [A1,B2] or [A2,B1], or various inconsistent mixes
> of the two versions, depending on timing. Well, torn pages ...
>
> Pretty much the only thing you can rely on is that when one process does
>
> write([B1,B2])
>
> the other process may first read [A1,B2], but the next read will return
> [B1,B2] (or possibly newer data, if there was another write). It will
> not read the "stale" A1 again.
>
> The basebackup relies on this kinda implicitly - on the retry it'll
> notice the LSN changed (thanks to the startptr check), and the page will
> be skipped entirely. This is pretty important, because the new page
> might be torn in some other way.
>
> The pg_verify_checksum apparently ignores this skip logic, because on
> the retry it simply re-reads the page again, verifies the checksum and
> reports an error. Which is broken, because the newly read page might be
> torn again due to a concurrent write.

Well, ok.

> So IMHO this should do something similar to basebackup - check the page
> LSN, and if it changed then skip the page.
>
> I'm afraid this requires using the last checkpoint LSN, the way startptr
> is used in basebackup. In particular we can't simply remember LSN from
> the first read, because we might actually read [B1,A2] on the first try,
> and then [B1,B2] or [B1,C2] on the retry. (Actually, the page may be
> torn in various other ways, not necessarily at the 4kB boundary - it
> might be torn right after the LSN, for example).

I'd prefer to come up with a plan where we don't just give up once we
see a new LSN, if possible. If I run a modified pg_verify_checksums
which skips on newer pages in a tight benchmark, basically everything
gets skipped as checkpoints don't happen often enough.

So how about we do check every page, but if one fails on retry, and the
LSN is newer than the checkpoint, we then skip it? Is that logic sound?

In any case, if we decide we really should skip the page if it is newer
than the checkpoint, I think it makes sense to track those skipped pages
and print their number out at the end, if there are any.

> FWIW I also don't understand the purpose of pg_sleep(), it does not seem
> to protect against anything, really.

Well, I've noticed that without it I get sporadic checksum failures on
reread, so I've added it to make them go away. It was certainly a
phenomenological decision that I am happy to trade for a better one.

Also, I noticed there's sometimes a 'data/global/pg_internal.init.606'
or some such file which pg_verify_checksums gets confused on, I guess we
should skip that as well. Can we assume that all files that start with
the ones in skip[] are safe to skip or should we have an exception for
files starting with pg_internal.init?

Michael

--
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax: +49 2166 9901-100
Email: michael(dot)banck(at)credativ(dot)de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	Michael Banck <michael(dot)banck(at)credativ(dot)de>
Cc:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2018-09-17 14:46:51
Message-ID:	20180917144651.GX4184@tamriel.snowman.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Greetings,

* Michael Banck (michael(dot)banck(at)credativ(dot)de) wrote:
> On Mon, Sep 03, 2018 at 10:29:18PM +0200, Tomas Vondra wrote:
> > Obviously, pg_verify_checksums can't do that easily because it's
> > supposed to run from outside the database instance.
>
> It reads pg_control anyway, so couldn't we just take
> ControlFile->checkPoint?
>
> Other than that, basebackup.c seems to only look at pages which haven't
> been changed since the backup starting checkpoint (see above if
> statement). That's reasonable for backups, but is it just as reasonable
> for online verification?

Right, basebackup doesn't need to look at other pages.

> > The pg_verify_checksum apparently ignores this skip logic, because on
> > the retry it simply re-reads the page again, verifies the checksum and
> > reports an error. Which is broken, because the newly read page might be
> > torn again due to a concurrent write.
>
> Well, ok.

The newly read page will have an updated LSN though then on the re-read,
in which case basebackup can know that what happened was a rewrite of
the page and it no longer has to care about the page and can skip it.

I haven't looked, but if basebackup isn't checking the LSN again for the
newly read page then that'd be broken, but I believe it does (at least,
that's the algorithm we came up with for pgBackRest, and I know David
shared that when the basebackup code was being written).

> > So IMHO this should do something similar to basebackup - check the page
> > LSN, and if it changed then skip the page.
> >
> > I'm afraid this requires using the last checkpoint LSN, the way startptr
> > is used in basebackup. In particular we can't simply remember LSN from
> > the first read, because we might actually read [B1,A2] on the first try,
> > and then [B1,B2] or [B1,C2] on the retry. (Actually, the page may be
> > torn in various other ways, not necessarily at the 4kB boundary - it
> > might be torn right after the LSN, for example).
>
> I'd prefer to come up with a plan where we don't just give up once we
> see a new LSN, if possible. If I run a modified pg_verify_checksums
> which skips on newer pages in a tight benchmark, basically everything
> gets skipped as checkpoints don't happen often enough.

I'm really not sure how you expect to be able to do something different
here. Even if we started poking into shared buffers, all you'd be able
to see is that there's a bunch of dirty pages- and we don't maintain the
checksums in shared buffers, so it's not like you could verify them
there.

You could possibly have an option that says "force a checkpoint" but,
honestly, that's really not all that interesting either- all you'd be
doing is forcing all the pages to be written out from shared buffers
into the kernel cache and then reading them from there instead, it's not
like you'd actually be able to tell if there was a disk/storage error
because you'll only be looking at the kernel cache.

> So how about we do check every page, but if one fails on retry, and the
> LSN is newer than the checkpoint, we then skip it? Is that logic sound?

I thought that's what basebackup did- if it doesn't do that today, then
it really should.

> In any case, if we decide we really should skip the page if it is newer
> than the checkpoint, I think it makes sense to track those skipped pages
> and print their number out at the end, if there are any.

Not sure what the point of this is. If we wanted to really do something
to cross-check here, we'd track the pages that were skipped and then
look through the WAL to make sure that they're there. That's something
we've talked about doing with pgBackRest, but don't currently.

> > FWIW I also don't understand the purpose of pg_sleep(), it does not seem
> > to protect against anything, really.
>
> Well, I've noticed that without it I get sporadic checksum failures on
> reread, so I've added it to make them go away. It was certainly a
> phenomenological decision that I am happy to trade for a better one.

That then sounds like we really aren't re-checking the LSN, and we
really should be, to avoid getting these sporadic checksum failures on
reread..

> Also, I noticed there's sometimes a 'data/global/pg_internal.init.606'
> or some such file which pg_verify_checksums gets confused on, I guess we
> should skip that as well. Can we assume that all files that start with
> the ones in skip[] are safe to skip or should we have an exception for
> files starting with pg_internal.init?

Everything listed in skip is safe to skip on a restore.. I've not
really thought too much about if they're all safe to skip when checking
checksums for an online system, but I would generally think so..

Thanks!

Stephen

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Stephen Frost <sfrost(at)snowman(dot)net>, Michael Banck <michael(dot)banck(at)credativ(dot)de>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2018-09-17 16:13:44
Message-ID:	1daf5fb0-adee-2a65-772a-579cbdd063c0@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 09/17/2018 04:46 PM, Stephen Frost wrote:
> Greetings,
>
> * Michael Banck (michael(dot)banck(at)credativ(dot)de) wrote:
>> On Mon, Sep 03, 2018 at 10:29:18PM +0200, Tomas Vondra wrote:
>>> Obviously, pg_verify_checksums can't do that easily because it's
>>> supposed to run from outside the database instance.
>>
>> It reads pg_control anyway, so couldn't we just take
>> ControlFile->checkPoint?
>>
>> Other than that, basebackup.c seems to only look at pages which haven't
>> been changed since the backup starting checkpoint (see above if
>> statement). That's reasonable for backups, but is it just as reasonable
>> for online verification?
>
> Right, basebackup doesn't need to look at other pages.
>
>>> The pg_verify_checksum apparently ignores this skip logic, because on
>>> the retry it simply re-reads the page again, verifies the checksum and
>>> reports an error. Which is broken, because the newly read page might be
>>> torn again due to a concurrent write.
>>
>> Well, ok.
>
> The newly read page will have an updated LSN though then on the re-read,
> in which case basebackup can know that what happened was a rewrite of
> the page and it no longer has to care about the page and can skip it.
>
> I haven't looked, but if basebackup isn't checking the LSN again for the
> newly read page then that'd be broken, but I believe it does (at least,
> that's the algorithm we came up with for pgBackRest, and I know David
> shared that when the basebackup code was being written).
>

Yes, basebackup does check the LSN on re-read, and skips the page if it
changed on re-read (because it eliminates the consistency guarantees
provided by the checkpoint).

>>> So IMHO this should do something similar to basebackup - check the page
>>> LSN, and if it changed then skip the page.
>>>
>>> I'm afraid this requires using the last checkpoint LSN, the way startptr
>>> is used in basebackup. In particular we can't simply remember LSN from
>>> the first read, because we might actually read [B1,A2] on the first try,
>>> and then [B1,B2] or [B1,C2] on the retry. (Actually, the page may be
>>> torn in various other ways, not necessarily at the 4kB boundary - it
>>> might be torn right after the LSN, for example).
>>
>> I'd prefer to come up with a plan where we don't just give up once we
>> see a new LSN, if possible. If I run a modified pg_verify_checksums
>> which skips on newer pages in a tight benchmark, basically everything
>> gets skipped as checkpoints don't happen often enough.
>
> I'm really not sure how you expect to be able to do something different
> here. Even if we started poking into shared buffers, all you'd be able
> to see is that there's a bunch of dirty pages- and we don't maintain the
> checksums in shared buffers, so it's not like you could verify them
> there.
>
> You could possibly have an option that says "force a checkpoint" but,
> honestly, that's really not all that interesting either- all you'd be
> doing is forcing all the pages to be written out from shared buffers
> into the kernel cache and then reading them from there instead, it's not
> like you'd actually be able to tell if there was a disk/storage error
> because you'll only be looking at the kernel cache.
>

Yeah.

>> So how about we do check every page, but if one fails on retry, and the
>> LSN is newer than the checkpoint, we then skip it? Is that logic sound?
>
> I thought that's what basebackup did- if it doesn't do that today, then
> it really should.
>

The crucial distinction here is that the trick is not in comparing LSNs
from the two page reads, but comparing it to the checkpoint LSN. If it's
greater, the page may be torn or broken, and there's no way to know
which case it is - so basebackup simply skips it.

>> In any case, if we decide we really should skip the page if it is newer
>> than the checkpoint, I think it makes sense to track those skipped pages
>> and print their number out at the end, if there are any.
>
> Not sure what the point of this is. If we wanted to really do something
> to cross-check here, we'd track the pages that were skipped and then
> look through the WAL to make sure that they're there. That's something
> we've talked about doing with pgBackRest, but don't currently.
>

I agree simply printing the page numbers seems rather useless. What we
could do is remember which pages we skipped and then try checking them
after another checkpoint. Or something like that.

>>> FWIW I also don't understand the purpose of pg_sleep(), it does not seem
>>> to protect against anything, really.
>>
>> Well, I've noticed that without it I get sporadic checksum failures on
>> reread, so I've added it to make them go away. It was certainly a
>> phenomenological decision that I am happy to trade for a better one.
>
> That then sounds like we really aren't re-checking the LSN, and we
> really should be, to avoid getting these sporadic checksum failures on
> reread..
>

Again, it's not enough to check the LSN against the preceding read. We
need a checkpoint LSN or something like that.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Michael Banck <michael(dot)banck(at)credativ(dot)de>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2018-09-17 16:25:02
Message-ID:	795c567a-936f-c765-8c9d-4a4427dbd243@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 09/17/2018 04:04 PM, Michael Banck wrote:
> Hi,
>
> On Mon, Sep 03, 2018 at 10:29:18PM +0200, Tomas Vondra wrote:
>> The patch is mostly copying the verification / retry logic from
>> basebackup.c, but I think it omitted a rather important detail that
>> makes it incorrect in the presence of concurrent writes.
>>
>> The very first thing basebackup does is this:
>>
>> startptr = do_pg_start_backup(...);
>>
>> i.e. it waits for a checkpoint, remembering the LSN. And then when
>> checking a page it does this:
>>
>> if (!PageIsNew(page) && PageGetLSN(page) < startptr)
>> {
>> ... verify the page checksum
>> }
>>
>> Obviously, pg_verify_checksums can't do that easily because it's
>> supposed to run from outside the database instance.
>
> It reads pg_control anyway, so couldn't we just take
> ControlFile->checkPoint?
>
> Other than that, basebackup.c seems to only look at pages which haven't
> been changed since the backup starting checkpoint (see above if
> statement). That's reasonable for backups, but is it just as reasonable
> for online verification?
>

I suppose we might refresh the checkpoint LSN regularly, and use the
most recent one. On large/busy databases that would allow checking
larger part of the database.

>> But the startptr detail is pretty important because it supports this
>> retry reasoning:
>>
>> /*
>> * Retry the block on the first failure. It's
>> * possible that we read the first 4K page of the
>> * block just before postgres updated the entire block
>> * so it ends up looking torn to us. We only need to
>> * retry once because the LSN should be updated to
>> * something we can ignore on the next pass. If the
>> * error happens again then it is a true validation
>> * failure.
>> */
>>
>> Imagine the 8kB page as two 4kB pages, with the initial state being
>> [A1,A2] and another process over-writing it with [B1,B2]. If you read
>> the 8kB page, what states can you see?
>>
>> I don't think POSIX provides any guarantees about atomicity of the write
>> calls (and even if it does, the filesystems on Linux don't seem to). So
>> you may observe both [A1,B2] or [A2,B1], or various inconsistent mixes
>> of the two versions, depending on timing. Well, torn pages ...
>>
>> Pretty much the only thing you can rely on is that when one process does
>>
>> write([B1,B2])
>>
>> the other process may first read [A1,B2], but the next read will return
>> [B1,B2] (or possibly newer data, if there was another write). It will
>> not read the "stale" A1 again.
>>
>> The basebackup relies on this kinda implicitly - on the retry it'll
>> notice the LSN changed (thanks to the startptr check), and the page will
>> be skipped entirely. This is pretty important, because the new page
>> might be torn in some other way.
>>
>> The pg_verify_checksum apparently ignores this skip logic, because on
>> the retry it simply re-reads the page again, verifies the checksum and
>> reports an error. Which is broken, because the newly read page might be
>> torn again due to a concurrent write.
>
> Well, ok.
>
>> So IMHO this should do something similar to basebackup - check the page
>> LSN, and if it changed then skip the page.
>>
>> I'm afraid this requires using the last checkpoint LSN, the way startptr
>> is used in basebackup. In particular we can't simply remember LSN from
>> the first read, because we might actually read [B1,A2] on the first try,
>> and then [B1,B2] or [B1,C2] on the retry. (Actually, the page may be
>> torn in various other ways, not necessarily at the 4kB boundary - it
>> might be torn right after the LSN, for example).
>
> I'd prefer to come up with a plan where we don't just give up once we
> see a new LSN, if possible. If I run a modified pg_verify_checksums
> which skips on newer pages in a tight benchmark, basically everything
> gets skipped as checkpoints don't happen often enough.
>

But in that case the checksums are verified when reading the buffer into
shared buffers, it's not like we don't notice the checksum error at all.
We are interested in the pages that have not been read/written for an
extended period time. So I think this is not a problem.

> So how about we do check every page, but if one fails on retry, and the
> LSN is newer than the checkpoint, we then skip it? Is that logic sound?
>

Hmmm, maybe.

I agree it might be useful to know how many pages were skipped, and how
many actually passed the checksum check.

>> FWIW I also don't understand the purpose of pg_sleep(), it does not seem
>> to protect against anything, really.
>
> Well, I've noticed that without it I get sporadic checksum failures on
> reread, so I've added it to make them go away. It was certainly a
> phenomenological decision that I am happy to trade for a better one.
>

My guess is this happened because both the read and re-read completed
during the same write.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	Michael Banck <michael(dot)banck(at)credativ(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2018-09-17 16:42:46
Message-ID:	20180917164246.GZ4184@tamriel.snowman.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Greetings,

* Tomas Vondra (tomas(dot)vondra(at)2ndquadrant(dot)com) wrote:
> On 09/17/2018 04:46 PM, Stephen Frost wrote:
> > * Michael Banck (michael(dot)banck(at)credativ(dot)de) wrote:
> >> On Mon, Sep 03, 2018 at 10:29:18PM +0200, Tomas Vondra wrote:
> >>> Obviously, pg_verify_checksums can't do that easily because it's
> >>> supposed to run from outside the database instance.
> >>
> >> It reads pg_control anyway, so couldn't we just take
> >> ControlFile->checkPoint?
> >>
> >> Other than that, basebackup.c seems to only look at pages which haven't
> >> been changed since the backup starting checkpoint (see above if
> >> statement). That's reasonable for backups, but is it just as reasonable
> >> for online verification?
> >
> > Right, basebackup doesn't need to look at other pages.
> >
> >>> The pg_verify_checksum apparently ignores this skip logic, because on
> >>> the retry it simply re-reads the page again, verifies the checksum and
> >>> reports an error. Which is broken, because the newly read page might be
> >>> torn again due to a concurrent write.
> >>
> >> Well, ok.
> >
> > The newly read page will have an updated LSN though then on the re-read,
> > in which case basebackup can know that what happened was a rewrite of
> > the page and it no longer has to care about the page and can skip it.
> >
> > I haven't looked, but if basebackup isn't checking the LSN again for the
> > newly read page then that'd be broken, but I believe it does (at least,
> > that's the algorithm we came up with for pgBackRest, and I know David
> > shared that when the basebackup code was being written).
>
> Yes, basebackup does check the LSN on re-read, and skips the page if it
> changed on re-read (because it eliminates the consistency guarantees
> provided by the checkpoint).

Ok, good, though I'm not sure what you mean by 'eliminates the
consistency guarantees provided by the checkpoint'. The point is that
the page will be in the WAL and the WAL will be replayed during the
restore of the backup.

> >> So how about we do check every page, but if one fails on retry, and the
> >> LSN is newer than the checkpoint, we then skip it? Is that logic sound?
> >
> > I thought that's what basebackup did- if it doesn't do that today, then
> > it really should.
>
> The crucial distinction here is that the trick is not in comparing LSNs
> from the two page reads, but comparing it to the checkpoint LSN. If it's
> greater, the page may be torn or broken, and there's no way to know
> which case it is - so basebackup simply skips it.

Sure, because we don't care about it any longer- that page isn't
interesting because the WAL will replay over it. IIRC it actually goes
something like: check the checksum, if it failed then check if the LSN
is greater than the checkpoint (of the backup start..), if not, then
re-read, if the LSN is now newer than the checkpoint then skip, if the
LSN is the same then throw an error.

> >> In any case, if we decide we really should skip the page if it is newer
> >> than the checkpoint, I think it makes sense to track those skipped pages
> >> and print their number out at the end, if there are any.
> >
> > Not sure what the point of this is. If we wanted to really do something
> > to cross-check here, we'd track the pages that were skipped and then
> > look through the WAL to make sure that they're there. That's something
> > we've talked about doing with pgBackRest, but don't currently.
>
> I agree simply printing the page numbers seems rather useless. What we
> could do is remember which pages we skipped and then try checking them
> after another checkpoint. Or something like that.

I'm still not sure I'm seeing the point of that. They're still going to
almost certainly be in the kernel cache. The reason for checking
against the WAL would be to detect errors in PG where we aren't putting
a page into the WAL when it really should be, or something similar,
which seems like it at least could be useful.

Maybe to put it another way- there's very little point in checking the
checksum of a page which we know must be re-written during recovery to
get to a consistent point. I don't think it hurts in the general case,
but I wouldn't write a lot of code which then needs to be tested to
handle it. I also don't think that we really need to make
pg_verify_checksum spend lots of extra cycles trying to verify that
*every* page had its checksum validated when we know that lots of pages
are going to be in memory marked dirty and our checking of them will be
ultimately pointless as they'll either be written out by the
checkpointer or some other process, or we'll replay them from the WAL if
we crash.

> >>> FWIW I also don't understand the purpose of pg_sleep(), it does not seem
> >>> to protect against anything, really.
> >>
> >> Well, I've noticed that without it I get sporadic checksum failures on
> >> reread, so I've added it to make them go away. It was certainly a
> >> phenomenological decision that I am happy to trade for a better one.
> >
> > That then sounds like we really aren't re-checking the LSN, and we
> > really should be, to avoid getting these sporadic checksum failures on
> > reread..
>
> Again, it's not enough to check the LSN against the preceding read. We
> need a checkpoint LSN or something like that.

I actually tend to disagree with you that, for this purpose, it's
actually necessary to check against the checkpoint LSN- if the LSN
changed and everything is operating correctly then the new LSN must be
more recent than the last checkpoint location or things are broken
badly.

Now, that said, I do think it's a good *idea* to check against the
checkpoint LSN (presuming this is for online checking of checksums- for
basebackup, we could just check against the backup-start LSN as anything
after that point will be rewritten by WAL anyway). The reason that I
think it's a good idea to check against the checkpoint LSN is that we'd
want to throw a big warning if the kernel is just feeding us random
garbage on reads and only finding a difference between two reads isn't
really doing any kind of validation, whereas checking against the
checkpoint-LSN would at least give us some idea that the value being
read isn't completely ridiculous.

When it comes to if the pg_sleep() is necessary or not, I have to admit
to being unsure about that.. I could see how it might be but it seems a
bit surprising- I'd probably want to see exactly what the page was at
the time of the failure and at the time of the second (no-sleep) re-read
and then after a delay and convince myself that it was just an unlucky
case of being scheduled in twice to read that page before the process
writing it out got a chance to finish the write.

Thanks!

Stephen

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Stephen Frost <sfrost(at)snowman(dot)net>
Cc:	Michael Banck <michael(dot)banck(at)credativ(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2018-09-17 17:00:31
Message-ID:	ab83e470-4077-a6c1-2dbc-083d7df274ac@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 09/17/2018 06:42 PM, Stephen Frost wrote:
> Greetings,
>
> * Tomas Vondra (tomas(dot)vondra(at)2ndquadrant(dot)com) wrote:
>> On 09/17/2018 04:46 PM, Stephen Frost wrote:
>>> * Michael Banck (michael(dot)banck(at)credativ(dot)de) wrote:
>>>> On Mon, Sep 03, 2018 at 10:29:18PM +0200, Tomas Vondra wrote:
>>>>> Obviously, pg_verify_checksums can't do that easily because it's
>>>>> supposed to run from outside the database instance.
>>>>
>>>> It reads pg_control anyway, so couldn't we just take
>>>> ControlFile->checkPoint?
>>>>
>>>> Other than that, basebackup.c seems to only look at pages which haven't
>>>> been changed since the backup starting checkpoint (see above if
>>>> statement). That's reasonable for backups, but is it just as reasonable
>>>> for online verification?
>>>
>>> Right, basebackup doesn't need to look at other pages.
>>>
>>>>> The pg_verify_checksum apparently ignores this skip logic, because on
>>>>> the retry it simply re-reads the page again, verifies the checksum and
>>>>> reports an error. Which is broken, because the newly read page might be
>>>>> torn again due to a concurrent write.
>>>>
>>>> Well, ok.
>>>
>>> The newly read page will have an updated LSN though then on the re-read,
>>> in which case basebackup can know that what happened was a rewrite of
>>> the page and it no longer has to care about the page and can skip it.
>>>
>>> I haven't looked, but if basebackup isn't checking the LSN again for the
>>> newly read page then that'd be broken, but I believe it does (at least,
>>> that's the algorithm we came up with for pgBackRest, and I know David
>>> shared that when the basebackup code was being written).
>>
>> Yes, basebackup does check the LSN on re-read, and skips the page if it
>> changed on re-read (because it eliminates the consistency guarantees
>> provided by the checkpoint).
>
> Ok, good, though I'm not sure what you mean by 'eliminates the
> consistency guarantees provided by the checkpoint'. The point is that
> the page will be in the WAL and the WAL will be replayed during the
> restore of the backup.
>

The checkpoint guarantees that the whole page was written and flushed to
disk with an LSN before the ckeckpoint LSN. So when you read a page with
that LSN, you know the whole write already completed and a read won't
return data from before the LSN.

Without the checkpoint that's not guaranteed, and simply re-reading the
page and rechecking it vs. the first read does not help:

1) write the first 512B of the page (sector), which includes the LSN

2) read the whole page, which will be a mix [new 512B, ... old ... ]

3) the checksum verification fails

4) read the page again (possibly reading a bit more new data)

5) the LSN did not change compared to the first read, yet the checksum
still fails

>>>> So how about we do check every page, but if one fails on retry, and the
>>>> LSN is newer than the checkpoint, we then skip it? Is that logic sound?
>>>
>>> I thought that's what basebackup did- if it doesn't do that today, then
>>> it really should.
>>
>> The crucial distinction here is that the trick is not in comparing LSNs
>> from the two page reads, but comparing it to the checkpoint LSN. If it's
>> greater, the page may be torn or broken, and there's no way to know
>> which case it is - so basebackup simply skips it.
>
> Sure, because we don't care about it any longer- that page isn't
> interesting because the WAL will replay over it. IIRC it actually goes
> something like: check the checksum, if it failed then check if the LSN
> is greater than the checkpoint (of the backup start..), if not, then
> re-read, if the LSN is now newer than the checkpoint then skip, if the
> LSN is the same then throw an error.
>

Nope, we only verify the checksum if it's LSN precedes the checkpoint:

https://github.com/postgres/postgres/blob/master/src/backend/replication/basebackup.c#L1454

>>>> In any case, if we decide we really should skip the page if it is newer
>>>> than the checkpoint, I think it makes sense to track those skipped pages
>>>> and print their number out at the end, if there are any.
>>>
>>> Not sure what the point of this is. If we wanted to really do something
>>> to cross-check here, we'd track the pages that were skipped and then
>>> look through the WAL to make sure that they're there. That's something
>>> we've talked about doing with pgBackRest, but don't currently.
>>
>> I agree simply printing the page numbers seems rather useless. What we
>> could do is remember which pages we skipped and then try checking them
>> after another checkpoint. Or something like that.
>
> I'm still not sure I'm seeing the point of that. They're still going to
> almost certainly be in the kernel cache. The reason for checking
> against the WAL would be to detect errors in PG where we aren't putting
> a page into the WAL when it really should be, or something similar,
> which seems like it at least could be useful.
>
> Maybe to put it another way- there's very little point in checking the
> checksum of a page which we know must be re-written during recovery to
> get to a consistent point. I don't think it hurts in the general case,
> but I wouldn't write a lot of code which then needs to be tested to
> handle it. I also don't think that we really need to make
> pg_verify_checksum spend lots of extra cycles trying to verify that
> *every* page had its checksum validated when we know that lots of pages
> are going to be in memory marked dirty and our checking of them will be
> ultimately pointless as they'll either be written out by the
> checkpointer or some other process, or we'll replay them from the WAL if
> we crash.
>

Yeah, I agree.

>>>>> FWIW I also don't understand the purpose of pg_sleep(), it does not seem
>>>>> to protect against anything, really.
>>>>
>>>> Well, I've noticed that without it I get sporadic checksum failures on
>>>> reread, so I've added it to make them go away. It was certainly a
>>>> phenomenological decision that I am happy to trade for a better one.
>>>
>>> That then sounds like we really aren't re-checking the LSN, and we
>>> really should be, to avoid getting these sporadic checksum failures on
>>> reread..
>>
>> Again, it's not enough to check the LSN against the preceding read. We
>> need a checkpoint LSN or something like that.
>
> I actually tend to disagree with you that, for this purpose, it's
> actually necessary to check against the checkpoint LSN- if the LSN
> changed and everything is operating correctly then the new LSN must be
> more recent than the last checkpoint location or things are broken
> badly.
>

I don't follow. Are you suggesting we don't need the checkpoint LSN?

I'm pretty sure that's not the case. The thing is - the LSN may not
change between the two reads, but that's not a guarantee the page was
not torn. The example I posted earlier in this message illustrates that.

> Now, that said, I do think it's a good *idea* to check against the
> checkpoint LSN (presuming this is for online checking of checksums- for
> basebackup, we could just check against the backup-start LSN as anything
> after that point will be rewritten by WAL anyway). The reason that I
> think it's a good idea to check against the checkpoint LSN is that we'd
> want to throw a big warning if the kernel is just feeding us random
> garbage on reads and only finding a difference between two reads isn't
> really doing any kind of validation, whereas checking against the
> checkpoint-LSN would at least give us some idea that the value being
> read isn't completely ridiculous.
>
> When it comes to if the pg_sleep() is necessary or not, I have to admit
> to being unsure about that.. I could see how it might be but it seems a
> bit surprising- I'd probably want to see exactly what the page was at
> the time of the failure and at the time of the second (no-sleep) re-read
> and then after a delay and convince myself that it was just an unlucky
> case of being scheduled in twice to read that page before the process
> writing it out got a chance to finish the write.
>

I think the pg_sleep() is a pretty strong sign there's something broken.
At the very least, it's likely to misbehave on machines with different
timings, machines under memory and/or memory pressure, etc.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	Michael Banck <michael(dot)banck(at)credativ(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2018-09-17 17:11:06
Message-ID:	20180917171106.GA4184@tamriel.snowman.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Greetings,

* Tomas Vondra (tomas(dot)vondra(at)2ndquadrant(dot)com) wrote:
> On 09/17/2018 06:42 PM, Stephen Frost wrote:
> > Ok, good, though I'm not sure what you mean by 'eliminates the
> > consistency guarantees provided by the checkpoint'. The point is that
> > the page will be in the WAL and the WAL will be replayed during the
> > restore of the backup.
>
> The checkpoint guarantees that the whole page was written and flushed to
> disk with an LSN before the ckeckpoint LSN. So when you read a page with
> that LSN, you know the whole write already completed and a read won't
> return data from before the LSN.

Well, you know that the first part was written out at some prior point,
but you could end up reading the first part of a page with an older LSN
while also reading the second part with new data.

> Without the checkpoint that's not guaranteed, and simply re-reading the
> page and rechecking it vs. the first read does not help:
>
> 1) write the first 512B of the page (sector), which includes the LSN
>
> 2) read the whole page, which will be a mix [new 512B, ... old ... ]
>
> 3) the checksum verification fails
>
> 4) read the page again (possibly reading a bit more new data)
>
> 5) the LSN did not change compared to the first read, yet the checksum
> still fails

So, I agree with all of the above though I've found it to be extremely
rare to get a single read which you've managed to catch part-way through
a write, getting multiple of them over a period of time strikes me as
even more unlikely. Still, if we can come up with a solution to solve
all of this, great, but I'm not sure that I'm hearing one.

> > Sure, because we don't care about it any longer- that page isn't
> > interesting because the WAL will replay over it. IIRC it actually goes
> > something like: check the checksum, if it failed then check if the LSN
> > is greater than the checkpoint (of the backup start..), if not, then
> > re-read, if the LSN is now newer than the checkpoint then skip, if the
> > LSN is the same then throw an error.
>
> Nope, we only verify the checksum if it's LSN precedes the checkpoint:
>
> https://github.com/postgres/postgres/blob/master/src/backend/replication/basebackup.c#L1454

That seems like it's leaving something on the table, but, to be fair, we
know that all of those pages should be rewritten by WAL anyway so they
aren't all that interesting to us, particularly in the basebackup case.

> > I actually tend to disagree with you that, for this purpose, it's
> > actually necessary to check against the checkpoint LSN- if the LSN
> > changed and everything is operating correctly then the new LSN must be
> > more recent than the last checkpoint location or things are broken
> > badly.
>
> I don't follow. Are you suggesting we don't need the checkpoint LSN?
>
> I'm pretty sure that's not the case. The thing is - the LSN may not
> change between the two reads, but that's not a guarantee the page was
> not torn. The example I posted earlier in this message illustrates that.

I agree that there's some risk there, but it's certainly much less
likely.

> > Now, that said, I do think it's a good *idea* to check against the
> > checkpoint LSN (presuming this is for online checking of checksums- for
> > basebackup, we could just check against the backup-start LSN as anything
> > after that point will be rewritten by WAL anyway). The reason that I
> > think it's a good idea to check against the checkpoint LSN is that we'd
> > want to throw a big warning if the kernel is just feeding us random
> > garbage on reads and only finding a difference between two reads isn't
> > really doing any kind of validation, whereas checking against the
> > checkpoint-LSN would at least give us some idea that the value being
> > read isn't completely ridiculous.
> >
> > When it comes to if the pg_sleep() is necessary or not, I have to admit
> > to being unsure about that.. I could see how it might be but it seems a
> > bit surprising- I'd probably want to see exactly what the page was at
> > the time of the failure and at the time of the second (no-sleep) re-read
> > and then after a delay and convince myself that it was just an unlucky
> > case of being scheduled in twice to read that page before the process
> > writing it out got a chance to finish the write.
>
> I think the pg_sleep() is a pretty strong sign there's something broken.
> At the very least, it's likely to misbehave on machines with different
> timings, machines under memory and/or memory pressure, etc.

If we assume that what you've outlined above is a serious enough issue
that we have to address it, and do so without a pg_sleep(), then I think
we have to bake into this a way for the process to check with PG as to
what the page's current LSN is, in shared buffers, because that's the
only place where we've got the locking required to ensure that we don't
end up with a read of a partially written page, and I'm really not
entirely convinced that we need to go to that level. It'd certainly add
a huge amount of additional complexity for what appears to be a quite
unlikely gain.

I'll chat w/ David shortly about this again though and get his thoughts
on it. This is certainly an area we've spent time thinking about but
are obviously also open to finding a better solution.

Thanks!

Stephen

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Stephen Frost <sfrost(at)snowman(dot)net>
Cc:	Michael Banck <michael(dot)banck(at)credativ(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2018-09-17 17:19:50
Message-ID:	dc63116a-d88b-0f7f-1a3d-621b6e002329@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 09/17/2018 07:11 PM, Stephen Frost wrote:
> Greetings,
>
> * Tomas Vondra (tomas(dot)vondra(at)2ndquadrant(dot)com) wrote:
>> On 09/17/2018 06:42 PM, Stephen Frost wrote:
>>> Ok, good, though I'm not sure what you mean by 'eliminates the
>>> consistency guarantees provided by the checkpoint'. The point is that
>>> the page will be in the WAL and the WAL will be replayed during the
>>> restore of the backup.
>>
>> The checkpoint guarantees that the whole page was written and flushed to
>> disk with an LSN before the ckeckpoint LSN. So when you read a page with
>> that LSN, you know the whole write already completed and a read won't
>> return data from before the LSN.
>
> Well, you know that the first part was written out at some prior point,
> but you could end up reading the first part of a page with an older LSN
> while also reading the second part with new data.
>

Doesn't the checkpoint fsync pretty much guarantee this can't happen?

>> Without the checkpoint that's not guaranteed, and simply re-reading the
>> page and rechecking it vs. the first read does not help:
>>
>> 1) write the first 512B of the page (sector), which includes the LSN
>>
>> 2) read the whole page, which will be a mix [new 512B, ... old ... ]
>>
>> 3) the checksum verification fails
>>
>> 4) read the page again (possibly reading a bit more new data)
>>
>> 5) the LSN did not change compared to the first read, yet the checksum
>> still fails
>
> So, I agree with all of the above though I've found it to be extremely
> rare to get a single read which you've managed to catch part-way through
> a write, getting multiple of them over a period of time strikes me as
> even more unlikely. Still, if we can come up with a solution to solve
> all of this, great, but I'm not sure that I'm hearing one.
>

I don't recall claiming catching many such torn pages - I'm sure it's
not very common in most workloads. But I suspect constructing workloads
hitting them regularly is not very difficult either (something with a
lot of churn in shared buffers should do the trick).

>>> Sure, because we don't care about it any longer- that page isn't
>>> interesting because the WAL will replay over it. IIRC it actually goes
>>> something like: check the checksum, if it failed then check if the LSN
>>> is greater than the checkpoint (of the backup start..), if not, then
>>> re-read, if the LSN is now newer than the checkpoint then skip, if the
>>> LSN is the same then throw an error.
>>
>> Nope, we only verify the checksum if it's LSN precedes the checkpoint:
>>
>> https://github.com/postgres/postgres/blob/master/src/backend/replication/basebackup.c#L1454
>
> That seems like it's leaving something on the table, but, to be fair, we
> know that all of those pages should be rewritten by WAL anyway so they
> aren't all that interesting to us, particularly in the basebackup case.
>

Yep.

>>> I actually tend to disagree with you that, for this purpose, it's
>>> actually necessary to check against the checkpoint LSN- if the LSN
>>> changed and everything is operating correctly then the new LSN must be
>>> more recent than the last checkpoint location or things are broken
>>> badly.
>>
>> I don't follow. Are you suggesting we don't need the checkpoint LSN?
>>
>> I'm pretty sure that's not the case. The thing is - the LSN may not
>> change between the two reads, but that's not a guarantee the page was
>> not torn. The example I posted earlier in this message illustrates that.
>
> I agree that there's some risk there, but it's certainly much less
> likely.
>

Well. If we're going to report a checksum failure, we better be sure it
actually is a broken page. I don't want users to start chasing bogus
data corruption issues.

>>> Now, that said, I do think it's a good *idea* to check against the
>>> checkpoint LSN (presuming this is for online checking of checksums- for
>>> basebackup, we could just check against the backup-start LSN as anything
>>> after that point will be rewritten by WAL anyway). The reason that I
>>> think it's a good idea to check against the checkpoint LSN is that we'd
>>> want to throw a big warning if the kernel is just feeding us random
>>> garbage on reads and only finding a difference between two reads isn't
>>> really doing any kind of validation, whereas checking against the
>>> checkpoint-LSN would at least give us some idea that the value being
>>> read isn't completely ridiculous.
>>>
>>> When it comes to if the pg_sleep() is necessary or not, I have to admit
>>> to being unsure about that.. I could see how it might be but it seems a
>>> bit surprising- I'd probably want to see exactly what the page was at
>>> the time of the failure and at the time of the second (no-sleep) re-read
>>> and then after a delay and convince myself that it was just an unlucky
>>> case of being scheduled in twice to read that page before the process
>>> writing it out got a chance to finish the write.
>>
>> I think the pg_sleep() is a pretty strong sign there's something broken.
>> At the very least, it's likely to misbehave on machines with different
>> timings, machines under memory and/or memory pressure, etc.
>
> If we assume that what you've outlined above is a serious enough issue
> that we have to address it, and do so without a pg_sleep(), then I think
> we have to bake into this a way for the process to check with PG as to
> what the page's current LSN is, in shared buffers, because that's the
> only place where we've got the locking required to ensure that we don't
> end up with a read of a partially written page, and I'm really not
> entirely convinced that we need to go to that level. It'd certainly add
> a huge amount of additional complexity for what appears to be a quite
> unlikely gain.
>
> I'll chat w/ David shortly about this again though and get his thoughts
> on it. This is certainly an area we've spent time thinking about but
> are obviously also open to finding a better solution.
>

Why not to simply look at the last checkpoint LSN and use that the same
way basebackup does? AFAICS that should make the pg_sleep() unnecessary.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	Michael Banck <michael(dot)banck(at)credativ(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2018-09-17 17:35:28
Message-ID:	CAOuzzgq7AgPU--tQJqB0qsaN72A4DG4it6KPgob7t4LQtTBLSQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Greetings,

On Mon, Sep 17, 2018 at 13:20 Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
wrote:

> On 09/17/2018 07:11 PM, Stephen Frost wrote:
> > Greetings,
> >
> > * Tomas Vondra (tomas(dot)vondra(at)2ndquadrant(dot)com) wrote:
> >> On 09/17/2018 06:42 PM, Stephen Frost wrote:
> >>> Ok, good, though I'm not sure what you mean by 'eliminates the
> >>> consistency guarantees provided by the checkpoint'. The point is that
> >>> the page will be in the WAL and the WAL will be replayed during the
> >>> restore of the backup.
> >>
> >> The checkpoint guarantees that the whole page was written and flushed to
> >> disk with an LSN before the ckeckpoint LSN. So when you read a page with
> >> that LSN, you know the whole write already completed and a read won't
> >> return data from before the LSN.
> >
> > Well, you know that the first part was written out at some prior point,
> > but you could end up reading the first part of a page with an older LSN
> > while also reading the second part with new data.

Doesn't the checkpoint fsync pretty much guarantee this can't happen?

How? Either it’s possible for the latter half of a page to be updated
before the first half (where the LSN lives), or it isn’t. If it’s possible
then that LSN could be ancient and it wouldn’t matter.

>> Without the checkpoint that's not guaranteed, and simply re-reading the
> >> page and rechecking it vs. the first read does not help:
> >>
> >> 1) write the first 512B of the page (sector), which includes the LSN
> >>
> >> 2) read the whole page, which will be a mix [new 512B, ... old ... ]
> >>
> >> 3) the checksum verification fails
> >>
> >> 4) read the page again (possibly reading a bit more new data)
> >>
> >> 5) the LSN did not change compared to the first read, yet the checksum
> >> still fails
> >
> > So, I agree with all of the above though I've found it to be extremely
> > rare to get a single read which you've managed to catch part-way through
> > a write, getting multiple of them over a period of time strikes me as
> > even more unlikely. Still, if we can come up with a solution to solve
> > all of this, great, but I'm not sure that I'm hearing one.

I don't recall claiming catching many such torn pages - I'm sure it's
> not very common in most workloads. But I suspect constructing workloads
> hitting them regularly is not very difficult either (something with a
> lot of churn in shared buffers should do the trick).

The question is if it’s possible to catch a torn page where the second half
is updated *before* the first half of the page in a read (and then in
subsequent reads having that state be maintained). I have some skepticism
that it’s really possible to happen in the first place but having an
interrupted system call be stalled across two more system calls just seems
terribly unlikely, and this is all based on the assumption that the kernel
might write the second half of a write before the first to the kernel cache
in the first place.

>>> Sure, because we don't care about it any longer- that page isn't
> >>> interesting because the WAL will replay over it. IIRC it actually goes
> >>> something like: check the checksum, if it failed then check if the LSN
> >>> is greater than the checkpoint (of the backup start..), if not, then
> >>> re-read, if the LSN is now newer than the checkpoint then skip, if the
> >>> LSN is the same then throw an error.
> >>
> >> Nope, we only verify the checksum if it's LSN precedes the checkpoint:
> >>
> >>
> https://github.com/postgres/postgres/blob/master/src/backend/replication/basebackup.c#L1454
> >
> > That seems like it's leaving something on the table, but, to be fair, we
> > know that all of those pages should be rewritten by WAL anyway so they
> > aren't all that interesting to us, particularly in the basebackup case.
> >
>
> Yep.
>
> >>> I actually tend to disagree with you that, for this purpose, it's
> >>> actually necessary to check against the checkpoint LSN- if the LSN
> >>> changed and everything is operating correctly then the new LSN must be
> >>> more recent than the last checkpoint location or things are broken
> >>> badly.
> >>
> >> I don't follow. Are you suggesting we don't need the checkpoint LSN?
> >>
> >> I'm pretty sure that's not the case. The thing is - the LSN may not
> >> change between the two reads, but that's not a guarantee the page was
> >> not torn. The example I posted earlier in this message illustrates that.
> >
> > I agree that there's some risk there, but it's certainly much less
> > likely.
> >
>
> Well. If we're going to report a checksum failure, we better be sure it
> actually is a broken page. I don't want users to start chasing bogus
> data corruption issues.

Yes, I definitely agree that we don’t want to mis-report checksum failures
if we can avoid it.

>>> Now, that said, I do think it's a good *idea* to check against the
> >>> checkpoint LSN (presuming this is for online checking of checksums- for
> >>> basebackup, we could just check against the backup-start LSN as
> anything
> >>> after that point will be rewritten by WAL anyway). The reason that I
> >>> think it's a good idea to check against the checkpoint LSN is that we'd
> >>> want to throw a big warning if the kernel is just feeding us random
> >>> garbage on reads and only finding a difference between two reads isn't
> >>> really doing any kind of validation, whereas checking against the
> >>> checkpoint-LSN would at least give us some idea that the value being
> >>> read isn't completely ridiculous.
> >>>
> >>> When it comes to if the pg_sleep() is necessary or not, I have to admit
> >>> to being unsure about that.. I could see how it might be but it seems
> a
> >>> bit surprising- I'd probably want to see exactly what the page was at
> >>> the time of the failure and at the time of the second (no-sleep)
> re-read
> >>> and then after a delay and convince myself that it was just an unlucky
> >>> case of being scheduled in twice to read that page before the process
> >>> writing it out got a chance to finish the write.
> >>
> >> I think the pg_sleep() is a pretty strong sign there's something broken.
> >> At the very least, it's likely to misbehave on machines with different
> >> timings, machines under memory and/or memory pressure, etc.
> >
> > If we assume that what you've outlined above is a serious enough issue
> > that we have to address it, and do so without a pg_sleep(), then I think
> > we have to bake into this a way for the process to check with PG as to
> > what the page's current LSN is, in shared buffers, because that's the
> > only place where we've got the locking required to ensure that we don't
> > end up with a read of a partially written page, and I'm really not
> > entirely convinced that we need to go to that level. It'd certainly add
> > a huge amount of additional complexity for what appears to be a quite
> > unlikely gain.
> >
> > I'll chat w/ David shortly about this again though and get his thoughts
> > on it. This is certainly an area we've spent time thinking about but
> > are obviously also open to finding a better solution.

Why not to simply look at the last checkpoint LSN and use that the same
> way basebackup does? AFAICS that should make the pg_sleep() unnecessary.

Use that to compare to what? The LSN in the first half of the page could
be from well before the checkpoint or even the backup started.

Thanks!

Stephen

From:	Michael Banck <michael(dot)banck(at)credativ(dot)de>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2018-09-17 17:38:20
Message-ID:	1537205900.3800.11.camel@credativ.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

so, trying some intermediate summary here, sorry for (also) top-posting:

1. the basebackup checksum verification logic only checks pages not
changed since the checkpoint, which makes sense for the basebackup.

2. However, it would be desirable to go further for pg_verify_checksums
and (re-)check all pages.

3. pg_verify_checksums should read the checkpoint LSN on startup and
compare the page LSN against it on re-read, and discard pages which have
checksum failures but are new. (Maybe it should read new checkpoint LSNs
as they come in during its runtime as well? See below).

4. The pg_sleep should go.

5. There seems to be no consensus on whether the number of skipped pages
should be summarized at the end.

Further comments:

Am Montag, den 17.09.2018, 19:19 +0200 schrieb Tomas Vondra:
> On 09/17/2018 07:11 PM, Stephen Frost wrote:
> > * Tomas Vondra (tomas(dot)vondra(at)2ndquadrant(dot)com) wrote:
> > > On 09/17/2018 06:42 PM, Stephen Frost wrote:
> > > Without the checkpoint that's not guaranteed, and simply re-reading the
> > > page and rechecking it vs. the first read does not help:
> > >
> > > 1) write the first 512B of the page (sector), which includes the LSN
> > >
> > > 2) read the whole page, which will be a mix [new 512B, ... old ... ]
> > >
> > > 3) the checksum verification fails
> > >
> > > 4) read the page again (possibly reading a bit more new data)
> > >
> > > 5) the LSN did not change compared to the first read, yet the checksum
> > > still fails
> >
> > So, I agree with all of the above though I've found it to be extremely
> > rare to get a single read which you've managed to catch part-way through
> > a write, getting multiple of them over a period of time strikes me as
> > even more unlikely. Still, if we can come up with a solution to solve
> > all of this, great, but I'm not sure that I'm hearing one.
>
> I don't recall claiming catching many such torn pages - I'm sure it's
> not very common in most workloads. But I suspect constructing workloads
> hitting them regularly is not very difficult either (something with a
> lot of churn in shared buffers should do the trick).
>
> > > > Sure, because we don't care about it any longer- that page isn't
> > > > interesting because the WAL will replay over it. IIRC it actually goes
> > > > something like: check the checksum, if it failed then check if the LSN
> > > > is greater than the checkpoint (of the backup start..), if not, then
> > > > re-read, if the LSN is now newer than the checkpoint then skip, if the
> > > > LSN is the same then throw an error.
> > >
> > > Nope, we only verify the checksum if it's LSN precedes the checkpoint:
> > >
> > > https://github.com/postgres/postgres/blob/master/src/backend/replication/basebackup.c#L1454
> >
> > That seems like it's leaving something on the table, but, to be fair, we
> > know that all of those pages should be rewritten by WAL anyway so they
> > aren't all that interesting to us, particularly in the basebackup case.
>
> Yep.

Right, see point 1 above.

> > > > I actually tend to disagree with you that, for this purpose, it's
> > > > actually necessary to check against the checkpoint LSN- if the LSN
> > > > changed and everything is operating correctly then the new LSN must be
> > > > more recent than the last checkpoint location or things are broken
> > > > badly.
> > >
> > > I don't follow. Are you suggesting we don't need the checkpoint LSN?
> > >
> > > I'm pretty sure that's not the case. The thing is - the LSN may not
> > > change between the two reads, but that's not a guarantee the page was
> > > not torn. The example I posted earlier in this message illustrates that.
> >
> > I agree that there's some risk there, but it's certainly much less
> > likely.
>
> Well. If we're going to report a checksum failure, we better be sure it
> actually is a broken page. I don't want users to start chasing bogus
> data corruption issues.

I agree.

> > > > Now, that said, I do think it's a good *idea* to check against the
> > > > checkpoint LSN (presuming this is for online checking of checksums- for
> > > > basebackup, we could just check against the backup-start LSN as anything
> > > > after that point will be rewritten by WAL anyway). The reason that I
> > > > think it's a good idea to check against the checkpoint LSN is that we'd
> > > > want to throw a big warning if the kernel is just feeding us random
> > > > garbage on reads and only finding a difference between two reads isn't
> > > > really doing any kind of validation, whereas checking against the
> > > > checkpoint-LSN would at least give us some idea that the value being
> > > > read isn't completely ridiculous.

Are you suggesting here that we always check against the current
checkpoint, or is checking against the checkpoint that we saw at startup
enough? I think re-reading pg_control all the time might be more
errorprone that what we could get from this, so I would prefer not to do
this.

> > > > When it comes to if the pg_sleep() is necessary or not, I have to admit
> > > > to being unsure about that.. I could see how it might be but it seems a
> > > > bit surprising- I'd probably want to see exactly what the page was at
> > > > the time of the failure and at the time of the second (no-sleep) re-read
> > > > and then after a delay and convince myself that it was just an unlucky
> > > > case of being scheduled in twice to read that page before the process
> > > > writing it out got a chance to finish the write.
> > >
> > > I think the pg_sleep() is a pretty strong sign there's something broken.
> > > At the very least, it's likely to misbehave on machines with different
> > > timings, machines under memory and/or memory pressure, etc.

I swapped out the pg_sleep earlier today for the check-against-
checkpoint-LSN-on-reread, and that seems to work just as fine, at least
in the tests I ran.

> > If we assume that what you've outlined above is a serious enough issue
> > that we have to address it, and do so without a pg_sleep(), then I think
> > we have to bake into this a way for the process to check with PG as to
> > what the page's current LSN is, in shared buffers, because that's the
> > only place where we've got the locking required to ensure that we don't
> > end up with a read of a partially written page, and I'm really not
> > entirely convinced that we need to go to that level. It'd certainly add
> > a huge amount of additional complexity for what appears to be a quite
> > unlikely gain.
> >
> > I'll chat w/ David shortly about this again though and get his thoughts
> > on it. This is certainly an area we've spent time thinking about but
> > are obviously also open to finding a better solution.
>
> Why not to simply look at the last checkpoint LSN and use that the same
> way basebackup does? AFAICS that should make the pg_sleep() unnecessary.

Right.

Michael

--
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax: +49 2166 9901-100
Email: michael(dot)banck(at)credativ(dot)de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	Michael Banck <michael(dot)banck(at)credativ(dot)de>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Subject:	Re: Online verification of checksums
Date:	2018-09-17 18:09:45
Message-ID:	CAOuzzgq0zxW7-26bKz=2ER8n34Y0iVG2gmB78pOfg0wACKM9Vw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Greetings,

On Mon, Sep 17, 2018 at 13:38 Michael Banck <michael(dot)banck(at)credativ(dot)de>
wrote:

> so, trying some intermediate summary here, sorry for (also) top-posting:
>
> 1. the basebackup checksum verification logic only checks pages not
> changed since the checkpoint, which makes sense for the basebackup.

Right. I’m tending towards the idea that this also be adopted for
pg_verify_checksums.

2. However, it would be desirable to go further for pg_verify_checksums
> and (re-)check all pages.

Maybe. I’m not entirely convinced that it’s all that useful.

3. pg_verify_checksums should read the checkpoint LSN on startup and
> compare the page LSN against it on re-read, and discard pages which have
> checksum failures but are new. (Maybe it should read new checkpoint LSNs
> as they come in during its runtime as well? See below).

I’m not sure that we really need to but I’m not against it either- but in
that case you’re definitely going to see checksum failures on torn pages.

4. The pg_sleep should go.

I know that pgbackrest does not have a sleep currently and we’ve not yet
seen or been able to reproduce this case where, on a reread, we still see
an older LSN, but we check the LSN first also. If it’s possible that the
LSN still hasn’t changed on the reread then maybe we do need to have a
sleep to force ourselves off CPU to allow the other process to finish
writing, or maybe finish the file and come back around to these pages
later, but we have yet to see this behavior in the wild anywhere, nor have
we been able to reproduce it.

5. There seems to be no consensus on whether the number of skipped pages
> should be summarized at the end.

I agree with printing the number of skipped pages, that does seem like a
nice to have. I don’t know that actually printing the pages themselves is
all that useful though.

Further comments:
>
> Am Montag, den 17.09.2018, 19:19 +0200 schrieb Tomas Vondra:
> > On 09/17/2018 07:11 PM, Stephen Frost wrote:
> > > * Tomas Vondra (tomas(dot)vondra(at)2ndquadrant(dot)com) wrote:
> > > > On 09/17/2018 06:42 PM, Stephen Frost wrote:
> > > > Without the checkpoint that's not guaranteed, and simply re-reading
> the
> > > > page and rechecking it vs. the first read does not help:
> > > >
> > > > 1) write the first 512B of the page (sector), which includes the LSN
> > > >
> > > > 2) read the whole page, which will be a mix [new 512B, ... old ... ]
> > > >
> > > > 3) the checksum verification fails
> > > >
> > > > 4) read the page again (possibly reading a bit more new data)
> > > >
> > > > 5) the LSN did not change compared to the first read, yet the
> checksum
> > > > still fails
> > >
> > > So, I agree with all of the above though I've found it to be extremely
> > > rare to get a single read which you've managed to catch part-way
> through
> > > a write, getting multiple of them over a period of time strikes me as
> > > even more unlikely. Still, if we can come up with a solution to solve
> > > all of this, great, but I'm not sure that I'm hearing one.
> >
> > I don't recall claiming catching many such torn pages - I'm sure it's
> > not very common in most workloads. But I suspect constructing workloads
> > hitting them regularly is not very difficult either (something with a
> > lot of churn in shared buffers should do the trick).
> >
> > > > > Sure, because we don't care about it any longer- that page isn't
> > > > > interesting because the WAL will replay over it. IIRC it actually
> goes
> > > > > something like: check the checksum, if it failed then check if the
> LSN
> > > > > is greater than the checkpoint (of the backup start..), if not,
> then
> > > > > re-read, if the LSN is now newer than the checkpoint then skip, if
> the
> > > > > LSN is the same then throw an error.
> > > >
> > > > Nope, we only verify the checksum if it's LSN precedes the
> checkpoint:
> > > >
> > > >
> https://github.com/postgres/postgres/blob/master/src/backend/replication/basebackup.c#L1454
> > >
> > > That seems like it's leaving something on the table, but, to be fair,
> we
> > > know that all of those pages should be rewritten by WAL anyway so they
> > > aren't all that interesting to us, particularly in the basebackup case.
> >
> > Yep.
>
> Right, see point 1 above.
>
> > > > > I actually tend to disagree with you that, for this purpose, it's
> > > > > actually necessary to check against the checkpoint LSN- if the LSN
> > > > > changed and everything is operating correctly then the new LSN
> must be
> > > > > more recent than the last checkpoint location or things are broken
> > > > > badly.
> > > >
> > > > I don't follow. Are you suggesting we don't need the checkpoint LSN?
> > > >
> > > > I'm pretty sure that's not the case. The thing is - the LSN may not
> > > > change between the two reads, but that's not a guarantee the page was
> > > > not torn. The example I posted earlier in this message illustrates
> that.
> > >
> > > I agree that there's some risk there, but it's certainly much less
> > > likely.
> >
> > Well. If we're going to report a checksum failure, we better be sure it
> > actually is a broken page. I don't want users to start chasing bogus
> > data corruption issues.
>
> I agree.
>
> > > > > Now, that said, I do think it's a good *idea* to check against the
> > > > > checkpoint LSN (presuming this is for online checking of
> checksums- for
> > > > > basebackup, we could just check against the backup-start LSN as
> anything
> > > > > after that point will be rewritten by WAL anyway). The reason
> that I
> > > > > think it's a good idea to check against the checkpoint LSN is that
> we'd
> > > > > want to throw a big warning if the kernel is just feeding us random
> > > > > garbage on reads and only finding a difference between two reads
> isn't
> > > > > really doing any kind of validation, whereas checking against the
> > > > > checkpoint-LSN would at least give us some idea that the value
> being
> > > > > read isn't completely ridiculous.
>
> Are you suggesting here that we always check against the current
> checkpoint, or is checking against the checkpoint that we saw at startup
> enough? I think re-reading pg_control all the time might be more
> errorprone that what we could get from this, so I would prefer not to do
> this.

I don’t follow why rereading pg_control would be error-prone. That said, I
don’t have a particularly strong opinion either way on this.

> > > > When it comes to if the pg_sleep() is necessary or not, I have to
> admit
> > > > > to being unsure about that.. I could see how it might be but it
> seems a
> > > > > bit surprising- I'd probably want to see exactly what the page was
> at
> > > > > the time of the failure and at the time of the second (no-sleep)
> re-read
> > > > > and then after a delay and convince myself that it was just an
> unlucky
> > > > > case of being scheduled in twice to read that page before the
> process
> > > > > writing it out got a chance to finish the write.
> > > >
> > > > I think the pg_sleep() is a pretty strong sign there's something
> broken.
> > > > At the very least, it's likely to misbehave on machines with
> different
> > > > timings, machines under memory and/or memory pressure, etc.
>
> I swapped out the pg_sleep earlier today for the check-against-
> checkpoint-LSN-on-reread, and that seems to work just as fine, at least
> in the tests I ran.

Ok, this sounds like you were probably seeing normal forward torn pages,
and we have certainly seen that before.

> > If we assume that what you've outlined above is a serious enough issue
> > > that we have to address it, and do so without a pg_sleep(), then I
> think
> > > we have to bake into this a way for the process to check with PG as to
> > > what the page's current LSN is, in shared buffers, because that's the
> > > only place where we've got the locking required to ensure that we don't
> > > end up with a read of a partially written page, and I'm really not
> > > entirely convinced that we need to go to that level. It'd certainly
> add
> > > a huge amount of additional complexity for what appears to be a quite
> > > unlikely gain.
> > >
> > > I'll chat w/ David shortly about this again though and get his thoughts
> > > on it. This is certainly an area we've spent time thinking about but
> > > are obviously also open to finding a better solution.
> >
> > Why not to simply look at the last checkpoint LSN and use that the same
> > way basebackup does? AFAICS that should make the pg_sleep() unnecessary.
>
> Right.

This is fine if you know the kernel will always write the first page first,
or you accept that a reread of a page which isn’t valid will always result
in seeing a completely updated page.

We’ve made the assumption that a reread on a failure where the LSN on the
first read was older than the backup-start LSN will give us an updated
first-half of the page which we then check the LSN, of, but we have yet to
prove that this is actually possible.

Thanks!

Stephen

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Stephen Frost <sfrost(at)snowman(dot)net>
Cc:	Michael Banck <michael(dot)banck(at)credativ(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2018-09-17 21:33:52
Message-ID:	a6f6a9f7-3fb6-1cb5-631a-3e36012abde0@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 09/17/2018 07:35 PM, Stephen Frost wrote:
> Greetings,
>
> On Mon, Sep 17, 2018 at 13:20 Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com
> <mailto:tomas(dot)vondra(at)2ndquadrant(dot)com>> wrote:
>
> On 09/17/2018 07:11 PM, Stephen Frost wrote:
> > Greetings,
> >
> > * Tomas Vondra (tomas(dot)vondra(at)2ndquadrant(dot)com
> <mailto:tomas(dot)vondra(at)2ndquadrant(dot)com>) wrote:
> >> On 09/17/2018 06:42 PM, Stephen Frost wrote:
> >>> Ok, good, though I'm not sure what you mean by 'eliminates the
> >>> consistency guarantees provided by the checkpoint'. The point
> is that
> >>> the page will be in the WAL and the WAL will be replayed during the
> >>> restore of the backup.
> >>
> >> The checkpoint guarantees that the whole page was written and
> flushed to
> >> disk with an LSN before the ckeckpoint LSN. So when you read a
> page with
> >> that LSN, you know the whole write already completed and a read won't
> >> return data from before the LSN.
> >
> > Well, you know that the first part was written out at some prior
> point,
> > but you could end up reading the first part of a page with an
> older LSN
> > while also reading the second part with new data.
>
>
>
> Doesn't the checkpoint fsync pretty much guarantee this can't happen?
>
>
> How? Either it’s possible for the latter half of a page to be updated
> before the first half (where the LSN lives), or it isn’t. If it’s
> possible then that LSN could be ancient and it wouldn’t matter.
>

I'm not sure I understand what you're saying here.

It is not about the latter page to be updated before the first half. I
don't think that's quite possible, because write() into page cache does
in fact write the data sequentially.

The problem is that the write is not atomic, and AFAIK it happens in
sectors (which are either 512B or 4K these days). And it may arbitrarily
interleave with reads.

So you may do write(8k), but it actually happens in 512B chunks and a
concurrent read may observe some mix of those.

But the trick is that if the read sees the effect of the write somewhere
in the middle of the page, the next read is guaranteed to see all the
preceding new data.

Without the checkpoint we risk seeing the same write() both in read and
re-read, just in a different stage - so the LSN would not change, making
the check futile.

But by waiting for the checkpoint we know that the original write is no
longer in progress, so if we saw a partial write we're guaranteed to see
a new LSN on re-read.

This is what I mean by the checkpoint / fsync guarantee.

> >> Without the checkpoint that's not guaranteed, and simply
> re-reading the
> >> page and rechecking it vs. the first read does not help:
> >>
> >> 1) write the first 512B of the page (sector), which includes the LSN
> >>
> >> 2) read the whole page, which will be a mix [new 512B, ... old ... ]
> >>
> >> 3) the checksum verification fails
> >>
> >> 4) read the page again (possibly reading a bit more new data)
> >>
> >> 5) the LSN did not change compared to the first read, yet the
> checksum
> >> still fails
> >
> > So, I agree with all of the above though I've found it to be extremely
> > rare to get a single read which you've managed to catch part-way
> through
> > a write, getting multiple of them over a period of time strikes me as
> > even more unlikely. Still, if we can come up with a solution to solve
> > all of this, great, but I'm not sure that I'm hearing one.
>
>
> I don't recall claiming catching many such torn pages - I'm sure it's
> not very common in most workloads. But I suspect constructing workloads
> hitting them regularly is not very difficult either (something with a
> lot of churn in shared buffers should do the trick).
>
>
> The question is if it’s possible to catch a torn page where the second
> half is updated *before* the first half of the page in a read (and then
> in subsequent reads having that state be maintained). I have some
> skepticism that it’s really possible to happen in the first place but
> having an interrupted system call be stalled across two more system
> calls just seems terribly unlikely, and this is all based on the
> assumption that the kernel might write the second half of a write before
> the first to the kernel cache in the first place.
>

Yes, if that was possible, the explanation about the checkpoint fsync
guarantee would be bogus, obviously.

I've spent quite a bit of time looking into how write() is handled, and
I believe seeing only the second half is not possible. You may observe a
page torn in various ways (not necessarily in half), e.g.

[old,new,old]

but then the re-read you should be guaranteed to see new data up until
the last "new" chunk:

[new,new,old]

At least that's my understanding. I failed to deduce what POSIX says
about this, or how it behaves on various OS/filesystems.

The one thing I've done was writing a simple stress test that writes a
single 8kB page in a loop, reads it concurrently and checks the
behavior. And it seems consistent with my understanding.

>
> >>> Now, that said, I do think it's a good *idea* to check against the
> >>> checkpoint LSN (presuming this is for online checking of
> checksums- for
> >>> basebackup, we could just check against the backup-start LSN as
> anything
> >>> after that point will be rewritten by WAL anyway). The reason
> that I
> >>> think it's a good idea to check against the checkpoint LSN is
> that we'd
> >>> want to throw a big warning if the kernel is just feeding us random
> >>> garbage on reads and only finding a difference between two reads
> isn't
> >>> really doing any kind of validation, whereas checking against the
> >>> checkpoint-LSN would at least give us some idea that the value being
> >>> read isn't completely ridiculous.
> >>>
> >>> When it comes to if the pg_sleep() is necessary or not, I have
> to admit
> >>> to being unsure about that.. I could see how it might be but it
> seems a
> >>> bit surprising- I'd probably want to see exactly what the page
> was at
> >>> the time of the failure and at the time of the second (no-sleep)
> re-read
> >>> and then after a delay and convince myself that it was just an
> unlucky
> >>> case of being scheduled in twice to read that page before the
> process
> >>> writing it out got a chance to finish the write.
> >>
> >> I think the pg_sleep() is a pretty strong sign there's something
> broken.
> >> At the very least, it's likely to misbehave on machines with
> different
> >> timings, machines under memory and/or memory pressure, etc.
> >
> > If we assume that what you've outlined above is a serious enough issue
> > that we have to address it, and do so without a pg_sleep(), then I
> think
> > we have to bake into this a way for the process to check with PG as to
> > what the page's current LSN is, in shared buffers, because that's the
> > only place where we've got the locking required to ensure that we
> don't
> > end up with a read of a partially written page, and I'm really not
> > entirely convinced that we need to go to that level. It'd
> certainly add
> > a huge amount of additional complexity for what appears to be a quite
> > unlikely gain.
> >
> > I'll chat w/ David shortly about this again though and get his
> thoughts
> > on it. This is certainly an area we've spent time thinking about but
> > are obviously also open to finding a better solution.
>
>
> Why not to simply look at the last checkpoint LSN and use that the same
> way basebackup does? AFAICS that should make the pg_sleep() unnecessary.
>
>
> Use that to compare to what? The LSN in the first half of the page
> could be from well before the checkpoint or even the backup started.
>

Not sure I follow. If the LSN in the page header is old, and the
checksum check failed, then on re-read we either find a new LSN (in
which case we skip the page) or consider this to be a checksum failure.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	Michael Banck <michael(dot)banck(at)credativ(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2018-09-17 22:01:33
Message-ID:	20180917220133.GC4184@tamriel.snowman.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Greetings,

* Tomas Vondra (tomas(dot)vondra(at)2ndquadrant(dot)com) wrote:
> On 09/17/2018 07:35 PM, Stephen Frost wrote:
> > On Mon, Sep 17, 2018 at 13:20 Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com
> > <mailto:tomas(dot)vondra(at)2ndquadrant(dot)com>> wrote:
> > Doesn't the checkpoint fsync pretty much guarantee this can't happen?
> >
> > How? Either it’s possible for the latter half of a page to be updated
> > before the first half (where the LSN lives), or it isn’t. If it’s
> > possible then that LSN could be ancient and it wouldn’t matter.
>
> I'm not sure I understand what you're saying here.
>
> It is not about the latter page to be updated before the first half. I
> don't think that's quite possible, because write() into page cache does
> in fact write the data sequentially.

Well, maybe 'updated before' wasn't quite the right way to talk about
it, but consider if a read(8K) gets only half-way through the copy
before having to go do something else and by the time it gets back, a
write has come in and rewritten the page, such that the read(8K)
returns half-old and half-new data.

> The problem is that the write is not atomic, and AFAIK it happens in
> sectors (which are either 512B or 4K these days). And it may arbitrarily
> interleave with reads.

Yes, of course the write isn't atomic, that's clear.

> So you may do write(8k), but it actually happens in 512B chunks and a
> concurrent read may observe some mix of those.

Right, I'm not sure that we really need to worry about sub-4K writes
though I suppose they're technically possible, but it doesn't much
matter in this case since the LSN is early on in the page, of course.

> But the trick is that if the read sees the effect of the write somewhere
> in the middle of the page, the next read is guaranteed to see all the
> preceding new data.

If that's guaranteed then we can just check the LSN and be done.

> Without the checkpoint we risk seeing the same write() both in read and
> re-read, just in a different stage - so the LSN would not change, making
> the check futile.

This is the part that isn't making much sense to me. If we are
guaranteed that writes into the kernel cache are always in order and
always at least 512B in size, then if we check the LSN first and
discover it's "old", and then read the rest of the page and calculate
the checksum, discover it's a bad checksum, and then go back and re-read
the page then we *must* see that the LSN has changed OR conclude that
the checksum is invalidated.

The reason this can happen in the first place is that our 8K read might
only get half-way done before getting scheduled off and a 8K write
happened on the page before our read(8K) gets back to finishing the
read, but if what you're saying is true, then we can't ever have a case
where such a thing would happen and a re-read would still see the "old"
LSN.

If we check the LSN first and discover it's "new" (as in, more recent
than our last checkpoint, or the checkpoint where the backup started)
then, sure, there's going to be a risk that the page is currently being
written right that moment and isn't yet completely valid.

The problem that we aren't solving for is if, somehow, we do a read(8K)
and get the first half/second half mixup and then on a subsequent
read(8K) we see that *again*, implying that somehow the kernel's copy
has the latter-half of the page updated consistently but not the first
half. That's a problem that I haven't got a solution to today. I'd
love to have a guarantee that it's not possible- we've certainly never
seen it but it's been a concern and I thought Michael was suggesting
he'd seen that, but it sounds like there wasn't a check on the LSN in
the first read, in which case it could have just been a 'regular' torn
page case.

> But by waiting for the checkpoint we know that the original write is no
> longer in progress, so if we saw a partial write we're guaranteed to see
> a new LSN on re-read.
>
> This is what I mean by the checkpoint / fsync guarantee.

I don't think any of this really has anythign to do with either fsync
being called or with the actual checkpointing process (except to the
extent that the checkpointer is the thing doing the writing, and that we
should be checking the LSN against the LSN of the last checkpoint when
we started, or against the start of the backup LSN if we're talking
about doing a backup).

> > The question is if it’s possible to catch a torn page where the second
> > half is updated *before* the first half of the page in a read (and then
> > in subsequent reads having that state be maintained). I have some
> > skepticism that it’s really possible to happen in the first place but
> > having an interrupted system call be stalled across two more system
> > calls just seems terribly unlikely, and this is all based on the
> > assumption that the kernel might write the second half of a write before
> > the first to the kernel cache in the first place.
>
> Yes, if that was possible, the explanation about the checkpoint fsync
> guarantee would be bogus, obviously.
>
> I've spent quite a bit of time looking into how write() is handled, and
> I believe seeing only the second half is not possible. You may observe a
> page torn in various ways (not necessarily in half), e.g.
>
> [old,new,old]
>
> but then the re-read you should be guaranteed to see new data up until
> the last "new" chunk:
>
> [new,new,old]
>
> At least that's my understanding. I failed to deduce what POSIX says
> about this, or how it behaves on various OS/filesystems.
>
> The one thing I've done was writing a simple stress test that writes a
> single 8kB page in a loop, reads it concurrently and checks the
> behavior. And it seems consistent with my understanding.

Good.

> > Use that to compare to what? The LSN in the first half of the page
> > could be from well before the checkpoint or even the backup started.
>
> Not sure I follow. If the LSN in the page header is old, and the
> checksum check failed, then on re-read we either find a new LSN (in
> which case we skip the page) or consider this to be a checksum failure.

Right, I'm in agreement with doing that and it's what is done in
pgbasebackup and pgBackRest.

Thanks!

Stephen

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Stephen Frost <sfrost(at)snowman(dot)net>
Cc:	Michael Banck <michael(dot)banck(at)credativ(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2018-09-18 00:34:35
Message-ID:	a96dcaa9-3e36-bcb5-34f3-804fefd2a571@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 09/18/2018 12:01 AM, Stephen Frost wrote:
> Greetings,
>
> * Tomas Vondra (tomas(dot)vondra(at)2ndquadrant(dot)com) wrote:
>> On 09/17/2018 07:35 PM, Stephen Frost wrote:
>>> On Mon, Sep 17, 2018 at 13:20 Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com
>>> <mailto:tomas(dot)vondra(at)2ndquadrant(dot)com>> wrote:
>>> Doesn't the checkpoint fsync pretty much guarantee this can't happen?
>>>
>>> How? Either it’s possible for the latter half of a page to be updated
>>> before the first half (where the LSN lives), or it isn’t. If it’s
>>> possible then that LSN could be ancient and it wouldn’t matter.
>>
>> I'm not sure I understand what you're saying here.
>>
>> It is not about the latter page to be updated before the first half. I
>> don't think that's quite possible, because write() into page cache does
>> in fact write the data sequentially.
>
> Well, maybe 'updated before' wasn't quite the right way to talk about
> it, but consider if a read(8K) gets only half-way through the copy
> before having to go do something else and by the time it gets back, a
> write has come in and rewritten the page, such that the read(8K)
> returns half-old and half-new data.
>
>> The problem is that the write is not atomic, and AFAIK it happens in
>> sectors (which are either 512B or 4K these days). And it may arbitrarily
>> interleave with reads.
>
> Yes, of course the write isn't atomic, that's clear.
>
>> So you may do write(8k), but it actually happens in 512B chunks and a
>> concurrent read may observe some mix of those.
>
> Right, I'm not sure that we really need to worry about sub-4K writes
> though I suppose they're technically possible, but it doesn't much
> matter in this case since the LSN is early on in the page, of course.
>
>> But the trick is that if the read sees the effect of the write somewhere
>> in the middle of the page, the next read is guaranteed to see all the
>> preceding new data.
>
> If that's guaranteed then we can just check the LSN and be done.
>

What do you mean by "check the LSN"? Compare it to LSN from the first
read? You don't know if the first read already saw the new LSN or not
(see the next example).

>> Without the checkpoint we risk seeing the same write() both in read and
>> re-read, just in a different stage - so the LSN would not change, making
>> the check futile.
>
> This is the part that isn't making much sense to me. If we are
> guaranteed that writes into the kernel cache are always in order and
> always at least 512B in size, then if we check the LSN first and
> discover it's "old", and then read the rest of the page and calculate
> the checksum, discover it's a bad checksum, and then go back and re-read
> the page then we *must* see that the LSN has changed OR conclude that
> the checksum is invalidated.
>

Even if the writes are in order and in 512B chunks, you don't know how
they are interleaved with the reads.

Let's assume we're doing a write(), which splits the 8kB page into 512B
chunks. A concurrent read may observe a random mix of old and new data,
depending on timing.

So let's say a read sees the first 2kB of data like this:

[new, new, new, old, new, old, new, old]

OK, the page is obviously torn, checksum fails, and we try reading it
again. We should see new data at least until the last 'new' chunk in the
first read, so let's say we got this:

[new, new, new, new, new, new, new, old]

Obviously, this page is also torn (there are old data at the end), but
we've read the new data in both cases, which includes the LSN. So the
LSN is the same in both cases, and your detection fails.

Comparing the page LSN to the last checkpoint LSN solves this, because
if the LSN is older than the checkpoint LSN, that write must have been
completed by now, and so we're not in danger of seeing only incomplete
effects of it. And newer write will update the LSN.

> The reason this can happen in the first place is that our 8K read might
> only get half-way done before getting scheduled off and a 8K write
> happened on the page before our read(8K) gets back to finishing the
> read, but if what you're saying is true, then we can't ever have a case
> where such a thing would happen and a re-read would still see the "old"
> LSN.
>
> If we check the LSN first and discover it's "new" (as in, more recent
> than our last checkpoint, or the checkpoint where the backup started)
> then, sure, there's going to be a risk that the page is currently being
> written right that moment and isn't yet completely valid.
>

Right.

> The problem that we aren't solving for is if, somehow, we do a read(8K)
> and get the first half/second half mixup and then on a subsequent
> read(8K) we see that *again*, implying that somehow the kernel's copy
> has the latter-half of the page updated consistently but not the first
> half. That's a problem that I haven't got a solution to today. I'd
> love to have a guarantee that it's not possible- we've certainly never
> seen it but it's been a concern and I thought Michael was suggesting
> he'd seen that, but it sounds like there wasn't a check on the LSN in
> the first read, in which case it could have just been a 'regular' torn
> page case.
>

Well, yeah. If that would be possible, we'd be in serious trouble. I've
done quite a bit of experimentation with concurrent reads and writes and
I have not observed such behavior. Of course, that's hardly a proof it
can't happen, and it wouldn't be the first surprise with respect to
kernel I/O this year ...

>> But by waiting for the checkpoint we know that the original write is no
>> longer in progress, so if we saw a partial write we're guaranteed to see
>> a new LSN on re-read.
>>
>> This is what I mean by the checkpoint / fsync guarantee.
>
> I don't think any of this really has anythign to do with either fsync
> being called or with the actual checkpointing process (except to the
> extent that the checkpointer is the thing doing the writing, and that we
> should be checking the LSN against the LSN of the last checkpoint when
> we started, or against the start of the backup LSN if we're talking
> about doing a backup).
>

You're right it's not about the fsync, sorry for the confusion. My point
is that using the checkpoint LSN gives us a guarantee that write is no
longer in progress, and so we can't see a page torn because of it. And
if we see a partial write due to a new write, it's guaranteed to update
the page LSN (and we'll notice it).

>>> The question is if it’s possible to catch a torn page where the second
>>> half is updated *before* the first half of the page in a read (and then
>>> in subsequent reads having that state be maintained). I have some
>>> skepticism that it’s really possible to happen in the first place but
>>> having an interrupted system call be stalled across two more system
>>> calls just seems terribly unlikely, and this is all based on the
>>> assumption that the kernel might write the second half of a write before
>>> the first to the kernel cache in the first place.
>>
>> Yes, if that was possible, the explanation about the checkpoint fsync
>> guarantee would be bogus, obviously.
>>
>> I've spent quite a bit of time looking into how write() is handled, and
>> I believe seeing only the second half is not possible. You may observe a
>> page torn in various ways (not necessarily in half), e.g.
>>
>> [old,new,old]
>>
>> but then the re-read you should be guaranteed to see new data up until
>> the last "new" chunk:
>>
>> [new,new,old]
>>
>> At least that's my understanding. I failed to deduce what POSIX says
>> about this, or how it behaves on various OS/filesystems.
>>
>> The one thing I've done was writing a simple stress test that writes a
>> single 8kB page in a loop, reads it concurrently and checks the
>> behavior. And it seems consistent with my understanding.
>
> Good.
>
>>> Use that to compare to what? The LSN in the first half of the page
>>> could be from well before the checkpoint or even the backup started.
>>
>> Not sure I follow. If the LSN in the page header is old, and the
>> checksum check failed, then on re-read we either find a new LSN (in
>> which case we skip the page) or consider this to be a checksum failure.
>
> Right, I'm in agreement with doing that and it's what is done in
> pgbasebackup and pgBackRest.
>

OK. All I'm saying is pg_verify_checksums should probably do the same
thing, i.e. grab checkpoint LSN and roll with that.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	Michael Banck <michael(dot)banck(at)credativ(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2018-09-18 00:45:06
Message-ID:	20180918004506.GF4184@tamriel.snowman.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Greetings,

* Tomas Vondra (tomas(dot)vondra(at)2ndquadrant(dot)com) wrote:
> On 09/18/2018 12:01 AM, Stephen Frost wrote:
> > * Tomas Vondra (tomas(dot)vondra(at)2ndquadrant(dot)com) wrote:
> >> On 09/17/2018 07:35 PM, Stephen Frost wrote:
> >> But the trick is that if the read sees the effect of the write somewhere
> >> in the middle of the page, the next read is guaranteed to see all the
> >> preceding new data.
> >
> > If that's guaranteed then we can just check the LSN and be done.
>
> What do you mean by "check the LSN"? Compare it to LSN from the first
> read? You don't know if the first read already saw the new LSN or not
> (see the next example).

Hmm, ok, I can see your point there. I've been going back and forth
between checking against what the prior LSN was on the page and checking
it against an independent source (like the last checkpoint's LSN), but..

[...]

> Comparing the page LSN to the last checkpoint LSN solves this, because
> if the LSN is older than the checkpoint LSN, that write must have been
> completed by now, and so we're not in danger of seeing only incomplete
> effects of it. And newer write will update the LSN.

Yeah, that makes sense- we need to be looking at something which only
gets updated once the write has actually completed, and the last
checkpoint's LSN gives us that guarantee.

> > The problem that we aren't solving for is if, somehow, we do a read(8K)
> > and get the first half/second half mixup and then on a subsequent
> > read(8K) we see that *again*, implying that somehow the kernel's copy
> > has the latter-half of the page updated consistently but not the first
> > half. That's a problem that I haven't got a solution to today. I'd
> > love to have a guarantee that it's not possible- we've certainly never
> > seen it but it's been a concern and I thought Michael was suggesting
> > he'd seen that, but it sounds like there wasn't a check on the LSN in
> > the first read, in which case it could have just been a 'regular' torn
> > page case.
>
> Well, yeah. If that would be possible, we'd be in serious trouble. I've
> done quite a bit of experimentation with concurrent reads and writes and
> I have not observed such behavior. Of course, that's hardly a proof it
> can't happen, and it wouldn't be the first surprise with respect to
> kernel I/O this year ...

I'm glad to hear that you've done a lot of experimentation in this area
and haven't seen such strange behavior happen- we've got quite a few
people running pgBackRest with checksum-checking and haven't seen it
either, but it's always been a bit of a concern.

> You're right it's not about the fsync, sorry for the confusion. My point
> is that using the checkpoint LSN gives us a guarantee that write is no
> longer in progress, and so we can't see a page torn because of it. And
> if we see a partial write due to a new write, it's guaranteed to update
> the page LSN (and we'll notice it).

Right, no worries about the confusion, I hadn't been fully thinking
through the LSN bit either and that what we really need is some external
confirmation of a write having *completed* (not just started) and that
makes a definite difference.

> > Right, I'm in agreement with doing that and it's what is done in
> > pgbasebackup and pgBackRest.
>
> OK. All I'm saying is pg_verify_checksums should probably do the same
> thing, i.e. grab checkpoint LSN and roll with that.

Agreed.

Thanks!

Stephen

From:	Michael Banck <michael(dot)banck(at)credativ(dot)de>
To:	Stephen Frost <sfrost(at)snowman(dot)net>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Subject:	Re: Online verification of checksums
Date:	2018-09-18 10:11:42
Message-ID:	1537265502.3800.14.camel@credativ.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Am Montag, den 17.09.2018, 14:09 -0400 schrieb Stephen Frost:
> > 5. There seems to be no consensus on whether the number of skipped pages
> > should be summarized at the end.
>
> I agree with printing the number of skipped pages, that does seem like
> a nice to have. I don’t know that actually printing the pages
> themselves is all that useful though.

Oh ok - I never intended to print out the block numbers themselves, just
the final number of skipped blocks in the summary. So I guess that's
fine and I will add that in my branch.

Michael

--
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax: +49 2166 9901-100
Email: michael(dot)banck(at)credativ(dot)de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

From:	Michael Banck <michael(dot)banck(at)credativ(dot)de>
To:	Stephen Frost <sfrost(at)snowman(dot)net>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2018-09-18 10:37:02
Message-ID:	1537267022.3800.17.camel@credativ.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi.

Am Montag, den 17.09.2018, 20:45 -0400 schrieb Stephen Frost:
> > You're right it's not about the fsync, sorry for the confusion. My point
> > is that using the checkpoint LSN gives us a guarantee that write is no
> > longer in progress, and so we can't see a page torn because of it. And
> > if we see a partial write due to a new write, it's guaranteed to update
> > the page LSN (and we'll notice it).
>
> Right, no worries about the confusion, I hadn't been fully thinking
> through the LSN bit either and that what we really need is some external
> confirmation of a write having *completed* (not just started) and that
> makes a definite difference.
>
> > > Right, I'm in agreement with doing that and it's what is done in
> > > pgbasebackup and pgBackRest.
> >
> > OK. All I'm saying is pg_verify_checksums should probably do the same
> > thing, i.e. grab checkpoint LSN and roll with that.
>
> Agreed.

I've attached the patch I added to my branch to swap out the pg_sleep()
with a check against the checkpoint LSN on a recheck verification
failure.

Let me know if there are still issues with it. I'll send a new patch for
the whole online verification feature in a bit.

Michael

--
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax: +49 2166 9901-100
Email: michael(dot)banck(at)credativ(dot)de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

Attachment	Content-Type	Size
pg_verify_checksums_recheck_lsn.patch	text/x-patch	2.9 KB

From:	Michael Banck <michael(dot)banck(at)credativ(dot)de>
To:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2018-09-18 14:37:22
Message-ID:	1537281442.3800.20.camel@credativ.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

please find attached version 2 of the patch.

Am Donnerstag, den 26.07.2018, 13:59 +0200 schrieb Michael Banck:
> I've now forward-ported this change to pg_verify_checksums, in order to
> make this application useful for online clusters, see attached patch.
>
> I've tested this in a tight loop (while true; do pg_verify_checksums -D
> data1 -d > /dev/null || /bin/true; done)[2] while doing "while true; do
> createdb pgbench; pgbench -i -s 10 pgbench > /dev/null; dropdb pgbench;
> done", which I already used to develop the original code in the fork and
> which brought up a few bugs.
>
> I got one checksums verification failure this way, all others were
> caught by the recheck (I've introduced a 500ms delay for the first ten
> failures) like this:
>
> > pg_verify_checksums: checksum verification failed on first attempt in
> > file "data1/base/16837/16850", block 7770: calculated checksum 785 but
> > expected 5063
> > pg_verify_checksums: block 7770 in file "data1/base/16837/16850"
> > verified ok on recheck

I have now changed this from the pg_sleep() to a check against the
checkpoint LSN as discussed upthread.

> However, I am also seeing sporadic (maybe 0.5 times per pgbench run)
> failures like this:
>
> > pg_verify_checksums: short read of block 2644 in file
> > "data1/base/16637/16650", got only 4096 bytes
>
> This is not strictly a verification failure, should we do anything about
> this? In my fork, I am also rechecking on this[3] (and I am happy to
> extend the patch that way), but that makes the code and the patch more
> complicated and I wanted to check the general opinion on this case
> first.

I have added a retry for this as well now, without a pg_sleep() as well.
This catches around 80% of the half-reads, but a few slip through. At
that point we bail out with exit(1), and the user can try again, which I
think is fine?

Alternatively, we could just skip to the next file then and don't make
it count as a checksum failure.

Other changes from V1:

1. Rebased to 422952ee
2. Ignore ENOENT failure during file open and skip to next file
3. Mention total number of skipped blocks during the summary at the end
of the run
4. Skip files starting with pg_internal.init*

Michael

--
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax: +49 2166 9901-100
Email: michael(dot)banck(at)credativ(dot)de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

Attachment	Content-Type	Size
online-verification-of-checksums_V2.patch	text/x-patch	5.9 KB

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	Michael Banck <michael(dot)banck(at)credativ(dot)de>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2018-09-18 15:45:36
Message-ID:	20180918154536.GE4184@tamriel.snowman.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Greetings,

* Michael Banck (michael(dot)banck(at)credativ(dot)de) wrote:
> please find attached version 2 of the patch.
>
> Am Donnerstag, den 26.07.2018, 13:59 +0200 schrieb Michael Banck:
> > I've now forward-ported this change to pg_verify_checksums, in order to
> > make this application useful for online clusters, see attached patch.
> >
> > I've tested this in a tight loop (while true; do pg_verify_checksums -D
> > data1 -d > /dev/null || /bin/true; done)[2] while doing "while true; do
> > createdb pgbench; pgbench -i -s 10 pgbench > /dev/null; dropdb pgbench;
> > done", which I already used to develop the original code in the fork and
> > which brought up a few bugs.
> >
> > I got one checksums verification failure this way, all others were
> > caught by the recheck (I've introduced a 500ms delay for the first ten
> > failures) like this:
> >
> > > pg_verify_checksums: checksum verification failed on first attempt in
> > > file "data1/base/16837/16850", block 7770: calculated checksum 785 but
> > > expected 5063
> > > pg_verify_checksums: block 7770 in file "data1/base/16837/16850"
> > > verified ok on recheck
>
> I have now changed this from the pg_sleep() to a check against the
> checkpoint LSN as discussed upthread.

Ok.

> > However, I am also seeing sporadic (maybe 0.5 times per pgbench run)
> > failures like this:
> >
> > > pg_verify_checksums: short read of block 2644 in file
> > > "data1/base/16637/16650", got only 4096 bytes
> >
> > This is not strictly a verification failure, should we do anything about
> > this? In my fork, I am also rechecking on this[3] (and I am happy to
> > extend the patch that way), but that makes the code and the patch more
> > complicated and I wanted to check the general opinion on this case
> > first.
>
> I have added a retry for this as well now, without a pg_sleep() as well.

> This catches around 80% of the half-reads, but a few slip through. At
> that point we bail out with exit(1), and the user can try again, which I
> think is fine?

No, this is perfectly normal behavior, as is having completely blank
pages, now that I think about it. If we get a short read then I'd say
we simply check that we got an EOF and, in that case, we just move on.

> Alternatively, we could just skip to the next file then and don't make
> it count as a checksum failure.

No, I wouldn't count it as a checksum failure. We could possibly count
it towards the skipped pages, though I'm even on the fence about that.

Thanks!

Stephen

From:	David Steele <david(at)pgmasters(dot)net>
To:	Stephen Frost <sfrost(at)snowman(dot)net>, Michael Banck <michael(dot)banck(at)credativ(dot)de>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2018-09-18 17:52:03
Message-ID:	47e26e3d-989f-b034-f2fc-926b67cc22bf@pgmasters.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 9/18/18 11:45 AM, Stephen Frost wrote:
> * Michael Banck (michael(dot)banck(at)credativ(dot)de) wrote:

>> I have added a retry for this as well now, without a pg_sleep() as well.
>
>> This catches around 80% of the half-reads, but a few slip through. At
>> that point we bail out with exit(1), and the user can try again, which I
>> think is fine?
>
> No, this is perfectly normal behavior, as is having completely blank
> pages, now that I think about it. If we get a short read then I'd say
> we simply check that we got an EOF and, in that case, we just move on.
>
>> Alternatively, we could just skip to the next file then and don't make
>> it count as a checksum failure.
>
> No, I wouldn't count it as a checksum failure. We could possibly count
> it towards the skipped pages, though I'm even on the fence about that.

+1 for it not being a failure. Personally I'd count it as a skipped
page, since we know the page exists but it can't be verified.

The other option is to wait for the page to stabilize, which doesn't
seem like it would take very long in most cases -- unless you are doing
this test from another host with shared storage. Then I would expect to
see all kinds of interesting torn pages after the last checkpoint.

Regards,
--
-David
david(at)pgmasters(dot)net

From:	Michael Banck <michael(dot)banck(at)credativ(dot)de>
To:	David Steele <david(at)pgmasters(dot)net>, Stephen Frost <sfrost(at)snowman(dot)net>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Subject:	Re: Online verification of checksums
Date:	2018-09-19 13:52:53
Message-ID:	1537365173.3800.26.camel@credativ.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Am Dienstag, den 18.09.2018, 13:52 -0400 schrieb David Steele:
> On 9/18/18 11:45 AM, Stephen Frost wrote:
> > * Michael Banck (michael(dot)banck(at)credativ(dot)de) wrote:
> > > I have added a retry for this as well now, without a pg_sleep() as well.
> > > This catches around 80% of the half-reads, but a few slip through. At
> > > that point we bail out with exit(1), and the user can try again, which I
> > > think is fine?
> >
> > No, this is perfectly normal behavior, as is having completely blank
> > pages, now that I think about it. If we get a short read then I'd say
> > we simply check that we got an EOF and, in that case, we just move on.
> >
> > > Alternatively, we could just skip to the next file then and don't make
> > > it count as a checksum failure.
> >
> > No, I wouldn't count it as a checksum failure. We could possibly count
> > it towards the skipped pages, though I'm even on the fence about that.
>
> +1 for it not being a failure. Personally I'd count it as a skipped
> page, since we know the page exists but it can't be verified.
>
> The other option is to wait for the page to stabilize, which doesn't
> seem like it would take very long in most cases -- unless you are doing
> this test from another host with shared storage. Then I would expect to
> see all kinds of interesting torn pages after the last checkpoint.

OK, I'm skipping the block now on first try, as this makes (i) sense and
(ii) simplifies the code (again).

Version 3 is attached.

Michael

--
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax: +49 2166 9901-100
Email: michael(dot)banck(at)credativ(dot)de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

Attachment	Content-Type	Size
online-verification-of-checksums_V3.patch	text/x-patch	5.2 KB

From:	Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To:	Michael Banck <michael(dot)banck(at)credativ(dot)de>
Cc:	David Steele <david(at)pgmasters(dot)net>, Stephen Frost <sfrost(at)snowman(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Subject:	Re: Online verification of checksums
Date:	2018-09-26 11:23:44
Message-ID:	alpine.DEB.2.21.1809191738070.901@lancre
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hallo Michael,

Patch v3 applies cleanly, code compiles and make check is ok, but the
command is probably not tested anywhere, as already mentioned on other
threads.

The patch is missing a documentation update.

There are debatable changes of behavior:

if (errno == ENOENT) return / continue...

For instance, a file disappearing is ok online, but not so if offline. On
the other hand, the probability that a file suddenly disappears while the
server offline looks remote, so reporting such issues does not seem
useful.

However I'm more wary with other continues/skips added. ISTM that skipping
a block because of a read error, or because it is new, or some other
reasons, is not the same thing, so should be counted & reported
differently?

+ if (block_retry == false)

Why not trust boolean operations?

if (!block_retry)

--
Fabien.

From:	Michael Banck <michael(dot)banck(at)credativ(dot)de>
To:	Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
Cc:	David Steele <david(at)pgmasters(dot)net>, Stephen Frost <sfrost(at)snowman(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Subject:	Re: Online verification of checksums
Date:	2018-09-26 14:37:18
Message-ID:	1537972638.3800.39.camel@credativ.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Am Mittwoch, den 26.09.2018, 13:23 +0200 schrieb Fabien COELHO:
> Patch v3 applies cleanly, code compiles and make check is ok, but the
> command is probably not tested anywhere, as already mentioned on other
> threads.

Right.

> The patch is missing a documentation update.

I've added that now. I think the only change needed was removing the
"server needs to be offline" part?

> There are debatable changes of behavior:
>
> if (errno == ENOENT) return / continue...
>
> For instance, a file disappearing is ok online, but not so if offline. On
> the other hand, the probability that a file suddenly disappears while the
> server offline looks remote, so reporting such issues does not seem
> useful.
>
> However I'm more wary with other continues/skips added. ISTM that skipping
> a block because of a read error, or because it is new, or some other
> reasons, is not the same thing, so should be counted & reported
> differently?

I think that would complicate things further without a lot of benefit.

After all, we are interested in checksum failures, not necessarily read
failures etc. so exiting on them (and skip checking possibly large parts
of PGDATA) looks undesirable to me.

So I have done no changes in this part so far, what do others think
about this?

> + if (block_retry == false)
>
> Why not trust boolean operations?
>
> if (!block_retry)

I've changed that as well.

Version 4 is attached.

Michael

--
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax: +49 2166 9901-100
Email: michael(dot)banck(at)credativ(dot)de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

Attachment	Content-Type	Size
online-verification-of-checksums_V4.patch	text/x-patch	6.0 KB

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	Michael Banck <michael(dot)banck(at)credativ(dot)de>
Cc:	Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>, David Steele <david(at)pgmasters(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Subject:	Re: Online verification of checksums
Date:	2018-09-26 14:54:45
Message-ID:	20180926145445.GZ4184@tamriel.snowman.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Greetings,

* Michael Banck (michael(dot)banck(at)credativ(dot)de) wrote:
> Am Mittwoch, den 26.09.2018, 13:23 +0200 schrieb Fabien COELHO:
> > There are debatable changes of behavior:
> >
> > if (errno == ENOENT) return / continue...
> >
> > For instance, a file disappearing is ok online, but not so if offline. On
> > the other hand, the probability that a file suddenly disappears while the
> > server offline looks remote, so reporting such issues does not seem
> > useful.
> >
> > However I'm more wary with other continues/skips added. ISTM that skipping
> > a block because of a read error, or because it is new, or some other
> > reasons, is not the same thing, so should be counted & reported
> > differently?
>
> I think that would complicate things further without a lot of benefit.
>
> After all, we are interested in checksum failures, not necessarily read
> failures etc. so exiting on them (and skip checking possibly large parts
> of PGDATA) looks undesirable to me.
>
> So I have done no changes in this part so far, what do others think
> about this?

I certainly don't see a lot of point in doing much more than what was
discussed previously for 'new' blocks (counting them as skipped and
moving on).

An actual read() error (that is, a failure on a read() call such as
getting back EIO), on the other hand, is something which I'd probably
report back to the user immediately and then move on, and perhaps
report again at the end.

Note that a short read isn't an error and falls under the 'new' blocks
discussion above.

Thanks!

Stephen

From:	Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To:	Michael Banck <michael(dot)banck(at)credativ(dot)de>
Cc:	David Steele <david(at)pgmasters(dot)net>, Stephen Frost <sfrost(at)snowman(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Subject:	Re: Online verification of checksums
Date:	2018-09-26 15:14:02
Message-ID:	alpine.DEB.2.21.1809261703520.22248@lancre
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

>> The patch is missing a documentation update.
>
> I've added that now. I think the only change needed was removing the
> "server needs to be offline" part?

Yes, and also checking that the described behavior correspond to the new
version.

>> There are debatable changes of behavior:
>>
>> if (errno == ENOENT) return / continue...
>>
>> For instance, a file disappearing is ok online, but not so if offline. On
>> the other hand, the probability that a file suddenly disappears while the
>> server offline looks remote, so reporting such issues does not seem
>> useful.
>>
>> However I'm more wary with other continues/skips added. ISTM that skipping
>> a block because of a read error, or because it is new, or some other
>> reasons, is not the same thing, so should be counted & reported
>> differently?
>
> I think that would complicate things further without a lot of benefit.
>
> After all, we are interested in checksum failures, not necessarily read
> failures etc. so exiting on them (and skip checking possibly large parts
> of PGDATA) looks undesirable to me.

Hmmm.

I'm really saying that it is debatable, so here is some fuel to the
debate:

If I run the check command and it cannot do its job, there is a problem
which is as bad as a failing checksum. The only safe assumption on a
cannot-read block is that the checksum is bad... So ISTM that on
on some of the "skipped" errors there should be appropriate report (exit
code, final output) that something is amiss.

--
Fabien.

From:	Michael Banck <michael(dot)banck(at)credativ(dot)de>
To:	Stephen Frost <sfrost(at)snowman(dot)net>
Cc:	Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>, David Steele <david(at)pgmasters(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Subject:	Re: Online verification of checksums
Date:	2018-09-26 15:15:27
Message-ID:	1537974927.3800.41.camel@credativ.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Am Mittwoch, den 26.09.2018, 10:54 -0400 schrieb Stephen Frost:
> * Michael Banck (michael(dot)banck(at)credativ(dot)de) wrote:
> > Am Mittwoch, den 26.09.2018, 13:23 +0200 schrieb Fabien COELHO:
> > > There are debatable changes of behavior:
> > >
> > > if (errno == ENOENT) return / continue...
> > >
> > > For instance, a file disappearing is ok online, but not so if offline. On
> > > the other hand, the probability that a file suddenly disappears while the
> > > server offline looks remote, so reporting such issues does not seem
> > > useful.
> > >
> > > However I'm more wary with other continues/skips added. ISTM that skipping
> > > a block because of a read error, or because it is new, or some other
> > > reasons, is not the same thing, so should be counted & reported
> > > differently?
> >
> > I think that would complicate things further without a lot of benefit.
> >
> > After all, we are interested in checksum failures, not necessarily read
> > failures etc. so exiting on them (and skip checking possibly large parts
> > of PGDATA) looks undesirable to me.
> >
> > So I have done no changes in this part so far, what do others think
> > about this?
>
> I certainly don't see a lot of point in doing much more than what was
> discussed previously for 'new' blocks (counting them as skipped and
> moving on).
>
> An actual read() error (that is, a failure on a read() call such as
> getting back EIO), on the other hand, is something which I'd probably
> report back to the user immediately and then move on, and perhaps
> report again at the end.
>
> Note that a short read isn't an error and falls under the 'new' blocks
> discussion above.

So I've added ENOENT checks when opening or statting files, i.e. EIO
would still be reported.

The current code in master exits on reads which do not return BLCKSZ,
which I've changed to a skip. So that means we now no longer check for
read failures (return code < 0) so I have now added a check for that and
emit an error message and return.

New version 5 attached.

Michael

--
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax: +49 2166 9901-100
Email: michael(dot)banck(at)credativ(dot)de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

Attachment	Content-Type	Size
online-verification-of-checksums_V5.patch	text/x-patch	6.1 KB

From:	Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To:	Stephen Frost <sfrost(at)snowman(dot)net>
Cc:	Michael Banck <michael(dot)banck(at)credativ(dot)de>, David Steele <david(at)pgmasters(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Subject:	Re: Online verification of checksums
Date:	2018-09-26 15:18:21
Message-ID:	alpine.DEB.2.21.1809261714060.22248@lancre
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello Stephen,

> I certainly don't see a lot of point in doing much more than what was
> discussed previously for 'new' blocks (counting them as skipped and
> moving on).

Sure.

> An actual read() error (that is, a failure on a read() call such as
> getting back EIO), on the other hand, is something which I'd probably
> report back to the user immediately and then move on, and perhaps
> report again at the end.

Yep.

> Note that a short read isn't an error and falls under the 'new' blocks
> discussion above.

I'm really unsure that a short read should really be coldly skipped:

If the check is offline, then one file is in a very bad state, this is
really a panic situation.

If the check is online, given that both postgres and the verify command
interact with the same OS (?) and at the pg page level, I'm not sure in
which situation there could be a partial block, because pg would only
send full pages to the OS.

--
Fabien.

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
Cc:	Michael Banck <michael(dot)banck(at)credativ(dot)de>, David Steele <david(at)pgmasters(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Subject:	Re: Online verification of checksums
Date:	2018-09-26 15:30:31
Message-ID:	20180926153031.GB4184@tamriel.snowman.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Greetings,

* Fabien COELHO (coelho(at)cri(dot)ensmp(dot)fr) wrote:
> >Note that a short read isn't an error and falls under the 'new' blocks
> >discussion above.
>
> I'm really unsure that a short read should really be coldly skipped:
>
> If the check is offline, then one file is in a very bad state, this is
> really a panic situation.

Why? Are we sure that's really something which can't ever happen, even
if the database was shutdown with 'immediate'? I don't think it can but
that's something to consider. In any case, my comments were
specifically thinking about it from an 'online' perspective.

> If the check is online, given that both postgres and the verify command
> interact with the same OS (?) and at the pg page level, I'm not sure in
> which situation there could be a partial block, because pg would only send
> full pages to the OS.

The OS doesn't operate at the same level that PG does- a single write in
PG could get blocked and scheduled off after having only copied half of
the 8k that PG sends. This isn't really debatable- we've seen it happen
and everything is operating perfectly correctly, it just happens that
you were able to get a read() at the same time a write() was happening
and that only part of the page had been updated at that point.

Thanks!

Stephen

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Michael Banck <michael(dot)banck(at)credativ(dot)de>, Stephen Frost <sfrost(at)snowman(dot)net>
Cc:	Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>, David Steele <david(at)pgmasters(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2018-09-29 08:27:43
Message-ID:	7fd462c9-27c1-4ba9-3cf2-83f9bf0ed7ef@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 09/26/2018 05:15 PM, Michael Banck wrote:
> ...
>
> New version 5 attached.
>

I've looked at v5, and the retry/recheck logic seems OK to me - I'd
still vote to keep it consistent with what pg_basebackup does (i.e.
doing the LSN check first, before looking at the checksum), but I don't
think it's a bug.

I'm not sure about the other issues brought up (ENOENT, short reads). I
haven't given it much thought.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Michael Banck <michael(dot)banck(at)credativ(dot)de>, Stephen Frost <sfrost(at)snowman(dot)net>
Cc:	Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>, David Steele <david(at)pgmasters(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2018-09-29 08:51:23
Message-ID:	ad9df50b-6d9d-f91e-8146-45c0b2cd6c9b@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

One more thought - when running similar tools on a live system, it's
usually a good idea to limit the impact by throttling the throughput. As
the verification runs in an independent process it can't reuse the
vacuum-like cost limit directly, but perhaps it could do something
similar? Like, limit the number of blocks read/second, or so?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Michael Paquier <michael(at)paquier(dot)xyz>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	Michael Banck <michael(dot)banck(at)credativ(dot)de>, Stephen Frost <sfrost(at)snowman(dot)net>, Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>, David Steele <david(at)pgmasters(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2018-09-29 09:20:33
Message-ID:	20180929092033.GE1823@paquier.xyz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Sep 29, 2018 at 10:51:23AM +0200, Tomas Vondra wrote:
> One more thought - when running similar tools on a live system, it's
> usually a good idea to limit the impact by throttling the throughput. As
> the verification runs in an independent process it can't reuse the
> vacuum-like cost limit directly, but perhaps it could do something
> similar? Like, limit the number of blocks read/second, or so?

When it comes to such parameters, not using a number of blocks but
throttling with a value in bytes (kB or MB of course) speaks more to the
user. The past experience with checkpoint_segments is one example of
that. Converting that to a number of blocks internally would definitely
make sense the most sense. +1 for this idea.
--
Michael

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	Michael Paquier <michael(at)paquier(dot)xyz>
Cc:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>, David Steele <david(at)pgmasters(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2018-09-29 12:14:02
Message-ID:	20180929121402.GM4184@tamriel.snowman.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Greetings,

* Michael Paquier (michael(at)paquier(dot)xyz) wrote:
> On Sat, Sep 29, 2018 at 10:51:23AM +0200, Tomas Vondra wrote:
> > One more thought - when running similar tools on a live system, it's
> > usually a good idea to limit the impact by throttling the throughput. As
> > the verification runs in an independent process it can't reuse the
> > vacuum-like cost limit directly, but perhaps it could do something
> > similar? Like, limit the number of blocks read/second, or so?
>
> When it comes to such parameters, not using a number of blocks but
> throttling with a value in bytes (kB or MB of course) speaks more to the
> user. The past experience with checkpoint_segments is one example of
> that. Converting that to a number of blocks internally would definitely
> make sense the most sense. +1 for this idea.

While I agree this would be a nice additional feature to have, it seems
like something which could certainly be added later and doesn't
necessairly have to be included in the initial patch. If Michael has
time to add that, great, if not, I'd rather have this as-is than not.

I do tend to agree with Michael that having the parameter be specified
as (or at least able to accept) a byte-based value is a good idea. As
another feature idea, having this able to work in parallel across
tablespaces would be nice too. I can certainly imagine some point where
this is a default process which scans the database at a slow pace across
all the tablespaces more-or-less all the time checking for corruption.

Thanks!

Stephen

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Stephen Frost <sfrost(at)snowman(dot)net>, Michael Paquier <michael(at)paquier(dot)xyz>
Cc:	Michael Banck <michael(dot)banck(at)credativ(dot)de>, Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>, David Steele <david(at)pgmasters(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2018-09-29 15:49:55
Message-ID:	5477fe69-afb8-f759-2d45-680b187a2b81@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 09/29/2018 02:14 PM, Stephen Frost wrote:
> Greetings,
>
> * Michael Paquier (michael(at)paquier(dot)xyz) wrote:
>> On Sat, Sep 29, 2018 at 10:51:23AM +0200, Tomas Vondra wrote:
>>> One more thought - when running similar tools on a live system, it's
>>> usually a good idea to limit the impact by throttling the throughput. As
>>> the verification runs in an independent process it can't reuse the
>>> vacuum-like cost limit directly, but perhaps it could do something
>>> similar? Like, limit the number of blocks read/second, or so?
>>
>> When it comes to such parameters, not using a number of blocks but
>> throttling with a value in bytes (kB or MB of course) speaks more to the
>> user. The past experience with checkpoint_segments is one example of
>> that. Converting that to a number of blocks internally would definitely
>> make sense the most sense. +1 for this idea.
>
> While I agree this would be a nice additional feature to have, it seems
> like something which could certainly be added later and doesn't
> necessairly have to be included in the initial patch. If Michael has
> time to add that, great, if not, I'd rather have this as-is than not.
>

True, although I don't think it'd be particularly difficult.

> I do tend to agree with Michael that having the parameter be specified
> as (or at least able to accept) a byte-based value is a good idea.

Sure, I was not really expecting it to be exposed as raw block count. I
agree it should be in byte-based values (i.e. just like --max-rate in
pg_basebackup).

> As another feature idea, having this able to work in parallel across
> tablespaces would be nice too. I can certainly imagine some point where
> this is a default process which scans the database at a slow pace across
> all the tablespaces more-or-less all the time checking for corruption.
>

Maybe, but that's certainly a non-trivial feature.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To:	Michael Banck <michael(dot)banck(at)credativ(dot)de>
Cc:	Stephen Frost <sfrost(at)snowman(dot)net>, David Steele <david(at)pgmasters(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Subject:	Re: Online verification of checksums
Date:	2018-10-25 08:16:03
Message-ID:	alpine.DEB.2.21.1810251010331.26778@lancre
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hallo Michael,

> New version 5 attached.

Patch does not seem to apply anymore.

Moreover, ISTM that some discussions about behavioral changes are not
fully settled.

My current opinion is that when offline some errors are not admissible,
whereas the same errors are admissible when online because they may be due
to the ongoing database processing, so the behavior should not be strictly
the same.

This might suggest some option to tell the command that it should work in
online or offline mode, so that it may be stricter in some cases. The
default may be one of the option, eg the stricter offline mode, or maybe
guessed at startup.

I put the patch in "waiting on author" state.

--
Fabien.

From:	Michael Banck <michael(dot)banck(at)credativ(dot)de>
To:	Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
Cc:	Stephen Frost <sfrost(at)snowman(dot)net>, David Steele <david(at)pgmasters(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Subject:	Re: Online verification of checksums
Date:	2018-10-30 13:22:26
Message-ID:	20181030132225.GB23740@nighthawk.caipicrew.dd-dns.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi Fabien,

On Thu, Oct 25, 2018 at 10:16:03AM +0200, Fabien COELHO wrote:
> >New version 5 attached.
>
> Patch does not seem to apply anymore.

Thanks, rebased version attached.

> Moreover, ISTM that some discussions about behavioral changes are not fully
> settled.
>
> My current opinion is that when offline some errors are not admissible,
> whereas the same errors are admissible when online because they may be due
> to the ongoing database processing, so the behavior should not be strictly
> the same.

Indeed, the recently-added pg_verify_checksums testsuite adds a few
files with just 'foo' in them and with V5 of the patch,
pg_verify_checksums no longer bails out with an error on those.

I have now re-added the retry logic for partially-read pages, so that it
bails out if it reads a page partially twice. This makes the testsuite
work again.

I am not convinced we need to differentiate further between online and
offline operation, can you explain in more detail which other
differences are ok in online mode and why?

> This might suggest some option to tell the command that it should work in
> online or offline mode, so that it may be stricter in some cases. The
> default may be one of the option, eg the stricter offline mode, or maybe
> guessed at startup.

If we believe the operation should be different, the patch removes the
"is cluster online?" check (as it is no longer necessary), so we could
just replace the current error message with a global variable with the
result of that check and use it where needed (if any).

Michael

--
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax: +49 2166 9901-100
Email: michael(dot)banck(at)credativ(dot)de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

Attachment	Content-Type	Size
online-verification-of-checksums_V6.patch	text/x-diff	7.2 KB

From:	Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To:	Michael Banck <michael(dot)banck(at)credativ(dot)de>
Cc:	Stephen Frost <sfrost(at)snowman(dot)net>, David Steele <david(at)pgmasters(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Subject:	Re: Online verification of checksums
Date:	2018-10-30 17:22:52
Message-ID:	alpine.DEB.2.21.1810301754020.9086@lancre
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hallo Michael,

Patch v6 applies cleanly, compiles, local make check is ok.

>> My current opinion is that when offline some errors are not admissible,
>> whereas the same errors are admissible when online because they may be due
>> to the ongoing database processing, so the behavior should not be strictly
>> the same.
>
> Indeed, the recently-added pg_verify_checksums testsuite

A welcome addition!

> adds a few files with just 'foo' in them and with V5 of the patch,
> pg_verify_checksums no longer bails out with an error on those.

> I have now re-added the retry logic for partially-read pages, so that it
> bails out if it reads a page partially twice. This makes the testsuite
> work again.
>
> I am not convinced we need to differentiate further between online and
> offline operation, can you explain in more detail which other
> differences are ok in online mode and why?

For instance the "file/directory was removed" do not look okay at all when
offline, even if unlikely. Moreover, the checks hides the error message
and is fully silent in this case, while it was not beforehand on the
same error when offline.

The "check if page was modified since checkpoint" does not look useful
when offline. Maybe it lacks a comment to say that this cannot (should not
?) happen when offline, but even then I would not like it to be true: ISTM
that no page should be allowed to be skipped on the checkpoint condition
when offline, but it is probably ok to skip with the new page test, which
make me still think that they should be counted and reported separately,
or at least the checkpoint skip test should not be run when offline.

When offline, the retry logic does not make much sense, it should complain
directly on the first error? Also, I'm unsure of the read & checksum retry
logic *without any delay*.

>> This might suggest some option to tell the command that it should work in
>> online or offline mode, so that it may be stricter in some cases. The
>> default may be one of the option, eg the stricter offline mode, or maybe
>> guessed at startup.
>
> If we believe the operation should be different, the patch removes the
> "is cluster online?" check (as it is no longer necessary), so we could
> just replace the current error message with a global variable with the
> result of that check and use it where needed (if any).

That could let open the issue of someone starting the check offline, and
then starting the database while it is not finished. Maybe it is not worth
sweating about such a narrow use case.

If operations are to be different, and it seems to me they should be, I'd
suggest (1) auto detect default based one the existing "is cluster online"
code, (2) force options, eg --online vs --offline, which would complain
and exit if the cluster is not in the right state on startup.

I'd suggest to add a failing checksum online test, if possible. At least a
"foo" file? It would also be nice if the test could apply on an active
database, eg with a low-rate pgbench running in parallel to the
verification, but I'm not sure how easy it is to add such a thing.

--
Fabien.

From:	Michael Banck <michael(dot)banck(at)credativ(dot)de>
To:	Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
Cc:	Stephen Frost <sfrost(at)snowman(dot)net>, David Steele <david(at)pgmasters(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Subject:	Re: Online verification of checksums
Date:	2018-11-21 12:35:35
Message-ID:	20181121123535.GD23740@nighthawk.caipicrew.dd-dns.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On Tue, Oct 30, 2018 at 06:22:52PM +0100, Fabien COELHO wrote:
> >I am not convinced we need to differentiate further between online and
> >offline operation, can you explain in more detail which other
> >differences are ok in online mode and why?
>
> For instance the "file/directory was removed" do not look okay at all when
> offline, even if unlikely. Moreover, the checks hides the error message and
> is fully silent in this case, while it was not beforehand on the same error
> when offline.

OK, I kinda see the point here and added that.

> The "check if page was modified since checkpoint" does not look useful when
> offline. Maybe it lacks a comment to say that this cannot (should not ?)
> happen when offline, but even then I would not like it to be true: ISTM that
> no page should be allowed to be skipped on the checkpoint condition when
> offline, but it is probably ok to skip with the new page test, which make me
> still think that they should be counted and reported separately, or at least
> the checkpoint skip test should not be run when offline.

What is the rationale to not skip on the checkpoint condition when the
instance is offline? If it was shutdown cleanly, this should not
happen, if the instance crashed, those would be spurious errors that
would get repaired on recovery.

I have not changed that for now.

> When offline, the retry logic does not make much sense, it should complain
> directly on the first error? Also, I'm unsure of the read & checksum retry
> logic *without any delay*.

I think the small overhead of retrying in offline mode even if useless
is worth avoiding making the code more complicated in order to cater for
both modes.

Initially there was a delay, but this was removed after analysis and
requests by several other reviewers.

> >>This might suggest some option to tell the command that it should work in
> >>online or offline mode, so that it may be stricter in some cases. The
> >>default may be one of the option, eg the stricter offline mode, or maybe
> >>guessed at startup.
> >
> >If we believe the operation should be different, the patch removes the
> >"is cluster online?" check (as it is no longer necessary), so we could
> >just replace the current error message with a global variable with the
> >result of that check and use it where needed (if any).
>
> That could let open the issue of someone starting the check offline, and
> then starting the database while it is not finished. Maybe it is not worth
> sweating about such a narrow use case.

I don't think we need to cater for that, yeah.

> If operations are to be different, and it seems to me they should be, I'd
> suggest (1) auto detect default based one the existing "is cluster online"
> code, (2) force options, eg --online vs --offline, which would complain and
> exit if the cluster is not in the right state on startup.

The current code bails out if it thinks the cluster is online. What is
wrong with just setting a flag now in case it is?

> I'd suggest to add a failing checksum online test, if possible. At least a
> "foo" file?

Ok, done so.

> It would also be nice if the test could apply on an active database,
> eg with a low-rate pgbench running in parallel to the verification,
> but I'm not sure how easy it is to add such a thing.

That sounds much more complicated so I have not tackled that yet.

Michael

--
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax: +49 2166 9901-100
Email: michael(dot)banck(at)credativ(dot)de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

Attachment	Content-Type	Size
online-verification-of-checksums_V7.patch	text/x-diff	10.2 KB

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	Michael Banck <michael(dot)banck(at)credativ(dot)de>
Cc:	Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>, David Steele <david(at)pgmasters(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Subject:	Re: Online verification of checksums
Date:	2018-11-22 01:12:19
Message-ID:	20181122011219.GA3415@tamriel.snowman.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Greetings,

* Michael Banck (michael(dot)banck(at)credativ(dot)de) wrote:
> On Tue, Oct 30, 2018 at 06:22:52PM +0100, Fabien COELHO wrote:
> > The "check if page was modified since checkpoint" does not look useful when
> > offline. Maybe it lacks a comment to say that this cannot (should not ?)
> > happen when offline, but even then I would not like it to be true: ISTM that
> > no page should be allowed to be skipped on the checkpoint condition when
> > offline, but it is probably ok to skip with the new page test, which make me
> > still think that they should be counted and reported separately, or at least
> > the checkpoint skip test should not be run when offline.
>
> What is the rationale to not skip on the checkpoint condition when the
> instance is offline? If it was shutdown cleanly, this should not
> happen, if the instance crashed, those would be spurious errors that
> would get repaired on recovery.
>
> I have not changed that for now.

Agreed- this is an important check even in offline mode.

> > When offline, the retry logic does not make much sense, it should complain
> > directly on the first error? Also, I'm unsure of the read & checksum retry
> > logic *without any delay*.

The race condition being considered here is where an 8k read somehow
gets the first 4k, then is scheduled off-cpu, and the full 8k page is
then written by some other process, and then this process is woken up
to read the second 4k. I agree that this is unnecessary when the
database is offline, but it's also pretty cheap. When the database is
online, it's an extremely unlikely case to hit (just try to reproduce
it...) but if it does get hit then it's easy enough to recheck by doing
a reread, which should show that the LSN has been updated in the first
4k and we can then know that this page is in the WAL. We have not yet
seen a case where such a re-read returns an old LSN and an invalid
checksum; based on discussion with other hackers, that shouldn't be
possible as every kernel seems to consistently write in-order, meaning
that the first 4k will be updated before the second, so a single re-read
should be sufficient.

Remember- this is all in-memory activity also, we aren't talking about
what might happen on disk here.

> I think the small overhead of retrying in offline mode even if useless
> is worth avoiding making the code more complicated in order to cater for
> both modes.

Agreed.

> Initially there was a delay, but this was removed after analysis and
> requests by several other reviewers.

Agreed, there's no need for or point to having such a delay.

> > >>This might suggest some option to tell the command that it should work in
> > >>online or offline mode, so that it may be stricter in some cases. The
> > >>default may be one of the option, eg the stricter offline mode, or maybe
> > >>guessed at startup.
> > >
> > >If we believe the operation should be different, the patch removes the
> > >"is cluster online?" check (as it is no longer necessary), so we could
> > >just replace the current error message with a global variable with the
> > >result of that check and use it where needed (if any).
> >
> > That could let open the issue of someone starting the check offline, and
> > then starting the database while it is not finished. Maybe it is not worth
> > sweating about such a narrow use case.
>
> I don't think we need to cater for that, yeah.

Agreed.

> > It would also be nice if the test could apply on an active database,
> > eg with a low-rate pgbench running in parallel to the verification,
> > but I'm not sure how easy it is to add such a thing.
>
> That sounds much more complicated so I have not tackled that yet.

I agree that this would be nice, but I don't want the regression tests
to become much longer...

Thanks!

Stephen

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Stephen Frost <sfrost(at)snowman(dot)net>, Michael Banck <michael(dot)banck(at)credativ(dot)de>
Cc:	Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>, David Steele <david(at)pgmasters(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2018-11-22 17:29:04
Message-ID:	53b04b08-e4f4-9d94-e57b-07b38b14705f@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 11/22/18 2:12 AM, Stephen Frost wrote:
> Greetings,
>
> * Michael Banck (michael(dot)banck(at)credativ(dot)de) wrote:
>> On Tue, Oct 30, 2018 at 06:22:52PM +0100, Fabien COELHO wrote:
>>> The "check if page was modified since checkpoint" does not look useful when
>>> offline. Maybe it lacks a comment to say that this cannot (should not ?)
>>> happen when offline, but even then I would not like it to be true: ISTM that
>>> no page should be allowed to be skipped on the checkpoint condition when
>>> offline, but it is probably ok to skip with the new page test, which make me
>>> still think that they should be counted and reported separately, or at least
>>> the checkpoint skip test should not be run when offline.
>>
>> What is the rationale to not skip on the checkpoint condition when the
>> instance is offline? If it was shutdown cleanly, this should not
>> happen, if the instance crashed, those would be spurious errors that
>> would get repaired on recovery.
>>
>> I have not changed that for now.
>
> Agreed- this is an important check even in offline mode.
>

Yeah. I suppose we could detect if the shutdown was clean (like
pg_rewind does), and then skip the check. Or perhaps we should still do
the check (without a retry), and report it as issue when we find a page
with LSN newer than the last checkpoint.

In any case, the check is pretty cheap (comparing two 64-bit values),
and I don't see how skipping it would optimize anything. It would make
the code a tad simpler, but we still need the check for the online mode.

>>> When offline, the retry logic does not make much sense, it should complain
>>> directly on the first error? Also, I'm unsure of the read & checksum retry
>>> logic *without any delay*.
>
> The race condition being considered here is where an 8k read somehow
> gets the first 4k, then is scheduled off-cpu, and the full 8k page is
> then written by some other process, and then this process is woken up
> to read the second 4k. I agree that this is unnecessary when the
> database is offline, but it's also pretty cheap. When the database is
> online, it's an extremely unlikely case to hit (just try to reproduce
> it...) but if it does get hit then it's easy enough to recheck by doing
> a reread, which should show that the LSN has been updated in the first
> 4k and we can then know that this page is in the WAL. We have not yet
> seen a case where such a re-read returns an old LSN and an invalid
> checksum; based on discussion with other hackers, that shouldn't be
> possible as every kernel seems to consistently write in-order, meaning
> that the first 4k will be updated before the second, so a single re-read
> should be sufficient.
>

Right.

A minor detail is that the reads/writes should be atomic at the sector
level, which used to be 512B, so it's not just about pages torn in
4kB/4kB manner, but possibly an arbitrary mix of 512B chunks from old
and new version.

This also explains why we don't need any delay - the reread happens
after the write must have already written the page header, so the new
LSN must be already visible.

So no delay is necessary. And if it was, how long should the delay be?
The processes might end up off-CPU for arbitrary amount of time, so
picking a good value would be pretty tricky.

> Remember- this is all in-memory activity also, we aren't talking about
> what might happen on disk here.
>
>> I think the small overhead of retrying in offline mode even if useless
>> is worth avoiding making the code more complicated in order to cater for
>> both modes.
>
> Agreed.
>
>> Initially there was a delay, but this was removed after analysis and
>> requests by several other reviewers.
>
> Agreed, there's no need for or point to having such a delay.
>

Yep.

>>>>> This might suggest some option to tell the command that it should work in
>>>>> online or offline mode, so that it may be stricter in some cases. The
>>>>> default may be one of the option, eg the stricter offline mode, or maybe
>>>>> guessed at startup.
>>>>
>>>> If we believe the operation should be different, the patch removes the
>>>> "is cluster online?" check (as it is no longer necessary), so we could
>>>> just replace the current error message with a global variable with the
>>>> result of that check and use it where needed (if any).
>>>
>>> That could let open the issue of someone starting the check offline, and
>>> then starting the database while it is not finished. Maybe it is not worth
>>> sweating about such a narrow use case.
>>
>> I don't think we need to cater for that, yeah.
>
> Agreed.
>

Yep. I don't think other tools protect against that either. And
pg_rewind does actually modify the cluster state, unlike checksum
verification.

>>> It would also be nice if the test could apply on an active database,
>>> eg with a low-rate pgbench running in parallel to the verification,
>>> but I'm not sure how easy it is to add such a thing.
>>
>> That sounds much more complicated so I have not tackled that yet.
>
> I agree that this would be nice, but I don't want the regression tests
> to become much longer...
>

I have to admit I find this thread rather confusing, because the subject
is "online verification of checksums" yet we're discussing verification
on offline instances.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	Michael Banck <michael(dot)banck(at)credativ(dot)de>, Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>, David Steele <david(at)pgmasters(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2018-11-22 17:39:28
Message-ID:	20181122173928.GB3415@tamriel.snowman.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Greetings,

* Tomas Vondra (tomas(dot)vondra(at)2ndquadrant(dot)com) wrote:
> On 11/22/18 2:12 AM, Stephen Frost wrote:
> >* Michael Banck (michael(dot)banck(at)credativ(dot)de) wrote:
> >>On Tue, Oct 30, 2018 at 06:22:52PM +0100, Fabien COELHO wrote:
> >>>The "check if page was modified since checkpoint" does not look useful when
> >>>offline. Maybe it lacks a comment to say that this cannot (should not ?)
> >>>happen when offline, but even then I would not like it to be true: ISTM that
> >>>no page should be allowed to be skipped on the checkpoint condition when
> >>>offline, but it is probably ok to skip with the new page test, which make me
> >>>still think that they should be counted and reported separately, or at least
> >>>the checkpoint skip test should not be run when offline.
> >>
> >>What is the rationale to not skip on the checkpoint condition when the
> >>instance is offline? If it was shutdown cleanly, this should not
> >>happen, if the instance crashed, those would be spurious errors that
> >>would get repaired on recovery.
> >>
> >>I have not changed that for now.
> >
> >Agreed- this is an important check even in offline mode.
>
> Yeah. I suppose we could detect if the shutdown was clean (like pg_rewind
> does), and then skip the check. Or perhaps we should still do the check
> (without a retry), and report it as issue when we find a page with LSN newer
> than the last checkpoint.

I agree that it'd be nice to report an issue if it's a clean shutdown
but there's an LSN newer than the last checkpoint, though I suspect that
would be more useful in debugging and such and not so useful for users.

> In any case, the check is pretty cheap (comparing two 64-bit values), and I
> don't see how skipping it would optimize anything. It would make the code a
> tad simpler, but we still need the check for the online mode.

Yeah, I'd just keep the check.

> A minor detail is that the reads/writes should be atomic at the sector
> level, which used to be 512B, so it's not just about pages torn in 4kB/4kB
> manner, but possibly an arbitrary mix of 512B chunks from old and new
> version.

Sure.

> This also explains why we don't need any delay - the reread happens after
> the write must have already written the page header, so the new LSN must be
> already visible.

Agreed.

Thanks!

Stephen

From:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
To:	michael(dot)banck(at)credativ(dot)de
Cc:	Stephen Frost <sfrost(at)snowman(dot)net>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>, David Steele <david(at)pgmasters(dot)net>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2018-12-01 11:47:13
Message-ID:	CA+q6zcXaw+firw_z-XuOeTfz0OQKoRN3t5LsCV13YP-0-DEoTQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On Wed, Nov 21, 2018 at 1:38 PM Michael Banck <michael(dot)banck(at)credativ(dot)de> wrote:
>
> Hi,
>
> On Tue, Oct 30, 2018 at 06:22:52PM +0100, Fabien COELHO wrote:
> > >I am not convinced we need to differentiate further between online and
> > >offline operation, can you explain in more detail which other
> > >differences are ok in online mode and why?
> >
> > For instance the "file/directory was removed" do not look okay at all when
> > offline, even if unlikely. Moreover, the checks hides the error message and
> > is fully silent in this case, while it was not beforehand on the same error
> > when offline.
>
> OK, I kinda see the point here and added that.

Hi,

Just for the information, looks like part of this patch (or at least some
similar code), related to the tests in 002_actions.pl, was committed recently
in 5c99513975, so there are minor conflicts with the master.

From:	Michael Paquier <michael(at)paquier(dot)xyz>
To:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
Cc:	michael(dot)banck(at)credativ(dot)de, Stephen Frost <sfrost(at)snowman(dot)net>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>, David Steele <david(at)pgmasters(dot)net>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2018-12-03 00:48:43
Message-ID:	20181203004843.GC3423@paquier.xyz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Dec 01, 2018 at 12:47:13PM +0100, Dmitry Dolgov wrote:
> Just for the information, looks like part of this patch (or at least some
> similar code), related to the tests in 002_actions.pl, was committed recently
> in 5c99513975, so there are minor conflicts with the master.

What what I can see in v7 of the patch as posted in [1], all the changes
to 002_actions.pl could just be removed because there are already
equivalents.

[1]: https://postgr.es/m/20181121123535.GD23740@nighthawk.caipicrew.dd-dns.de
--
Michael

From:	Michael Banck <michael(dot)banck(at)credativ(dot)de>
To:	Michael Paquier <michael(at)paquier(dot)xyz>
Cc:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>, David Steele <david(at)pgmasters(dot)net>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2018-12-20 15:19:11
Message-ID:	20181220151911.GA4974@nighthawk.caipicrew.dd-dns.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On Mon, Dec 03, 2018 at 09:48:43AM +0900, Michael Paquier wrote:
> On Sat, Dec 01, 2018 at 12:47:13PM +0100, Dmitry Dolgov wrote:
> > Just for the information, looks like part of this patch (or at least some
> > similar code), related to the tests in 002_actions.pl, was committed recently
> > in 5c99513975, so there are minor conflicts with the master.
>
> What what I can see in v7 of the patch as posted in [1], all the changes
> to 002_actions.pl could just be removed because there are already
> equivalents.

Yeah, new rebased version attached.

Michael

--
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax: +49 2166 9901-100
Email: michael(dot)banck(at)credativ(dot)de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

Attachment	Content-Type	Size
online-verification-of-checksums_V8.patch	text/x-diff	8.2 KB

From:	Michael Banck <michael(dot)banck(at)credativ(dot)de>
To:	Michael Paquier <michael(at)paquier(dot)xyz>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>, David Steele <david(at)pgmasters(dot)net>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2018-12-21 10:08:06
Message-ID:	20181221100805.GC4974@nighthawk.caipicrew.dd-dns.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On Thu, Dec 20, 2018 at 04:19:11PM +0100, Michael Banck wrote:
> Yeah, new rebased version attached.

By the way, one thing that this patch also fixes is checksum
verification on basebackups (as pointed out the other day by my
colleague Bernd Helmele):

postgres(at)kohn:~$ initdb -k data
postgres(at)kohn:~$ pg_ctl -D data -l logfile start
waiting for server to start.... done
server started
postgres(at)kohn:~$ pg_verify_checksums -D data
pg_verify_checksums: cluster must be shut down to verify checksums
postgres(at)kohn:~$ pg_basebackup -h /tmp -D backup1
postgres(at)kohn:~$ pg_verify_checksums -D backup1
pg_verify_checksums: cluster must be shut down to verify checksums
postgres(at)kohn:~$ pg_checksums -c -D backup1
Checksum scan completed
Files scanned: 1094
Blocks scanned: 2867
Bad checksums: 0
Data checksum version: 1

Where pg_checksums has the online verification patch applied.

As I don't think many people will take down their production servers in
order to verify checksums, verifying them on basebackups looks like a
useful use-case that is currently broken with pg_verify_checksums.

Michael

--
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax: +49 2166 9901-100
Email: michael(dot)banck(at)credativ(dot)de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

From:	Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To:	Michael Banck <michael(dot)banck(at)credativ(dot)de>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, David Steele <david(at)pgmasters(dot)net>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2018-12-25 09:25:46
Message-ID:	alpine.DEB.2.21.1812250943120.32444@lancre
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hallo Michael,

> Yeah, new rebased version attached.

Patch v8 applies cleanly, compiles, global & local make check are ok.

A few comments:

About added tests: the node is left running at the end of the script,
which is not very clean. I'd suggest to either move the added checks
before stopping, or to stop again at the end of the script, depending on
the intention.

I'm wondering (possibly again) about the existing early exit if one block
cannot be read on retry: the command should count this as a kind of bad
block, proceed on checking other files, and obviously fail in the end, but
having checked everything else and generated a report. I do not think that
this condition warrants a full stop. ISTM that under rare race conditions
(eg, an unlucky concurrent "drop database" or "drop table") this could
happen when online, although I could not trigger one despite heavy
testing, so I'm possibly mistaken.

--
Fabien.

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
Cc:	Michael Banck <michael(dot)banck(at)credativ(dot)de>, Michael Paquier <michael(at)paquier(dot)xyz>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, David Steele <david(at)pgmasters(dot)net>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2019-02-03 10:06:45
Message-ID:	20190203100645.evinbgnnj3xggwkb@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 2018-12-25 10:25:46 +0100, Fabien COELHO wrote:
> Hallo Michael,
>
> > Yeah, new rebased version attached.
>
> Patch v8 applies cleanly, compiles, global & local make check are ok.
>
> A few comments:
>
> About added tests: the node is left running at the end of the script, which
> is not very clean. I'd suggest to either move the added checks before
> stopping, or to stop again at the end of the script, depending on the
> intention.

Michael?

> I'm wondering (possibly again) about the existing early exit if one block
> cannot be read on retry: the command should count this as a kind of bad
> block, proceed on checking other files, and obviously fail in the end, but
> having checked everything else and generated a report. I do not think that
> this condition warrants a full stop. ISTM that under rare race conditions
> (eg, an unlucky concurrent "drop database" or "drop table") this could
> happen when online, although I could not trigger one despite heavy testing,
> so I'm possibly mistaken.

This seems like a defensible judgement call either way.

Greetings,

Andres Freund

From:	Michael Paquier <michael(at)paquier(dot)xyz>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, David Steele <david(at)pgmasters(dot)net>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2019-02-04 02:36:27
Message-ID:	20190204023627.GE1881@paquier.xyz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Feb 03, 2019 at 02:06:45AM -0800, Andres Freund wrote:
> On 2018-12-25 10:25:46 +0100, Fabien COELHO wrote:
>> About added tests: the node is left running at the end of the script, which
>> is not very clean. I'd suggest to either move the added checks before
>> stopping, or to stop again at the end of the script, depending on the
>> intention.
>
> Michael?

Unlikely P., and most likely B.

I have marked the patch as returned with feedback as it has been a
couple of weeks already.
--
Michael

From:	Michael Banck <michael(dot)banck(at)credativ(dot)de>
To:	Andres Freund <andres(at)anarazel(dot)de>, Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, David Steele <david(at)pgmasters(dot)net>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2019-02-04 07:57:17
Message-ID:	1549267037.796.2.camel@credativ.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Am Sonntag, den 03.02.2019, 02:06 -0800 schrieb Andres Freund:
> Hi,
>
> On 2018-12-25 10:25:46 +0100, Fabien COELHO wrote:
> > Hallo Michael,
> >
> > > Yeah, new rebased version attached.
> >
> > Patch v8 applies cleanly, compiles, global & local make check are ok.
> >
> > A few comments:
> >
> > About added tests: the node is left running at the end of the script, which
> > is not very clean. I'd suggest to either move the added checks before
> > stopping, or to stop again at the end of the script, depending on the
> > intention.
>
> Michael?

Uh, I kinda forgot about this, I've made the tests stop the node now.

> > I'm wondering (possibly again) about the existing early exit if one block
> > cannot be read on retry: the command should count this as a kind of bad
> > block, proceed on checking other files, and obviously fail in the end, but
> > having checked everything else and generated a report. I do not think that
> > this condition warrants a full stop. ISTM that under rare race conditions
> > (eg, an unlucky concurrent "drop database" or "drop table") this could
> > happen when online, although I could not trigger one despite heavy testing,
> > so I'm possibly mistaken.
>
> This seems like a defensible judgement call either way.

Right now we have a few tests that explicitly check that
pg_verify_checksums fail on broken data ("foo" in the file). Those
would then just get skipped AFAICT, which I think is the worse behaviour
, but if everybody thinks that should be the way to go, we can
drop/adjust those tests and make pg_verify_checksums skip them.

Thoughts?

In the meanwhile, v9 is attached with the above change and rebased
(without changes) to master.

Michael

--
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax: +49 2166 9901-100
Email: michael(dot)banck(at)credativ(dot)de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

Attachment	Content-Type	Size
online-verification-of-checksums_V9.patch	text/x-patch	8.2 KB

From:	Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To:	Michael Banck <michael(dot)banck(at)credativ(dot)de>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, Michael Paquier <michael(at)paquier(dot)xyz>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, David Steele <david(at)pgmasters(dot)net>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2019-02-05 05:57:06
Message-ID:	alpine.DEB.2.21.1902050652170.32208@lancre
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hallo Michael,

>>> I'm wondering (possibly again) about the existing early exit if one block
>>> cannot be read on retry: the command should count this as a kind of bad
>>> block, proceed on checking other files, and obviously fail in the end, but
>>> having checked everything else and generated a report. I do not think that
>>> this condition warrants a full stop. ISTM that under rare race conditions
>>> (eg, an unlucky concurrent "drop database" or "drop table") this could
>>> happen when online, although I could not trigger one despite heavy testing,
>>> so I'm possibly mistaken.
>>
>> This seems like a defensible judgement call either way.
>
> Right now we have a few tests that explicitly check that
> pg_verify_checksums fail on broken data ("foo" in the file). Those
> would then just get skipped AFAICT, which I think is the worse behaviour
> , but if everybody thinks that should be the way to go, we can
> drop/adjust those tests and make pg_verify_checksums skip them.
>
> Thoughts?

My point is that it should fail as it does, only not immediately (early
exit), but after having checked everything else. This mean avoiding
calling "exit(1)" here and there (lseek, fopen...), but taking note that
something bad happened, and call exit only in the end.

--
Fabien.

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
Cc:	Michael Banck <michael(dot)banck(at)credativ(dot)de>, Michael Paquier <michael(at)paquier(dot)xyz>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, David Steele <david(at)pgmasters(dot)net>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2019-02-05 07:01:43
Message-ID:	20190205070143.ntsbcldj22mwwf2d@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 2019-02-05 06:57:06 +0100, Fabien COELHO wrote:
> > > > I'm wondering (possibly again) about the existing early exit if one block
> > > > cannot be read on retry: the command should count this as a kind of bad
> > > > block, proceed on checking other files, and obviously fail in the end, but
> > > > having checked everything else and generated a report. I do not think that
> > > > this condition warrants a full stop. ISTM that under rare race conditions
> > > > (eg, an unlucky concurrent "drop database" or "drop table") this could
> > > > happen when online, although I could not trigger one despite heavy testing,
> > > > so I'm possibly mistaken.
> > >
> > > This seems like a defensible judgement call either way.
> >
> > Right now we have a few tests that explicitly check that
> > pg_verify_checksums fail on broken data ("foo" in the file). Those
> > would then just get skipped AFAICT, which I think is the worse behaviour
> > , but if everybody thinks that should be the way to go, we can
> > drop/adjust those tests and make pg_verify_checksums skip them.
> >
> > Thoughts?
>
> My point is that it should fail as it does, only not immediately (early
> exit), but after having checked everything else. This mean avoiding calling
> "exit(1)" here and there (lseek, fopen...), but taking note that something
> bad happened, and call exit only in the end.

I can see both as being valuable (one gives you a more complete picture,
the other a quicker answer in scripts). For me that's the point where
it's the prerogative of the author to make that choice.

Greetings,

Andres Freund

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>, Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
Cc:	Michael Banck <michael(dot)banck(at)credativ(dot)de>, Michael Paquier <michael(at)paquier(dot)xyz>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, David Steele <david(at)pgmasters(dot)net>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2019-02-05 10:30:48
Message-ID:	68ffd2de-be8b-b49f-a077-7ab529559d6f@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2/5/19 8:01 AM, Andres Freund wrote:
> Hi,
>
> On 2019-02-05 06:57:06 +0100, Fabien COELHO wrote:
>>>>> I'm wondering (possibly again) about the existing early exit if one block
>>>>> cannot be read on retry: the command should count this as a kind of bad
>>>>> block, proceed on checking other files, and obviously fail in the end, but
>>>>> having checked everything else and generated a report. I do not think that
>>>>> this condition warrants a full stop. ISTM that under rare race conditions
>>>>> (eg, an unlucky concurrent "drop database" or "drop table") this could
>>>>> happen when online, although I could not trigger one despite heavy testing,
>>>>> so I'm possibly mistaken.
>>>>
>>>> This seems like a defensible judgement call either way.
>>>
>>> Right now we have a few tests that explicitly check that
>>> pg_verify_checksums fail on broken data ("foo" in the file). Those
>>> would then just get skipped AFAICT, which I think is the worse behaviour
>>> , but if everybody thinks that should be the way to go, we can
>>> drop/adjust those tests and make pg_verify_checksums skip them.
>>>
>>> Thoughts?
>>
>> My point is that it should fail as it does, only not immediately (early
>> exit), but after having checked everything else. This mean avoiding calling
>> "exit(1)" here and there (lseek, fopen...), but taking note that something
>> bad happened, and call exit only in the end.
>
> I can see both as being valuable (one gives you a more complete picture,
> the other a quicker answer in scripts). For me that's the point where
> it's the prerogative of the author to make that choice.
>

Why not make this configurable, using a command-line option?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Michael Banck <michael(dot)banck(at)credativ(dot)de>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, David Steele <david(at)pgmasters(dot)net>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2019-02-05 11:29:53
Message-ID:	1549366193.796.9.camel@credativ.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Am Dienstag, den 05.02.2019, 11:30 +0100 schrieb Tomas Vondra:
> On 2/5/19 8:01 AM, Andres Freund wrote:
> > On 2019-02-05 06:57:06 +0100, Fabien COELHO wrote:
> > > > > > I'm wondering (possibly again) about the existing early exit if one block
> > > > > > cannot be read on retry: the command should count this as a kind of bad
> > > > > > block, proceed on checking other files, and obviously fail in the end, but
> > > > > > having checked everything else and generated a report. I do not think that
> > > > > > this condition warrants a full stop. ISTM that under rare race conditions
> > > > > > (eg, an unlucky concurrent "drop database" or "drop table") this could
> > > > > > happen when online, although I could not trigger one despite heavy testing,
> > > > > > so I'm possibly mistaken.
> > > > >
> > > > > This seems like a defensible judgement call either way.
> > > >
> > > > Right now we have a few tests that explicitly check that
> > > > pg_verify_checksums fail on broken data ("foo" in the file). Those
> > > > would then just get skipped AFAICT, which I think is the worse behaviour
> > > > , but if everybody thinks that should be the way to go, we can
> > > > drop/adjust those tests and make pg_verify_checksums skip them.
> > > >
> > > > Thoughts?
> > >
> > > My point is that it should fail as it does, only not immediately (early
> > > exit), but after having checked everything else. This mean avoiding calling
> > > "exit(1)" here and there (lseek, fopen...), but taking note that something
> > > bad happened, and call exit only in the end.
> >
> > I can see both as being valuable (one gives you a more complete picture,
> > the other a quicker answer in scripts). For me that's the point where
> > it's the prerogative of the author to make that choice.

Personally, I would prefer to keep it as simple as possible for now and
get this patch committed; in my opinion the behaviour is already like
this (early exit on corrupt files) so I don't think the online
verification patch should change this.

If we see complaints about this, then I'd be happy to change it
afterwards.

> Why not make this configurable, using a command-line option?

I like this even less - this tool is about verifying checksums, so
adding options on what to do when it encounters broken pages looks out-
of-scope to me. Unless we want to say it should generally abort on the
first issue (i.e. on wrong checksums as well).

Michael

--
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax: +49 2166 9901-100
Email: michael(dot)banck(at)credativ(dot)de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	Michael Banck <michael(dot)banck(at)credativ(dot)de>
Cc:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>, Michael Paquier <michael(at)paquier(dot)xyz>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, David Steele <david(at)pgmasters(dot)net>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2019-02-06 16:39:33
Message-ID:	20190206163933.GA6197@tamriel.snowman.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Greetings,

* Michael Banck (michael(dot)banck(at)credativ(dot)de) wrote:
> Am Dienstag, den 05.02.2019, 11:30 +0100 schrieb Tomas Vondra:
> > On 2/5/19 8:01 AM, Andres Freund wrote:
> > > On 2019-02-05 06:57:06 +0100, Fabien COELHO wrote:
> > > > > > > I'm wondering (possibly again) about the existing early exit if one block
> > > > > > > cannot be read on retry: the command should count this as a kind of bad
> > > > > > > block, proceed on checking other files, and obviously fail in the end, but
> > > > > > > having checked everything else and generated a report. I do not think that
> > > > > > > this condition warrants a full stop. ISTM that under rare race conditions
> > > > > > > (eg, an unlucky concurrent "drop database" or "drop table") this could
> > > > > > > happen when online, although I could not trigger one despite heavy testing,
> > > > > > > so I'm possibly mistaken.
> > > > > >
> > > > > > This seems like a defensible judgement call either way.
> > > > >
> > > > > Right now we have a few tests that explicitly check that
> > > > > pg_verify_checksums fail on broken data ("foo" in the file). Those
> > > > > would then just get skipped AFAICT, which I think is the worse behaviour
> > > > > , but if everybody thinks that should be the way to go, we can
> > > > > drop/adjust those tests and make pg_verify_checksums skip them.
> > > > >
> > > > > Thoughts?
> > > >
> > > > My point is that it should fail as it does, only not immediately (early
> > > > exit), but after having checked everything else. This mean avoiding calling
> > > > "exit(1)" here and there (lseek, fopen...), but taking note that something
> > > > bad happened, and call exit only in the end.
> > >
> > > I can see both as being valuable (one gives you a more complete picture,
> > > the other a quicker answer in scripts). For me that's the point where
> > > it's the prerogative of the author to make that choice.

... unless people here object or prefer other options, and then it's up
to discussion and hopefully some consensus comes out of it.

Also, I have to say that I really don't think the 'quicker answer'
argument holds any weight, making me question if that's a valid
use-case. If there *isn't* an issue, which we would likely all agree is
the case the vast majority of the time that this is going to be run,
then it's going to take quite a while and anyone calling it should
expect and be prepared for that. In the extremely rare cases, what does
exiting early actually do for us?

> Personally, I would prefer to keep it as simple as possible for now and
> get this patch committed; in my opinion the behaviour is already like
> this (early exit on corrupt files) so I don't think the online
> verification patch should change this.

I'm also in the camp of "would rather it not exit immediately, so the
extent of the issue is clear".

> If we see complaints about this, then I'd be happy to change it
> afterwards.

I really don't think this is something we should change later on in a
future release.. If the consensus is that there's really two different
but valid use-cases then we should make it configurable, but I'm not
convinced there is.

> > Why not make this configurable, using a command-line option?
>
> I like this even less - this tool is about verifying checksums, so
> adding options on what to do when it encounters broken pages looks out-
> of-scope to me. Unless we want to say it should generally abort on the
> first issue (i.e. on wrong checksums as well).

I definitely disagree that it's somehow 'out of scope' for this tool to
skip broken pages, when we can tell that they're broken. There is a
question here about how to handle a short read since that can happen
under normal conditions if we're unlucky. The same is also true for
files disappearing entirely.

So, let's talk/think through a few cases:

A file with just 'foo\n' in it- could that be a page starting with
an LSN around 666F6F0A that we somehow only read the first few bytes of?
If not, why not? I could possibly see an argument that we expect to
always get at least 512 bytes in a read, or 4K, but it seems like we
could possibly run into edge cases on odd filesystems or such. In the
end, I'm leaning towards categorizing different things, well,
differently- a short read would be reported as a NOTICE or equivilant,
perhaps, meaning that the test case needs to do something more than just
have a file with 'foo' in it, but that is likely a good things anyway-
the test cases would be better if they were closer to real world. Other
read failures would be reported in a more serious category assuming they
are "this really shouldn't happen" cases. A file disappearing isn't a
"can't happen" case, and might be reported at the same 'NOTICE' level
(or maybe with a 'verbose' ption).

A file that's 8k in size and has a checksum but it's not right seems
pretty clear to me. Might as well include a count of pages which have a
valid checksum, I would think, though perhaps only in a 'verbose' mode
would that get reported.

A completely zero'd page could also be reported at a NOTICE level or
with a count, or perhaps only with verbose.

Other thoughts about use-cases and what should happen..?

Thanks!

Stephen

From:	Michael Banck <michael(dot)banck(at)credativ(dot)de>
To:	Stephen Frost <sfrost(at)snowman(dot)net>
Cc:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>, Michael Paquier <michael(at)paquier(dot)xyz>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, David Steele <david(at)pgmasters(dot)net>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2019-02-16 12:22:58
Message-ID:	1550319778.12689.8.camel@credativ.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Am Mittwoch, den 06.02.2019, 11:39 -0500 schrieb Stephen Frost:
> * Michael Banck (michael(dot)banck(at)credativ(dot)de) wrote:
> > Am Dienstag, den 05.02.2019, 11:30 +0100 schrieb Tomas Vondra:
> > > On 2/5/19 8:01 AM, Andres Freund wrote:
> > > > On 2019-02-05 06:57:06 +0100, Fabien COELHO wrote:
> > > > > > > > I'm wondering (possibly again) about the existing early exit if one block
> > > > > > > > cannot be read on retry: the command should count this as a kind of bad
> > > > > > > > block, proceed on checking other files, and obviously fail in the end, but
> > > > > > > > having checked everything else and generated a report. I do not think that
> > > > > > > > this condition warrants a full stop. ISTM that under rare race conditions
> > > > > > > > (eg, an unlucky concurrent "drop database" or "drop table") this could
> > > > > > > > happen when online, although I could not trigger one despite heavy testing,
> > > > > > > > so I'm possibly mistaken.
> > > > > > >
> > > > > > > This seems like a defensible judgement call either way.
> > > > > >
> > > > > > Right now we have a few tests that explicitly check that
> > > > > > pg_verify_checksums fail on broken data ("foo" in the file). Those
> > > > > > would then just get skipped AFAICT, which I think is the worse behaviour
> > > > > > , but if everybody thinks that should be the way to go, we can
> > > > > > drop/adjust those tests and make pg_verify_checksums skip them.
> > > > > >
> > > > > > Thoughts?
> > > > >
> > > > > My point is that it should fail as it does, only not immediately (early
> > > > > exit), but after having checked everything else. This mean avoiding calling
> > > > > "exit(1)" here and there (lseek, fopen...), but taking note that something
> > > > > bad happened, and call exit only in the end.
> > > >
> > > > I can see both as being valuable (one gives you a more complete picture,
> > > > the other a quicker answer in scripts). For me that's the point where
> > > > it's the prerogative of the author to make that choice.
>
> ... unless people here object or prefer other options, and then it's up
> to discussion and hopefully some consensus comes out of it.
>
> Also, I have to say that I really don't think the 'quicker answer'
> argument holds any weight, making me question if that's a valid
> use-case. If there *isn't* an issue, which we would likely all agree is
> the case the vast majority of the time that this is going to be run,
> then it's going to take quite a while and anyone calling it should
> expect and be prepared for that. In the extremely rare cases, what does
> exiting early actually do for us?
>
> > Personally, I would prefer to keep it as simple as possible for now and
> > get this patch committed; in my opinion the behaviour is already like
> > this (early exit on corrupt files) so I don't think the online
> > verification patch should change this.
>
> I'm also in the camp of "would rather it not exit immediately, so the
> extent of the issue is clear".
>
> > If we see complaints about this, then I'd be happy to change it
> > afterwards.
>
> I really don't think this is something we should change later on in a
> future release.. If the consensus is that there's really two different
> but valid use-cases then we should make it configurable, but I'm not
> convinced there is.

OK, fair enough.

> > > Why not make this configurable, using a command-line option?
> >
> > I like this even less - this tool is about verifying checksums, so
> > adding options on what to do when it encounters broken pages looks out-
> > of-scope to me. Unless we want to say it should generally abort on the
> > first issue (i.e. on wrong checksums as well).
>
> I definitely disagree that it's somehow 'out of scope' for this tool to
> skip broken pages, when we can tell that they're broken.

I didn't mean that it's out-of-scope for pg_verify_checksums, I meant it
is out-of-scope for this patch, which adds online checking.

> There is a question here about how to handle a short read since that
> can happen under normal conditions if we're unlucky. The same is also
> true for files disappearing entirely.
>
> So, let's talk/think through a few cases:
>
> A file with just 'foo\n' in it- could that be a page starting with
> an LSN around 666F6F0A that we somehow only read the first few bytes of?
> If not, why not? I could possibly see an argument that we expect to
> always get at least 512 bytes in a read, or 4K, but it seems like we
> could possibly run into edge cases on odd filesystems or such. In the
> end, I'm leaning towards categorizing different things, well,
> differently- a short read would be reported as a NOTICE or equivilant,
> perhaps, meaning that the test case needs to do something more than just
> have a file with 'foo' in it, but that is likely a good things anyway-
> the test cases would be better if they were closer to real world. Other
> read failures would be reported in a more serious category assuming they
> are "this really shouldn't happen" cases. A file disappearing isn't a
> "can't happen" case, and might be reported at the same 'NOTICE' level
> (or maybe with a 'verbose' ption).

In the context of this patch, we should also discern whether a
particular case is merely a notice (or warning) on an offline cluster, I
guess you think it should be?

So I've changed it such that a short read emits a "warning" message,
increments a new skippedfiles (as it is not just a skipped block)
variable and reports its number at the end - should it then exit with >
0 even if there were no wrong checksums?

> A file that's 8k in size and has a checksum but it's not right seems
> pretty clear to me. Might as well include a count of pages which have a
> valid checksum, I would think, though perhaps only in a 'verbose' mode
> would that get reported.

What's the use for that? It already reports the number of scanned blocks
at the end, so that number is pretty easy to figure out from it and the
number of bad checksums.

> A completely zero'd page could also be reported at a NOTICE level or
> with a count, or perhaps only with verbose.

It is counted as a skipped block right now (well, every block that
qualifes for PageIsNew() is), but skipped blocks are not mentioned right
now. I guess the rationale is that it might lead to excessive screen
output (but then, verbose originally logged /every/ block), but you'd
have to check with the original authors.

So I have now changed behaviour so that short writes count as skipped
files and pg_verify_checksums no longer bails out on them. When this
occors a warning is written to stderr and their overall count is also
reported at the end. However, unless there are other blocks with bad
checksums, the exit status is kept at zero.

New patch attached.

Michael

--
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax: +49 2166 9901-100
Email: michael(dot)banck(at)credativ(dot)de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

Attachment	Content-Type	Size
online-verification-of-checksums_V10.patch	text/x-patch	9.2 KB

From:	Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To:	Michael Banck <michael(dot)banck(at)credativ(dot)de>
Cc:	Stephen Frost <sfrost(at)snowman(dot)net>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Michael Paquier <michael(at)paquier(dot)xyz>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, David Steele <david(at)pgmasters(dot)net>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2019-02-28 13:29:45
Message-ID:	alpine.DEB.2.21.1902171410130.3339@lancre
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hallo Mickael,

> So I have now changed behaviour so that short writes count as skipped
> files and pg_verify_checksums no longer bails out on them. When this
> occors a warning is written to stderr and their overall count is also
> reported at the end. However, unless there are other blocks with bad
> checksums, the exit status is kept at zero.

This seems fair when online, however I'm wondering whether it is when
offline. I'd say that the whole retry logic should be skipped in this
case? i.e. "if (block_retry || !online) { error message and continue }"
on both short read & checksum failure retries.

> New patch attached.

Patch applies cleanly, compiles, global & local make check ok.

I'm wondering whether it should exit(1) on "lseek" failures. Would it make
sense to skip the file and report it as such? Should it be counted as a
skippedfile?

WRT the final status, ISTM that slippedblocks & files could warrant an
error when offline, although they might be ok when online?

--
Fabien.

From:	Michael Banck <michael(dot)banck(at)credativ(dot)de>
To:	Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
Cc:	Stephen Frost <sfrost(at)snowman(dot)net>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Michael Paquier <michael(at)paquier(dot)xyz>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, David Steele <david(at)pgmasters(dot)net>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2019-03-01 00:05:14
Message-ID:	1551398714.4947.28.camel@credativ.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Am Donnerstag, den 28.02.2019, 14:29 +0100 schrieb Fabien COELHO:
> > So I have now changed behaviour so that short writes count as skipped
> > files and pg_verify_checksums no longer bails out on them. When this
> > occors a warning is written to stderr and their overall count is also
> > reported at the end. However, unless there are other blocks with bad
> > checksums, the exit status is kept at zero.
>
> This seems fair when online, however I'm wondering whether it is when
> offline. I'd say that the whole retry logic should be skipped in this
> case? i.e. "if (block_retry || !online) { error message and continue }"
> on both short read & checksum failure retries.

Ok, the stand-alone pg_checksums program also got a PR about the LSN
skip logic not being helpful when the instance is offline and somebody
just writes /dev/urandom over the heap files:

https://github.com/credativ/pg_checksums/pull/6

So I now tried to change the patch so that it only retries blocks when
online.

> Patch applies cleanly, compiles, global & local make check ok.
>
> I'm wondering whether it should exit(1) on "lseek" failures. Would it make
> sense to skip the file and report it as such? Should it be counted as a
> skippedfile?

Ok, I think it makes sense to march on and I changed it that way.

> WRT the final status, ISTM that slippedblocks & files could warrant an
> error when offline, although they might be ok when online?

Ok, also changed it that way.

New patch attached.

Michael

--
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax: +49 2166 9901-100
Email: michael(dot)banck(at)credativ(dot)de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

Attachment	Content-Type	Size
online-verification-of-checksums_V11.patch	text/x-patch	9.9 KB

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Michael Banck <michael(dot)banck(at)credativ(dot)de>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2019-03-01 23:03:21
Message-ID:	CA+TgmoYb4MOdiyda2E8yJm40nhyFarVkRdS7R3enTbNuwTtigw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Sep 18, 2018 at 10:37 AM Michael Banck
<michael(dot)banck(at)credativ(dot)de> wrote:
> I have added a retry for this as well now, without a pg_sleep() as well.
> This catches around 80% of the half-reads, but a few slip through. At
> that point we bail out with exit(1), and the user can try again, which I
> think is fine?

Maybe I'm confused here, but catching 80% of torn pages doesn't sound
robust at all.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Michael Banck <michael(dot)banck(at)credativ(dot)de>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2019-03-02 10:45:48
Message-ID:	1551523548.4947.32.camel@credativ.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Am Freitag, den 01.03.2019, 18:03 -0500 schrieb Robert Haas:
> On Tue, Sep 18, 2018 at 10:37 AM Michael Banck
> <michael(dot)banck(at)credativ(dot)de> wrote:
> > I have added a retry for this as well now, without a pg_sleep() as well.
> > This catches around 80% of the half-reads, but a few slip through. At
> > that point we bail out with exit(1), and the user can try again, which I
> > think is fine?
>
> Maybe I'm confused here, but catching 80% of torn pages doesn't sound
> robust at all.

The chance that pg_verify_checksums hits a torn page (at least in my
tests, see below) is already pretty low, a couple of times per 1000
runs. Maybe 4 out 5 times, the page is read fine on retry and we march
on. Otherwise, we now just issue a warning and skip the file (or so was
the idea, see below), do you think that is not acceptable?

I re-ran the tests (concurrent createdb/pgbench -i -s 50/dropdb and
pg_verify_checksums in tight loops) with the current patch version, and
I am seeing short reads very, very rarely (maybe every 1000th run) with
a warning like:

|1174
|pg_verify_checksums: warning: could not read block 374 in file "data/base/18032/18045": read 4096 of 8192
|pg_verify_checksums: warning: could not read block 375 in file "data/base/18032/18045": read 4096 of 8192
|Files skipped: 2

The 1174 is the sequence number, the first 1173 runs of
pg_verify_checksums only skipped blocks.

However, the fact it shows two warnings for the same file means there is
something wrong here. It was continueing to the next block while I think
it should just skip to the next file on read failures. So I have changed
that now, new patch attached.

Michael

--
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax: +49 2166 9901-100
Email: michael(dot)banck(at)credativ(dot)de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

Attachment	Content-Type	Size
online-verification-of-checksums_V12.patch	text/x-patch	10.0 KB

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	Michael Banck <michael(dot)banck(at)credativ(dot)de>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2019-03-02 16:08:16
Message-ID:	20190302160816.GK6197@tamriel.snowman.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Greetings,

* Michael Banck (michael(dot)banck(at)credativ(dot)de) wrote:
> Am Freitag, den 01.03.2019, 18:03 -0500 schrieb Robert Haas:
> > On Tue, Sep 18, 2018 at 10:37 AM Michael Banck
> > <michael(dot)banck(at)credativ(dot)de> wrote:
> > > I have added a retry for this as well now, without a pg_sleep() as well.
> > > This catches around 80% of the half-reads, but a few slip through. At
> > > that point we bail out with exit(1), and the user can try again, which I
> > > think is fine?
> >
> > Maybe I'm confused here, but catching 80% of torn pages doesn't sound
> > robust at all.
>
> The chance that pg_verify_checksums hits a torn page (at least in my
> tests, see below) is already pretty low, a couple of times per 1000
> runs. Maybe 4 out 5 times, the page is read fine on retry and we march
> on. Otherwise, we now just issue a warning and skip the file (or so was
> the idea, see below), do you think that is not acceptable?
>
> I re-ran the tests (concurrent createdb/pgbench -i -s 50/dropdb and
> pg_verify_checksums in tight loops) with the current patch version, and
> I am seeing short reads very, very rarely (maybe every 1000th run) with
> a warning like:
>
> |1174
> |pg_verify_checksums: warning: could not read block 374 in file "data/base/18032/18045": read 4096 of 8192
> |pg_verify_checksums: warning: could not read block 375 in file "data/base/18032/18045": read 4096 of 8192
> |Files skipped: 2
>
> The 1174 is the sequence number, the first 1173 runs of
> pg_verify_checksums only skipped blocks.
>
> However, the fact it shows two warnings for the same file means there is
> something wrong here. It was continueing to the next block while I think
> it should just skip to the next file on read failures. So I have changed
> that now, new patch attached.

I'm confused- if previously it was continueing to the next block instead
of doing the re-read on the same block, why don't we just change it to
do the re-read on the same block properly and see if that fixes the
retry, instead of just giving up and skipping..? I'm not necessairly
against skipping to the next file, to be clear, but I think I'd be
happier if we kept reading the file until we actually get EOF.

(I've not looked at the actual patch, just read what you wrote..)

Thanks!

Stephen

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>, Michael Banck <michael(dot)banck(at)credativ(dot)de>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2019-03-02 21:38:55
Message-ID:	75ad93fd-faf9-571b-fbc6-befdebf9fd80@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 3/2/19 12:03 AM, Robert Haas wrote:
> On Tue, Sep 18, 2018 at 10:37 AM Michael Banck
> <michael(dot)banck(at)credativ(dot)de> wrote:
>> I have added a retry for this as well now, without a pg_sleep() as well.
>> This catches around 80% of the half-reads, but a few slip through. At
>> that point we bail out with exit(1), and the user can try again, which I
>> think is fine?
>
> Maybe I'm confused here, but catching 80% of torn pages doesn't sound
> robust at all.
>

FWIW I don't think this qualifies as torn page - i.e. it's not a full
read with a mix of old and new data. This is partial write, most likely
because we read the blocks one by one, and when we hit the last page
while the table is being extended, we may only see the fist 4kB. And if
we retry very fast, we may still see only the first 4kB.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Stephen Frost <sfrost(at)snowman(dot)net>, Michael Banck <michael(dot)banck(at)credativ(dot)de>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2019-03-02 21:49:33
Message-ID:	56769250-df37-6bfe-76f4-b33566700462@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 3/2/19 5:08 PM, Stephen Frost wrote:
> Greetings,
>
> * Michael Banck (michael(dot)banck(at)credativ(dot)de) wrote:
>> Am Freitag, den 01.03.2019, 18:03 -0500 schrieb Robert Haas:
>>> On Tue, Sep 18, 2018 at 10:37 AM Michael Banck
>>> <michael(dot)banck(at)credativ(dot)de> wrote:
>>>> I have added a retry for this as well now, without a pg_sleep() as well.
>>>> This catches around 80% of the half-reads, but a few slip through. At
>>>> that point we bail out with exit(1), and the user can try again, which I
>>>> think is fine?
>>>
>>> Maybe I'm confused here, but catching 80% of torn pages doesn't sound
>>> robust at all.
>>
>> The chance that pg_verify_checksums hits a torn page (at least in my
>> tests, see below) is already pretty low, a couple of times per 1000
>> runs. Maybe 4 out 5 times, the page is read fine on retry and we march
>> on. Otherwise, we now just issue a warning and skip the file (or so was
>> the idea, see below), do you think that is not acceptable?
>>
>> I re-ran the tests (concurrent createdb/pgbench -i -s 50/dropdb and
>> pg_verify_checksums in tight loops) with the current patch version, and
>> I am seeing short reads very, very rarely (maybe every 1000th run) with
>> a warning like:
>>
>> |1174
>> |pg_verify_checksums: warning: could not read block 374 in file "data/base/18032/18045": read 4096 of 8192
>> |pg_verify_checksums: warning: could not read block 375 in file "data/base/18032/18045": read 4096 of 8192
>> |Files skipped: 2
>>
>> The 1174 is the sequence number, the first 1173 runs of
>> pg_verify_checksums only skipped blocks.
>>
>> However, the fact it shows two warnings for the same file means there is
>> something wrong here. It was continueing to the next block while I think
>> it should just skip to the next file on read failures. So I have changed
>> that now, new patch attached.
>
> I'm confused- if previously it was continueing to the next block instead
> of doing the re-read on the same block, why don't we just change it to
> do the re-read on the same block properly and see if that fixes the
> retry, instead of just giving up and skipping..? I'm not necessairly
> against skipping to the next file, to be clear, but I think I'd be
> happier if we kept reading the file until we actually get EOF.
>
> (I've not looked at the actual patch, just read what you wrote..)
>

Notice that those two errors are actually for two consecutive blocks in
the same file. So what probably happened is that postgres started to
extend the page, and the verification tried to read the last page after
the kernel added just the first 4kB filesystem page. Then it probably
succeeded on a retry, and then the same thing happened on the next page.

I don't think EOF addresses this, though - the partial read happens
before we actually reach the end of the file.

And re-reads are not a solution either, because the second read may
still see only the first half, and then what - is it a permanent issue
(in which case it's a data corruption), or an extension in progress?

I wonder if we can simply ignore those errors entirely, if it's the last
page in the segment? We can't really check the file is "complete"
anyway, e.g. if you have multiple segments for a table, and the "middle"
one is a page shorter, we'll happily ignore that during verification.

Also, what if we're reading a file and it gets truncated (e.g. after
vacuum notices the last few pages are empty)? Doesn't that have the same
issue?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	Stephen Frost <sfrost(at)snowman(dot)net>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2019-03-02 22:00:31
Message-ID:	20190302220031.j7ayfoimgr42ofij@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 2019-03-02 22:49:33 +0100, Tomas Vondra wrote:
>
>
> On 3/2/19 5:08 PM, Stephen Frost wrote:
> > Greetings,
> >
> > * Michael Banck (michael(dot)banck(at)credativ(dot)de) wrote:
> >> Am Freitag, den 01.03.2019, 18:03 -0500 schrieb Robert Haas:
> >>> On Tue, Sep 18, 2018 at 10:37 AM Michael Banck
> >>> <michael(dot)banck(at)credativ(dot)de> wrote:
> >>>> I have added a retry for this as well now, without a pg_sleep() as well.
> >>>> This catches around 80% of the half-reads, but a few slip through. At
> >>>> that point we bail out with exit(1), and the user can try again, which I
> >>>> think is fine?
> >>>
> >>> Maybe I'm confused here, but catching 80% of torn pages doesn't sound
> >>> robust at all.
> >>
> >> The chance that pg_verify_checksums hits a torn page (at least in my
> >> tests, see below) is already pretty low, a couple of times per 1000
> >> runs. Maybe 4 out 5 times, the page is read fine on retry and we march
> >> on. Otherwise, we now just issue a warning and skip the file (or so was
> >> the idea, see below), do you think that is not acceptable?
> >>
> >> I re-ran the tests (concurrent createdb/pgbench -i -s 50/dropdb and
> >> pg_verify_checksums in tight loops) with the current patch version, and
> >> I am seeing short reads very, very rarely (maybe every 1000th run) with
> >> a warning like:
> >>
> >> |1174
> >> |pg_verify_checksums: warning: could not read block 374 in file "data/base/18032/18045": read 4096 of 8192
> >> |pg_verify_checksums: warning: could not read block 375 in file "data/base/18032/18045": read 4096 of 8192
> >> |Files skipped: 2
> >>
> >> The 1174 is the sequence number, the first 1173 runs of
> >> pg_verify_checksums only skipped blocks.
> >>
> >> However, the fact it shows two warnings for the same file means there is
> >> something wrong here. It was continueing to the next block while I think
> >> it should just skip to the next file on read failures. So I have changed
> >> that now, new patch attached.
> >
> > I'm confused- if previously it was continueing to the next block instead
> > of doing the re-read on the same block, why don't we just change it to
> > do the re-read on the same block properly and see if that fixes the
> > retry, instead of just giving up and skipping..? I'm not necessairly
> > against skipping to the next file, to be clear, but I think I'd be
> > happier if we kept reading the file until we actually get EOF.
> >
> > (I've not looked at the actual patch, just read what you wrote..)
> >
>
> Notice that those two errors are actually for two consecutive blocks in
> the same file. So what probably happened is that postgres started to
> extend the page, and the verification tried to read the last page after
> the kernel added just the first 4kB filesystem page. Then it probably
> succeeded on a retry, and then the same thing happened on the next page.
>
> I don't think EOF addresses this, though - the partial read happens
> before we actually reach the end of the file.
>
> And re-reads are not a solution either, because the second read may
> still see only the first half, and then what - is it a permanent issue
> (in which case it's a data corruption), or an extension in progress?
>
> I wonder if we can simply ignore those errors entirely, if it's the last
> page in the segment? We can't really check the file is "complete"
> anyway, e.g. if you have multiple segments for a table, and the "middle"
> one is a page shorter, we'll happily ignore that during verification.
>
> Also, what if we're reading a file and it gets truncated (e.g. after
> vacuum notices the last few pages are empty)? Doesn't that have the same
> issue?

I gotta say, my conclusion from this debate is that it's simply a
mistake to do this without involvement of the server that can use
locking to prevent these kind of issues. It seems pretty absurd to me
to have hacky workarounds around partial writes of a live server, around
truncation, etc, even though the server has ways to deal with that.

- Andres

From:	Michael Paquier <michael(at)paquier(dot)xyz>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2019-03-02 23:48:30
Message-ID:	20190302234830.GA1999@paquier.xyz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Mar 02, 2019 at 02:00:31PM -0800, Andres Freund wrote:
> I gotta say, my conclusion from this debate is that it's simply a
> mistake to do this without involvement of the server that can use
> locking to prevent these kind of issues. It seems pretty absurd to me
> to have hacky workarounds around partial writes of a live server, around
> truncation, etc, even though the server has ways to deal with that.

I agree with Andres on this one. We are never going to make this
stuff safe if we don't handle page reads with the proper locks because
of torn pages. What I think we should do is provide a SQL function
which reads a page in shared mode, and then checks its checksum if its
LSN is older than the previous redo point. This discards cases with
rather hot pages, but if the page is hot enough then the backend
re-reading the page would just do the same by verifying the page
checksum by itself.
--
Michael

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Michael Paquier <michael(at)paquier(dot)xyz>, Andres Freund <andres(at)anarazel(dot)de>
Cc:	Stephen Frost <sfrost(at)snowman(dot)net>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2019-03-03 02:12:51
Message-ID:	ce2984fe-1c0f-41e1-588f-a9c061e7bbf2@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 3/3/19 12:48 AM, Michael Paquier wrote:
> On Sat, Mar 02, 2019 at 02:00:31PM -0800, Andres Freund wrote:
>> I gotta say, my conclusion from this debate is that it's simply a
>> mistake to do this without involvement of the server that can use
>> locking to prevent these kind of issues. It seems pretty absurd to me
>> to have hacky workarounds around partial writes of a live server, around
>> truncation, etc, even though the server has ways to deal with that.
>
> I agree with Andres on this one. We are never going to make this
> stuff safe if we don't handle page reads with the proper locks because
> of torn pages. What I think we should do is provide a SQL function
> which reads a page in shared mode, and then checks its checksum if its
> LSN is older than the previous redo point. This discards cases with
> rather hot pages, but if the page is hot enough then the backend
> re-reading the page would just do the same by verifying the page
> checksum by itself.

Handling torn pages is not difficult, and the patch already does that
(it reads LSN of the last checkpoint LSN from the control file, and uses
it the same way basebackup does). That's working since (at least)
September, so I don't see how would the SQL function help with this?

The other issue (raised recently) is partial reads, where we read only a
fraction of the page. Basebackup simply ignores such pages, likely on
the assumption that it's either concurrent extension or truncation (in
which case it's newer than the last checkpoint LSN anyway). So maybe we
should do the same thing here. As I mentioned before, we can't reliably
detect incomplete segments anyway (at least I believe that's the case).

You and Andres may be right that trying to verify checksums online
without close interaction with the server is ultimately futile (or at
least overly complex). But I'm not sure those issues (torn pages and
partial reads) are very good arguments, considering basebackup has to
deal with them too. Not sure.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To:	Michael Paquier <michael(at)paquier(dot)xyz>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2019-03-03 06:58:26
Message-ID:	alpine.DEB.2.21.1903030743240.8095@lancre
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Bonjour Michaël,

>> I gotta say, my conclusion from this debate is that it's simply a
>> mistake to do this without involvement of the server that can use
>> locking to prevent these kind of issues. It seems pretty absurd to me
>> to have hacky workarounds around partial writes of a live server, around
>> truncation, etc, even though the server has ways to deal with that.
>
> I agree with Andres on this one. We are never going to make this stuff
> safe if we don't handle page reads with the proper locks because of torn
> pages. What I think we should do is provide a SQL function which reads a
> page in shared mode, and then checks its checksum if its LSN is older
> than the previous redo point. This discards cases with rather hot
> pages, but if the page is hot enough then the backend re-reading the
> page would just do the same by verifying the page checksum by itself. --
> Michael

My 0.02€ about that, as one of the reviewer of the patch:

I agree that having a server function (extension?) to do a full checksum
verification, possibly bandwidth-controlled, would be a good thing.
However it would have side effects, such as interfering deeply with the
server page cache, which may or may not be desirable.

On the other hand I also see value in an independent system-level external
tool capable of a best effort checksum verification: the current check
that the cluster is offline to prevent pg_verify_checksum from running is
kind of artificial, and when online simply counting
online-database-related checksum issues looks like a reasonable
compromise.

So basically I think that allowing pg_verify_checksum to run on an online
cluster is still a good thing, provided that expected errors are correctly
handled.

--
Fabien.

From:	Michael Banck <michael(dot)banck(at)credativ(dot)de>
To:	Stephen Frost <sfrost(at)snowman(dot)net>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2019-03-03 10:51:48
Message-ID:	1551610308.4947.34.camel@credativ.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Am Samstag, den 02.03.2019, 11:08 -0500 schrieb Stephen Frost:h
> * Michael Banck (michael(dot)banck(at)credativ(dot)de) wrote:
> > Am Freitag, den 01.03.2019, 18:03 -0500 schrieb Robert Haas:
> > > On Tue, Sep 18, 2018 at 10:37 AM Michael Banck
> > > <michael(dot)banck(at)credativ(dot)de> wrote:
> > > > I have added a retry for this as well now, without a pg_sleep() as well.
> > > > This catches around 80% of the half-reads, but a few slip through. At
> > > > that point we bail out with exit(1), and the user can try again, which I
> > > > think is fine?
> > >
> > > Maybe I'm confused here, but catching 80% of torn pages doesn't sound
> > > robust at all.
> >
> > The chance that pg_verify_checksums hits a torn page (at least in my
> > tests, see below) is already pretty low, a couple of times per 1000
> > runs. Maybe 4 out 5 times, the page is read fine on retry and we march
> > on. Otherwise, we now just issue a warning and skip the file (or so was
> > the idea, see below), do you think that is not acceptable?
> >
> > I re-ran the tests (concurrent createdb/pgbench -i -s 50/dropdb and
> > pg_verify_checksums in tight loops) with the current patch version, and
> > I am seeing short reads very, very rarely (maybe every 1000th run) with
> > a warning like:
> >
> > > 1174
> > > pg_verify_checksums: warning: could not read block 374 in file "data/base/18032/18045": read 4096 of 8192
> > > pg_verify_checksums: warning: could not read block 375 in file "data/base/18032/18045": read 4096 of 8192
> > > Files skipped: 2
> >
> > The 1174 is the sequence number, the first 1173 runs of
> > pg_verify_checksums only skipped blocks.
> >
> > However, the fact it shows two warnings for the same file means there is
> > something wrong here. It was continueing to the next block while I think
> > it should just skip to the next file on read failures. So I have changed
> > that now, new patch attached.
>
> I'm confused- if previously it was continueing to the next block instead
> of doing the re-read on the same block, why don't we just change it to
> do the re-read on the same block properly and see if that fixes the
> retry, instead of just giving up and skipping..?

It was re-reading the block and continueing to read the file after it
got a short read even on re-read.

> I'm not necessairly against skipping to the next file, to be clear,
> but I think I'd be happier if we kept reading the file until we
> actually get EOF.

So if we read half a block twice we should seek() to the next block and
continue till EOF, ok. I think in most cases those pages will be new
anyway and there will be no checksum check, but it sounds like a cleaner
approach. I've seen one or two examples where we did successfully verify
the checksum of a page after a half-read, so it might be worth it.

The alternative would be to just bail out early and skip the file on the
first short read and (possibly) log a skipped file.

I still think that an external checksum verification tool has some
merit, given that basebackup does it and the current offline requirement
is really not useful in practise.

Michael

--
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax: +49 2166 9901-100
Email: michael(dot)banck(at)credativ(dot)de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

From:	Michael Paquier <michael(at)paquier(dot)xyz>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, Stephen Frost <sfrost(at)snowman(dot)net>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2019-03-04 01:00:18
Message-ID:	20190304010018.GC1999@paquier.xyz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Mar 03, 2019 at 03:12:51AM +0100, Tomas Vondra wrote:
> You and Andres may be right that trying to verify checksums online
> without close interaction with the server is ultimately futile (or at
> least overly complex). But I'm not sure those issues (torn pages and
> partial reads) are very good arguments, considering basebackup has to
> deal with them too. Not sure.

FWIW, I don't think that the backend is right in its way of checking
checksums the way it does currently either with warnings and a limited
set of failures generated. I raised concerns about that unfortunately
after 11 has been GA'ed, which was too late, so this time, for this
patch, I prefer raising them before the fact and I'd rather not spread
this kind of methodology around the core code more and more. I work a
lot with virtualization, and I have seen ESX hanging around I/O
requests from time to time depending on the environment used (which is
actually wrong, anyway, but a lot of tests happen on a daily basis on
the stuff I work on). What's presented on this thread is *never*
going to be 100% safe, and would generate false positives which can be
confusing for the user. This is not a good sign.
--
Michael

From:	Michael Paquier <michael(at)paquier(dot)xyz>
To:	Michael Banck <michael(dot)banck(at)credativ(dot)de>
Cc:	Stephen Frost <sfrost(at)snowman(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2019-03-04 03:06:16
Message-ID:	20190304030616.GF1999@paquier.xyz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Mar 03, 2019 at 11:51:48AM +0100, Michael Banck wrote:
> I still think that an external checksum verification tool has some
> merit, given that basebackup does it and the current offline requirement
> is really not useful in practise.

I am not going to argue again about the way checksum verification is
done in a base backup.. :)

Being able to do an online verification of checksums has a lot of
value, do not take me wrong, and an SQL interface to do that does not
prevent having a frontend wrapper using it.
--
Michael

From:	Michael Paquier <michael(at)paquier(dot)xyz>
To:	Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2019-03-04 03:09:41
Message-ID:	20190304030941.GG1999@paquier.xyz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Mar 03, 2019 at 07:58:26AM +0100, Fabien COELHO wrote:
> I agree that having a server function (extension?) to do a full checksum
> verification, possibly bandwidth-controlled, would be a good thing. However
> it would have side effects, such as interfering deeply with the server page
> cache, which may or may not be desirable.

In what is that different from VACUUM or a sequential scan? It is
possible to use buffer ring replacement strategies in such cases using
the normal clock-sweep algorithm, so that scanning a range of pages
does not really impact Postgres shared buffer cache.
--
Michael

From:	Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To:	Michael Paquier <michael(at)paquier(dot)xyz>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2019-03-04 06:05:39
Message-ID:	alpine.DEB.2.21.1903040702230.8095@lancre
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Bonjour Michaël,

>> I agree that having a server function (extension?) to do a full checksum
>> verification, possibly bandwidth-controlled, would be a good thing. However
>> it would have side effects, such as interfering deeply with the server page
>> cache, which may or may not be desirable.
>
> In what is that different from VACUUM or a sequential scan?

Scrubbing would read all files, not only relation data? I'm unsure about
what does VACUUM, but it is probably pretty similar.

> It is possible to use buffer ring replacement strategies in such cases
> using the normal clock-sweep algorithm, so that scanning a range of
> pages does not really impact Postgres shared buffer cache.

Good! I did not know that there was an existing strategy to avoid filling
the cache.

--
Fabien.

From:	Magnus Hagander <magnus(at)hagander(dot)net>
To:	Michael Paquier <michael(at)paquier(dot)xyz>
Cc:	Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>, Andres Freund <andres(at)anarazel(dot)de>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2019-03-04 07:09:33
Message-ID:	CABUevEz7a6ABBDKzZaEf=6P5LNmHNDNQVK70Hgq2scUKUTSW3w@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Mar 4, 2019, 04:10 Michael Paquier <michael(at)paquier(dot)xyz> wrote:

> On Sun, Mar 03, 2019 at 07:58:26AM +0100, Fabien COELHO wrote:
> > I agree that having a server function (extension?) to do a full checksum
> > verification, possibly bandwidth-controlled, would be a good thing.
> However
> > it would have side effects, such as interfering deeply with the server
> page
> > cache, which may or may not be desirable.
>
> In what is that different from VACUUM or a sequential scan? It is
> possible to use buffer ring replacement strategies in such cases using
> the normal clock-sweep algorithm, so that scanning a range of pages
> does not really impact Postgres shared buffer cache.
>

Yeah, I wouldn't worry too much about the effect on the postgres cache when
that is done. It could of course have a much worse impact on the os cache
or on the "smart" (aka dumb) storage system cache. But that effect will be
there just as much with a separate tool.

/Magnus

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Michael Paquier <michael(at)paquier(dot)xyz>, Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, Stephen Frost <sfrost(at)snowman(dot)net>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2019-03-04 14:01:46
Message-ID:	090a264c-f6ba-ae69-86df-41408df66293@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 3/4/19 4:09 AM, Michael Paquier wrote:
> On Sun, Mar 03, 2019 at 07:58:26AM +0100, Fabien COELHO wrote:
>> I agree that having a server function (extension?) to do a full checksum
>> verification, possibly bandwidth-controlled, would be a good thing. However
>> it would have side effects, such as interfering deeply with the server page
>> cache, which may or may not be desirable.
>
> In what is that different from VACUUM or a sequential scan? It is
> possible to use buffer ring replacement strategies in such cases using
> the normal clock-sweep algorithm, so that scanning a range of pages
> does not really impact Postgres shared buffer cache.
> --

But Fabien was talking about page cache, not shared buffers. And we
can't use custom ring buffer there. OTOH I don't see why accessing the
file through SQL function would behave any differently than direct
access (i.e. what the tool does now).

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Michael Paquier <michael(at)paquier(dot)xyz>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, Stephen Frost <sfrost(at)snowman(dot)net>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2019-03-04 14:08:09
Message-ID:	42c56652-bec1-9a6b-a765-979709457cf1@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 3/4/19 2:00 AM, Michael Paquier wrote:
> On Sun, Mar 03, 2019 at 03:12:51AM +0100, Tomas Vondra wrote:
>> You and Andres may be right that trying to verify checksums online
>> without close interaction with the server is ultimately futile (or at
>> least overly complex). But I'm not sure those issues (torn pages and
>> partial reads) are very good arguments, considering basebackup has to
>> deal with them too. Not sure.
>
> FWIW, I don't think that the backend is right in its way of checking
> checksums the way it does currently either with warnings and a limited
> set of failures generated. I raised concerns about that unfortunately
> after 11 has been GA'ed, which was too late, so this time, for this
> patch, I prefer raising them before the fact and I'd rather not spread
> this kind of methodology around the core code more and more.

I still don't understand what issue you see in how basebackup verifies
checksums. Can you point me to the explanation you've sent after 11 was
released?

> I work a lot with virtualization, and I have seen ESX hanging around
> I/O requests from time to time depending on the environment used
> (which is actually wrong, anyway, but a lot of tests happen on a
> daily basis on the stuff I work on). What's presented on this thread
> is *never* going to be 100% safe, and would generate false positives
> which can be confusing for the user. This is not a good sign.

So you have a workload/configuration that actually results in data
corruption yet we fail to detect that? Or we generate false positives?
Or what do you mean by "100% safe" here?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Magnus Hagander <magnus(at)hagander(dot)net>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>, Andres Freund <andres(at)anarazel(dot)de>, Stephen Frost <sfrost(at)snowman(dot)net>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2019-03-04 15:53:25
Message-ID:	CABUevEwunDWDYqCJug_=gzzkX1mDgf0BMp67OGFR+cgucz6WdQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Mar 4, 2019 at 3:02 PM Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
wrote:

>
>
> On 3/4/19 4:09 AM, Michael Paquier wrote:
> > On Sun, Mar 03, 2019 at 07:58:26AM +0100, Fabien COELHO wrote:
> >> I agree that having a server function (extension?) to do a full checksum
> >> verification, possibly bandwidth-controlled, would be a good thing.
> However
> >> it would have side effects, such as interfering deeply with the server
> page
> >> cache, which may or may not be desirable.
> >
> > In what is that different from VACUUM or a sequential scan? It is
> > possible to use buffer ring replacement strategies in such cases using
> > the normal clock-sweep algorithm, so that scanning a range of pages
> > does not really impact Postgres shared buffer cache.
> > --
>
> But Fabien was talking about page cache, not shared buffers. And we
> can't use custom ring buffer there. OTOH I don't see why accessing the
> file through SQL function would behave any differently than direct
> access (i.e. what the tool does now).
>

It shouldn't.

One other thought that I had around this though, which if it's been covered
before and I missed it, please disregard :)

The *online* version of the tool is very similar to running pg_basebackup
to /dev/null, is it not? Except it doesn't set the cluster to backup mode.
Perhaps what we really want is a simpler way to do *that*. That wouldn't
necessarily make it a SQL callable function, but it would be a CLI tool
that would call a command on a walsender for example.

(We'd of course still need the standalone tool for offline checks)

--
Magnus Hagander
Me: https://www.hagander.net/ <http://www.hagander.net/>
Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/>

From:	Michael Paquier <michael(at)paquier(dot)xyz>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, Stephen Frost <sfrost(at)snowman(dot)net>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2019-03-05 03:12:06
Message-ID:	20190305031206.GC3156@paquier.xyz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Mar 04, 2019 at 03:08:09PM +0100, Tomas Vondra wrote:
> I still don't understand what issue you see in how basebackup verifies
> checksums. Can you point me to the explanation you've sent after 11 was
> released?

The history is mostly on this thread:
/message-id/20181020044248.GD2553@paquier.xyz

> So you have a workload/configuration that actually results in data
> corruption yet we fail to detect that? Or we generate false positives?
> Or what do you mean by "100% safe" here?

What's proposed on this thread could generate false positives. Checks
which have deterministic properties and clean failure handling are
reliable when it comes to reports.
--
Michael

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Michael Paquier <michael(at)paquier(dot)xyz>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, Stephen Frost <sfrost(at)snowman(dot)net>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2019-03-05 13:08:03
Message-ID:	a9aa939f-fcf6-017d-d7ce-0a7d26a72cc2@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 3/5/19 4:12 AM, Michael Paquier wrote:
> On Mon, Mar 04, 2019 at 03:08:09PM +0100, Tomas Vondra wrote:
>> I still don't understand what issue you see in how basebackup verifies
>> checksums. Can you point me to the explanation you've sent after 11 was
>> released?
>
> The history is mostly on this thread:
> /message-id/20181020044248.GD2553@paquier.xyz
>

Thanks, will look.

Based on quickly skimming that thread the main issue seems to be
deciding which files in the data directory are expected to have
checksums. Which is a valid issue, of course, but I was expecting
something about partial read/writes etc.

>> So you have a workload/configuration that actually results in data
>> corruption yet we fail to detect that? Or we generate false positives?
>> Or what do you mean by "100% safe" here?
>
> What's proposed on this thread could generate false positives. Checks
> which have deterministic properties and clean failure handling are
> reliable when it comes to reports.

My understanding is that:

(a) The checksum verification should not generate false positives (same
as for basebackup).

(b) The partial reads do emit warnings, which might be considered false
positives I guess. Which is why I'm arguing for changing it to do the
same thing basebackup does, i.e. ignore this.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Michael Paquier <michael(at)paquier(dot)xyz>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, Stephen Frost <sfrost(at)snowman(dot)net>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2019-03-06 02:36:40
Message-ID:	20190306023640.GC30982@paquier.xyz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Mar 05, 2019 at 02:08:03PM +0100, Tomas Vondra wrote:
> Based on quickly skimming that thread the main issue seems to be
> deciding which files in the data directory are expected to have
> checksums. Which is a valid issue, of course, but I was expecting
> something about partial read/writes etc.

I remember complaining about partial write handling as well for the
base backup checks... There should be an email about it on the list,
cannot find it now ;p

> My understanding is that:
>
> (a) The checksum verification should not generate false positives (same
> as for basebackup).
>
> (b) The partial reads do emit warnings, which might be considered false
> positives I guess. Which is why I'm arguing for changing it to do the
> same thing basebackup does, i.e. ignore this.

Well, at least that's consistent... Argh, I really think that we
ought to make the failures reported harder because that's easier to
detect within a tool and some deployments set log_min_messages >
WARNING so checksum failures would just be lost. For base backups we
don't care much about that as files are just blindly copied so they
could have torn pages, which is fine as that's fixed at replay. Now
we are talking about a set of tools which could have reliable
detection mechanisms for those problems.
--
Michael

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	Michael Paquier <michael(at)paquier(dot)xyz>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Subject:	Re: Online verification of checksums
Date:	2019-03-06 02:42:21
Message-ID:	CAOuzzgoMGsWx-_pJH6hLLs=_a91wa+POzyntsesnO3ajOm0MyA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Greetings,

On Tue, Mar 5, 2019 at 18:36 Michael Paquier <michael(at)paquier(dot)xyz> wrote:

> On Tue, Mar 05, 2019 at 02:08:03PM +0100, Tomas Vondra wrote:
> > Based on quickly skimming that thread the main issue seems to be
> > deciding which files in the data directory are expected to have
> > checksums. Which is a valid issue, of course, but I was expecting
> > something about partial read/writes etc.
>
> I remember complaining about partial write handling as well for the
> base backup checks... There should be an email about it on the list,
> cannot find it now ;p
>
> > My understanding is that:
> >
> > (a) The checksum verification should not generate false positives (same
> > as for basebackup).
> >
> > (b) The partial reads do emit warnings, which might be considered false
> > positives I guess. Which is why I'm arguing for changing it to do the
> > same thing basebackup does, i.e. ignore this.
>
> Well, at least that's consistent... Argh, I really think that we
> ought to make the failures reported harder because that's easier to
> detect within a tool and some deployments set log_min_messages >
> WARNING so checksum failures would just be lost. For base backups we
> don't care much about that as files are just blindly copied so they
> could have torn pages, which is fine as that's fixed at replay. Now
> we are talking about a set of tools which could have reliable
> detection mechanisms for those problems.

I’m traveling but will try to comment more in the coming days but in
general I agree with Tomas on these items. Also, pg_basebackup has to
handle torn pages when it comes to checksums just like the verify tool
does, and having them be consistent (along with external tools) would
really be for the best, imv. I still feel like a retry of a short read
(try reading more to get the whole page..) would be alright and reading
until we hit eof and then moving on. I’m not sure it’s possible but I do
worry a bit that we might get a short read from a network file system or
something that isn’t actually at eof and then we would skip a significant
remaining portion of the file... another thought might be to stat the
file after we have opened it to see it’s length...

Just a few thoughts since I’m on my phone. Will try to write up something
more in a day or two.

Thanks!

Stephen

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	Michael Banck <michael(dot)banck(at)credativ(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2019-03-06 17:26:58
Message-ID:	CA+TgmoZ1wb5x2YRG-Rut4Nb2_K+T6326nGKf0DP_y1bVLsi1bA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Mar 2, 2019 at 4:38 PM Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> FWIW I don't think this qualifies as torn page - i.e. it's not a full
> read with a mix of old and new data. This is partial write, most likely
> because we read the blocks one by one, and when we hit the last page
> while the table is being extended, we may only see the fist 4kB. And if
> we retry very fast, we may still see only the first 4kB.

I see the distinction you're making, and you're right. The problem
is, whether in this case or whether for a real torn page, we don't
seem to have a way to distinguish between a state that occurs
transiently due to lack of synchronization and a situation that is
permanent and means that we have corruption. And that worries me,
because it means we'll either report bogus complaints that will scare
easily-panicked users (and anybody who is running this tool has a good
chance of being in the "easily-panicked" category ...), or else we'll
skip reporting real problems. Neither is good.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Michael Banck <michael(dot)banck(at)credativ(dot)de>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2019-03-06 17:33:49
Message-ID:	CA+TgmoY3xhRD1UV5RJXDeiz=HSwmy7i8o47vKMzM0xFwPVg5iQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Mar 2, 2019 at 5:45 AM Michael Banck <michael(dot)banck(at)credativ(dot)de> wrote:
> Am Freitag, den 01.03.2019, 18:03 -0500 schrieb Robert Haas:
> > On Tue, Sep 18, 2018 at 10:37 AM Michael Banck
> > <michael(dot)banck(at)credativ(dot)de> wrote:
> > > I have added a retry for this as well now, without a pg_sleep() as well.
> > > This catches around 80% of the half-reads, but a few slip through. At
> > > that point we bail out with exit(1), and the user can try again, which I
> > > think is fine?
> >
> > Maybe I'm confused here, but catching 80% of torn pages doesn't sound
> > robust at all.
>
> The chance that pg_verify_checksums hits a torn page (at least in my
> tests, see below) is already pretty low, a couple of times per 1000
> runs. Maybe 4 out 5 times, the page is read fine on retry and we march
> on. Otherwise, we now just issue a warning and skip the file (or so was
> the idea, see below), do you think that is not acceptable?

Yeah. Consider a paranoid customer with 100 clusters who runs this
every day on every cluster. They're going to see failures every day
or three and go ballistic.

I suspect that better retry logic might help here. I mean, I would
guess that 10 retries at 1 second intervals or something of that sort
would be enough to virtually eliminate false positives while still
allowing us to report persistent -- and thus real -- problems. But if
even that is going to produce false positives with any measurable
probability different from zero, then I think we have a problem,
because I neither like a verification tool that ignores possible signs
of trouble nor one that "cries wolf" when things are fine.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Michael Banck <michael(dot)banck(at)credativ(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2019-03-06 17:42:04
Message-ID:	20190306174204.mynxidkfvcw7nadg@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2019-03-06 12:33:49 -0500, Robert Haas wrote:
> On Sat, Mar 2, 2019 at 5:45 AM Michael Banck <michael(dot)banck(at)credativ(dot)de> wrote:
> > Am Freitag, den 01.03.2019, 18:03 -0500 schrieb Robert Haas:
> > > On Tue, Sep 18, 2018 at 10:37 AM Michael Banck
> > > <michael(dot)banck(at)credativ(dot)de> wrote:
> > > > I have added a retry for this as well now, without a pg_sleep() as well.
> > > > This catches around 80% of the half-reads, but a few slip through. At
> > > > that point we bail out with exit(1), and the user can try again, which I
> > > > think is fine?
> > >
> > > Maybe I'm confused here, but catching 80% of torn pages doesn't sound
> > > robust at all.
> >
> > The chance that pg_verify_checksums hits a torn page (at least in my
> > tests, see below) is already pretty low, a couple of times per 1000
> > runs. Maybe 4 out 5 times, the page is read fine on retry and we march
> > on. Otherwise, we now just issue a warning and skip the file (or so was
> > the idea, see below), do you think that is not acceptable?
>
> Yeah. Consider a paranoid customer with 100 clusters who runs this
> every day on every cluster. They're going to see failures every day
> or three and go ballistic.

> I suspect that better retry logic might help here. I mean, I would
> guess that 10 retries at 1 second intervals or something of that sort
> would be enough to virtually eliminate false positives while still
> allowing us to report persistent -- and thus real -- problems. But if
> even that is going to produce false positives with any measurable
> probability different from zero, then I think we have a problem,
> because I neither like a verification tool that ignores possible signs
> of trouble nor one that "cries wolf" when things are fine.

To me the right way seems to be to IO lock the page via PG after such a
failure, and then retry. Which should be relatively easily doable for
the basebackup case, but obviously harder for the pg_verify_checksums
case.

Greetings,

Andres Freund

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Michael Banck <michael(dot)banck(at)credativ(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2019-03-06 19:25:12
Message-ID:	cf720a09-a060-efa7-4cbb-9c5f5eb2d1ab@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 3/6/19 6:26 PM, Robert Haas wrote:
> On Sat, Mar 2, 2019 at 4:38 PM Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> FWIW I don't think this qualifies as torn page - i.e. it's not a full
>> read with a mix of old and new data. This is partial write, most likely
>> because we read the blocks one by one, and when we hit the last page
>> while the table is being extended, we may only see the fist 4kB. And if
>> we retry very fast, we may still see only the first 4kB.
>
> I see the distinction you're making, and you're right. The problem
> is, whether in this case or whether for a real torn page, we don't
> seem to have a way to distinguish between a state that occurs
> transiently due to lack of synchronization and a situation that is
> permanent and means that we have corruption. And that worries me,
> because it means we'll either report bogus complaints that will scare
> easily-panicked users (and anybody who is running this tool has a good
> chance of being in the "easily-panicked" category ...), or else we'll
> skip reporting real problems. Neither is good.
>

Sure, I'd also prefer having a tool that reliably detects all cases of
data corruption, and I certainly do share your concerns about false
positives and false negatives.

But maybe we shouldn't expect a tool meant to verify checksums to detect
various other issues.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Michael Banck <michael(dot)banck(at)credativ(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2019-03-06 19:37:39
Message-ID:	b15d1e0b-2e66-1cb8-65e0-dc51b4fe7d2f@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 3/6/19 6:42 PM, Andres Freund wrote:
> On 2019-03-06 12:33:49 -0500, Robert Haas wrote:
>> On Sat, Mar 2, 2019 at 5:45 AM Michael Banck <michael(dot)banck(at)credativ(dot)de> wrote:
>>> Am Freitag, den 01.03.2019, 18:03 -0500 schrieb Robert Haas:
>>>> On Tue, Sep 18, 2018 at 10:37 AM Michael Banck
>>>> <michael(dot)banck(at)credativ(dot)de> wrote:
>>>>> I have added a retry for this as well now, without a pg_sleep() as well.
>>>>> This catches around 80% of the half-reads, but a few slip through. At
>>>>> that point we bail out with exit(1), and the user can try again, which I
>>>>> think is fine?
>>>>
>>>> Maybe I'm confused here, but catching 80% of torn pages doesn't sound
>>>> robust at all.
>>>
>>> The chance that pg_verify_checksums hits a torn page (at least in my
>>> tests, see below) is already pretty low, a couple of times per 1000
>>> runs. Maybe 4 out 5 times, the page is read fine on retry and we march
>>> on. Otherwise, we now just issue a warning and skip the file (or so was
>>> the idea, see below), do you think that is not acceptable?
>>
>> Yeah. Consider a paranoid customer with 100 clusters who runs this
>> every day on every cluster. They're going to see failures every day
>> or three and go ballistic.
>
> +1
>
>
>> I suspect that better retry logic might help here. I mean, I would
>> guess that 10 retries at 1 second intervals or something of that sort
>> would be enough to virtually eliminate false positives while still
>> allowing us to report persistent -- and thus real -- problems. But if
>> even that is going to produce false positives with any measurable
>> probability different from zero, then I think we have a problem,
>> because I neither like a verification tool that ignores possible signs
>> of trouble nor one that "cries wolf" when things are fine.
>
> To me the right way seems to be to IO lock the page via PG after such a
> failure, and then retry. Which should be relatively easily doable for
> the basebackup case, but obviously harder for the pg_verify_checksums
> case.
>

Yes, if we could ensure the retry happens after completing the current
I/O on the page (without actually initiating a read into shared buffers)
that would work I think - both for partial reads and torn pages.

Not sure how to integrate it into the CLI tool, though. Perhaps we it
could require connection info so that it can execute a function, when
executed in online mode?

cheers

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2019-03-06 19:41:51
Message-ID:	20190306194151.4farhcjt4iib5gxm@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 2019-03-06 20:37:39 +0100, Tomas Vondra wrote:
> Not sure how to integrate it into the CLI tool, though. Perhaps we it
> could require connection info so that it can execute a function, when
> executed in online mode?

To me the right fix would be to simply have this run as part of the
cluster / in a function. I don't see much point in running this outside
of the cluster.

Greetings,

Andres Freund

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2019-03-06 19:53:57
Message-ID:	6c8c0acb-b1b3-2ec1-a203-c605f77a43a9@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 3/6/19 8:41 PM, Andres Freund wrote:
> Hi,
>
> On 2019-03-06 20:37:39 +0100, Tomas Vondra wrote:
>> Not sure how to integrate it into the CLI tool, though. Perhaps we it
>> could require connection info so that it can execute a function, when
>> executed in online mode?
>
> To me the right fix would be to simply have this run as part of the
> cluster / in a function. I don't see much point in running this outside
> of the cluster.
>

Not sure. AFAICS that would to require a single transaction, and if we
happen to add some sort of throttling (which is a feature request I'd
expect pretty soon to make it usable on live clusters) that might be
quite long-running. So, not great.

If we want to run it from the server itself, then I guess a background
worker would be a better solution. Incidentally, that's something I've
been toying with some time ago, see [1].

[1] https://github.com/tvondra/scrub

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Michael Paquier <michael(at)paquier(dot)xyz>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2019-03-07 02:16:41
Message-ID:	20190307021641.GD17293@paquier.xyz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Mar 06, 2019 at 08:53:57PM +0100, Tomas Vondra wrote:
> Not sure. AFAICS that would to require a single transaction, and if we
> happen to add some sort of throttling (which is a feature request I'd
> expect pretty soon to make it usable on live clusters) that might be
> quite long-running. So, not great.
>
> If we want to run it from the server itself, then I guess a background
> worker would be a better solution. Incidentally, that's something I've
> been toying with some time ago, see [1].

It does not prevent having a SQL function which acts as a wrapper on
top of the whole routine logic, does it? I think that it would be
nice to have the possibility to target a specific relation and a
specific page, as well as being able to check fully a relation at
once. It gets easier to check for page ranges this way, and the
throttling can be part of the function doing a full-relation check.
--
Michael

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Michael Banck <michael(dot)banck(at)credativ(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2019-03-07 11:53:30
Message-ID:	29a0ef4d-7d5d-fe5c-253b-3f7f53df0859@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 3/6/19 6:42 PM, Andres Freund wrote:
>
> ...
>
> To me the right way seems to be to IO lock the page via PG after such a
> failure, and then retry. Which should be relatively easily doable for
> the basebackup case, but obviously harder for the pg_verify_checksums
> case.
>

Actually, what do you mean by "IO lock the page"? Just waiting for the
current IO to complete (essentially BM_IO_IN_PROGRESS)? Or essentially
acquiring a lock and holding it for the duration of the check?

The former does not really help, because there might be another I/O
request initiated right after, interfering with the retry.

The latter might work, assuming the check is fast (which it probably
is). I wonder if this might cause issues due to loading possibly
corrupted data (with invalid checksums) into shared buffers. But then
again, we could just hack a special version of ReadBuffer_common() which
would just

(a) check if a page is in shared buffers, and if it is then consider the
checksum correct (because in memory it may be stale, and it was read
successfully so it was OK at that moment)

(b) if it's not in shared buffers already, try reading it and verify the
checksum, and then just evict it right away (not to spoil sb)

Or did you have something else in mind?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2019-03-07 18:00:35
Message-ID:	20190307180035.2sc4byi2mivjayy6@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 2019-03-07 12:53:30 +0100, Tomas Vondra wrote:
> On 3/6/19 6:42 PM, Andres Freund wrote:
> >
> > ...
> >
> > To me the right way seems to be to IO lock the page via PG after such a
> > failure, and then retry. Which should be relatively easily doable for
> > the basebackup case, but obviously harder for the pg_verify_checksums
> > case.
> >
>
> Actually, what do you mean by "IO lock the page"? Just waiting for the
> current IO to complete (essentially BM_IO_IN_PROGRESS)? Or essentially
> acquiring a lock and holding it for the duration of the check?

The latter. And with IO lock I meant BufferDescriptorGetIOLock(), in
contrast to a buffer's content lock. That way we wouldn't block
modifications to the in-memory page.

> The former does not really help, because there might be another I/O request
> initiated right after, interfering with the retry.
>
> The latter might work, assuming the check is fast (which it probably is). I
> wonder if this might cause issues due to loading possibly corrupted data
> (with invalid checksums) into shared buffers.

Oh, I was basically thinking that we'd just reread from disk outside of
postgres in that case, while preventing postgres related IO by holding
the IO lock.

But:

> But then again, we could just
> hack a special version of ReadBuffer_common() which would just

> (a) check if a page is in shared buffers, and if it is then consider the
> checksum correct (because in memory it may be stale, and it was read
> successfully so it was OK at that moment)
>
> (b) if it's not in shared buffers already, try reading it and verify the
> checksum, and then just evict it right away (not to spoil sb)

This'd also make sense and make the whole process more efficient. OTOH,
it might actually be worthwhile to check the on-disk page even if
there's in-memory state. Unless IO is in progress the on-disk page
always should be valid.

Greetings,

Andres Freund

From:	Michael Banck <michael(dot)banck(at)credativ(dot)de>
To:	Stephen Frost <sfrost(at)snowman(dot)net>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Andres Freund <andres(at)anarazel(dot)de>
Subject:	Re: Online verification of checksums
Date:	2019-03-08 11:51:21
Message-ID:	1552045881.4947.43.camel@credativ.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Am Sonntag, den 03.03.2019, 11:51 +0100 schrieb Michael Banck:
> Am Samstag, den 02.03.2019, 11:08 -0500 schrieb Stephen Frost:
> > I'm not necessairly against skipping to the next file, to be clear,
> > but I think I'd be happier if we kept reading the file until we
> > actually get EOF.
>
> So if we read half a block twice we should seek() to the next block and
> continue till EOF, ok. I think in most cases those pages will be new
> anyway and there will be no checksum check, but it sounds like a cleaner
> approach. I've seen one or two examples where we did successfully verify
> the checksum of a page after a half-read, so it might be worth it.

I've done that now, i.e. it seeks to the next block and continues to
read there (possibly getting an EOF).

I don't issue a warning for this skipped block anymore as it is somewhat
to be expected that we see some half-reads. If the seek fails for some
reason, that is still a warning.

> I still think that an external checksum verification tool has some
> merit, given that basebackup does it and the current offline requirement
> is really not useful in practise.

I've read the rest of the thread, and it seems several people prefer a
solution that interacts with the server. I won't be able to work on that
for v12 and I guess it would be too late in the cycle anyway.

I thought about I/O throttling in online mode, but it seems to be most
easily tied in with the progress reporting (that already keeps track of
everything or most of what we'd need), so I will work on it in that
context.

Michael

--
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax: +49 2166 9901-100
Email: michael(dot)banck(at)credativ(dot)de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

Attachment	Content-Type	Size
online-verification-of-checksums_V13.patch	text/x-patch	10.1 KB

From:	Julien Rouhaud <rjuju123(at)gmail(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2019-03-08 15:19:03
Message-ID:	CAOBaU_ZaC3PSm50Dih8w0AiyUNd+GbP3M90RR3LstVbr1jOdgQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Mar 7, 2019 at 7:00 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> On 2019-03-07 12:53:30 +0100, Tomas Vondra wrote:
> >
> > But then again, we could just
> > hack a special version of ReadBuffer_common() which would just
>
> > (a) check if a page is in shared buffers, and if it is then consider the
> > checksum correct (because in memory it may be stale, and it was read
> > successfully so it was OK at that moment)
> >
> > (b) if it's not in shared buffers already, try reading it and verify the
> > checksum, and then just evict it right away (not to spoil sb)
>
> This'd also make sense and make the whole process more efficient. OTOH,
> it might actually be worthwhile to check the on-disk page even if
> there's in-memory state. Unless IO is in progress the on-disk page
> always should be valid.

Definitely. I already saw servers with all-frozen-read-only blocks
popular enough to never get evicted in months, and then a minor
upgrade / restart having catastrophic consequences.

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Julien Rouhaud <rjuju123(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2019-03-08 17:49:56
Message-ID:	986bd2ca-4318-f8e4-8a5f-320082d31f52@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 3/8/19 4:19 PM, Julien Rouhaud wrote:
> On Thu, Mar 7, 2019 at 7:00 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
>>
>> On 2019-03-07 12:53:30 +0100, Tomas Vondra wrote:
>>>
>>> But then again, we could just
>>> hack a special version of ReadBuffer_common() which would just
>>
>>> (a) check if a page is in shared buffers, and if it is then consider the
>>> checksum correct (because in memory it may be stale, and it was read
>>> successfully so it was OK at that moment)
>>>
>>> (b) if it's not in shared buffers already, try reading it and verify the
>>> checksum, and then just evict it right away (not to spoil sb)
>>
>> This'd also make sense and make the whole process more efficient. OTOH,
>> it might actually be worthwhile to check the on-disk page even if
>> there's in-memory state. Unless IO is in progress the on-disk page
>> always should be valid.
>
> Definitely. I already saw servers with all-frozen-read-only blocks
> popular enough to never get evicted in months, and then a minor
> upgrade / restart having catastrophic consequences.
>

Do I understand correctly the "catastrophic consequences" here are due
to data corruption / broken checksums on those on-disk pages?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Julien Rouhaud <rjuju123(at)gmail(dot)com>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2019-03-08 17:59:41
Message-ID:	CAOBaU_ZaNXN1TMToZG4q4gQZeS7KVUx2_Hsa4MpAwhbu1N-WXA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Mar 8, 2019 at 6:50 PM Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
> On 3/8/19 4:19 PM, Julien Rouhaud wrote:
> > On Thu, Mar 7, 2019 at 7:00 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> >>
> >> On 2019-03-07 12:53:30 +0100, Tomas Vondra wrote:
> >>>
> >>> But then again, we could just
> >>> hack a special version of ReadBuffer_common() which would just
> >>
> >>> (a) check if a page is in shared buffers, and if it is then consider the
> >>> checksum correct (because in memory it may be stale, and it was read
> >>> successfully so it was OK at that moment)
> >>>
> >>> (b) if it's not in shared buffers already, try reading it and verify the
> >>> checksum, and then just evict it right away (not to spoil sb)
> >>
> >> This'd also make sense and make the whole process more efficient. OTOH,
> >> it might actually be worthwhile to check the on-disk page even if
> >> there's in-memory state. Unless IO is in progress the on-disk page
> >> always should be valid.
> >
> > Definitely. I already saw servers with all-frozen-read-only blocks
> > popular enough to never get evicted in months, and then a minor
> > upgrade / restart having catastrophic consequences.
> >
>
> Do I understand correctly the "catastrophic consequences" here are due
> to data corruption / broken checksums on those on-disk pages?

Ah, yes sorry I should have been clearer. Indeed, there was silent
data corruptions (no ckecksum though) that was revealed by the
restart. So a routine minor update resulted in a massive outage.
Such a scenario can't be avoided if we always bypass checksum check
for alreay in shared_buffers pages.

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2019-03-18 05:43:08
Message-ID:	20190318054308.GC6197@tamriel.snowman.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Greetings,

* Tomas Vondra (tomas(dot)vondra(at)2ndquadrant(dot)com) wrote:
> On 3/2/19 12:03 AM, Robert Haas wrote:
> > On Tue, Sep 18, 2018 at 10:37 AM Michael Banck
> > <michael(dot)banck(at)credativ(dot)de> wrote:
> >> I have added a retry for this as well now, without a pg_sleep() as well.
> >> This catches around 80% of the half-reads, but a few slip through. At
> >> that point we bail out with exit(1), and the user can try again, which I
> >> think is fine?
> >
> > Maybe I'm confused here, but catching 80% of torn pages doesn't sound
> > robust at all.
>
> FWIW I don't think this qualifies as torn page - i.e. it's not a full
> read with a mix of old and new data. This is partial write, most likely
> because we read the blocks one by one, and when we hit the last page
> while the table is being extended, we may only see the fist 4kB. And if
> we retry very fast, we may still see only the first 4kB.

I really still am not following why this is such an issue- we do a read,
get back 4KB, do another read, check if it's zero, and if so then we
should be able to conclude that we're at the end of the file, no? If
we're at the end of the file and we don't have a final complete block to
run a checksum check on then it seems clear to me that the file was
being extended and it's ok to skip that block. We could also stat the
file and keep track of where we are, to detect such an extension of the
file happening, if we wanted an additional cross-check, couldn't we? If
we do a read and get 4KB back and then do another and get 4KB back, then
we just treat it like we would an 8KB block. Really, as long as a
subsequent read is returning bytes then we keep going, and if it returns
zero then it's EOF. I could maybe see a "one final read" option, but I
don't think it makes sense to have some kind of time-based delay around
this where we keep trying to read.

All of this about hacking up a way to connect to PG and lock pages in
shared buffers so that we can perform a checksum check seems really
rather ridiculous for either the extension case or the regular mid-file
torn-page case.

To be clear, I agree completely that we don't want to be reporting false
positives or "this might mean corruption!" to users running the tool,
but I haven't seen a good explaination of why this needs to involve the
server to avoid that happening. If someone would like to point that out
to me, I'd be happy to go read about it and try to understand.

Thanks!

Stephen

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2019-03-18 05:47:06
Message-ID:	20190318054706.GD6197@tamriel.snowman.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Greetings,

* Tomas Vondra (tomas(dot)vondra(at)2ndquadrant(dot)com) wrote:
> If we want to run it from the server itself, then I guess a background
> worker would be a better solution. Incidentally, that's something I've
> been toying with some time ago, see [1].

So, I'm a big fan of this idea of having a background worker that's
running and (slowly, maybe configurably) scanning through the data
directory checking for corrupted pages. I'd certainly prefer it if that
background worker didn't fault those pages into shared buffers though,
and I don't really think it should need to even check if a given page is
currently being written out or is presently in shared buffers.
Basically, I'd think it would work just fine to have it essentially do
what I am imagining pg_checksums to do, but as a background worker.

Thanks!

Stephen

From:	Michael Paquier <michael(at)paquier(dot)xyz>
To:	Stephen Frost <sfrost(at)snowman(dot)net>
Cc:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2019-03-18 06:05:59
Message-ID:	20190318060559.GF1885@paquier.xyz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Mar 18, 2019 at 01:43:08AM -0400, Stephen Frost wrote:
> To be clear, I agree completely that we don't want to be reporting false
> positives or "this might mean corruption!" to users running the tool,
> but I haven't seen a good explaination of why this needs to involve the
> server to avoid that happening. If someone would like to point that out
> to me, I'd be happy to go read about it and try to understand.

The mentions on this thread that the server has all the facility in
place to properly lock a buffer and make sure that a partial read
*never* happens and that we *never* have any kind of false positives,
directly preventing the set of issues we are trying to implement
workarounds for in a frontend tool are rather good arguments in my
opinion (you can grep for BufferDescriptorGetIOLock() on this thread
for example).
--
Michael

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	Michael Paquier <michael(at)paquier(dot)xyz>
Cc:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2019-03-18 06:38:10
Message-ID:	20190318063810.GI6197@tamriel.snowman.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Greetings,

* Michael Paquier (michael(at)paquier(dot)xyz) wrote:
> On Mon, Mar 18, 2019 at 01:43:08AM -0400, Stephen Frost wrote:
> > To be clear, I agree completely that we don't want to be reporting false
> > positives or "this might mean corruption!" to users running the tool,
> > but I haven't seen a good explaination of why this needs to involve the
> > server to avoid that happening. If someone would like to point that out
> > to me, I'd be happy to go read about it and try to understand.
>
> The mentions on this thread that the server has all the facility in
> place to properly lock a buffer and make sure that a partial read
> *never* happens and that we *never* have any kind of false positives,

Uh, we are, of course, going to have partial reads- we just need to
handle them appropriately, and that's not hard to do in a way that we
never have false positives.

I do not understand, at all, the whole sub-thread argument that we have
to avoid partial reads. We certainly don't worry about that when doing
backups, and I don't see why we need to avoid it here. We are going to
have partial reads- and that's ok, as long as it's because we're at the
end of the file, and that's easy enough to check by just doing another
read to see if we get back zero bytes, which indicates we're at the end
of the file, and then we move on, no need to coordinate anything with
the backend for this.

> directly preventing the set of issues we are trying to implement
> workarounds for in a frontend tool are rather good arguments in my
> opinion (you can grep for BufferDescriptorGetIOLock() on this thread
> for example).

Sure the backend has those facilities since it needs to, but these
frontend tools *don't* need that to *never* have any false positives, so
why are we complicating things by saying that this frontend tool and the
backend have to coordinate?

If there's an explanation of why we can't avoid having false positives
in the frontend tool, I've yet to see it. I definitely understand that
we can get partial reads, but a partial read isn't a failure, and
shouldn't be reported as such.

Thanks!

Stephen

From:	Michael Banck <michael(dot)banck(at)credativ(dot)de>
To:	Stephen Frost <sfrost(at)snowman(dot)net>, Michael Paquier <michael(at)paquier(dot)xyz>
Cc:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2019-03-18 07:18:18
Message-ID:	1552893498.9697.30.camel@credativ.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Am Montag, den 18.03.2019, 02:38 -0400 schrieb Stephen Frost:
> * Michael Paquier (michael(at)paquier(dot)xyz) wrote:
> > On Mon, Mar 18, 2019 at 01:43:08AM -0400, Stephen Frost wrote:
> > > To be clear, I agree completely that we don't want to be reporting false
> > > positives or "this might mean corruption!" to users running the tool,
> > > but I haven't seen a good explaination of why this needs to involve the
> > > server to avoid that happening. If someone would like to point that out
> > > to me, I'd be happy to go read about it and try to understand.
> >
> > The mentions on this thread that the server has all the facility in
> > place to properly lock a buffer and make sure that a partial read
> > *never* happens and that we *never* have any kind of false positives,
>
> Uh, we are, of course, going to have partial reads- we just need to
> handle them appropriately, and that's not hard to do in a way that we
> never have false positives.

I think the current patch (V13 from /message-i
d/1552045881(dot)4947(dot)43(dot)camel(at)credativ(dot)de) does that, modulo possible bugs.

> I do not understand, at all, the whole sub-thread argument that we have
> to avoid partial reads. We certainly don't worry about that when doing
> backups, and I don't see why we need to avoid it here. We are going to
> have partial reads- and that's ok, as long as it's because we're at the
> end of the file, and that's easy enough to check by just doing another
> read to see if we get back zero bytes, which indicates we're at the end
> of the file, and then we move on, no need to coordinate anything with
> the backend for this.

Well, I agree with you, but we don't seem to have consensus on that.

> > directly preventing the set of issues we are trying to implement
> > workarounds for in a frontend tool are rather good arguments in my
> > opinion (you can grep for BufferDescriptorGetIOLock() on this thread
> > for example).
>
> Sure the backend has those facilities since it needs to, but these
> frontend tools *don't* need that to *never* have any false positives, so
> why are we complicating things by saying that this frontend tool and the
> backend have to coordinate?
>
> If there's an explanation of why we can't avoid having false positives
> in the frontend tool, I've yet to see it. I definitely understand that
> we can get partial reads, but a partial read isn't a failure, and
> shouldn't be reported as such.

It is not in the current patch, it should just get reported as a skipped
block in the end. If the cluster is online that is, if it is offline,
we do consider it a failure.

I have now rebased that patch on top of the pg_verify_checksums ->
pg_checksums renaming, see attached.

Michael

--
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax: +49 2166 9901-100
Email: michael(dot)banck(at)credativ(dot)de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

Attachment	Content-Type	Size
online-verification-of-checksums_V14.patch	text/x-patch	10.1 KB

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	Michael Banck <michael(dot)banck(at)credativ(dot)de>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2019-03-18 07:34:03
Message-ID:	20190318073403.GL6197@tamriel.snowman.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Greetings,

* Michael Banck (michael(dot)banck(at)credativ(dot)de) wrote:
> Am Montag, den 18.03.2019, 02:38 -0400 schrieb Stephen Frost:
> > * Michael Paquier (michael(at)paquier(dot)xyz) wrote:
> > > On Mon, Mar 18, 2019 at 01:43:08AM -0400, Stephen Frost wrote:
> > > > To be clear, I agree completely that we don't want to be reporting false
> > > > positives or "this might mean corruption!" to users running the tool,
> > > > but I haven't seen a good explaination of why this needs to involve the
> > > > server to avoid that happening. If someone would like to point that out
> > > > to me, I'd be happy to go read about it and try to understand.
> > >
> > > The mentions on this thread that the server has all the facility in
> > > place to properly lock a buffer and make sure that a partial read
> > > *never* happens and that we *never* have any kind of false positives,
> >
> > Uh, we are, of course, going to have partial reads- we just need to
> > handle them appropriately, and that's not hard to do in a way that we
> > never have false positives.
>
> I think the current patch (V13 from /message-i
> d/1552045881(dot)4947(dot)43(dot)camel(at)credativ(dot)de) does that, modulo possible bugs.

I think the question here is- do you ever see false positives with this
latest version..? If you are, then that's an issue and we should
discuss and try to figure out what's happening. If you aren't seeing
false positives, then it seems like we're done here, right?

> > I do not understand, at all, the whole sub-thread argument that we have
> > to avoid partial reads. We certainly don't worry about that when doing
> > backups, and I don't see why we need to avoid it here. We are going to
> > have partial reads- and that's ok, as long as it's because we're at the
> > end of the file, and that's easy enough to check by just doing another
> > read to see if we get back zero bytes, which indicates we're at the end
> > of the file, and then we move on, no need to coordinate anything with
> > the backend for this.
>
> Well, I agree with you, but we don't seem to have consensus on that.

I feel like everyone is concerned that we'd report an acceptable partial
read as a failure, hence it would be a false positive, and I agree
entirely that we don't want false positives, but the answer to that
seems to be that we shouldn't report partial reads as failures, solving
the problem in a simple way that doesn't involve the server and doesn't
materially reduce the check that's being performed.

> > > directly preventing the set of issues we are trying to implement
> > > workarounds for in a frontend tool are rather good arguments in my
> > > opinion (you can grep for BufferDescriptorGetIOLock() on this thread
> > > for example).
> >
> > Sure the backend has those facilities since it needs to, but these
> > frontend tools *don't* need that to *never* have any false positives, so
> > why are we complicating things by saying that this frontend tool and the
> > backend have to coordinate?
> >
> > If there's an explanation of why we can't avoid having false positives
> > in the frontend tool, I've yet to see it. I definitely understand that
> > we can get partial reads, but a partial read isn't a failure, and
> > shouldn't be reported as such.
>
> It is not in the current patch, it should just get reported as a skipped
> block in the end. If the cluster is online that is, if it is offline,
> we do consider it a failure.

Ok, that sounds fine- and do we ever see false positives now?

> I have now rebased that patch on top of the pg_verify_checksums ->
> pg_checksums renaming, see attached.

Thanks for that. Reading through the code though, I don't entirely
understand why we're making things complicated for ourselves by trying
to seek and re-read the entire block, specifically this:

> if (r != BLCKSZ)
> {
> - fprintf(stderr, _("%s: could not read block %u in file \"%s\": read %d of %d\n"),
> - progname, blockno, fn, r, BLCKSZ);
> - exit(1);
> + if (online)
> + {
> + if (block_retry)
> + {
> + /* We already tried once to reread the block, skip to the next block */
> + skippedblocks++;
> + if (lseek(f, BLCKSZ-r, SEEK_CUR) == -1)
> + {
> + skippedfiles++;
> + fprintf(stderr, _("%s: could not lseek to next block in file \"%s\": %m\n"),
> + progname, fn);
> + return;
> + }
> + continue;
> + }
> +
> + /*
> + * Retry the block. It's possible that we read the block while it
> + * was extended or shrinked, so it it ends up looking torn to us.
> + */
> +
> + /*
> + * Seek back by the amount of bytes we read to the beginning of
> + * the failed block.
> + */
> + if (lseek(f, -r, SEEK_CUR) == -1)
> + {
> + skippedfiles++;
> + fprintf(stderr, _("%s: could not lseek in file \"%s\": %m\n"),
> + progname, fn);
> + return;
> + }
> +
> + /* Set flag so we know a retry was attempted */
> + block_retry = true;
> +
> + /* Reset loop to validate the block again */
> + blockno--;
> +
> + continue;
> + }

I would think that we could just do:

insert_location = 0;
r = read(BLCKSIZE - insert_location);
if (r < 0) error();
if (r == 0) EOF detected, move to next
if (r < (BLCKSIZE - insert_location)) {
insert_location += r;
continue;
}

At this point, we should have a full block, do our checks...

Have you seen cases where the kernel will actually return a partial read
for something that isn't at the end of the file, and where you could
actually lseek past that point and read the next block? I'd be really
curious to see that if you can reproduce it... I've definitely seen
empty pages come back with a claim that the full amount was read, but
that's a very different thing.

Obviously the same goes for anywhere else we're trying to handle a
partial read return from..

Thanks!

Stephen

From:	Michael Paquier <michael(at)paquier(dot)xyz>
To:	Stephen Frost <sfrost(at)snowman(dot)net>
Cc:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2019-03-18 07:34:12
Message-ID:	20190318073412.GG1885@paquier.xyz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Mar 18, 2019 at 02:38:10AM -0400, Stephen Frost wrote:
> Uh, we are, of course, going to have partial reads- we just need to
> handle them appropriately, and that's not hard to do in a way that we
> never have false positives.

Ere, my apologies here. I meant the read of a torn page, not a
partial read (when extending the relation file we have locks
preventing from a partial read as well by the way).
--
Michael

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	Michael Paquier <michael(at)paquier(dot)xyz>
Cc:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2019-03-18 07:38:02
Message-ID:	20190318073802.GM6197@tamriel.snowman.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Greetings,

* Michael Paquier (michael(at)paquier(dot)xyz) wrote:
> On Mon, Mar 18, 2019 at 02:38:10AM -0400, Stephen Frost wrote:
> > Uh, we are, of course, going to have partial reads- we just need to
> > handle them appropriately, and that's not hard to do in a way that we
> > never have false positives.
>
> Ere, my apologies here. I meant the read of a torn page, not a

In the case of a torn page, we should be able to check the LSN, as
discussed extensively previously, and if the LSN is from after the
checkpoint we started at then we should be fine to skip the page.

> partial read (when extending the relation file we have locks
> preventing from a partial read as well by the way).

Yes, we do, in the backend... We don't have (nor do we need) to get
involved in those locks for these tools though..

Thanks!

Stephen

From:	Michael Banck <michael(dot)banck(at)credativ(dot)de>
To:	Stephen Frost <sfrost(at)snowman(dot)net>, Michael Paquier <michael(at)paquier(dot)xyz>
Cc:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2019-03-18 07:39:42
Message-ID:	1552894782.9697.33.camel@credativ.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Am Montag, den 18.03.2019, 08:18 +0100 schrieb Michael Banck:
> I have now rebased that patch on top of the pg_verify_checksums ->
> pg_checksums renaming, see attached.

Sorry, I had missed some hunks in the TAP tests, fixed-up patch
attached.

Michael

--
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax: +49 2166 9901-100
Email: michael(dot)banck(at)credativ(dot)de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

Attachment	Content-Type	Size
online-verification-of-checksums_V15.patch	text/x-patch	10.0 KB

From:	Michael Banck <michael(dot)banck(at)credativ(dot)de>
To:	Stephen Frost <sfrost(at)snowman(dot)net>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2019-03-18 07:52:28
Message-ID:	1552895548.9697.35.camel@credativ.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi.

Am Montag, den 18.03.2019, 03:34 -0400 schrieb Stephen Frost:
> * Michael Banck (michael(dot)banck(at)credativ(dot)de) wrote:
> > Am Montag, den 18.03.2019, 02:38 -0400 schrieb Stephen Frost:
> > > * Michael Paquier (michael(at)paquier(dot)xyz) wrote:
> > > > On Mon, Mar 18, 2019 at 01:43:08AM -0400, Stephen Frost wrote:
> > > > > To be clear, I agree completely that we don't want to be reporting false
> > > > > positives or "this might mean corruption!" to users running the tool,
> > > > > but I haven't seen a good explaination of why this needs to involve the
> > > > > server to avoid that happening. If someone would like to point that out
> > > > > to me, I'd be happy to go read about it and try to understand.
> > > >
> > > > The mentions on this thread that the server has all the facility in
> > > > place to properly lock a buffer and make sure that a partial read
> > > > *never* happens and that we *never* have any kind of false positives,
> > >
> > > Uh, we are, of course, going to have partial reads- we just need to
> > > handle them appropriately, and that's not hard to do in a way that we
> > > never have false positives.
> >
> > I think the current patch (V13 from /message-i
> > d/1552045881(dot)4947(dot)43(dot)camel(at)credativ(dot)de) does that, modulo possible bugs.
>
> I think the question here is- do you ever see false positives with this
> latest version..? If you are, then that's an issue and we should
> discuss and try to figure out what's happening. If you aren't seeing
> false positives, then it seems like we're done here, right?

What do you mean with false positives here? I've never seen a bogus
checksum failure, i.e. pg_checksums claiming some checksum is wrong
cause it only read half of a block or a torn page.

I do see sporadic partial reads and they get treated by the re-check
logic and (if that is not enough) get tallied up as a skipped block in
the end. Is that a false positive in your book?

[...]

> > I have now rebased that patch on top of the pg_verify_checksums ->
> > pg_checksums renaming, see attached.
>
> Thanks for that. Reading through the code though, I don't entirely
> understand why we're making things complicated for ourselves by trying
> to seek and re-read the entire block, specifically this:

[...]

> I would think that we could just do:
>
> insert_location = 0;
> r = read(BLCKSIZE - insert_location);
> if (r < 0) error();
> if (r == 0) EOF detected, move to next
> if (r < (BLCKSIZE - insert_location)) {
> insert_location += r;
> continue;
> }
>
> At this point, we should have a full block, do our checks...

Well, we need to read() into some buffer which you have ommitted.

So if we had a short read, and then read the rest of the block via
(BLCKSIZE - insert_location) wouldn't we have to read that in a second
buffer and then join the two in order to compute the checksum? That
does not sounds simpler to me than just re-reading the block entirely.

> Have you seen cases where the kernel will actually return a partial read
> for something that isn't at the end of the file, and where you could
> actually lseek past that point and read the next block? I'd be really
> curious to see that if you can reproduce it... I've definitely seen
> empty pages come back with a claim that the full amount was read, but
> that's a very different thing.

Well, I've seen partial reads and I have seen very rarely that it will
continue to read another block afterwards. If the relation is being
extended while we check it, it sounds plausible that another block could
be written before we get to read EOF on the next read() after a partial
read() so that does not sounds like a bug to me either.

I might be misunderstanding your question though?

Michael

--
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax: +49 2166 9901-100
Email: michael(dot)banck(at)credativ(dot)de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	Michael Banck <michael(dot)banck(at)credativ(dot)de>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Subject:	Re: Online verification of checksums
Date:	2019-03-18 08:11:29
Message-ID:	CAOuzzgr26WY3izp8_svYHVZp0ZUmyp5WiuJB7u_2V5PHHcheHw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Greetings,

On Mon, Mar 18, 2019 at 15:52 Michael Banck <michael(dot)banck(at)credativ(dot)de>
wrote:

> Hi.
>
> Am Montag, den 18.03.2019, 03:34 -0400 schrieb Stephen Frost:
> > * Michael Banck (michael(dot)banck(at)credativ(dot)de) wrote:
> > > Am Montag, den 18.03.2019, 02:38 -0400 schrieb Stephen Frost:
> > > > * Michael Paquier (michael(at)paquier(dot)xyz) wrote:
> > > > > On Mon, Mar 18, 2019 at 01:43:08AM -0400, Stephen Frost wrote:
> > > > > > To be clear, I agree completely that we don't want to be
> reporting false
> > > > > > positives or "this might mean corruption!" to users running the
> tool,
> > > > > > but I haven't seen a good explaination of why this needs to
> involve the
> > > > > > server to avoid that happening. If someone would like to point
> that out
> > > > > > to me, I'd be happy to go read about it and try to understand.
> > > > >
> > > > > The mentions on this thread that the server has all the facility in
> > > > > place to properly lock a buffer and make sure that a partial read
> > > > > *never* happens and that we *never* have any kind of false
> positives,
> > > >
> > > > Uh, we are, of course, going to have partial reads- we just need to
> > > > handle them appropriately, and that's not hard to do in a way that we
> > > > never have false positives.
> > >
> > > I think the current patch (V13 from
> /message-i
> > > d/1552045881(dot)4947(dot)43(dot)camel(at)credativ(dot)de) does that, modulo possible
> bugs.
> >
> > I think the question here is- do you ever see false positives with this
> > latest version..? If you are, then that's an issue and we should
> > discuss and try to figure out what's happening. If you aren't seeing
> > false positives, then it seems like we're done here, right?
>
> What do you mean with false positives here? I've never seen a bogus
> checksum failure, i.e. pg_checksums claiming some checksum is wrong
> cause it only read half of a block or a torn page.
>
> I do see sporadic partial reads and they get treated by the re-check
> logic and (if that is not enough) get tallied up as a skipped block in
> the end. Is that a false positive in your book?

No, that’s clearer not a false positive.

[...]
>
> > > I have now rebased that patch on top of the pg_verify_checksums ->
> > > pg_checksums renaming, see attached.
> >
> > Thanks for that. Reading through the code though, I don't entirely
> > understand why we're making things complicated for ourselves by trying
> > to seek and re-read the entire block, specifically this:
>
> [...]
>
> > I would think that we could just do:
> >
> > insert_location = 0;
> > r = read(BLCKSIZE - insert_location);
> > if (r < 0) error();
> > if (r == 0) EOF detected, move to next
> > if (r < (BLCKSIZE - insert_location)) {
> > insert_location += r;
> > continue;
> > }
> >
> > At this point, we should have a full block, do our checks...
>
> Well, we need to read() into some buffer which you have ommitted.

Surely there’s a buffer the read in the existing code is passing in, you
just need to offset by the current pointer, sorry for not being clear.

In other words the read would look more like:

read(fd,buf + insert_ptr, BUFSZ - insert_ptr)

And then you have to reset insert_ptr once you have a full block.

So if we had a short read, and then read the rest of the block via
> (BLCKSIZE - insert_location) wouldn't we have to read that in a second
> buffer and then join the two in order to compute the checksum? That
> does not sounds simpler to me than just re-reading the block entirely.

No, just read into your existing buffer at the point where the prior
partial read left off...

> Have you seen cases where the kernel will actually return a partial read
> > for something that isn't at the end of the file, and where you could
> > actually lseek past that point and read the next block? I'd be really
> > curious to see that if you can reproduce it... I've definitely seen
> > empty pages come back with a claim that the full amount was read, but
> > that's a very different thing.
>
> Well, I've seen partial reads and I have seen very rarely that it will
> continue to read another block afterwards. If the relation is being
> extended while we check it, it sounds plausible that another block could
> be written before we get to read EOF on the next read() after a partial
> read() so that does not sounds like a bug to me either.

Right, absolutely you can have a partial read during a relation extension
and then come back around and do another read and discover more data,
that’s entirely reasonable and I’ve seen it happen too.

I might be misunderstanding your question though?

Yes, the question was more like this: have you ever seen a read return a
partial result when you know you’re in the middle somewhere of an existing
file and the length of the file hasn’t been changed by something else..? I
can’t say that I have, when reading from regular files, even in
kernel-error type of conditions due to hardware issues, but I’m open to
being told I’m wrong... in such a case though I would still expect an
error on a subsequent read, which would work just fine for our case. If the
kernel just decides to return a zero in that case then I don’t know that
there’s really anything we can do about that because that seems like it
would be pretty clearly broken results from the kernel and that’s out of
scope for this.

Apologies if this isn’t clear, on my phone now.

Thanks!

Stephen

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Michael Paquier <michael(at)paquier(dot)xyz>
Cc:	Stephen Frost <sfrost(at)snowman(dot)net>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2019-03-18 20:02:58
Message-ID:	CA+Tgmoa31CEhvXC5gocnTKWa4JkcAZOdmpihjqz98qRff3CCiw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Mar 18, 2019 at 2:06 AM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
> The mentions on this thread that the server has all the facility in
> place to properly lock a buffer and make sure that a partial read
> *never* happens and that we *never* have any kind of false positives,
> directly preventing the set of issues we are trying to implement
> workarounds for in a frontend tool are rather good arguments in my
> opinion (you can grep for BufferDescriptorGetIOLock() on this thread
> for example).

Yeah, exactly. It may be that there is a good way to avoid those
issues without interacting with the server and that would be nice, but
... as far as I can see, nobody's figured out a way that's reliable
yet, and all of the solutions proposed so far basically amount to
"let's ignore things that might be serious problems because they might
be transient" and/or "let's retry and see if the problem goes away."
I'm more sanguine about a retry-based solution than an
ignore-possible-problems solution, but what's been proposed so far
seems quite prone to retrying so fast that it makes no difference, and
it's not clear how much code complexity we'd have to add to do better
or how reliable it would be even then.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Michael Banck <michael(dot)banck(at)credativ(dot)de>
To:	Stephen Frost <sfrost(at)snowman(dot)net>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Subject:	Re: Online verification of checksums
Date:	2019-03-18 20:15:42
Message-ID:	1552940142.9697.40.camel@credativ.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Am Montag, den 18.03.2019, 16:11 +0800 schrieb Stephen Frost:
> On Mon, Mar 18, 2019 at 15:52 Michael Banck <michael(dot)banck(at)credativ(dot)de> wrote:
> > Am Montag, den 18.03.2019, 03:34 -0400 schrieb Stephen Frost:
> > > Thanks for that. Reading through the code though, I don't entirely
> > > understand why we're making things complicated for ourselves by trying
> > > to seek and re-read the entire block, specifically this:
> >
> > [...]
> >
> > > I would think that we could just do:
> > >
> > > insert_location = 0;
> > > r = read(BLCKSIZE - insert_location);
> > > if (r < 0) error();
> > > if (r == 0) EOF detected, move to next
> > > if (r < (BLCKSIZE - insert_location)) {
> > > insert_location += r;
> > > continue;
> > > }
> > >
> > > At this point, we should have a full block, do our checks...
> >
> > Well, we need to read() into some buffer which you have ommitted.
>
> Surely there’s a buffer the read in the existing code is passing in,
> you just need to offset by the current pointer, sorry for not being
> clear.
>
> In other words the read would look more like:
>
> read(fd,buf + insert_ptr, BUFSZ - insert_ptr)
>
> And then you have to reset insert_ptr once you have a full block.

Ok, thanks for clearing that up.

I've tried to do that now in the attached, does that suit you?

> Yes, the question was more like this: have you ever seen a read return
> a partial result when you know you’re in the middle somewhere of an
> existing file and the length of the file hasn’t been changed by
> something else..?

I don't think I've seen that, but that wouldn't turn up in regular
testing anyway I guess but only in pathological cases? I guess we are
probably dealing with this in the current version of the patch, but I
can't say for certain as it sounds pretty difficult to test.

I have also added a paragraph to the documentation about possilby
skipping new or recently updated pages:

+ If the cluster is online, pages that have been (re-)written since the last
+ checkpoint will not count as checksum failures if they cannot be read or
+ verified correctly.

Wording improvements welcome.

Michael

--
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax: +49 2166 9901-100
Email: michael(dot)banck(at)credativ(dot)de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

Attachment	Content-Type	Size
online-verification-of-checksums_V16.patch	text/x-patch	9.9 KB

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	Michael Banck <michael(dot)banck(at)credativ(dot)de>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Subject:	Re: Online verification of checksums
Date:	2019-03-18 21:50:36
Message-ID:	CAOuzzgrZyz_vgRjd8GrLen39B7KWXY-qTyobDE5nP2mSTtGOgQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Greetings,

On Tue, Mar 19, 2019 at 04:15 Michael Banck <michael(dot)banck(at)credativ(dot)de>
wrote:

> Am Montag, den 18.03.2019, 16:11 +0800 schrieb Stephen Frost:
> > On Mon, Mar 18, 2019 at 15:52 Michael Banck <michael(dot)banck(at)credativ(dot)de>
> wrote:
> > > Am Montag, den 18.03.2019, 03:34 -0400 schrieb Stephen Frost:
> > > > Thanks for that. Reading through the code though, I don't entirely
> > > > understand why we're making things complicated for ourselves by
> trying
> > > > to seek and re-read the entire block, specifically this:
> > >
> > > [...]
> > >
> > > > I would think that we could just do:
> > > >
> > > > insert_location = 0;
> > > > r = read(BLCKSIZE - insert_location);
> > > > if (r < 0) error();
> > > > if (r == 0) EOF detected, move to next
> > > > if (r < (BLCKSIZE - insert_location)) {
> > > > insert_location += r;
> > > > continue;
> > > > }
> > > >
> > > > At this point, we should have a full block, do our checks...
> > >
> > > Well, we need to read() into some buffer which you have ommitted.
> >
> > Surely there’s a buffer the read in the existing code is passing in,
> > you just need to offset by the current pointer, sorry for not being
> > clear.
> >
> > In other words the read would look more like:
> >
> > read(fd,buf + insert_ptr, BUFSZ - insert_ptr)
> >
> > And then you have to reset insert_ptr once you have a full block.
>
> Ok, thanks for clearing that up.
>
> I've tried to do that now in the attached, does that suit you?

Yes, that’s what I was thinking. I’m honestly not entirely convinced that
the lseek() efforts still need to be put in- I would have thought it’d be
fine to simply check the LSN on a checksum failure and mark it as skipped
if the LSN is past the current checkpoint. That seems like it would make
things much simpler, but I’m also not against keeping that logic now that
it’s in, provided it doesn’t cause issues

> Yes, the question was more like this: have you ever seen a read return
> > a partial result when you know you’re in the middle somewhere of an
> > existing file and the length of the file hasn’t been changed by
> > something else..?
>
> I don't think I've seen that, but that wouldn't turn up in regular
> testing anyway I guess but only in pathological cases? I guess we are
> probably dealing with this in the current version of the patch, but I
> can't say for certain as it sounds pretty difficult to test.

Yeah, a lot of things in this area are unfortunately difficult to test.
I’m glad to hear that it doesn’t sound like you’ve seen it though.

I have also added a paragraph to the documentation about possilby
> skipping new or recently updated pages:
>
> + If the cluster is online, pages that have been (re-)written since the
> last
> + checkpoint will not count as checksum failures if they cannot be read
> or
> + verified correctly.

I would flip this around:

——-
In an online cluster, pages are being concurrently written to the files
while the check is being run, leading to possible torn pages or partial
reads. When the tool detects a concurrently written page, indicated by the
page’s LSN being beyond the checkpoint the tool started at, that page will
be reported as skipped. Note that in a crash scenario, any pages written
since the last checkpoint will be replayed from the WAL.
——-

Now here’s the $64 question- have you tested this latest version under
load..? If not, could you? And when you do, can you report back what the
results are? Do you still see any actual checksum failures? Do the number
of skipped pages seem reasonable in your tests or is there a concern there?

If you still see actual checksum failures which aren’t because the LSN is
higher than the checkpoint, or because of a short read, then we need to
investigate further but hopefully that isn’t happening now. I think a lot
of the concerns raised on this thread about wanting to avoid false
positives are because the torn page (with higher LSN than current
checkpoint) and short read cases were previously reported as failures when
they really are expected. Let’s test this as much as we can and make sure
we aren’t seeing false positives anymore.

Thanks!

Stephen

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Stephen Frost <sfrost(at)snowman(dot)net>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2019-03-19 15:22:30
Message-ID:	CA+TgmobAv2nYJuqdAaV3vcBppwsKYRMziyussSeFTZQ8y2eAzA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Mar 18, 2019 at 2:38 AM Stephen Frost <sfrost(at)snowman(dot)net> wrote:
> Sure the backend has those facilities since it needs to, but these
> frontend tools *don't* need that to *never* have any false positives, so
> why are we complicating things by saying that this frontend tool and the
> backend have to coordinate?
>
> If there's an explanation of why we can't avoid having false positives
> in the frontend tool, I've yet to see it. I definitely understand that
> we can get partial reads, but a partial read isn't a failure, and
> shouldn't be reported as such.

I think there's some confusion between 'partial read' and 'torn page',
as Michael also said.

It's torn pages that I am concerned about - the server is writing and
we are reading, and we get a mix of old and new content. We have been
quite diligent about protecting ourselves from such risks elsewhere,
and checksum verification should not be held to any lesser standard.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Michael Banck <michael(dot)banck(at)credativ(dot)de>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2019-03-19 15:52:08
Message-ID:	1553010728.9697.51.camel@credativ.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Am Dienstag, den 19.03.2019, 11:22 -0400 schrieb Robert Haas:
> It's torn pages that I am concerned about - the server is writing and
> we are reading, and we get a mix of old and new content. We have been
> quite diligent about protecting ourselves from such risks elsewhere,
> and checksum verification should not be held to any lesser standard.

If we see a checksum failure on an otherwise correctly read block in
online mode, we retry the block on the theory that we might have read a
torn page. If the checksum verification still fails, we compare its LSN
to the LSN of the current checkpoint and don't mind if its newer. This
way, a torn page should not cause a false positive either way I think?.
If it is a genuine storage failure we will see it in the next
pg_checksums run as its LSN will be older than the checkpoint. The
basebackup checksum verification works in the same way.

I am happy to look into further option about how to make things better,
but I am not sure what the actual problem might be that you mention
above. I will see whether I can stress-test the patch a bit more but
I've already taxed the SSD on my company notebook quite a bit during the
development of this so will see whether I can get some real server
hardware somewhere.

Michael

--
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax: +49 2166 9901-100
Email: michael(dot)banck(at)credativ(dot)de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Michael Banck <michael(dot)banck(at)credativ(dot)de>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Michael Paquier <michael(at)paquier(dot)xyz>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2019-03-19 15:58:56
Message-ID:	20190319155856.esqdvzr3tfoqeqi2@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 2019-03-19 16:52:08 +0100, Michael Banck wrote:
> Am Dienstag, den 19.03.2019, 11:22 -0400 schrieb Robert Haas:
> > It's torn pages that I am concerned about - the server is writing and
> > we are reading, and we get a mix of old and new content. We have been
> > quite diligent about protecting ourselves from such risks elsewhere,
> > and checksum verification should not be held to any lesser standard.
>
> If we see a checksum failure on an otherwise correctly read block in
> online mode, we retry the block on the theory that we might have read a
> torn page. If the checksum verification still fails, we compare its LSN
> to the LSN of the current checkpoint and don't mind if its newer. This
> way, a torn page should not cause a false positive either way I
> think?.

False positives, no. But there's plenty potential for false
negatives. In plenty clusters a large fraction of the pages is going to
be touched in most checkpoints.

> If it is a genuine storage failure we will see it in the next
> pg_checksums run as its LSN will be older than the checkpoint.

Well, but also, by that time it might be too late to recover things. Or
it might be a backup that you just made, that you later want to recover
from, ...

> The basebackup checksum verification works in the same way.

Shouldn't have been merged that way.

Greetings,

Andres Freund

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Michael Banck <michael(dot)banck(at)credativ(dot)de>, Michael Paquier <michael(at)paquier(dot)xyz>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Subject:	Re: Online verification of checksums
Date:	2019-03-19 19:27:55
Message-ID:	CAOuzzgrhZJ6kBaK7v+Xra=q7_XFbtXyQCBFrvcrsCqZZ20WFTw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Greetings,

On Tue, Mar 19, 2019 at 23:59 Andres Freund <andres(at)anarazel(dot)de> wrote:

> Hi,
>
> On 2019-03-19 16:52:08 +0100, Michael Banck wrote:
> > Am Dienstag, den 19.03.2019, 11:22 -0400 schrieb Robert Haas:
> > > It's torn pages that I am concerned about - the server is writing and
> > > we are reading, and we get a mix of old and new content. We have been
> > > quite diligent about protecting ourselves from such risks elsewhere,
> > > and checksum verification should not be held to any lesser standard.
> >
> > If we see a checksum failure on an otherwise correctly read block in
> > online mode, we retry the block on the theory that we might have read a
> > torn page. If the checksum verification still fails, we compare its LSN
> > to the LSN of the current checkpoint and don't mind if its newer. This
> > way, a torn page should not cause a false positive either way I
> > think?.
>
> False positives, no. But there's plenty potential for false
> negatives. In plenty clusters a large fraction of the pages is going to
> be touched in most checkpoints.

How is it a false negative? The page was in the middle of being written,
if we crash the page won’t be used because it’ll get replayed over by the
checkpoint, if we don’t crash then it also won’t be used until it’s been
written out completely. I don’t agree that this is in any way a false
negative- it’s simply a page that happens to be in the middle of a file
that we can skip because it isn’t going to be used. It’s not like there’s
going to be a checksum failure if the backend reads it.

Not only that, but checksums and such failures are much more likely to
happen on long dormant data, not on data that’s actively being written out
and therefore is still in the Linux FS cache and hasn’t even hit actual
storage yet anyway.

> If it is a genuine storage failure we will see it in the next
> > pg_checksums run as its LSN will be older than the checkpoint.
>
> Well, but also, by that time it might be too late to recover things. Or
> it might be a backup that you just made, that you later want to recover
> from, ...

If it’s a backup you just made then that page is going to be in the WAL and
the torn page on disk isn’t going to be used, so how is this an issue?
This is why we have WAL- to deal with torn pages.

> The basebackup checksum verification works in the same way.
>
> Shouldn't have been merged that way.

I have a hard time not finding this offensive. These issues were
considered, discussed, and well thought out, with the result being
committed after agreement.

Do you have any example cases where the code in pg_basebackup has resulted
in either a false positive or a false negative? Any case which can be
shown to result in either?

If not then I think we need to stop this, because if we can’t trust that a
torn page won’t be actually used in that torn state then it seems likely
that our entire WAL system is broken and we can’t trust the way we do
backups either and have to rewrite all of that to take precautions to lock
pages while doing a backup.

Thanks!

Stephen

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Stephen Frost <sfrost(at)snowman(dot)net>
Cc:	Michael Banck <michael(dot)banck(at)credativ(dot)de>, Michael Paquier <michael(at)paquier(dot)xyz>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Subject:	Re: Online verification of checksums
Date:	2019-03-19 20:00:50
Message-ID:	20190319200050.ncuxejradurjakdc@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 2019-03-20 03:27:55 +0800, Stephen Frost wrote:
> On Tue, Mar 19, 2019 at 23:59 Andres Freund <andres(at)anarazel(dot)de> wrote:
> > On 2019-03-19 16:52:08 +0100, Michael Banck wrote:
> > > Am Dienstag, den 19.03.2019, 11:22 -0400 schrieb Robert Haas:
> > > > It's torn pages that I am concerned about - the server is writing and
> > > > we are reading, and we get a mix of old and new content. We have been
> > > > quite diligent about protecting ourselves from such risks elsewhere,
> > > > and checksum verification should not be held to any lesser standard.
> > >
> > > If we see a checksum failure on an otherwise correctly read block in
> > > online mode, we retry the block on the theory that we might have read a
> > > torn page. If the checksum verification still fails, we compare its LSN
> > > to the LSN of the current checkpoint and don't mind if its newer. This
> > > way, a torn page should not cause a false positive either way I
> > > think?.
> >
> > False positives, no. But there's plenty potential for false
> > negatives. In plenty clusters a large fraction of the pages is going to
> > be touched in most checkpoints.
>
>
> How is it a false negative? The page was in the middle of being
> written,

You don't actually know that. It could just be random gunk in the LSN,
and this type of logic just ignores such failures as long as the random
gunk is above the system's LSN.

And the basebackup logic doesn't just ignore if both the checksum
failed, and the lsn is between startptr and current insertion pointer -
it just does it with *any* page that has a pd_upper != 0 and a pd_lsn >
startptr. Given typical startlsn values (skewing heavily towards lower
int64s), that means that random data is more likely than not to pass
this test.

As it stands, the logic seems to give more false confidence than
anything else.

> > The basebackup checksum verification works in the same way.
> >
> > Shouldn't have been merged that way.
>
>
> I have a hard time not finding this offensive. These issues were
> considered, discussed, and well thought out, with the result being
> committed after agreement.

Well, I don't know what to tell you. But:

/*
* Only check pages which have not been modified since the
* start of the base backup. Otherwise, they might have been
* written only halfway and the checksum would not be valid.
* However, replaying WAL would reinstate the correct page in
* this case. We also skip completely new pages, since they
* don't have a checksum yet.
*/
if (!PageIsNew(page) && PageGetLSN(page) < startptr)
{

doesn't consider plenty scenarios, as pointed out above. It'd be one
thing if the concerns I point out above were actually commented upon and
weighed not substantial enough (not that I know how). But...

> Do you have any example cases where the code in pg_basebackup has resulted
> in either a false positive or a false negative? Any case which can be
> shown to result in either?

CREATE TABLE corruptme AS SELECT g.i::text AS data FROM generate_series(1, 1000000) g(i);
SELECT pg_relation_size('corruptme');
postgres[22890][1]=# SELECT current_setting('data_directory') || '/' || pg_relation_filepath('corruptme');
┌─────────────────────────────────────┐
│ ?column? │
├─────────────────────────────────────┤
│ /srv/dev/pgdev-dev/base/13390/16384 │
└─────────────────────────────────────┘
(1 row)
dd if=/dev/urandom of=/srv/dev/pgdev-dev/base/13390/16384 bs=8192 count=1 conv=notrunc

Try a basebackup and see how many times it'll detect the corrupt
data. In the vast majority of cases you're going to see checksum
failures when reading the data for normal operation, but not when using
basebackup (or this new tool).

At the very very least this would need to do

a) checks that the page is all zeroes if PageIsNew() (like
PageIsVerified() does for the backend). That avoids missing cases
where corruption just zeroed out the header, but not the whole page.
b) Check that pd_lsn is between startlsn and the insertion pointer. That
avoids accepting just about all random data.

And that'd *still* be less strenuous than what normal backends
check. And that's already not great (due to not noticing zeroed out
data).

I fail to see how it's offensive to describe this as "shouldn't have
been merged that way".

Greetings,

Andres Freund

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Stephen Frost <sfrost(at)snowman(dot)net>
Cc:	Michael Banck <michael(dot)banck(at)credativ(dot)de>, Michael Paquier <michael(at)paquier(dot)xyz>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Subject:	Re: Online verification of checksums
Date:	2019-03-19 20:49:06
Message-ID:	20190319204906.kglh62lt4yvffjzh@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2019-03-19 13:00:50 -0700, Andres Freund wrote:
> As it stands, the logic seems to give more false confidence than
> anything else.

To demonstrate that I ran a loop that verified that a) a normal backend
query using the tale detects the corruption b) pg_basebackup doesn't.

i=0;
while true; do
i=$(($i+1));
echo attempt $i;
dd if=/dev/urandom of=/srv/dev/pgdev-dev/base/13390/16384 bs=8192 count=1 conv=notrunc 2>/dev/null;
psql -X -c 'SELECT * FROM corruptme;' 2>/dev/null && break;
~/build/postgres/dev-assert/vpath/src/bin/pg_basebackup/pg_basebackup -X fetch -F t -D - -c fast > /dev/null || break;
done

(excuse the crappy one-off sh)

had, during ~12k iterations, always detected the corruption in the
backend, and never via pg_basebackup. Given the likely LSNs in a
cluster, that's not too surprising.

Greetings,

Andres Freund

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Stephen Frost <sfrost(at)snowman(dot)net>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, Michael Paquier <michael(at)paquier(dot)xyz>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Subject:	Re: Online verification of checksums
Date:	2019-03-19 21:34:37
Message-ID:	CA+TgmoaS181A8FUMWRiSF7O2QFMNLYX2ipqqyqY4Xw4FrE9z1g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Mar 19, 2019 at 4:49 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> To demonstrate that I ran a loop that verified that a) a normal backend
> query using the tale detects the corruption b) pg_basebackup doesn't.
>
> i=0;
> while true; do
> i=$(($i+1));
> echo attempt $i;
> dd if=/dev/urandom of=/srv/dev/pgdev-dev/base/13390/16384 bs=8192 count=1 conv=notrunc 2>/dev/null;
> psql -X -c 'SELECT * FROM corruptme;' 2>/dev/null && break;
> ~/build/postgres/dev-assert/vpath/src/bin/pg_basebackup/pg_basebackup -X fetch -F t -D - -c fast > /dev/null || break;
> done
>
> (excuse the crappy one-off sh)
>
> had, during ~12k iterations, always detected the corruption in the
> backend, and never via pg_basebackup. Given the likely LSNs in a
> cluster, that's not too surprising.

Wow. So we shipped a checksum-verification feature (in pg_basebackup)
that reliably fails to detect blatantly corrupt pages. That's pretty
awful. Your chances get better the more WAL you've ever generated,
but you have to generate 163 petabytes of WAL to have a 1% chance of
detecting a page of random garbage, so realistically they never get
very good.

It's probably fair to point out that flipping a couple of random bytes
on the page is a more likely error than replacing the entire page with
garbage, and the check as designed will detect that fairly reliably --
unless those bytes are very near the beginning of the page. Still,
that leaves a lot of kinds of corruption that this will not catch.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Michael Banck <michael(dot)banck(at)credativ(dot)de>
To:	Andres Freund <andres(at)anarazel(dot)de>, Stephen Frost <sfrost(at)snowman(dot)net>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Subject:	Re: Online verification of checksums
Date:	2019-03-19 21:39:16
Message-ID:	1553031556.9697.59.camel@credativ.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Am Dienstag, den 19.03.2019, 13:00 -0700 schrieb Andres Freund:
> On 2019-03-20 03:27:55 +0800, Stephen Frost wrote:
> > On Tue, Mar 19, 2019 at 23:59 Andres Freund <andres(at)anarazel(dot)de> wrote:
> > > On 2019-03-19 16:52:08 +0100, Michael Banck wrote:
> > > > Am Dienstag, den 19.03.2019, 11:22 -0400 schrieb Robert Haas:
> > > > > It's torn pages that I am concerned about - the server is writing and
> > > > > we are reading, and we get a mix of old and new content. We have been
> > > > > quite diligent about protecting ourselves from such risks elsewhere,
> > > > > and checksum verification should not be held to any lesser standard.
> > > >
> > > > If we see a checksum failure on an otherwise correctly read block in
> > > > online mode, we retry the block on the theory that we might have read a
> > > > torn page. If the checksum verification still fails, we compare its LSN
> > > > to the LSN of the current checkpoint and don't mind if its newer. This
> > > > way, a torn page should not cause a false positive either way I
> > > > think?.
> > >
> > > False positives, no. But there's plenty potential for false
> > > negatives. In plenty clusters a large fraction of the pages is going to
> > > be touched in most checkpoints.
> >
> >
> > How is it a false negative? The page was in the middle of being
> > written,
>
> You don't actually know that. It could just be random gunk in the LSN,
> and this type of logic just ignores such failures as long as the random
> gunk is above the system's LSN.

Right, I think this needs to be taken into account. For pg_basebackup,
that'd be an additional check for GetRedoRecPtr() or something
in the below check:

[...]

> Well, I don't know what to tell you. But:
>
> /*
> * Only check pages which have not been modified since the
> * start of the base backup. Otherwise, they might have been
> * written only halfway and the checksum would not be valid.
> * However, replaying WAL would reinstate the correct page in
> * this case. We also skip completely new pages, since they
> * don't have a checksum yet.
> */
> if (!PageIsNew(page) && PageGetLSN(page) < startptr)
> {
>
> doesn't consider plenty scenarios, as pointed out above. It'd be one
> thing if the concerns I point out above were actually commented upon and
> weighed not substantial enough (not that I know how). But...
>

> > Do you have any example cases where the code in pg_basebackup has resulted
> > in either a false positive or a false negative? Any case which can be
> > shown to result in either?
>
> CREATE TABLE corruptme AS SELECT g.i::text AS data FROM generate_series(1, 1000000) g(i);
> SELECT pg_relation_size('corruptme');
> postgres[22890][1]=# SELECT current_setting('data_directory') || '/' || pg_relation_filepath('corruptme');
> ┌─────────────────────────────────────┐
> │ ?column? │
> ├─────────────────────────────────────┤
> │ /srv/dev/pgdev-dev/base/13390/16384 │
> └─────────────────────────────────────┘
> (1 row)
> dd if=/dev/urandom of=/srv/dev/pgdev-dev/base/13390/16384 bs=8192 count=1 conv=notrunc
>
> Try a basebackup and see how many times it'll detect the corrupt
> data. In the vast majority of cases you're going to see checksum
> failures when reading the data for normal operation, but not when using
> basebackup (or this new tool).

Right, see above.

> At the very very least this would need to do
>
> a) checks that the page is all zeroes if PageIsNew() (like
> PageIsVerified() does for the backend). That avoids missing cases
> where corruption just zeroed out the header, but not the whole page.

We can't run pg_checksum_page() on those afterwards though as it would
fire an assertion:

|pg_checksums: [...]/../src/include/storage/checksum_impl.h:194:
|pg_checksum_page: Assertion `!(((PageHeader) (&cpage->phdr))->pd_upper
|== 0)' failed.

But we should count it as a checksum error and generate an appropriate
error message in that case.

> b) Check that pd_lsn is between startlsn and the insertion pointer. That
> avoids accepting just about all random data.

However, for pg_checksums being a stand-alone application it can't just
access the insertion pointer, can it? We could maybe set a threshold
from the last checkpoint after which we consider the pd_lsn bogus. But
what's a good threshold here?

And/or we could port the other sanity checks from PageIsVerified:

That should catch large-scale random corruption like you showed above.

Michael

--
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax: +49 2166 9901-100
Email: michael(dot)banck(at)credativ(dot)de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Michael Banck <michael(dot)banck(at)credativ(dot)de>
Cc:	Stephen Frost <sfrost(at)snowman(dot)net>, Michael Paquier <michael(at)paquier(dot)xyz>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Subject:	Re: Online verification of checksums
Date:	2019-03-19 21:44:52
Message-ID:	20190319214452.7ithifs7iy5yxvsi@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 2019-03-19 22:39:16 +0100, Michael Banck wrote:
> Am Dienstag, den 19.03.2019, 13:00 -0700 schrieb Andres Freund:
> > a) checks that the page is all zeroes if PageIsNew() (like
> > PageIsVerified() does for the backend). That avoids missing cases
> > where corruption just zeroed out the header, but not the whole page.
>
> We can't run pg_checksum_page() on those afterwards though as it would
> fire an assertion:
>
> |pg_checksums: [...]/../src/include/storage/checksum_impl.h:194:
> |pg_checksum_page: Assertion `!(((PageHeader) (&cpage->phdr))->pd_upper
> |== 0)' failed.
>
> But we should count it as a checksum error and generate an appropriate
> error message in that case.

All I'm saying is that if PageIsNew() you need to run the same checks
that PageIsVerified() runs in that case. Namely verifying that the page
is all-zeroes, rather than just the pd_upper field. That's separate
from running pg_checksum_page().

> > b) Check that pd_lsn is between startlsn and the insertion pointer. That
> > avoids accepting just about all random data.
>
> However, for pg_checksums being a stand-alone application it can't just
> access the insertion pointer, can it? We could maybe set a threshold
> from the last checkpoint after which we consider the pd_lsn bogus. But
> what's a good threshold here?

That's *PRECISELY* my point. I think it's a bad idea to do online
checksumming from outside the backend. It needs to be inside the
backend, and if there's any verification failures on a block, it needs
to acquire the IO lock on the page, and reread from disk.

Greetings,

Andres Freund

From:	Michael Paquier <michael(at)paquier(dot)xyz>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Michael Banck <michael(dot)banck(at)credativ(dot)de>, Stephen Frost <sfrost(at)snowman(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Julien Rouhaud <rjuju123(at)gmail(dot)com>
Subject:	Re: Online verification of checksums
Date:	2019-03-19 23:52:55
Message-ID:	20190319235255.GD3488@paquier.xyz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Mar 19, 2019 at 02:44:52PM -0700, Andres Freund wrote:
> That's *PRECISELY* my point. I think it's a bad idea to do online
> checksumming from outside the backend. It needs to be inside the
> backend, and if there's any verification failures on a block, it needs
> to acquire the IO lock on the page, and reread from disk.

Yeah, FWIW, Julien Rouhaud was mentioning me that we could use
mdread() and loop over the blocks so as we don't finish loading
corrupted blocks into shared buffers, checking on the way if the block
is already in shared buffers or not.
--
Michael

From:	Michael Banck <michael(dot)banck(at)credativ(dot)de>
To:	Andres Freund <andres(at)anarazel(dot)de>, Stephen Frost <sfrost(at)snowman(dot)net>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Subject:	[Patch] Base backups and random or zero pageheaders (was: Online verification of checksums)
Date:	2019-03-26 17:22:55
Message-ID:	1553620975.4884.22.camel@credativ.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Am Dienstag, den 19.03.2019, 13:00 -0700 schrieb Andres Freund:
> CREATE TABLE corruptme AS SELECT g.i::text AS data FROM generate_series(1, 1000000) g(i);
> SELECT pg_relation_size('corruptme');
> postgres[22890][1]=# SELECT current_setting('data_directory') || '/' || pg_relation_filepath('corruptme');
> ┌─────────────────────────────────────┐
> │ ?column? │
> ├─────────────────────────────────────┤
> │ /srv/dev/pgdev-dev/base/13390/16384 │
> └─────────────────────────────────────┘
> (1 row)
> dd if=/dev/urandom of=/srv/dev/pgdev-dev/base/13390/16384 bs=8192 count=1 conv=notrunc
>
> Try a basebackup and see how many times it'll detect the corrupt
> data. In the vast majority of cases you're going to see checksum
> failures when reading the data for normal operation, but not when using
> basebackup (or this new tool).
>
> At the very very least this would need to do
>
> a) checks that the page is all zeroes if PageIsNew() (like
> PageIsVerified() does for the backend). That avoids missing cases
> where corruption just zeroed out the header, but not the whole page.
> b) Check that pd_lsn is between startlsn and the insertion pointer. That
> avoids accepting just about all random data.
>
> And that'd *still* be less strenuous than what normal backends
> check. And that's already not great (due to not noticing zeroed out
> data).

I've done the above in the attached patch now. Well, literally like an
hour ago, then went jogging and came back to see you outlined about
fixing this differently in a separate thread. Still might be helpful for
the TAP test changes at least.

Michael

--
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax: +49 2166 9901-100
Email: michael(dot)banck(at)credativ(dot)de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

Attachment	Content-Type	Size
pg_basebackup_random_or_zero_pageheader.patch	text/x-patch	9.1 KB

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Michael Banck <michael(dot)banck(at)credativ(dot)de>
Cc:	Stephen Frost <sfrost(at)snowman(dot)net>, Michael Paquier <michael(at)paquier(dot)xyz>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Subject:	Re: [Patch] Base backups and random or zero pageheaders (was: Online verification of checksums)
Date:	2019-03-26 17:30:40
Message-ID:	20190326173040.64ra4maiqudkpdof@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2019-03-26 18:22:55 +0100, Michael Banck wrote:
> Hi,
>
> Am Dienstag, den 19.03.2019, 13:00 -0700 schrieb Andres Freund:
> > CREATE TABLE corruptme AS SELECT g.i::text AS data FROM generate_series(1, 1000000) g(i);
> > SELECT pg_relation_size('corruptme');
> > postgres[22890][1]=# SELECT current_setting('data_directory') || '/' || pg_relation_filepath('corruptme');
> > ┌─────────────────────────────────────┐
> > │ ?column? │
> > ├─────────────────────────────────────┤
> > │ /srv/dev/pgdev-dev/base/13390/16384 │
> > └─────────────────────────────────────┘
> > (1 row)
> > dd if=/dev/urandom of=/srv/dev/pgdev-dev/base/13390/16384 bs=8192 count=1 conv=notrunc
> >
> > Try a basebackup and see how many times it'll detect the corrupt
> > data. In the vast majority of cases you're going to see checksum
> > failures when reading the data for normal operation, but not when using
> > basebackup (or this new tool).
> >
> > At the very very least this would need to do
> >
> > a) checks that the page is all zeroes if PageIsNew() (like
> > PageIsVerified() does for the backend). That avoids missing cases
> > where corruption just zeroed out the header, but not the whole page.
> > b) Check that pd_lsn is between startlsn and the insertion pointer. That
> > avoids accepting just about all random data.
> >
> > And that'd *still* be less strenuous than what normal backends
> > check. And that's already not great (due to not noticing zeroed out
> > data).
>
> I've done the above in the attached patch now. Well, literally like an
> hour ago, then went jogging and came back to see you outlined about
> fixing this differently in a separate thread. Still might be helpful for
> the TAP test changes at least.

Sorry, I just hadn't seen much movement on this, and I'm a bit concerned
about such a critical issue not being addressed.

> /*
> - * Only check pages which have not been modified since the
> - * start of the base backup. Otherwise, they might have been
> - * written only halfway and the checksum would not be valid.
> - * However, replaying WAL would reinstate the correct page in
> - * this case. We also skip completely new pages, since they
> - * don't have a checksum yet.
> + * We skip completely new pages after checking they are
> + * all-zero, since they don't have a checksum yet.
> */
> - if (!PageIsNew(page) && PageGetLSN(page) < startptr)
> + if (PageIsNew(page))
> {
> - checksum = pg_checksum_page((char *) page, blkno + segmentno * RELSEG_SIZE);
> - phdr = (PageHeader) page;
> - if (phdr->pd_checksum != checksum)
> + all_zeroes = true;
> + pagebytes = (size_t *) page;
> + for (int i = 0; i < (BLCKSZ / sizeof(size_t)); i++)

Can we please abstract the zeroeness check into a separate function to
be used both by PageIsVerified() and this?

> + if (!all_zeroes)
> + {
> + /*
> + * pd_upper is zero, but the page is not all zero. We
> + * cannot run pg_checksum_page() on the page as it
> + * would throw an assertion failure. Consider this a
> + * checksum failure.
> + */

I don't think the assertion failure is the relevant bit here, it's htat
the page is corrupted, no?

> + /*
> + * Only check pages which have not been modified since the
> + * start of the base backup. Otherwise, they might have been
> + * written only halfway and the checksum would not be valid.
> + * However, replaying WAL would reinstate the correct page in
> + * this case. If the page LSN is larger than the current redo
> + * pointer then we assume a bogus LSN due to random page header
> + * corruption and do verify the checksum.
> + */
> + if (PageGetLSN(page) < startptr || PageGetLSN(page) > GetRedoRecPtr())

I don't think GetRedoRecPtr() is the right check? Wouldn't it need to be
GetInsertRecPtr()?

Greetings,

Andres Freund

From:	Michael Banck <michael(dot)banck(at)credativ(dot)de>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Stephen Frost <sfrost(at)snowman(dot)net>, Michael Paquier <michael(at)paquier(dot)xyz>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Subject:	Re: [Patch] Base backups and random or zero pageheaders (was: Online verification of checksums)
Date:	2019-03-26 18:23:19
Message-ID:	1553624599.4884.24.camel@credativ.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Am Dienstag, den 26.03.2019, 10:30 -0700 schrieb Andres Freund:
> On 2019-03-26 18:22:55 +0100, Michael Banck wrote:
> > Am Dienstag, den 19.03.2019, 13:00 -0700 schrieb Andres Freund:
> > > CREATE TABLE corruptme AS SELECT g.i::text AS data FROM generate_series(1, 1000000) g(i);
> > > SELECT pg_relation_size('corruptme');
> > > postgres[22890][1]=# SELECT current_setting('data_directory') || '/' || pg_relation_filepath('corruptme');
> > > ┌─────────────────────────────────────┐
> > > │ ?column? │
> > > ├─────────────────────────────────────┤
> > > │ /srv/dev/pgdev-dev/base/13390/16384 │
> > > └─────────────────────────────────────┘
> > > (1 row)
> > > dd if=/dev/urandom of=/srv/dev/pgdev-dev/base/13390/16384 bs=8192 count=1 conv=notrunc
> > >
> > > Try a basebackup and see how many times it'll detect the corrupt
> > > data. In the vast majority of cases you're going to see checksum
> > > failures when reading the data for normal operation, but not when using
> > > basebackup (or this new tool).
> > >
> > > At the very very least this would need to do
> > >
> > > a) checks that the page is all zeroes if PageIsNew() (like
> > > PageIsVerified() does for the backend). That avoids missing cases
> > > where corruption just zeroed out the header, but not the whole page.
> > > b) Check that pd_lsn is between startlsn and the insertion pointer. That
> > > avoids accepting just about all random data.
> > >
> > > And that'd *still* be less strenuous than what normal backends
> > > check. And that's already not great (due to not noticing zeroed out
> > > data).
> >
> > I've done the above in the attached patch now. Well, literally like an
> > hour ago, then went jogging and came back to see you outlined about
> > fixing this differently in a separate thread. Still might be helpful for
> > the TAP test changes at least.
>
> Sorry, I just hadn't seen much movement on this, and I'm a bit concerned
> about such a critical issue not being addressed.

Sure, I was working on this a bit on and off for a few days but I had
random corruption issues which I finally tracked down earlier to reusing
"for (i=0 [...]" via copy&paste, d'oh. That's why I renamed the `i'
variable to `page_in_buf' cause it's a pretty long loop so should have a
useful variable name IMO.

> > /*
> > - * Only check pages which have not been modified since the
> > - * start of the base backup. Otherwise, they might have been
> > - * written only halfway and the checksum would not be valid.
> > - * However, replaying WAL would reinstate the correct page in
> > - * this case. We also skip completely new pages, since they
> > - * don't have a checksum yet.
> > + * We skip completely new pages after checking they are
> > + * all-zero, since they don't have a checksum yet.
> > */
> > - if (!PageIsNew(page) && PageGetLSN(page) < startptr)
> > + if (PageIsNew(page))
> > {
> > - checksum = pg_checksum_page((char *) page, blkno + segmentno * RELSEG_SIZE);
> > - phdr = (PageHeader) page;
> > - if (phdr->pd_checksum != checksum)
> > + all_zeroes = true;
> > + pagebytes = (size_t *) page;
> > + for (int i = 0; i < (BLCKSZ / sizeof(size_t)); i++)
>
> Can we please abstract the zeroeness check into a separate function to
> be used both by PageIsVerified() and this?

Ok, done so as PageIsZero further down in bufpage.c.

> > + if (!all_zeroes)
> > + {
> > + /*
> > + * pd_upper is zero, but the page is not all zero. We
> > + * cannot run pg_checksum_page() on the page as it
> > + * would throw an assertion failure. Consider this a
> > + * checksum failure.
> > + */
>
> I don't think the assertion failure is the relevant bit here, it's htat
> the page is corrupted, no?

Well, relevant in the sense that the reader might wonder why we don't
just call pg_checksum_page() and have a consistent error message with
the other codepath.

We could maybe run pg_checksum_block() on it and reverse the rest of the
permutations from pg_checksum_page() but that might be overly
complicated for little gain.

> > + /*
> > + * Only check pages which have not been modified since the
> > + * start of the base backup. Otherwise, they might have been
> > + * written only halfway and the checksum would not be valid.
> > + * However, replaying WAL would reinstate the correct page in
> > + * this case. If the page LSN is larger than the current redo
> > + * pointer then we assume a bogus LSN due to random page header
> > + * corruption and do verify the checksum.
> > + */
> > + if (PageGetLSN(page) < startptr || PageGetLSN(page) > GetRedoRecPtr())
>
> I don't think GetRedoRecPtr() is the right check? Wouldn't it need to be
> GetInsertRecPtr()?

Oh, right.

I also fixed a bug in the TAP tests, the $random_data string wasn't
properly set.

New patch attached.

Michael

--
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax: +49 2166 9901-100
Email: michael(dot)banck(at)credativ(dot)de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

Attachment	Content-Type	Size
pg_basebackup_random_or_zero_pageheader_V2.patch	text/x-patch	11.3 KB

From:	Michael Banck <michael(dot)banck(at)credativ(dot)de>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Stephen Frost <sfrost(at)snowman(dot)net>, Michael Paquier <michael(at)paquier(dot)xyz>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Subject:	Re: [Patch] Base backups and random or zero pageheaders
Date:	2019-03-27 10:37:25
Message-ID:	1553683045.4884.31.camel@credativ.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Am Dienstag, den 26.03.2019, 19:23 +0100 schrieb Michael Banck:
> Am Dienstag, den 26.03.2019, 10:30 -0700 schrieb Andres Freund:
> > On 2019-03-26 18:22:55 +0100, Michael Banck wrote:
> > > /*
> > > - * Only check pages which have not been modified since the
> > > - * start of the base backup. Otherwise, they might have been
> > > - * written only halfway and the checksum would not be valid.
> > > - * However, replaying WAL would reinstate the correct page in
> > > - * this case. We also skip completely new pages, since they
> > > - * don't have a checksum yet.
> > > + * We skip completely new pages after checking they are
> > > + * all-zero, since they don't have a checksum yet.
> > > */
> > > - if (!PageIsNew(page) && PageGetLSN(page) < startptr)
> > > + if (PageIsNew(page))
> > > {
> > > - checksum = pg_checksum_page((char *) page, blkno + segmentno * RELSEG_SIZE);
> > > - phdr = (PageHeader) page;
> > > - if (phdr->pd_checksum != checksum)
> > > + all_zeroes = true;
> > > + pagebytes = (size_t *) page;
> > > + for (int i = 0; i < (BLCKSZ / sizeof(size_t)); i++)
> >
> > Can we please abstract the zeroeness check into a separate function to
> > be used both by PageIsVerified() and this?
>
> Ok, done so as PageIsZero further down in bufpage.c.

It turns out that pg_checksums (current master and back branches, not
just the online version) needs this treatment as well as it won't catch
zeroed-out pageheader corruption, see attached patch to its TAP tests
which trigger it (I also added a random data check similar to
pg_basebackup as well which is not a problem for the current codebase).

Any suggestion on how to handle this? Should I duplicate the
PageIsZero() code in pg_checksums? Should I move PageIsZero into
something like bufpage_impl.h for use by external programs, similar to
pg_checksum_page()?

I've done the latter as a POC in the second attached patch.

Michael

--
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax: +49 2166 9901-100
Email: michael(dot)banck(at)credativ(dot)de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

Attachment	Content-Type	Size
pg_checksums_tap_tests_random_empty_pageheader.patch	text/x-patch	3.3 KB
pg_basebackup_random_or_zero_pageheader_V3.patch	text/x-patch	16.7 KB

From:	Michael Banck <michael(dot)banck(at)credativ(dot)de>
To:	Stephen Frost <sfrost(at)snowman(dot)net>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Subject:	Re: Online verification of checksums
Date:	2019-03-28 16:08:33
Message-ID:	1553789313.4884.48.camel@credativ.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

I have rebased this patch now.

I also fixed the two issues Andres reported, namely a zeroed-out
pageheader and a random LSN. The first is caught be checking for an all-
zero-page in the way PageIsVerified() does. The second is caught by
comparing the upper 32 bits of the LSN as well and demanding that they
are equal. If the LSN is corrupted, the upper 32 bits should be wildly
different to the current checkpoint LSN.

Well, at least that is a stab at a fix; there is a window where the
upper 32 bits could legitimately be different. In order to make that as
small as possible, I update the checkpoint LSN every once in a while.

Am Montag, den 18.03.2019, 21:15 +0100 schrieb Michael Banck:
> I have also added a paragraph to the documentation about possilby
> skipping new or recently updated pages:
>
> + If the cluster is online, pages that have been (re-)written since the last
> + checkpoint will not count as checksum failures if they cannot be read or
> + verified correctly.

I have removed that for now as it seems to be more confusing than
helpful.

Michael

--
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax: +49 2166 9901-100
Email: michael(dot)banck(at)credativ(dot)de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

Attachment	Content-Type	Size
online-verification-of-checksums_V17.patch	text/x-patch	17.3 KB

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Michael Banck <michael(dot)banck(at)credativ(dot)de>
Cc:	Stephen Frost <sfrost(at)snowman(dot)net>, Michael Paquier <michael(at)paquier(dot)xyz>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject:	Re: Online verification of checksums
Date:	2019-03-28 17:19:05
Message-ID:	20190328171905.GA16397@development
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Mar 28, 2019 at 05:08:33PM +0100, Michael Banck wrote:
>Hi,
>
>I have rebased this patch now.
>
>I also fixed the two issues Andres reported, namely a zeroed-out
>pageheader and a random LSN. The first is caught be checking for an all-
>zero-page in the way PageIsVerified() does. The second is caught by
>comparing the upper 32 bits of the LSN as well and demanding that they
>are equal. If the LSN is corrupted, the upper 32 bits should be wildly
>different to the current checkpoint LSN.
>
>Well, at least that is a stab at a fix; there is a window where the
>upper 32 bits could legitimately be different. In order to make that as
>small as possible, I update the checkpoint LSN every once in a while.
>

Doesn't that mean we'll report a false positive?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Michael Banck <michael(dot)banck(at)credativ(dot)de>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	Stephen Frost <sfrost(at)snowman(dot)net>, Michael Paquier <michael(at)paquier(dot)xyz>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject:	Re: Online verification of checksums
Date:	2019-03-28 20:09:22
Message-ID:	1553803762.4884.52.camel@credativ.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Am Donnerstag, den 28.03.2019, 18:19 +0100 schrieb Tomas Vondra:
> On Thu, Mar 28, 2019 at 05:08:33PM +0100, Michael Banck wrote:
> > I also fixed the two issues Andres reported, namely a zeroed-out
> > pageheader and a random LSN. The first is caught be checking for an all-
> > zero-page in the way PageIsVerified() does. The second is caught by
> > comparing the upper 32 bits of the LSN as well and demanding that they
> > are equal. If the LSN is corrupted, the upper 32 bits should be wildly
> > different to the current checkpoint LSN.
> >
> > Well, at least that is a stab at a fix; there is a window where the
> > upper 32 bits could legitimately be different. In order to make that as
> > small as possible, I update the checkpoint LSN every once in a while.

I decided it makes more sense to just re-read the checkpoint LSN from
the control file when we encounter a wrong checksum on re-read of a page
as that is when it counts, instead of doing it only every once in a
while.

> Doesn't that mean we'll report a false positive?

A false positive would be pg_checksums claiming a block has a wrong
checksum while in fact it does not (after it is correctly written out
and synced to disk), right?

If pg_checksums reads a current first part and a stale second part twice
in a row (we re-read the block), then the LSN of the first part would
presumably(?) be higher than the latest checkpoint LSN. If there was a
wraparound in the lower part of the LSN so that the upper part is now
different to the latest checkpoint LSN, then pg_checksums would report
this as a false positive I believe.

We could add some additional heuristics like checking the upper part of
the LSN has advanced by at most one but that does not seem to make it
100% certified robust either, does it?

If pg_checksums reads a current second part and a stale first part
twice, then the pageheader LSN would presumably be lower than the
checkpoint LSN and again a false positive would be reported.

At least in my testing I haven't seen the second case and the first
(disregarding the wraparound issue for now) extremely rarely if at all
(usually the torn page is gone on re-read). The first case requiring a
wraparound since the latest checkpointLSN update also seems quite narrow
compared to the issue of random data being written due to corruption. So
I think it is more important to make sure random data won't be a false
negative than this being a false positive.

Maybe we can just issue a warning in online mode that some checksum
failures could be false positives and advise the user to recheck those
files (using the -r switch) again? I have added this in the attached new
version:

+ printf(_("%s ran against an online cluster and found some bad checksums.\n"), progname);
+ printf(_("It could be that those are false positives due concurrently updated blocks,\n"));
+ printf(_("checking the offending files again with the -r option is advised.\n"));

It was not mentioned on this thread, but I want to stress again that you
cannot run the current pg_checksums on a basebackup due to the control
file claiming it is still online. This makes the current program pretty
useless for production setups right now in my opinion as few people have
the luxury of regular maintenance downtimes when pg_checksums could run
and running it against base backups is quite cumbersome.

Maybe we can improve things by checking for the postmaster.pid as well
and going ahead (only for --check of course) if it is missing, but that
hasn't been implemented yet.

I agree that the current patch might have some corner-cases where it
does not guarantee 100% accuracy in online mode, but I hope the current
version at least has no more false negatives.

Michael

--
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax: +49 2166 9901-100
Email: michael(dot)banck(at)credativ(dot)de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

Attachment	Content-Type	Size
online-verification-of-checksums_V18.patch	text/x-patch	17.5 KB

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Michael Banck <michael(dot)banck(at)credativ(dot)de>
Cc:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Michael Paquier <michael(at)paquier(dot)xyz>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject:	Re: Online verification of checksums
Date:	2019-03-28 20:11:40
Message-ID:	20190328201140.vtiaiics2b43745y@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 2019-03-28 21:09:22 +0100, Michael Banck wrote:
> I agree that the current patch might have some corner-cases where it
> does not guarantee 100% accuracy in online mode, but I hope the current
> version at least has no more false negatives.

False positives are *bad*. We shouldn't integrate code that has them.

Greetings,

Andres Freund

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Michael Banck <michael(dot)banck(at)credativ(dot)de>, Stephen Frost <sfrost(at)snowman(dot)net>, Michael Paquier <michael(at)paquier(dot)xyz>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject:	Re: Online verification of checksums
Date:	2019-03-28 21:19:02
Message-ID:	20190328211902.GD16397@development
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Mar 28, 2019 at 01:11:40PM -0700, Andres Freund wrote:
>Hi,
>
>On 2019-03-28 21:09:22 +0100, Michael Banck wrote:
>> I agree that the current patch might have some corner-cases where it
>> does not guarantee 100% accuracy in online mode, but I hope the current
>> version at least has no more false negatives.
>
>False positives are *bad*. We shouldn't integrate code that has them.
>

Yeah, I agree. I'm a bit puzzled by the reluctance to make the online mode
communicate with the server, which would presumably address these issues.
Can someone explain why not to do that?

FWIW I've initially argued against that, believing that we can address
those issues in some other way, and I'd love if that was possible. But
considering we're still trying to make that work reliably I think the
reasonable conclusion is that Andres was right communicating with the
server is necessary.

Of course, I definitely appreciate people are working on this, otherwise
we wouldn't be having this discussion ...

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Magnus Hagander <magnus(at)hagander(dot)net>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, Stephen Frost <sfrost(at)snowman(dot)net>, Michael Paquier <michael(at)paquier(dot)xyz>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject:	Re: Online verification of checksums
Date:	2019-03-29 12:15:00
Message-ID:	CABUevEw1afuDpqRt689uDHCDxXiXUBDH+fVXC1pJ=yY4yPxEVw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Mar 28, 2019 at 10:19 PM Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
wrote:

> On Thu, Mar 28, 2019 at 01:11:40PM -0700, Andres Freund wrote:
> >Hi,
> >
> >On 2019-03-28 21:09:22 +0100, Michael Banck wrote:
> >> I agree that the current patch might have some corner-cases where it
> >> does not guarantee 100% accuracy in online mode, but I hope the current
> >> version at least has no more false negatives.
> >
> >False positives are *bad*. We shouldn't integrate code that has them.
> >
>
> Yeah, I agree. I'm a bit puzzled by the reluctance to make the online mode
> communicate with the server, which would presumably address these issues.
> Can someone explain why not to do that?
>

I agree that this effort seems better spent on fixing those issues there
(of which many are the same), and then re-use that.

FWIW I've initially argued against that, believing that we can address
> those issues in some other way, and I'd love if that was possible. But
> considering we're still trying to make that work reliably I think the
> reasonable conclusion is that Andres was right communicating with the
> server is necessary.
>
> Of course, I definitely appreciate people are working on this, otherwise
> we wouldn't be having this discussion ...
>

+1.

--
Magnus Hagander
Me: https://www.hagander.net/ <http://www.hagander.net/>
Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/>

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	Magnus Hagander <magnus(at)hagander(dot)net>
Cc:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, Michael Paquier <michael(at)paquier(dot)xyz>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject:	Re: Online verification of checksums
Date:	2019-03-29 15:30:15
Message-ID:	20190329153014.GL6197@tamriel.snowman.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Greetings,

* Magnus Hagander (magnus(at)hagander(dot)net) wrote:
> On Thu, Mar 28, 2019 at 10:19 PM Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
> wrote:
>
> > On Thu, Mar 28, 2019 at 01:11:40PM -0700, Andres Freund wrote:
> > >Hi,
> > >
> > >On 2019-03-28 21:09:22 +0100, Michael Banck wrote:
> > >> I agree that the current patch might have some corner-cases where it
> > >> does not guarantee 100% accuracy in online mode, but I hope the current
> > >> version at least has no more false negatives.
> > >
> > >False positives are *bad*. We shouldn't integrate code that has them.
> > >
> >
> > Yeah, I agree. I'm a bit puzzled by the reluctance to make the online mode
> > communicate with the server, which would presumably address these issues.
> > Can someone explain why not to do that?
>
> I agree that this effort seems better spent on fixing those issues there
> (of which many are the same), and then re-use that.

This really seems like it depends on which of the options we're talking
about.. Connecting to the server and asking what the current insert
point is, so we can check that the LSN isn't completely insane, seems
reasonable, but at least one option being discussed was to have
pg_basebackup actually *lock the page* (even if just for I/O..) and then
re-read it, and having an external tool doing that instead of the
backend seems like a whole different level to me. That would involve
having an SQL function for "lock this page against I/O" and then another
for "unlock this page", wouldn't it?

> > FWIW I've initially argued against that, believing that we can address
> > those issues in some other way, and I'd love if that was possible. But
> > considering we're still trying to make that work reliably I think the
> > reasonable conclusion is that Andres was right communicating with the
> > server is necessary.

As part of a backup, you could check against the pages written out into
the WAL as a cross-check and be able to be confident that at least
everything which was backed up had been checked. That doesn't cover
things like unlogged tables though.

For my part, at least, adding additional checks around the LSN seems
like a good solution (though we can't allow those checks to turn into
false positives...) and would seriously reduce the risk that we have
false negatives (we can *not* completely eliminate false negatives
entirely.. we could possibly get to a point where at least we don't
have any more false negatives than PG itself has but it looks like an
awful lot of work and ends up adding its own risks...).

As I've said before, I'd certainly support a background worker which
performs ongoing checksum validation of pages and that would be able to
use the same approach as what we do with pg_basebackup, but having an
external tool locking pages seems really unlikely to be reasonable.

Thanks!

Stephen

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Stephen Frost <sfrost(at)snowman(dot)net>
Cc:	Magnus Hagander <magnus(at)hagander(dot)net>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, Michael Paquier <michael(at)paquier(dot)xyz>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject:	Re: Online verification of checksums
Date:	2019-03-29 15:34:14
Message-ID:	20190329153414.h2vhhbhfexfecanm@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 2019-03-29 11:30:15 -0400, Stephen Frost wrote:
> * Magnus Hagander (magnus(at)hagander(dot)net) wrote:
> > On Thu, Mar 28, 2019 at 10:19 PM Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
> > wrote:
> > > On Thu, Mar 28, 2019 at 01:11:40PM -0700, Andres Freund wrote:
> > > >Hi,
> > > >
> > > >On 2019-03-28 21:09:22 +0100, Michael Banck wrote:
> > > >> I agree that the current patch might have some corner-cases where it
> > > >> does not guarantee 100% accuracy in online mode, but I hope the current
> > > >> version at least has no more false negatives.
> > > >
> > > >False positives are *bad*. We shouldn't integrate code that has them.
> > > >
> > >
> > > Yeah, I agree. I'm a bit puzzled by the reluctance to make the online mode
> > > communicate with the server, which would presumably address these issues.
> > > Can someone explain why not to do that?
> >
> > I agree that this effort seems better spent on fixing those issues there
> > (of which many are the same), and then re-use that.
>
> This really seems like it depends on which of the options we're talking
> about.. Connecting to the server and asking what the current insert
> point is, so we can check that the LSN isn't completely insane, seems
> reasonable, but at least one option being discussed was to have
> pg_basebackup actually *lock the page* (even if just for I/O..) and then
> re-read it, and having an external tool doing that instead of the
> backend seems like a whole different level to me. That would involve
> having an SQL function for "lock this page against I/O" and then another
> for "unlock this page", wouldn't it?

No, I don't think so. And we obviously couldn't have a SQL level
function hold an LWLock after it has finished, that'd make undetected
deadlocks triggerable by users. The way I'd imagine that being done is
to just perform the checksum test in the commandline tool, and whenever
there's a checksum failure that could plausibly be a torn read, call a
server side function that re-tests the page after locking it. Which then
would just return the error message in a string.

Greetings,

Andres Freund

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Magnus Hagander <magnus(at)hagander(dot)net>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, Michael Paquier <michael(at)paquier(dot)xyz>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject:	Re: Online verification of checksums
Date:	2019-03-29 15:38:02
Message-ID:	20190329153802.GM6197@tamriel.snowman.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Greetings,

* Andres Freund (andres(at)anarazel(dot)de) wrote:
> On 2019-03-29 11:30:15 -0400, Stephen Frost wrote:
> > * Magnus Hagander (magnus(at)hagander(dot)net) wrote:
> > > On Thu, Mar 28, 2019 at 10:19 PM Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
> > > wrote:
> > > > On Thu, Mar 28, 2019 at 01:11:40PM -0700, Andres Freund wrote:
> > > > >Hi,
> > > > >
> > > > >On 2019-03-28 21:09:22 +0100, Michael Banck wrote:
> > > > >> I agree that the current patch might have some corner-cases where it
> > > > >> does not guarantee 100% accuracy in online mode, but I hope the current
> > > > >> version at least has no more false negatives.
> > > > >
> > > > >False positives are *bad*. We shouldn't integrate code that has them.
> > > > >
> > > >
> > > > Yeah, I agree. I'm a bit puzzled by the reluctance to make the online mode
> > > > communicate with the server, which would presumably address these issues.
> > > > Can someone explain why not to do that?
> > >
> > > I agree that this effort seems better spent on fixing those issues there
> > > (of which many are the same), and then re-use that.
> >
> > This really seems like it depends on which of the options we're talking
> > about.. Connecting to the server and asking what the current insert
> > point is, so we can check that the LSN isn't completely insane, seems
> > reasonable, but at least one option being discussed was to have
> > pg_basebackup actually *lock the page* (even if just for I/O..) and then
> > re-read it, and having an external tool doing that instead of the
> > backend seems like a whole different level to me. That would involve
> > having an SQL function for "lock this page against I/O" and then another
> > for "unlock this page", wouldn't it?
>
> No, I don't think so. And we obviously couldn't have a SQL level
> function hold an LWLock after it has finished, that'd make undetected
> deadlocks triggerable by users. The way I'd imagine that being done is
> to just perform the checksum test in the commandline tool, and whenever
> there's a checksum failure that could plausibly be a torn read, call a
> server side function that re-tests the page after locking it. Which then
> would just return the error message in a string.

The server-side function would essentially lock the page against i/o,
re-read it off disk into an independent location, unlock the page, then
calculate the checksum and report back?

That seems like it would be reasonable to me. Wouldn't it make sense to
then have pg_basebackup use that same function..?

Thanks,

Stephen

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Stephen Frost <sfrost(at)snowman(dot)net>
Cc:	Magnus Hagander <magnus(at)hagander(dot)net>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, Michael Paquier <michael(at)paquier(dot)xyz>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject:	Re: Online verification of checksums
Date:	2019-03-29 15:40:33
Message-ID:	20190329154033.vticzlku36mppxb4@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 2019-03-29 11:38:02 -0400, Stephen Frost wrote:
> The server-side function would essentially lock the page against i/o,
> re-read it off disk into an independent location, unlock the page, then
> calculate the checksum and report back?

Right. I think there's a few minor variations of how this could be done,
but that'd be the basic approach.

> That seems like it would be reasonable to me. Wouldn't it make sense to
> then have pg_basebackup use that same function..?

Yea, probably. Or at least reuse the majority of it, I can imagine the
error reporting would be a bit different (sqlstates et al are needed for
the basebackup.c case, but not the pg_checksum case).

Greetings,

Andres Freund

From:	Magnus Hagander <magnus(at)hagander(dot)net>
To:	Stephen Frost <sfrost(at)snowman(dot)net>
Cc:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, Michael Paquier <michael(at)paquier(dot)xyz>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject:	Re: Online verification of checksums
Date:	2019-03-29 15:52:16
Message-ID:	CABUevEz5SeHe0O4fLerkQX+R3-FNhvqipnbhNRLieMvjc4AMaw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Mar 29, 2019 at 4:30 PM Stephen Frost <sfrost(at)snowman(dot)net> wrote:

> Greetings,
>
> * Magnus Hagander (magnus(at)hagander(dot)net) wrote:
> > On Thu, Mar 28, 2019 at 10:19 PM Tomas Vondra <
> tomas(dot)vondra(at)2ndquadrant(dot)com>
> > wrote:
> >
> > > On Thu, Mar 28, 2019 at 01:11:40PM -0700, Andres Freund wrote:
> > > >Hi,
> > > >
> > > >On 2019-03-28 21:09:22 +0100, Michael Banck wrote:
> > > >> I agree that the current patch might have some corner-cases where it
> > > >> does not guarantee 100% accuracy in online mode, but I hope the
> current
> > > >> version at least has no more false negatives.
> > > >
> > > >False positives are *bad*. We shouldn't integrate code that has them.
> > > >
> > >
> > > Yeah, I agree. I'm a bit puzzled by the reluctance to make the online
> mode
> > > communicate with the server, which would presumably address these
> issues.
> > > Can someone explain why not to do that?
> >
> > I agree that this effort seems better spent on fixing those issues there
> > (of which many are the same), and then re-use that.
>
> This really seems like it depends on which of the options we're talking
> about.. Connecting to the server and asking what the current insert
> point is, so we can check that the LSN isn't completely insane, seems
> reasonable, but at least one option being discussed was to have
> pg_basebackup actually *lock the page* (even if just for I/O..) and then
> re-read it, and having an external tool doing that instead of the
> backend seems like a whole different level to me. That would involve
> having an SQL function for "lock this page against I/O" and then another
> for "unlock this page", wouldn't it?
>

Right.

But what if we just added a flag to the BASE_BACKUP command in the
replication protocol that said "meh, I really just want to verify the
checksums, so please send the data to devnull and only feed me regular
status updates on this connection"?

--
Magnus Hagander
Me: https://www.hagander.net/ <http://www.hagander.net/>
Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/>

From:	Michael Banck <michael(dot)banck(at)credativ(dot)de>
To:	Magnus Hagander <magnus(at)hagander(dot)net>, Stephen Frost <sfrost(at)snowman(dot)net>
Cc:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Michael Paquier <michael(at)paquier(dot)xyz>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject:	Re: Online verification of checksums
Date:	2019-03-29 21:08:30
Message-ID:	1553893710.4884.62.camel@credativ.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Am Freitag, den 29.03.2019, 16:52 +0100 schrieb Magnus Hagander:
> On Fri, Mar 29, 2019 at 4:30 PM Stephen Frost <sfrost(at)snowman(dot)net> wrote:
> > * Magnus Hagander (magnus(at)hagander(dot)net) wrote:
> > > On Thu, Mar 28, 2019 at 10:19 PM Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
> > > wrote:
> > > > On Thu, Mar 28, 2019 at 01:11:40PM -0700, Andres Freund wrote:
> > > > >On 2019-03-28 21:09:22 +0100, Michael Banck wrote:
> > > > >> I agree that the current patch might have some corner-cases where it
> > > > >> does not guarantee 100% accuracy in online mode, but I hope the current
> > > > >> version at least has no more false negatives.
> > > > >
> > > > >False positives are *bad*. We shouldn't integrate code that has them.
> > > >
> > > > Yeah, I agree. I'm a bit puzzled by the reluctance to make the online mode
> > > > communicate with the server, which would presumably address these issues.
> > > > Can someone explain why not to do that?
> > >
> > > I agree that this effort seems better spent on fixing those issues there
> > > (of which many are the same), and then re-use that.
> >
> > This really seems like it depends on which of the options we're talking
> > about.. Connecting to the server and asking what the current insert
> > point is, so we can check that the LSN isn't completely insane, seems
> > reasonable, but at least one option being discussed was to have
> > pg_basebackup actually *lock the page* (even if just for I/O..) and then
> > re-read it, and having an external tool doing that instead of the
> > backend seems like a whole different level to me. That would involve
> > having an SQL function for "lock this page against I/O" and then another
> > for "unlock this page", wouldn't it?
>
> Right.
>
> But what if we just added a flag to the BASE_BACKUP command in the
> replication protocol that said "meh, I really just want to verify the
> checksums, so please send the data to devnull and only feed me regular
> status updates on this connection"?

I don't know whether BASE_BACKUP is the best interface for that (at
least right now) - backend/replication/basebackup.c's sendFile() gets
only an absolute filename to send, which is not adequate for more in-
depth server-based things like locking a particular page in a particular
relation of some particular tablespace.

ISTM that the fact that we had to teach it about different segment files
for checksum verification by splitting up the filename at "." implies
that it is not the correct level of abstraction (but maybe it could get
schooled some more about Postgres internals, e.g. by passing it a
RefFileNode struct and not a filename).

Michael

--
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax: +49 2166 9901-100
Email: michael(dot)banck(at)credativ(dot)de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

From:	Magnus Hagander <magnus(at)hagander(dot)net>
To:	Michael Banck <michael(dot)banck(at)credativ(dot)de>
Cc:	Stephen Frost <sfrost(at)snowman(dot)net>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Michael Paquier <michael(at)paquier(dot)xyz>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject:	Re: Online verification of checksums
Date:	2019-03-30 11:56:21
Message-ID:	CABUevEy23u6-s_-q4m7Yz3QZwbAKFespAvqY4Gq7DvJGhFheVw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Mar 29, 2019 at 10:08 PM Michael Banck <michael(dot)banck(at)credativ(dot)de>
wrote:

> Hi,
>
> Am Freitag, den 29.03.2019, 16:52 +0100 schrieb Magnus Hagander:
> > On Fri, Mar 29, 2019 at 4:30 PM Stephen Frost <sfrost(at)snowman(dot)net>
> wrote:
> > > * Magnus Hagander (magnus(at)hagander(dot)net) wrote:
> > > > On Thu, Mar 28, 2019 at 10:19 PM Tomas Vondra <
> tomas(dot)vondra(at)2ndquadrant(dot)com>
> > > > wrote:
> > > > > On Thu, Mar 28, 2019 at 01:11:40PM -0700, Andres Freund wrote:
> > > > > >On 2019-03-28 21:09:22 +0100, Michael Banck wrote:
> > > > > >> I agree that the current patch might have some corner-cases
> where it
> > > > > >> does not guarantee 100% accuracy in online mode, but I hope the
> current
> > > > > >> version at least has no more false negatives.
> > > > > >
> > > > > >False positives are *bad*. We shouldn't integrate code that has
> them.
> > > > >
> > > > > Yeah, I agree. I'm a bit puzzled by the reluctance to make the
> online mode
> > > > > communicate with the server, which would presumably address these
> issues.
> > > > > Can someone explain why not to do that?
> > > >
> > > > I agree that this effort seems better spent on fixing those issues
> there
> > > > (of which many are the same), and then re-use that.
> > >
> > > This really seems like it depends on which of the options we're talking
> > > about.. Connecting to the server and asking what the current insert
> > > point is, so we can check that the LSN isn't completely insane, seems
> > > reasonable, but at least one option being discussed was to have
> > > pg_basebackup actually *lock the page* (even if just for I/O..) and
> then
> > > re-read it, and having an external tool doing that instead of the
> > > backend seems like a whole different level to me. That would involve
> > > having an SQL function for "lock this page against I/O" and then
> another
> > > for "unlock this page", wouldn't it?
> >
> > Right.
> >
> > But what if we just added a flag to the BASE_BACKUP command in the
> > replication protocol that said "meh, I really just want to verify the
> > checksums, so please send the data to devnull and only feed me regular
> > status updates on this connection"?
>
> I don't know whether BASE_BACKUP is the best interface for that (at
> least right now) - backend/replication/basebackup.c's sendFile() gets
> only an absolute filename to send, which is not adequate for more in-
> depth server-based things like locking a particular page in a particular
> relation of some particular tablespace.

> ISTM that the fact that we had to teach it about different segment files
> for checksum verification by splitting up the filename at "." implies
> that it is not the correct level of abstraction (but maybe it could get
> schooled some more about Postgres internals, e.g. by passing it a
> RefFileNode struct and not a filename).
>

But that has to be fixed in pg_basebackup *regardless*, doesn't it? And if
we fix it there, we only have to fix it once...

//Magnus

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Magnus Hagander <magnus(at)hagander(dot)net>
Cc:	Michael Banck <michael(dot)banck(at)credativ(dot)de>, Stephen Frost <sfrost(at)snowman(dot)net>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject:	Re: Online verification of checksums
Date:	2019-03-30 13:35:29
Message-ID:	20190330133529.t7chaepvbmi63xbi@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 2019-03-30 12:56:21 +0100, Magnus Hagander wrote:
> > ISTM that the fact that we had to teach it about different segment files
> > for checksum verification by splitting up the filename at "." implies
> > that it is not the correct level of abstraction (but maybe it could get
> > schooled some more about Postgres internals, e.g. by passing it a
> > RefFileNode struct and not a filename).
> >
>
> But that has to be fixed in pg_basebackup *regardless*, doesn't it? And if
> we fix it there, we only have to fix it once...

I'm not understanding the problem here. We already need to know all of
this? sendFile() determines whether the file is checksummed, and
computes the segment number:

if (is_checksummed_file(readfilename, filename))
{
verify_checksum = true;
...
checksum = pg_checksum_page((char *) page, blkno + segmentno * RELSEG_SIZE);
phdr = (PageHeader) page;

I agree that the way checksumming works is a bit of a layering
violation. In my opinion it belongs in the smgr level, not bufmgr.c etc,
so different storage methods can store it differently. But that seems
fairly indepedent of this problem.

Greetings,

Andres Freund

From:	Michael Banck <michael(dot)banck(at)credativ(dot)de>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Stephen Frost <sfrost(at)snowman(dot)net>, Michael Paquier <michael(at)paquier(dot)xyz>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Subject:	Re: [Patch] Base backups and random or zero pageheaders
Date:	2019-04-30 13:07:43
Message-ID:	1556629663.25111.4.camel@credativ.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Am Mittwoch, den 27.03.2019, 11:37 +0100 schrieb Michael Banck:
> Am Dienstag, den 26.03.2019, 19:23 +0100 schrieb Michael Banck:
> > Am Dienstag, den 26.03.2019, 10:30 -0700 schrieb Andres Freund:
> > > On 2019-03-26 18:22:55 +0100, Michael Banck wrote:
> > > > /*
> > > > - * Only check pages which have not been modified since the
> > > > - * start of the base backup. Otherwise, they might have been
> > > > - * written only halfway and the checksum would not be valid.
> > > > - * However, replaying WAL would reinstate the correct page in
> > > > - * this case. We also skip completely new pages, since they
> > > > - * don't have a checksum yet.
> > > > + * We skip completely new pages after checking they are
> > > > + * all-zero, since they don't have a checksum yet.
> > > > */
> > > > - if (!PageIsNew(page) && PageGetLSN(page) < startptr)
> > > > + if (PageIsNew(page))
> > > > {
> > > > - checksum = pg_checksum_page((char *) page, blkno + segmentno * RELSEG_SIZE);
> > > > - phdr = (PageHeader) page;
> > > > - if (phdr->pd_checksum != checksum)
> > > > + all_zeroes = true;
> > > > + pagebytes = (size_t *) page;
> > > > + for (int i = 0; i < (BLCKSZ / sizeof(size_t)); i++)
> > >
> > > Can we please abstract the zeroeness check into a separate function to
> > > be used both by PageIsVerified() and this?
> >
> > Ok, done so as PageIsZero further down in bufpage.c.
>
> It turns out that pg_checksums (current master and back branches, not
> just the online version) needs this treatment as well as it won't catch
> zeroed-out pageheader corruption, see attached patch to its TAP tests
> which trigger it (I also added a random data check similar to
> pg_basebackup as well which is not a problem for the current codebase).
>
> Any suggestion on how to handle this? Should I duplicate the
> PageIsZero() code in pg_checksums? Should I move PageIsZero into
> something like bufpage_impl.h for use by external programs, similar to
> pg_checksum_page()?
>
> I've done the latter as a POC in the second attached patch.

This is still an open item for the back branches I guess, i.e. zero page
header for pg_verify_checksums and additionally random page header for
pg_basebackup's base backup.

Do you plan to work on the patch you have outlined, what would I need to
change in the patches I submitted or is another approach warranted
entirely? Should I add my patches to the next commitfest in order to
track them?

Michael

--
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax: +49 2166 9901-100
Email: michael(dot)banck(at)credativ(dot)de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

From:	Michael Paquier <michael(at)paquier(dot)xyz>
To:	Michael Banck <michael(dot)banck(at)credativ(dot)de>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, Stephen Frost <sfrost(at)snowman(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Subject:	Re: [Patch] Base backups and random or zero pageheaders
Date:	2019-05-04 12:50:46
Message-ID:	20190504125046.GC4805@paquier.xyz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Apr 30, 2019 at 03:07:43PM +0200, Michael Banck wrote:
> This is still an open item for the back branches I guess, i.e. zero page
> header for pg_verify_checksums and additionally random page header for
> pg_basebackup's base backup.

I may be missing something, but could you add an entry in the future
commit fest about the stuff discussed here? I have not looked at your
patch closely.. Sorry.
--
Michael

From:	Michael Banck <michael(dot)banck(at)credativ(dot)de>
To:	Michael Paquier <michael(at)paquier(dot)xyz>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, Stephen Frost <sfrost(at)snowman(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Subject:	Re: [Patch] Base backups and random or zero pageheaders
Date:	2019-10-18 09:05:52
Message-ID:	1571389552.26469.8.camel@credativ.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Am Samstag, den 04.05.2019, 21:50 +0900 schrieb Michael Paquier:
> On Tue, Apr 30, 2019 at 03:07:43PM +0200, Michael Banck wrote:
> > This is still an open item for the back branches I guess, i.e. zero page
> > header for pg_verify_checksums and additionally random page header for
> > pg_basebackup's base backup.
>
> I may be missing something, but could you add an entry in the future
> commit fest about the stuff discussed here? I have not looked at your
> patch closely.. Sorry.

Here is finally a rebased patch for the (IMO) more important issue in
pg_basebackup. I've added a commitfest entry for this now:
https://commitfest.postgresql.org/25/2308/

Michael

--
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax: +49 2166 9901-100
Email: michael(dot)banck(at)credativ(dot)de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

Attachment	Content-Type	Size
0001-Fix-checksum-verification-in-base-backups-for-random.patch	text/x-patch	12.6 KB

From:	Asif Rehman <asifr(dot)rehman(at)gmail(dot)com>
To:	Michael Banck <michael(dot)banck(at)credativ(dot)de>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, Andres Freund <andres(at)anarazel(dot)de>, Stephen Frost <sfrost(at)snowman(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Subject:	Re: [Patch] Base backups and random or zero pageheaders
Date:	2020-02-25 14:34:35
Message-ID:	CADM=JejNpZuCwS3TkSCHnMEjs8ey_j4n9ocKZ46aCrrSmxtj1A@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Oct 18, 2019 at 2:06 PM Michael Banck <michael(dot)banck(at)credativ(dot)de>
wrote:

> Hi,
>
> Am Samstag, den 04.05.2019, 21:50 +0900 schrieb Michael Paquier:
> > On Tue, Apr 30, 2019 at 03:07:43PM +0200, Michael Banck wrote:
> > > This is still an open item for the back branches I guess, i.e. zero
> page
> > > header for pg_verify_checksums and additionally random page header for
> > > pg_basebackup's base backup.
> >
> > I may be missing something, but could you add an entry in the future
> > commit fest about the stuff discussed here? I have not looked at your
> > patch closely.. Sorry.
>
> Here is finally a rebased patch for the (IMO) more important issue in
> pg_basebackup. I've added a commitfest entry for this now:
> https://commitfest.postgresql.org/25/2308/
>
>
>
Hi Michael,

The patch does not seem to apply anymore, can you rebase it?

--
Asif Rehman
Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca

From:	Michael Banck <michael(dot)banck(at)credativ(dot)de>
To:	asifr(dot)rehman(at)gmail(dot)com
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, Andres Freund <andres(at)anarazel(dot)de>, Stephen Frost <sfrost(at)snowman(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Subject:	Re: [Patch] Base backups and random or zero pageheaders
Date:	2020-02-25 17:28:54
Message-ID:	d284482fed8620783f32be8bdda7865c1c87cb99.camel@credativ.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Am Dienstag, den 25.02.2020, 19:34 +0500 schrieb Asif Rehman:
> On Fri, Oct 18, 2019 at 2:06 PM Michael Banck <michael(dot)banck(at)credativ(dot)de> wrote:
> > Here is finally a rebased patch for the (IMO) more important issue in
> > pg_basebackup. I've added a commitfest entry for this now:
> > https://commitfest.postgresql.org/25/2308/
>
> The patch does not seem to apply anymore, can you rebase it?

Thanks for letting me know, please find attached a rebased version. I
hope the StaticAssertDecl() is still correct in bufpage.h.

Michael

--
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax: +49 2166 9901-100
Email: michael(dot)banck(at)credativ(dot)de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

Attachment	Content-Type	Size
0001-Fix-checksum-verification-in-base-backups-for-random_V2.patch	text/x-patch	12.5 KB

From:	Asif Rehman <asifr(dot)rehman(at)gmail(dot)com>
To:	pgsql-hackers(at)lists(dot)postgresql(dot)org
Cc:	Michael Banck <michael(dot)banck(at)credativ(dot)de>
Subject:	Re: Online verification of checksums
Date:	2020-02-27 10:57:09
Message-ID:	158280102962.21707.6408582526895921673.pgcf@coridan.postgresql.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

The following review has been posted through the commitfest application:
make installcheck-world: tested, passed
Implements feature: tested, passed
Spec compliant: tested, passed
Documentation: not tested

The patch applies cleanly and works as expected. Just a few minor observations:

- I would suggest refactoring PageIsZero function by getting rid of all_zeroes variable
and simply returning false when a non-zero byte is found, rather than setting all_zeros
variable to false and breaking the for loop. The function should simply return true at the
end otherwise.

- Remove the empty line:
+ * would throw an assertion failure. Consider this a
+ * checksum failure.
+ */
+
+ checksum_failures++;

- Code needs to run through pgindent.

Also, I'd suggest to make "5" a define within the current file/function, perhaps
something like "MAX_CHECKSUM_FAILURES". You could move the second
warning outside the conditional statement as it appears in both "if" and "else" blocks.

Regards,
--Asif

The new status of this patch is: Waiting on Author

From:	Michael Banck <michael(dot)banck(at)credativ(dot)de>
To:	Asif Rehman <asifr(dot)rehman(at)gmail(dot)com>
Cc:	pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: Online verification of checksums
Date:	2020-03-12 06:32:12
Message-ID:	28133de5f2acb9e2d16ba31372e80828f6262b92.camel@credativ.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

thanks for reviewing this patch!

Am Donnerstag, den 27.02.2020, 10:57 +0000 schrieb Asif Rehman:
> The following review has been posted through the commitfest application:
> make installcheck-world: tested, passed
> Implements feature: tested, passed
> Spec compliant: tested, passed
> Documentation: not tested
>
> The patch applies cleanly and works as expected. Just a few minor observations:
>
> - I would suggest refactoring PageIsZero function by getting rid of all_zeroes variable
> and simply returning false when a non-zero byte is found, rather than setting all_zeros
> variable to false and breaking the for loop. The function should simply return true at the
> end otherwise.

Good point, I have done so.

> - Remove the empty line:
> + * would throw an assertion failure. Consider this a
> + * checksum failure.
> + */
> +
> + checksum_failures++;

Done

> - Code needs to run through pgindent.

Done.

> Also, I'd suggest to make "5" a define within the current file/function, perhaps
> something like "MAX_CHECKSUM_FAILURES". You could move the second
> warning outside the conditional statement as it appears in both "if" and "else" blocks.

Well, I think you have a valid point, but that would be a different (non
bug-fix) patch as this part is not changed by this patch, but code is at
most moved around, is it?

New version attached.

Best regards,

Michael

--
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax: +49 2166 9901-100
Email: michael(dot)banck(at)credativ(dot)de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

Attachment	Content-Type	Size
0001-Fix-checksum-verification-in-base-backups-for-random_V3.patch	text/x-patch	12.4 KB

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Michael Banck <michael(dot)banck(at)credativ(dot)de>
Cc:	Asif Rehman <asifr(dot)rehman(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: Online verification of checksums
Date:	2020-04-06 19:59:15
Message-ID:	26184.1586203155@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Michael Banck <michael(dot)banck(at)credativ(dot)de> writes:
> [ 0001-Fix-checksum-verification-in-base-backups-for-random_V3.patch ]

I noticed that the cfbot wasn't testing this because of a minor merge
conflict. I rebased it over that, and also readjusted things a little bit
to avoid unnecessarily reindenting existing code, in hopes of making the
patch easier to review. Doing that reveals that the patch actually
removes a chunk of code, namely a special case for EOF. Was that
intentional, or a result of a faulty merge earlier? It certainly isn't
mentioned in your proposed commit message.

Another thing that's bothering me is that the patch compares page LSN
against GetInsertRecPtr(); but that function says

* NOTE: The value *actually* returned is the position of the last full
* xlog page. It lags behind the real insert position by at most 1 page.
* For that, we don't need to scan through WAL insertion locks, and an
* approximation is enough for the current usage of this function.

I'm not convinced that an approximation is good enough here. It seems
like a page that's just now been updated could have an LSN beyond the
current XLOG page start, potentially leading to a false checksum
complaint. Maybe we could address that by adding one xlog page to
the GetInsertRecPtr result? Kind of a hack, but ...

regards, tom lane

Attachment	Content-Type	Size
0001-Fix-checksum-verification-in-base-backups-for-random_V4.patch	text/x-diff	7.6 KB

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Michael Banck <michael(dot)banck(at)credativ(dot)de>
Cc:	Asif Rehman <asifr(dot)rehman(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: Online verification of checksums
Date:	2020-04-06 20:45:44
Message-ID:	19763.1586205944@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I wrote:
> Another thing that's bothering me is that the patch compares page LSN
> against GetInsertRecPtr(); but that function says
> ...
> I'm not convinced that an approximation is good enough here. It seems
> like a page that's just now been updated could have an LSN beyond the
> current XLOG page start, potentially leading to a false checksum
> complaint. Maybe we could address that by adding one xlog page to
> the GetInsertRecPtr result? Kind of a hack, but ...

Actually, after thinking about that a bit more: why is there an LSN-based
special condition at all? It seems like it'd be far more useful to
checksum everything, and on failure try to re-read and re-verify the page
once or twice, so as to handle the corner case where we examine a page
that's in process of being overwritten.

regards, tom lane

From:	Michael Banck <michael(dot)banck(at)credativ(dot)de>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Asif Rehman <asifr(dot)rehman(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org, Andres Freund <andres(at)anarazel(dot)de>
Subject:	Re: Online verification of checksums
Date:	2020-04-06 21:15:17
Message-ID:	dfeee5471ea96cf30f720177b12ec9053a954598.camel@credativ.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Am Montag, den 06.04.2020, 16:45 -0400 schrieb Tom Lane:
> I wrote:
> > Another thing that's bothering me is that the patch compares page LSN
> > against GetInsertRecPtr(); but that function says
> > ...
> > I'm not convinced that an approximation is good enough here. It seems
> > like a page that's just now been updated could have an LSN beyond the
> > current XLOG page start, potentially leading to a false checksum
> > complaint. Maybe we could address that by adding one xlog page to
> > the GetInsertRecPtr result? Kind of a hack, but ...

I was about to write that it sounds like a pragmatic solution to me,
but...

> Actually, after thinking about that a bit more: why is there an LSN-based
> special condition at all? It seems like it'd be far more useful to
> checksum everything, and on failure try to re-read and re-verify the page
> once or twice, so as to handle the corner case where we examine a page
> that's in process of being overwritten.

Andres outlined something about a year ago which on re-reading sounds
similar to what you suggest above in
20190326170820(dot)6sylklg7eh6uhabd(at)alap3(dot)anarazel(dot)de but never posted a
full patch. He seems to have had a few additional checks from PageIsVerified() in mind, though.

The original check against the checkpoint LSN wasn't suggested by me;
I've submitted this patch with the InsertRecPtr as an upper bound as a
*(presumably) minimal-invasive patch which could be back-patched (when
nothing came of the above thread for a while), but the issue seems to be
quite a bit nuanced.

Probably we need to take a step back; the question is whether something
like what Andres suggested should/could be coded up for v13 still
(before the feature freeze) and if so, by whom (I won't have the time),
or whether it would still qualify as a back-patchable bug-fix and/or
whether your suggestion above would.

Michael

--
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax: +49 2166 9901-100
Email: michael(dot)banck(at)credativ(dot)de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

From:	Daniel Gustafsson <daniel(at)yesql(dot)se>
To:	Michael Banck <michael(dot)banck(at)credativ(dot)de>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Asif Rehman <asifr(dot)rehman(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org, Andres Freund <andres(at)anarazel(dot)de>
Subject:	Re: Online verification of checksums
Date:	2020-07-05 11:52:32
Message-ID:	18FC7110-4DC0-45E4-8D58-38E71E2C30A7@yesql.se
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On 6 Apr 2020, at 23:15, Michael Banck <michael(dot)banck(at)credativ(dot)de> wrote:

> Probably we need to take a step back;

This patch has been Waiting on Author since the last commitfest (and no longer
applies as well), and by the sounds of the thread there are some open issues
with it. Should it be Returned with Feedback to be re-opened with a fresh take
on it?

cheers ./daniel

From:	Daniel Gustafsson <daniel(at)yesql(dot)se>
To:	Michael Banck <michael(dot)banck(at)credativ(dot)de>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Asif Rehman <asifr(dot)rehman(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org, Andres Freund <andres(at)anarazel(dot)de>
Subject:	Re: Online verification of checksums
Date:	2020-07-30 22:26:36
Message-ID:	02F17A62-8D2F-4B51-99CF-6D3B2E32D552@yesql.se
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On 5 Jul 2020, at 13:52, Daniel Gustafsson <daniel(at)yesql(dot)se> wrote:
>
>> On 6 Apr 2020, at 23:15, Michael Banck <michael(dot)banck(at)credativ(dot)de> wrote:
>
>> Probably we need to take a step back;
>
> This patch has been Waiting on Author since the last commitfest (and no longer
> applies as well), and by the sounds of the thread there are some open issues
> with it. Should it be Returned with Feedback to be re-opened with a fresh take
> on it?

Marked as Returned with Feedback, please open a new entry in case there is a
renewed interest with a new patch.

cheers ./daniel

From:	Michael Paquier <michael(at)paquier(dot)xyz>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Michael Banck <michael(dot)banck(at)credativ(dot)de>, Asif Rehman <asifr(dot)rehman(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: Online verification of checksums
Date:	2020-10-20 09:11:03
Message-ID:	20201020091103.GA1475@paquier.xyz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Apr 06, 2020 at 04:45:44PM -0400, Tom Lane wrote:
> Actually, after thinking about that a bit more: why is there an LSN-based
> special condition at all? It seems like it'd be far more useful to
> checksum everything, and on failure try to re-read and re-verify the page
> once or twice, so as to handle the corner case where we examine a page
> that's in process of being overwritten.

I was reviewing this area today, and that actually matches my
impression. Why do we need a LSN-based check at all? As said
upthread, that's of course weak with random data as we would miss most
of the real checksum failures, with odds getting better depending on
the current LSN of the cluster moving on. However, it seems to me
that we would have an extra advantage in removing this check
all together: it would be possible to check for pages even if these
are more recent than the start LSN of the backup, and that could be a
lot of pages that could be checked on a large cluster. So by keeping
this check we also delay the detection of real problems. As things
stand, I'd like to think that it would be much more useful to remove
this check and to have one or two extra retries (the current code only
has one). I don't like much the possibility of false positives for
such critical checks, but as we need to live with what has been
released, that looks like a good move for stable branches.
--
Michael

From:	Michael Banck <michael(dot)banck(at)credativ(dot)de>
To:	Michael Paquier <michael(at)paquier(dot)xyz>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Asif Rehman <asifr(dot)rehman(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: Online verification of checksums
Date:	2020-10-21 10:00:23
Message-ID:	79ffc294aa51d33bf3d7569a6e72977fb2051925.camel@credativ.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Am Dienstag, den 20.10.2020, 18:11 +0900 schrieb Michael Paquier:
> On Mon, Apr 06, 2020 at 04:45:44PM -0400, Tom Lane wrote:
> > Actually, after thinking about that a bit more: why is there an LSN-based
> > special condition at all? It seems like it'd be far more useful to
> > checksum everything, and on failure try to re-read and re-verify the page
> > once or twice, so as to handle the corner case where we examine a page
> > that's in process of being overwritten.
>
> I was reviewing this area today, and that actually matches my
> impression. Why do we need a LSN-based check at all? As said
> upthread, that's of course weak with random data as we would miss most
> of the real checksum failures, with odds getting better depending on
> the current LSN of the cluster moving on. However, it seems to me
> that we would have an extra advantage in removing this check
> all together: it would be possible to check for pages even if these
> are more recent than the start LSN of the backup, and that could be a
> lot of pages that could be checked on a large cluster. So by keeping
> this check we also delay the detection of real problems.

The check was ported (or the concept of it adapted) from pgBackRest if I
remember correctly.

> As things stand, I'd like to think that it would be much more useful
> to remove this check and to have one or two extra retries (the current
> code only has one). I don't like much the possibility of false
> positives for such critical checks, but as we need to live with what
> has been released, that looks like a good move for stable branches.

Sounds good to me. I think some were advocating for locking the page
before re-reading. When I looked at it, the level of abstraction that
pg_basebackup has (just a list of files chopped up into blocks, no
notion of relations I think) made that non-trivial, but maybe still
possible for v14 and beyond.

Michael

--
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax: +49 2166 9901-100
Email: michael(dot)banck(at)credativ(dot)de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

From:	Michael Paquier <michael(at)paquier(dot)xyz>
To:	Michael Banck <michael(dot)banck(at)credativ(dot)de>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Asif Rehman <asifr(dot)rehman(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: Online verification of checksums
Date:	2020-10-21 10:10:34
Message-ID:	20201021101034.GE1475@paquier.xyz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Oct 21, 2020 at 12:00:23PM +0200, Michael Banck wrote:
> The check was ported (or the concept of it adapted) from pgBackRest if I
> remember correctly.

Okay, I did not know that.

>> As things stand, I'd like to think that it would be much more useful
>> to remove this check and to have one or two extra retries (the current
>> code only has one). I don't like much the possibility of false
>> positives for such critical checks, but as we need to live with what
>> has been released, that looks like a good move for stable branches.
>
> Sounds good to me. I think some were advocating for locking the page
> before re-reading. When I looked at it, the level of abstraction that
> pg_basebackup has (just a list of files chopped up into blocks, no
> notion of relations I think) made that non-trivial, but maybe still
> possible for v14 and beyond.

That's a API layer I was looking at here:
/message-id/flat/CAOBaU_aVvMjQn=ge5qPiJOPMmOj5=ii3st5Q0Y+WuLML5sR17w(at)mail(dot)gmail(dot)com

My guess is that we should be able to make use of that for base
backups as well, but I also think that I'd rather let v13 go with more
retries without depending on a new API layer, removing of the LSN
check altogether. Thinking of it, that's actually roughly what I
posted here, but without the PageGetLSN() bit in the refactored code.
So I see a pretty good argument to address the stable branches with
that, and study for the future a better API to govern them all:
/message-id/20201020062432.GA30362@paquier.xyz
--
Michael

From:	Michael Paquier <michael(at)paquier(dot)xyz>
To:	Michael Banck <michael(dot)banck(at)credativ(dot)de>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Asif Rehman <asifr(dot)rehman(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: Online verification of checksums
Date:	2020-10-22 01:41:53
Message-ID:	20201022014153.GG1475@paquier.xyz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Oct 21, 2020 at 07:10:34PM +0900, Michael Paquier wrote:
> My guess is that we should be able to make use of that for base
> backups as well, but I also think that I'd rather let v13 go with more
> retries without depending on a new API layer, removing of the LSN
> check altogether. Thinking of it, that's actually roughly what I
> posted here, but without the PageGetLSN() bit in the refactored code.
> So I see a pretty good argument to address the stable branches with
> that, and study for the future a better API to govern them all:
> /message-id/20201020062432.GA30362@paquier.xyz

So, I was sleeping on this one, and I could not find a reason why we
should not address both the zero case and the random data case at the
same time, as mentioned here:
/message-id/20201022012519.GF1475@paquier.xyz

We cannot trust the fields fields of the page header because these may
have been messed up with some random corruption, so what really
matters is that the checksums don't match, and that we can just rely
on that. The zero-only case of a page is different because these
don't have a checksum set, so I would finish with something like the
attached to make the detection more robust. This does not make the
detection perfect as there is no locking insurance (we really need to
do that but v13 has been released already), but with a sufficient
number of retries this can make things much more reliable than what's
present.

Are there any comments? Anybody?
--
Michael

Attachment	Content-Type	Size
checksums-zeros-v7.patch	text/x-diff	13.1 KB

From:	Michael Paquier <michael(at)paquier(dot)xyz>
To:	Michael Banck <michael(dot)banck(at)credativ(dot)de>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Asif Rehman <asifr(dot)rehman(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: Online verification of checksums
Date:	2020-10-30 02:30:28
Message-ID:	20201030023028.GC1693@paquier.xyz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Oct 22, 2020 at 10:41:53AM +0900, Michael Paquier wrote:
> We cannot trust the fields fields of the page header because these may
> have been messed up with some random corruption, so what really
> matters is that the checksums don't match, and that we can just rely
> on that. The zero-only case of a page is different because these
> don't have a checksum set, so I would finish with something like the
> attached to make the detection more robust. This does not make the
> detection perfect as there is no locking insurance (we really need to
> do that but v13 has been released already), but with a sufficient
> number of retries this can make things much more reliable than what's
> present.
>
> Are there any comments? Anybody?

So, hearing nothing, attached is a set of patches that I would like to
apply to 11~ to address the set of issues of this thread. This comes
with two parts:
- Some refactoring of PageIsVerified(), similar to d401c57 on HEAD
except that this keeps ABI compatibility.
- The actual patch, with tweaks for each stable branch.

Playing with dd and generating random pages, this detects random
corruptions, making use of a wait/retry loop if a failure is detected.
As mentioned upthread, this is a double-edged sword, increasing the
number of retries reduces the changes of false positives, at the cost
of making regression tests longer. This stuff uses up to 5 retries
with 100ms of sleep for each page. (I am aware of the fact that the
commit message of the main patch is not written yet).
--
Michael

Attachment	Content-Type	Size
v8-master-0001-Fix-page-verifications-in-base-backups.patch	text/x-diff	8.4 KB
v8-13-0001-Extend-PageIsVerified-to-handle-more-custom-optio.patch	text/x-diff	6.8 KB
v8-13-0002-Fix-page-verification-in-base-backups.patch	text/x-diff	9.3 KB
v8-12-0001-Extend-PageIsVerified-to-handle-more-custom-optio.patch	text/x-diff	6.0 KB
v8-12-0002-Fix-page-verification-in-base-backups.patch	text/x-diff	9.3 KB
v8-11-0001-Extend-PageIsVerified-to-handle-more-custom-optio.patch	text/x-diff	5.1 KB
v8-11-0002-Fix-page-verification-in-base-backups.patch	text/x-diff	9.3 KB

From:	Michael Paquier <michael(at)paquier(dot)xyz>
To:	Michael Banck <michael(dot)banck(at)credativ(dot)de>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Asif Rehman <asifr(dot)rehman(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: Online verification of checksums
Date:	2020-11-04 08:48:41
Message-ID:	20201104084841.GF1711@paquier.xyz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Oct 30, 2020 at 11:30:28AM +0900, Michael Paquier wrote:
> Playing with dd and generating random pages, this detects random
> corruptions, making use of a wait/retry loop if a failure is detected.
> As mentioned upthread, this is a double-edged sword, increasing the
> number of retries reduces the changes of false positives, at the cost
> of making regression tests longer. This stuff uses up to 5 retries
> with 100ms of sleep for each page. (I am aware of the fact that the
> commit message of the main patch is not written yet).

So, I have done much more testing of this patch using an instance with
a small shared buffer pool and pgbench running in parallel for having
a large eviction rate, and I cannot convince myself to do that. My
laptop got easily constrained on I/O, and within a total of 2000 base
backups or so, I have seen some 5 backup failures with a correct
detection logic. The rate is low here, but that could be annoying for
users even at 1~2%. Couldn't we take the different approach to remove
this feature instead? This still requires the grammar to be present
in back-branches, but as things stand, we have a feature that fails
its promise, and that also eats for nothing resources for each base
backup taken :/
--
Michael

From:	Michael Banck <michael(dot)banck(at)credativ(dot)de>
To:	Michael Paquier <michael(at)paquier(dot)xyz>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Asif Rehman <asifr(dot)rehman(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: Online verification of checksums
Date:	2020-11-04 16:41:39
Message-ID:	f4fc5f14f48aa58d2a6e8135e848c87d95eeae82.camel@credativ.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Am Mittwoch, den 04.11.2020, 17:48 +0900 schrieb Michael Paquier:
> On Fri, Oct 30, 2020 at 11:30:28AM +0900, Michael Paquier wrote:
> > Playing with dd and generating random pages, this detects random
> > corruptions, making use of a wait/retry loop if a failure is detected.
> > As mentioned upthread, this is a double-edged sword, increasing the
> > number of retries reduces the changes of false positives, at the cost
> > of making regression tests longer. This stuff uses up to 5 retries
> > with 100ms of sleep for each page. (I am aware of the fact that the
> > commit message of the main patch is not written yet).
>
> So, I have done much more testing of this patch using an instance with
> a small shared buffer pool and pgbench running in parallel for having
> a large eviction rate, and I cannot convince myself to do that. My
> laptop got easily constrained on I/O, and within a total of 2000 base
> backups or so, I have seen some 5 backup failures with a correct
> detection logic.

I don't quite undestand what you mean here: how do the base backups
fail, and what exactly is "correct detection logic"?

Michael

--
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax: +49 2166 9901-100
Email: michael(dot)banck(at)credativ(dot)de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

From:	Michael Paquier <michael(at)paquier(dot)xyz>
To:	Michael Banck <michael(dot)banck(at)credativ(dot)de>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Asif Rehman <asifr(dot)rehman(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: Online verification of checksums
Date:	2020-11-05 01:57:16
Message-ID:	20201105015716.GC1632@paquier.xyz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Nov 04, 2020 at 05:41:39PM +0100, Michael Banck wrote:
> Am Mittwoch, den 04.11.2020, 17:48 +0900 schrieb Michael Paquier:
>> So, I have done much more testing of this patch using an instance with
>> a small shared buffer pool and pgbench running in parallel for having
>> a large eviction rate, and I cannot convince myself to do that. My
>> laptop got easily constrained on I/O, and within a total of 2000 base
>> backups or so, I have seen some 5 backup failures with a correct
>> detection logic.
>
> I don't quite undestand what you mean here: how do the base backups
> fail,

As of basebackup.c, on HEAD:
if (total_checksum_failures)
{
if (total_checksum_failures > 1)
ereport(WARNING,
(errmsg_plural("%lld total checksum verification failure",
"%lld total checksum verification failures",
total_checksum_failures,
total_checksum_failures)));

ereport(ERROR,
(errcode(ERRCODE_DATA_CORRUPTED),
errmsg("checksum verification failure during base backup")));
}

This means that when at least one page verification fails,
pg_basebackup would fail.

> and what exactly is "correct detection logic"?

I was referring to the patch I sent on this thread that fixes the
detection of a corruption for the zero-only case and where pd_lsn
and/or pg_upper are trashed by a corruption of the page header. Both
cases allow a base backup to complete on HEAD, while sending pages
that could be corrupted, which is wrong. Once you make the page
verification rely only on pd_checksum, as the patch does because the
checksum is the only source of truth in the page header, corrupted
pages are correctly detected, causing pg_basebackup to complain as it
should. However, it has also the risk to cause pg_basebackup to fail
*and* to report as broken pages that are in the process of being
written, depending on how slow a disk is able to finish a 8kB write.
That's a different kind of wrongness, and users have two more reasons
to be pissed. Note that if a page is found as torn we have a
consistent page header, meaning that on HEAD the PageIsNew() and
PageGetLSN() would pass, but the checksum verification would fail as
the contents at the end of the page does not match the checksum.
--
Michael

From:	Michael Paquier <michael(at)paquier(dot)xyz>
To:	Michael Banck <michael(dot)banck(at)credativ(dot)de>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Asif Rehman <asifr(dot)rehman(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org, Magnus Hagander <magnus(at)hagander(dot)net>
Subject:	Re: Online verification of checksums
Date:	2020-11-10 04:44:11
Message-ID:	20201110044411.GJ1887@paquier.xyz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Nov 05, 2020 at 10:57:16AM +0900, Michael Paquier wrote:
> I was referring to the patch I sent on this thread that fixes the
> detection of a corruption for the zero-only case and where pd_lsn
> and/or pg_upper are trashed by a corruption of the page header. Both
> cases allow a base backup to complete on HEAD, while sending pages
> that could be corrupted, which is wrong. Once you make the page
> verification rely only on pd_checksum, as the patch does because the
> checksum is the only source of truth in the page header, corrupted
> pages are correctly detected, causing pg_basebackup to complain as it
> should. However, it has also the risk to cause pg_basebackup to fail
> *and* to report as broken pages that are in the process of being
> written, depending on how slow a disk is able to finish a 8kB write.
> That's a different kind of wrongness, and users have two more reasons
> to be pissed. Note that if a page is found as torn we have a
> consistent page header, meaning that on HEAD the PageIsNew() and
> PageGetLSN() would pass, but the checksum verification would fail as
> the contents at the end of the page does not match the checksum.

Magnus, as the original committer of 4eb77d5, do you have an opinion
to share?
--
Michael

From:	Magnus Hagander <magnus(at)hagander(dot)net>
To:	Michael Paquier <michael(at)paquier(dot)xyz>
Cc:	Michael Banck <michael(dot)banck(at)credativ(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Asif Rehman <asifr(dot)rehman(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2020-11-15 15:37:36
Message-ID:	CABUevEwspcASScjnpK-Eb3fuP3fpxGFAzx3EhDE4nfG=xAsUUw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Nov 10, 2020 at 5:44 AM Michael Paquier <michael(at)paquier(dot)xyz> wrote:

> On Thu, Nov 05, 2020 at 10:57:16AM +0900, Michael Paquier wrote:
> > I was referring to the patch I sent on this thread that fixes the
> > detection of a corruption for the zero-only case and where pd_lsn
> > and/or pg_upper are trashed by a corruption of the page header. Both
> > cases allow a base backup to complete on HEAD, while sending pages
> > that could be corrupted, which is wrong. Once you make the page
> > verification rely only on pd_checksum, as the patch does because the
> > checksum is the only source of truth in the page header, corrupted
> > pages are correctly detected, causing pg_basebackup to complain as it
> > should. However, it has also the risk to cause pg_basebackup to fail
> > *and* to report as broken pages that are in the process of being
> > written, depending on how slow a disk is able to finish a 8kB write.
> > That's a different kind of wrongness, and users have two more reasons
> > to be pissed. Note that if a page is found as torn we have a
> > consistent page header, meaning that on HEAD the PageIsNew() and
> > PageGetLSN() would pass, but the checksum verification would fail as
> > the contents at the end of the page does not match the checksum.
>
> Magnus, as the original committer of 4eb77d5, do you have an opinion
> to share?
>

I admit that I at some point lost track of the overlapping threads around
this, and just figured there was enough different checksum-involved-people
on those threads to handle it :) Meaning the short answer is "no, I don't
really have one at this point".

Slightly longer comment is that it does seem reasonable, but I have not
read in on all the different issues discussed over the whole thread, so
take that as a weak-certainty comment.

--
Magnus Hagander
Me: https://www.hagander.net/ <http://www.hagander.net/>
Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/>

From:	Michael Paquier <michael(at)paquier(dot)xyz>
To:	Magnus Hagander <magnus(at)hagander(dot)net>
Cc:	Michael Banck <michael(dot)banck(at)credativ(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Asif Rehman <asifr(dot)rehman(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2020-11-16 00:23:24
Message-ID:	20201116002324.GB2656@paquier.xyz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Nov 15, 2020 at 04:37:36PM +0100, Magnus Hagander wrote:
> On Tue, Nov 10, 2020 at 5:44 AM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
>> On Thu, Nov 05, 2020 at 10:57:16AM +0900, Michael Paquier wrote:
>>> I was referring to the patch I sent on this thread that fixes the
>>> detection of a corruption for the zero-only case and where pd_lsn
>>> and/or pg_upper are trashed by a corruption of the page header. Both
>>> cases allow a base backup to complete on HEAD, while sending pages
>>> that could be corrupted, which is wrong. Once you make the page
>>> verification rely only on pd_checksum, as the patch does because the
>>> checksum is the only source of truth in the page header, corrupted
>>> pages are correctly detected, causing pg_basebackup to complain as it
>>> should. However, it has also the risk to cause pg_basebackup to fail
>>> *and* to report as broken pages that are in the process of being
>>> written, depending on how slow a disk is able to finish a 8kB write.
>>> That's a different kind of wrongness, and users have two more reasons
>>> to be pissed. Note that if a page is found as torn we have a
>>> consistent page header, meaning that on HEAD the PageIsNew() and
>>> PageGetLSN() would pass, but the checksum verification would fail as
>>> the contents at the end of the page does not match the checksum.
>>
>> Magnus, as the original committer of 4eb77d5, do you have an opinion
>> to share?
>>
>
> I admit that I at some point lost track of the overlapping threads around
> this, and just figured there was enough different checksum-involved-people
> on those threads to handle it :) Meaning the short answer is "no, I don't
> really have one at this point".
>
> Slightly longer comment is that it does seem reasonable, but I have not
> read in on all the different issues discussed over the whole thread, so
> take that as a weak-certainty comment.

Which part are you considering as reasonable? The removal-feature
part on a stable branch or perhaps something else?
--
Michael

From:	Magnus Hagander <magnus(at)hagander(dot)net>
To:	Michael Paquier <michael(at)paquier(dot)xyz>
Cc:	Michael Banck <michael(dot)banck(at)credativ(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Asif Rehman <asifr(dot)rehman(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, David Steele <david(at)pgmasters(dot)net>
Subject:	Re: Online verification of checksums
Date:	2020-11-16 10:41:51
Message-ID:	CABUevEw-3jqFFu6hbsai4-4w+Zr+C_rFOBmk4zyb006X_4a6nA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Nov 16, 2020 at 1:23 AM Michael Paquier <michael(at)paquier(dot)xyz> wrote:

> On Sun, Nov 15, 2020 at 04:37:36PM +0100, Magnus Hagander wrote:
> > On Tue, Nov 10, 2020 at 5:44 AM Michael Paquier <michael(at)paquier(dot)xyz>
> wrote:
> >> On Thu, Nov 05, 2020 at 10:57:16AM +0900, Michael Paquier wrote:
> >>> I was referring to the patch I sent on this thread that fixes the
> >>> detection of a corruption for the zero-only case and where pd_lsn
> >>> and/or pg_upper are trashed by a corruption of the page header. Both
> >>> cases allow a base backup to complete on HEAD, while sending pages
> >>> that could be corrupted, which is wrong. Once you make the page
> >>> verification rely only on pd_checksum, as the patch does because the
> >>> checksum is the only source of truth in the page header, corrupted
> >>> pages are correctly detected, causing pg_basebackup to complain as it
> >>> should. However, it has also the risk to cause pg_basebackup to fail
> >>> *and* to report as broken pages that are in the process of being
> >>> written, depending on how slow a disk is able to finish a 8kB write.
> >>> That's a different kind of wrongness, and users have two more reasons
> >>> to be pissed. Note that if a page is found as torn we have a
> >>> consistent page header, meaning that on HEAD the PageIsNew() and
> >>> PageGetLSN() would pass, but the checksum verification would fail as
> >>> the contents at the end of the page does not match the checksum.
> >>
> >> Magnus, as the original committer of 4eb77d5, do you have an opinion
> >> to share?
> >>
> >
> > I admit that I at some point lost track of the overlapping threads around
> > this, and just figured there was enough different
> checksum-involved-people
> > on those threads to handle it :) Meaning the short answer is "no, I don't
> > really have one at this point".
> >
> > Slightly longer comment is that it does seem reasonable, but I have not
> > read in on all the different issues discussed over the whole thread, so
> > take that as a weak-certainty comment.
>
> Which part are you considering as reasonable? The removal-feature
> part on a stable branch or perhaps something else?
>

I was referring to the latest patch on the thread. But as I said, I have
not read up on all the different issues raised in the thread, so take it
with a big grain os salt.

And I would also echo the previous comment that this code was adapted from
what the pgbackrest folks do. As such, it would be good to get a comment
from for example David on that -- I don't see any of them having commented
after that was mentioned?

--
Magnus Hagander
Me: https://www.hagander.net/ <http://www.hagander.net/>
Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/>

From:	Michael Paquier <michael(at)paquier(dot)xyz>
To:	Magnus Hagander <magnus(at)hagander(dot)net>
Cc:	Michael Banck <michael(dot)banck(at)credativ(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Asif Rehman <asifr(dot)rehman(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, David Steele <david(at)pgmasters(dot)net>, Stephen Frost <sfrost(at)snowman(dot)net>
Subject:	Re: Online verification of checksums
Date:	2020-11-20 07:28:01
Message-ID:	20201120072801.GF8506@paquier.xyz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Nov 16, 2020 at 11:41:51AM +0100, Magnus Hagander wrote:
> I was referring to the latest patch on the thread. But as I said, I have
> not read up on all the different issues raised in the thread, so take it
> with a big grain os salt.
>
> And I would also echo the previous comment that this code was adapted from
> what the pgbackrest folks do. As such, it would be good to get a comment
> from for example David on that -- I don't see any of them having commented
> after that was mentioned?

Agreed. I am adding Stephen as well in CC. From the code of
backrest, the same logic happens in src/command/backup/pageChecksum.c
(see pageChecksumProcess), where two checks on pd_upper and pd_lsn
happen before verifying the checksum. So, if the page header finishes
with random junk because of some kind of corruption, even corrupted
pages would be incorrectly considered as correct if the random data
passes the pd_upper and pg_lsn checks :/
--
Michael

From:	David Steele <david(at)pgmasters(dot)net>
To:	Michael Paquier <michael(at)paquier(dot)xyz>, Magnus Hagander <magnus(at)hagander(dot)net>
Cc:	Michael Banck <michael(dot)banck(at)credativ(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Asif Rehman <asifr(dot)rehman(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Stephen Frost <sfrost(at)snowman(dot)net>
Subject:	Re: Online verification of checksums
Date:	2020-11-20 14:50:33
Message-ID:	6eb9697d-5eca-8b9a-6f1a-44e9f13054b7@pgmasters.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi Michael,

On 11/20/20 2:28 AM, Michael Paquier wrote:
> On Mon, Nov 16, 2020 at 11:41:51AM +0100, Magnus Hagander wrote:
>> I was referring to the latest patch on the thread. But as I said, I have
>> not read up on all the different issues raised in the thread, so take it
>> with a big grain os salt.
>>
>> And I would also echo the previous comment that this code was adapted from
>> what the pgbackrest folks do. As such, it would be good to get a comment
>> from for example David on that -- I don't see any of them having commented
>> after that was mentioned?
>
> Agreed. I am adding Stephen as well in CC. From the code of
> backrest, the same logic happens in src/command/backup/pageChecksum.c
> (see pageChecksumProcess), where two checks on pd_upper and pd_lsn
> happen before verifying the checksum. So, if the page header finishes
> with random junk because of some kind of corruption, even corrupted
> pages would be incorrectly considered as correct if the random data
> passes the pd_upper and pg_lsn checks :/

Indeed, this is not good, as Andres pointed out some time ago. My
apologies for not getting to this sooner.

Our current plan for pgBackRest:

1) Remove the LSN check as you have done in your patch and when
rechecking see if the page has become valid *or* the LSN is ascending.
2) Check the LSN against the max LSN reported by PostgreSQL to make sure
it is valid.

These do completely rule out any type of corruption, but they certainly
narrows the possibility by a lot.

In the future we would also like to scan the WAL to verify that the page
is definitely being written to.

As for your patch, it mostly looks good but my objection is that a page
may be reported as invalid after 5 retries when in fact it may just be
very hot.

Maybe checking for an ascending LSN is a good idea there as well? At
least in that case we could issue a different warning, instead of
"checksum verification failed" perhaps "checksum verification skipped
due to concurrent modifications".

Regards,
--
-David
david(at)pgmasters(dot)net

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	David Steele <david(at)pgmasters(dot)net>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, Magnus Hagander <magnus(at)hagander(dot)net>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Asif Rehman <asifr(dot)rehman(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2020-11-20 16:08:27
Message-ID:	20201120160827.GY16415@tamriel.snowman.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Greetings,

* David Steele (david(at)pgmasters(dot)net) wrote:
> On 11/20/20 2:28 AM, Michael Paquier wrote:
> >On Mon, Nov 16, 2020 at 11:41:51AM +0100, Magnus Hagander wrote:
> >>I was referring to the latest patch on the thread. But as I said, I have
> >>not read up on all the different issues raised in the thread, so take it
> >>with a big grain os salt.
> >>
> >>And I would also echo the previous comment that this code was adapted from
> >>what the pgbackrest folks do. As such, it would be good to get a comment
> >>from for example David on that -- I don't see any of them having commented
> >>after that was mentioned?
> >
> >Agreed. I am adding Stephen as well in CC. From the code of
> >backrest, the same logic happens in src/command/backup/pageChecksum.c
> >(see pageChecksumProcess), where two checks on pd_upper and pd_lsn
> >happen before verifying the checksum. So, if the page header finishes
> >with random junk because of some kind of corruption, even corrupted
> >pages would be incorrectly considered as correct if the random data
> >passes the pd_upper and pg_lsn checks :/
>
> Indeed, this is not good, as Andres pointed out some time ago. My apologies
> for not getting to this sooner.

Yeah, it's been on our backlog to improve this.

> Our current plan for pgBackRest:
>
> 1) Remove the LSN check as you have done in your patch and when rechecking
> see if the page has become valid *or* the LSN is ascending.
> 2) Check the LSN against the max LSN reported by PostgreSQL to make sure it
> is valid.

Yup, that's my recollection also as to our plans for how to improve
things here.

> These do completely rule out any type of corruption, but they certainly
> narrows the possibility by a lot.

*don't :)

> In the future we would also like to scan the WAL to verify that the page is
> definitely being written to.

Yeah, that'd certainly be nice to do too.

> As for your patch, it mostly looks good but my objection is that a page may
> be reported as invalid after 5 retries when in fact it may just be very hot.

Yeah.. while unlikely that it'd actually get written out that much, it
does seem at least possible.

> Maybe checking for an ascending LSN is a good idea there as well? At least
> in that case we could issue a different warning, instead of "checksum
> verification failed" perhaps "checksum verification skipped due to
> concurrent modifications".

+1.

Thanks,

Stephen

From:	Michael Paquier <michael(at)paquier(dot)xyz>
To:	Stephen Frost <sfrost(at)snowman(dot)net>
Cc:	David Steele <david(at)pgmasters(dot)net>, Magnus Hagander <magnus(at)hagander(dot)net>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Asif Rehman <asifr(dot)rehman(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2020-11-21 01:30:03
Message-ID:	20201121013003.GC6052@paquier.xyz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Nov 20, 2020 at 11:08:27AM -0500, Stephen Frost wrote:
> David Steele (david(at)pgmasters(dot)net) wrote:
>> Our current plan for pgBackRest:
>>
>> 1) Remove the LSN check as you have done in your patch and when rechecking
>> see if the page has become valid *or* the LSN is ascending.
>> 2) Check the LSN against the max LSN reported by PostgreSQL to make sure it
>> is valid.
>
> Yup, that's my recollection also as to our plans for how to improve
> things here.
>
>> These do completely rule out any type of corruption, but they certainly
>> narrows the possibility by a lot.
>
> *don't :)

Have you considered the possibility of only using pd_checksums for the
validation? This is the only source of truth in the page header we
can rely on to validate the full contents of the page, so if the logic
relies on anything but the checksum then you expose the logic to risks
of reporting pages as corrupted while they were just torn, or just
miss corrupted pages, which is what we should avoid for such things.
Both are bad.

>> As for your patch, it mostly looks good but my objection is that a page may
>> be reported as invalid after 5 retries when in fact it may just be very hot.
>
> Yeah.. while unlikely that it'd actually get written out that much, it
> does seem at least possible.
>
>> Maybe checking for an ascending LSN is a good idea there as well? At least
>> in that case we could issue a different warning, instead of "checksum
>> verification failed" perhaps "checksum verification skipped due to
>> concurrent modifications".
>
> +1.

I don't quite understand how you can make sure that the page is not
corrupted here? It could be possible that the last 4kB of a 8kB page
got corrupted, where the header had valid data but failing the
checksum verification. So if you are not careful you could have at
hand a corrupted page discarded because of it failed the retry
multiple times in a row. The only method I can think as being really
reliable is based on two facts:
- Do a check only on pd_checksums, as that validates the full contents
of the page.
- When doing a retry, make sure that there is no concurrent I/O
activity in the shared buffers. This requires an API we don't have
yet.
--
Michael

From:	Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>
To:	Michael Paquier <michael(at)paquier(dot)xyz>, Stephen Frost <sfrost(at)snowman(dot)net>
Cc:	David Steele <david(at)pgmasters(dot)net>, Magnus Hagander <magnus(at)hagander(dot)net>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Asif Rehman <asifr(dot)rehman(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2020-11-23 12:53:42
Message-ID:	196553ba-65ac-ce1b-acd9-24209d9ec9eb@postgrespro.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 21.11.2020 04:30, Michael Paquier wrote:
> The only method I can think as being really
> reliable is based on two facts:
> - Do a check only on pd_checksums, as that validates the full contents
> of the page.
> - When doing a retry, make sure that there is no concurrent I/O
> activity in the shared buffers. This requires an API we don't have
> yet.

It seems reasonable to me to rely on checksums only.

As for retry, I think that API for concurrent I/O will be complicated.
Instead, we can introduce a function to read the page directly from
shared buffers after PAGE_RETRY_THRESHOLD attempts. It looks like a
bullet-proof solution to me. Do you see any possible problems with it?

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	Michael Paquier <michael(at)paquier(dot)xyz>
Cc:	David Steele <david(at)pgmasters(dot)net>, Magnus Hagander <magnus(at)hagander(dot)net>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Asif Rehman <asifr(dot)rehman(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2020-11-23 15:29:57
Message-ID:	20201123152957.GH16415@tamriel.snowman.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Greetings,

* Michael Paquier (michael(at)paquier(dot)xyz) wrote:
> On Fri, Nov 20, 2020 at 11:08:27AM -0500, Stephen Frost wrote:
> > David Steele (david(at)pgmasters(dot)net) wrote:
> >> Our current plan for pgBackRest:
> >>
> >> 1) Remove the LSN check as you have done in your patch and when rechecking
> >> see if the page has become valid *or* the LSN is ascending.
> >> 2) Check the LSN against the max LSN reported by PostgreSQL to make sure it
> >> is valid.
> >
> > Yup, that's my recollection also as to our plans for how to improve
> > things here.
> >
> >> These do completely rule out any type of corruption, but they certainly
> >> narrows the possibility by a lot.
> >
> > *don't :)
>
> Have you considered the possibility of only using pd_checksums for the
> validation? This is the only source of truth in the page header we
> can rely on to validate the full contents of the page, so if the logic
> relies on anything but the checksum then you expose the logic to risks
> of reporting pages as corrupted while they were just torn, or just
> miss corrupted pages, which is what we should avoid for such things.
> Both are bad.

There's no doubt that you'll get checksum failures from time to time,
and that it's an entirely valid case if the page is being concurrently
written, so we have to decide if we should be reporting those failures,
retrying, or what.

It's not at all clear what you're suggesting here as to how you can use
'only' the checksum.

> >> As for your patch, it mostly looks good but my objection is that a page may
> >> be reported as invalid after 5 retries when in fact it may just be very hot.
> >
> > Yeah.. while unlikely that it'd actually get written out that much, it
> > does seem at least possible.
> >
> >> Maybe checking for an ascending LSN is a good idea there as well? At least
> >> in that case we could issue a different warning, instead of "checksum
> >> verification failed" perhaps "checksum verification skipped due to
> >> concurrent modifications".
> >
> > +1.
>
> I don't quite understand how you can make sure that the page is not
> corrupted here? It could be possible that the last 4kB of a 8kB page
> got corrupted, where the header had valid data but failing the
> checksum verification.

Not sure that the proposed approach was really understood here.
Specifically what we're talking about is:

- read(), save the LSN seen
- calculate checksum- get a failure
- re-read(), compare LSN to prior LSN, maybe also re-check checksum

If checksum fails again AND the LSN has changed and increased (and
perhaps otherwise seems reasonable) then we have at least a bit more
confidence that the failing checksum is due to the page being rewritten
concurrently and not due to latest storage corruption, which is the
specific distinction that we're trying to discern here.

> So if you are not careful you could have at
> hand a corrupted page discarded because of it failed the retry
> multiple times in a row.

The point of checking for an ascending LSN is to see if the page is
being concurrently modified. If it is, then we actually don't care if
the page is corrupted because it's going to be rewritten during WAL
replay as part of the restore process.

> The only method I can think as being really
> reliable is based on two facts:
> - Do a check only on pd_checksums, as that validates the full contents
> of the page.
> - When doing a retry, make sure that there is no concurrent I/O
> activity in the shared buffers. This requires an API we don't have
> yet.

I don't think we actually want the backup process to start locking
pages, which it seems like is what you're suggesting here..? Trying to
do a check without a lock and without having PG end up reading the page
back in if it had been evicted due to pressure seems likely to be hard
to do reliably and without race conditions complicating things.

The other 100% reliable approach, as David discussed before, is to be
scanning the WAL at the same time and to ignore any checksum failures
for pages that we know are in the WAL with FPIs. Unfortunately, reading
WAL for all different versions of PG is a fair bit of work and we
haven't quite gotten to biting that off yet (though it's on the
roadmap), and the core code certainly doesn't help us in that regard
since any given version only supports the current major version WAL (an
issue pg_basebackup would also have to deal with it, were it to be
modified to use such an approach and to continue working with older
versions of PG..). In a similar vein to what we do (in pgbackrest) with
pg_control, we expect to develop our own library basically vendorizing
WAL reading code from all the major versions of PG which we support in
order to track FPIs, restore points, all the kinds of potential recovery
targets, and other useful information.

Thanks,

Stephen

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, David Steele <david(at)pgmasters(dot)net>, Magnus Hagander <magnus(at)hagander(dot)net>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Asif Rehman <asifr(dot)rehman(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2020-11-23 15:35:54
Message-ID:	20201123153553.GI16415@tamriel.snowman.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Greetings,

* Anastasia Lubennikova (a(dot)lubennikova(at)postgrespro(dot)ru) wrote:
> On 21.11.2020 04:30, Michael Paquier wrote:
> >The only method I can think as being really
> >reliable is based on two facts:
> >- Do a check only on pd_checksums, as that validates the full contents
> >of the page.
> >- When doing a retry, make sure that there is no concurrent I/O
> >activity in the shared buffers. This requires an API we don't have
> >yet.
>
> It seems reasonable to me to rely on checksums only.
>
> As for retry, I think that API for concurrent I/O will be complicated.
> Instead, we can introduce a function to read the page directly from shared
> buffers after PAGE_RETRY_THRESHOLD attempts. It looks like a bullet-proof
> solution to me. Do you see any possible problems with it?

We might end up reading pages back in that have been evicted, for one
thing, which doesn't seem great, and this also seems likely to be
awkward for cases which aren't using the replication protocol, unless
every process maintains a connection to PG the entire time, which also
doesn't seem great.

Also- what is the point of reading the page from shared buffers
anyway..? All we need to do is prove that the page will be rewritten
during WAL replay. If we can prove that, we don't actually care what
the contents of the page are. We certainly can't calculate the
checksum on a page we plucked out of shared buffers since we only
calculate the checksum when we go to write the page out.

Thanks,

Stephen

From:	Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>
To:	Stephen Frost <sfrost(at)snowman(dot)net>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, David Steele <david(at)pgmasters(dot)net>, Magnus Hagander <magnus(at)hagander(dot)net>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Asif Rehman <asifr(dot)rehman(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2020-11-23 21:17:54
Message-ID:	97cbb4a6-ee0a-4cad-6b65-84e06d14dfe9@postgrespro.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 23.11.2020 18:35, Stephen Frost wrote:
> Greetings,
>
> * Anastasia Lubennikova (a(dot)lubennikova(at)postgrespro(dot)ru) wrote:
>> On 21.11.2020 04:30, Michael Paquier wrote:
>>> The only method I can think as being really
>>> reliable is based on two facts:
>>> - Do a check only on pd_checksums, as that validates the full contents
>>> of the page.
>>> - When doing a retry, make sure that there is no concurrent I/O
>>> activity in the shared buffers. This requires an API we don't have
>>> yet.
>> It seems reasonable to me to rely on checksums only.
>>
>> As for retry, I think that API for concurrent I/O will be complicated.
>> Instead, we can introduce a function to read the page directly from shared
>> buffers after PAGE_RETRY_THRESHOLD attempts. It looks like a bullet-proof
>> solution to me. Do you see any possible problems with it?
> We might end up reading pages back in that have been evicted, for one
> thing, which doesn't seem great,
TBH, I think it is highly unlikely that the page that was just updated
will be evicted.
> and this also seems likely to be
> awkward for cases which aren't using the replication protocol, unless
> every process maintains a connection to PG the entire time, which also
> doesn't seem great.
Have I missed something? Now pg_basebackup has only one process + one
child process for streaming. Anyway, I totally agree with your argument.
The need to maintain connection(s) to PG is the most unpleasant part of
the proposed approach.

> Also- what is the point of reading the page from shared buffers
> anyway..?
Well... Reading a page from shared buffers is a reliable way to get a
correct page from postgres under any concurrent load. So it just seems
natural to me.
> All we need to do is prove that the page will be rewritten
> during WAL replay.
Yes and this is a tricky part. Until you have explained it in your
latest message, I wasn't sure how we can distinct concurrent update from
a page header corruption. Now I agree that if page LSN updated and
increased between rereads, it is safe enough to conclude that we have
some concurrent load.

> If we can prove that, we don't actually care what
> the contents of the page are. We certainly can't calculate the
> checksum on a page we plucked out of shared buffers since we only
> calculate the checksum when we go to write the page out.
Good point. I was thinking that we can recalculate checksum. Or even
save a page without it, as we have checked LSN and know for sure that it
will be rewritten by WAL replay.

To sum up, I agree with your proposal to reread the page and rely on
ascending LSNs. Can you submit a patch?
You can write it on top of the latest attachment in this thread:
v8-master-0001-Fix-page-verifications-in-base-backups.patch
</message-id/attachment/115403/v8-master-0001-Fix-page-verifications-in-base-backups.patch>
from this message
/message-id/20201030023028.GC1693@paquier.xyz

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, David Steele <david(at)pgmasters(dot)net>, Magnus Hagander <magnus(at)hagander(dot)net>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Asif Rehman <asifr(dot)rehman(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2020-11-23 22:28:52
Message-ID:	20201123222852.GP16415@tamriel.snowman.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Greetings,

* Anastasia Lubennikova (a(dot)lubennikova(at)postgrespro(dot)ru) wrote:
> On 23.11.2020 18:35, Stephen Frost wrote:
> >* Anastasia Lubennikova (a(dot)lubennikova(at)postgrespro(dot)ru) wrote:
> >>On 21.11.2020 04:30, Michael Paquier wrote:
> >>>The only method I can think as being really
> >>>reliable is based on two facts:
> >>>- Do a check only on pd_checksums, as that validates the full contents
> >>>of the page.
> >>>- When doing a retry, make sure that there is no concurrent I/O
> >>>activity in the shared buffers. This requires an API we don't have
> >>>yet.
> >>It seems reasonable to me to rely on checksums only.
> >>
> >>As for retry, I think that API for concurrent I/O will be complicated.
> >>Instead, we can introduce a function to read the page directly from shared
> >>buffers after PAGE_RETRY_THRESHOLD attempts. It looks like a bullet-proof
> >>solution to me. Do you see any possible problems with it?
> >We might end up reading pages back in that have been evicted, for one
> >thing, which doesn't seem great,
> TBH, I think it is highly unlikely that the page that was just updated will
> be evicted.

Is it though..? Consider that the page which was being written out was
being done so specifically to free a page for use by another backend-
while perhaps that doesn't happen all the time, it certainly happens
enough on very busy systems.

> >and this also seems likely to be
> >awkward for cases which aren't using the replication protocol, unless
> >every process maintains a connection to PG the entire time, which also
> >doesn't seem great.
> Have I missed something? Now pg_basebackup has only one process + one child
> process for streaming. Anyway, I totally agree with your argument. The need
> to maintain connection(s) to PG is the most unpleasant part of the proposed
> approach.

I was thinking beyond pg_basebackup, yes; apologies for that not being
clear but that's what I was meaning when I said "aren't using the
replication protocol".

> >Also- what is the point of reading the page from shared buffers
> >anyway..?
> Well... Reading a page from shared buffers is a reliable way to get a
> correct page from postgres under any concurrent load. So it just seems
> natural to me.

Yes, that's true, but if a dirty page was just written out by a backend
in order to be able to evict it, so that the backend can then pull in a
new page, then having pg_basebackup pull that page back in really isn't
great.

> >All we need to do is prove that the page will be rewritten
> >during WAL replay.
> Yes and this is a tricky part. Until you have explained it in your latest
> message, I wasn't sure how we can distinct concurrent update from a page
> header corruption. Now I agree that if page LSN updated and increased
> between rereads, it is safe enough to conclude that we have some concurrent
> load.

Even in this case, it's almost free to compare the LSN to the starting
backup LSN, and to the current LSN position, and make sure it's
somewhere between the two. While that doesn't entirely eliminite the
possibility that the page happened to get corrupted *and* return a
different result on subsequent reads *and* that it was corrupted in such
a way that the LSN ended up falling between the starting backup LSN and
the current LSN, it's certainly reducing the chances of a false negative
a fair bit.

A concern here, however, is- can we be 100% sure that we'll get a
different result from the two subsequent reads? For my part, at least,
I've been doubtful that it's possible but it'd be nice to hear it from
someone who has really looked at the kernel side. To try and clairfy,
let me illustrate:

pg_basebackup (the backend that's sending data to it anyway) starts
reading an 8K page, but gets interrupted halfway through, meaning that
it's read 4K and is now paused.

PG writes that same 8K page, and is able to successfully write the
entire block.

pg_basebackup then wakes up, reads the second half, computes a checksum
and gets a checksum failure.

At this point the question is: if pg_basebackup loops, seeks and
re-reads the same 8K block again, is it possible that pg_basebackup will
get the "old" starting 4K and the "new" ending 4K again? I'd like to
think that the answer is 'no' and that the kernel will guarantee that if
we managed to read a "new" ending 4K block then the following read of
the full 8K block would be guaranteed to give us the "new" starting 4K.
If that is truely guaranteed then we could be much more confident that
the idea here of simply checking for an ascending LSN, which falls
between the starting LSN of the backup and the current LSN (or perhaps
the final LSN for the backup) would be sufficient to detect this case.

I would also think that, if we can trust that, then there really isn't
any need for the delay in performing the re-read, which I have to admit
that I don't particularly care for.

> > If we can prove that, we don't actually care what
> >the contents of the page are. We certainly can't calculate the
> >checksum on a page we plucked out of shared buffers since we only
> >calculate the checksum when we go to write the page out.
> Good point. I was thinking that we can recalculate checksum. Or even save a
> page without it, as we have checked LSN and know for sure that it will be
> rewritten by WAL replay.

At the point that we know the page is in the WAL which must be replayed
to make this backup consistent, we could theoretically zero the page out
of the actual backup (or if we're doing some kind of incremental magic,
skip it entirely, as long as we zero-fill it on restore).

> To sum up, I agree with your proposal to reread the page and rely on
> ascending LSNs. Can you submit a patch?

Probably would make sense to give Michael an opportunity to comment and
get his thoughts on this, and for him to update the patch if he agrees.

As it relates to pgbackrest, we're currently contemplating having a
higher level loop which, upon detecting any page with an invalid
checksum, continues to scan to the end of that file and perform the
compression, encryption, et al, but then loops back after we've
completed that file and skips through the file again, re-reading those
pages which didn't have a valid checksum the first time to see if their
LSN has changed and is within the range of the backup. This will
certainly give more opportunity for the kernel to 'catch up', if needed,
and give us an updated page without a random 100ms delay, and will also
make it easier for us to, eventually, check and make sure the page was
in the WAL that was been produced as part of the backup, to give us a
complete guarantee that the contents of this page don't matter and that
the failed checksum isn't a sign of latent storage corruption.

Thanks,

Stephen

From:	Michael Paquier <michael(at)paquier(dot)xyz>
To:	Stephen Frost <sfrost(at)snowman(dot)net>
Cc:	Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Magnus Hagander <magnus(at)hagander(dot)net>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Asif Rehman <asifr(dot)rehman(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2020-11-24 01:10:43
Message-ID:	20201124011043.GA3046@paquier.xyz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Nov 23, 2020 at 10:35:54AM -0500, Stephen Frost wrote:
> * Anastasia Lubennikova (a(dot)lubennikova(at)postgrespro(dot)ru) wrote:
>> It seems reasonable to me to rely on checksums only.
>>
>> As for retry, I think that API for concurrent I/O will be complicated.
>> Instead, we can introduce a function to read the page directly from shared
>> buffers after PAGE_RETRY_THRESHOLD attempts. It looks like a bullet-proof
>> solution to me. Do you see any possible problems with it?

It seems to me that you are missing the point here. It is not
necessary to read a page from shared buffers. What is necessary is to
make sure that there is zero concurrent I/O activity in shared buffers
while a page is getting checked on disk, giving the insurance that
there is zero risk of having a torn page for a check for anything
working with shared buffers. You could do that only on a retry if we
found a page where there was a checksum mismatch, meaning that the
page we either torn or currupted, but need an extra verification
anyway.

> We might end up reading pages back in that have been evicted, for one
> thing, which doesn't seem great, and this also seems likely to be
> awkward for cases which aren't using the replication protocol, unless
> every process maintains a connection to PG the entire time, which also
> doesn't seem great.

I don't quite see a problem in checking pages that have been just
evicted if we are able to detect faster that a page is corrupted,
because the initial check may fail because a page was torn, meaning
that it was in the middle of an eviction, but the page could also be
corrupted, meaning also that it was *not* torn, and would fail a retry
where we should make sure that there is no s_b concurrent activity.
So in the worst case of seeing you make the detection of a corrupted
page faster.

Please note that Andres also mentioned about the potential need to
worry about table AMs that call directly smgrwrite(), bypassing shared
buffers. The only cases in-core where it is used are related to init
forks when an unlogged relation gets created, where it would not
matter if you are doing a page check while holding a database
transaction as the newly-created relation would not be visible yet,
but it would matter in the case of base backups doing direct page
lookups. Fun.

> Also- what is the point of reading the page from shared buffers
> anyway..? All we need to do is prove that the page will be rewritten
> during WAL replay. If we can prove that, we don't actually care what
> the contents of the page are. We certainly can't calculate the
> checksum on a page we plucked out of shared buffers since we only
> calculate the checksum when we go to write the page out.

A LSN-based check makes the thing tricky. How do you make sure that
pd_lsn is not itself broken? It could be perfectly possible that a
random on-disk corruption makes pd_lsn seen as having a correct value,
still the rest of the page is borked.
--
Michael

From:	Michael Paquier <michael(at)paquier(dot)xyz>
To:	Stephen Frost <sfrost(at)snowman(dot)net>
Cc:	Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Magnus Hagander <magnus(at)hagander(dot)net>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Asif Rehman <asifr(dot)rehman(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2020-11-24 01:28:06
Message-ID:	20201124012806.GB3046@paquier.xyz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Nov 23, 2020 at 05:28:52PM -0500, Stephen Frost wrote:
> * Anastasia Lubennikova (a(dot)lubennikova(at)postgrespro(dot)ru) wrote:
>> Yes and this is a tricky part. Until you have explained it in your latest
>> message, I wasn't sure how we can distinct concurrent update from a page
>> header corruption. Now I agree that if page LSN updated and increased
>> between rereads, it is safe enough to conclude that we have some concurrent
>> load.
>
> Even in this case, it's almost free to compare the LSN to the starting
> backup LSN, and to the current LSN position, and make sure it's
> somewhere between the two. While that doesn't entirely eliminite the
> possibility that the page happened to get corrupted *and* return a
> different result on subsequent reads *and* that it was corrupted in such
> a way that the LSN ended up falling between the starting backup LSN and
> the current LSN, it's certainly reducing the chances of a false negative
> a fair bit.

FWIW, I am not much a fan of designs that are not bullet-proof by
design. This reduces the odds of problems, sure, still it does not
discard the possibility of incorrect results, confusing users as well
as people looking at such reports.

>> To sum up, I agree with your proposal to reread the page and rely on
>> ascending LSNs. Can you submit a patch?
>
> Probably would make sense to give Michael an opportunity to comment and
> get his thoughts on this, and for him to update the patch if he agrees.

I think that a LSN check would be a safe thing to do iff pd_checksum
is already checked first to make sure that the page contents are fine
to use. Still, what's the point in doing a LSN check anyway if we
know that the checksum is valid? Then on a retry if the first attempt
failed you also need the guarantee that there is zero concurrent I/O
activity while a page is rechecked (no need to do that unless the
initial page check doing a checksum match failed). So the retry needs
to do some s_b interactions, but then comes the much trickier point of
concurrent smgrwrite() calls bypassing the shared buffers.

> As it relates to pgbackrest, we're currently contemplating having a
> higher level loop which, upon detecting any page with an invalid
> checksum, continues to scan to the end of that file and perform the
> compression, encryption, et al, but then loops back after we've
> completed that file and skips through the file again, re-reading those
> pages which didn't have a valid checksum the first time to see if their
> LSN has changed and is within the range of the backup. This will
> certainly give more opportunity for the kernel to 'catch up', if needed,
> and give us an updated page without a random 100ms delay, and will also
> make it easier for us to, eventually, check and make sure the page was
> in the WAL that was been produced as part of the backup, to give us a
> complete guarantee that the contents of this page don't matter and that
> the failed checksum isn't a sign of latent storage corruption.

That would reduce the likelyhood of facing torn pages, still you
cannot fully discard the problem either as a same page may get changed
again once you loop over, no? And what if a corruption has updated
pd_lsn on-disk? Unlikely so, still possible.
--
Michael

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	Michael Paquier <michael(at)paquier(dot)xyz>
Cc:	Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>, Asif Rehman <asifr(dot)rehman(at)gmail(dot)com>, David Steele <david(at)pgmasters(dot)net>, Magnus Hagander <magnus(at)hagander(dot)net>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: Online verification of checksums
Date:	2020-11-24 01:54:41
Message-ID:	CAOuzzgpJ1X7oXZXGy+Rkv52M=ZiRRWUF5gkD99=cCswwt5idRA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Greetings,

On Mon, Nov 23, 2020 at 20:28 Michael Paquier <michael(at)paquier(dot)xyz> wrote:

> On Mon, Nov 23, 2020 at 05:28:52PM -0500, Stephen Frost wrote:
> > * Anastasia Lubennikova (a(dot)lubennikova(at)postgrespro(dot)ru) wrote:
> >> Yes and this is a tricky part. Until you have explained it in your
> latest
> >> message, I wasn't sure how we can distinct concurrent update from a page
> >> header corruption. Now I agree that if page LSN updated and increased
> >> between rereads, it is safe enough to conclude that we have some
> concurrent
> >> load.
> >
> > Even in this case, it's almost free to compare the LSN to the starting
> > backup LSN, and to the current LSN position, and make sure it's
> > somewhere between the two. While that doesn't entirely eliminite the
> > possibility that the page happened to get corrupted *and* return a
> > different result on subsequent reads *and* that it was corrupted in such
> > a way that the LSN ended up falling between the starting backup LSN and
> > the current LSN, it's certainly reducing the chances of a false negative
> > a fair bit.
>
> FWIW, I am not much a fan of designs that are not bullet-proof by
> design. This reduces the odds of problems, sure, still it does not
> discard the possibility of incorrect results, confusing users as well
> as people looking at such reports.

Let’s be clear about this- our checksums are, themselves, far from
bulletproof, regardless of all of our other efforts. They are not
foolproof against any corruption, and certainly not even close to being
sufficient for guarantees you’d expect in, say, encryption integrity. We
cannot say with certainty that a page which passes checksum validation
isn’t corrupted in some way. A page which doesn’t pass checksum validation
may be corrupted or may be torn and we aren’t 100% of that either, but we
can work to try and make a sensible call about which it is.

>> To sum up, I agree with your proposal to reread the page and rely on
> >> ascending LSNs. Can you submit a patch?
> >
> > Probably would make sense to give Michael an opportunity to comment and
> > get his thoughts on this, and for him to update the patch if he agrees.
>
> I think that a LSN check would be a safe thing to do iff pd_checksum
> is already checked first to make sure that the page contents are fine
> to use. Still, what's the point in doing a LSN check anyway if we
> know that the checksum is valid? Then on a retry if the first attempt
> failed you also need the guarantee that there is zero concurrent I/O
> activity while a page is rechecked (no need to do that unless the
> initial page check doing a checksum match failed). So the retry needs
> to do some s_b interactions, but then comes the much trickier point of
> concurrent smgrwrite() calls bypassing the shared buffers.

I agree that the LSN check isn’t interesting if the page passes the
checksum validation. I do think we can look at the LSN and make reasonable
inferences based off of it even if the checksum doesn’t validate- in
particular, in my experience at least, the result of a read, without any
intervening write, is very likely to be the same if performed multiple
times quickly even if there is latent storage corruption- due to cache’ing,
if nothing else. What’s interesting about the LSN check is that we are
specifically looking to see if it *changed* in a reasonable and predictable
manner, and that it was replaced with a new yet reasonable value. The
chances of that happening due to latent storage corruption is vanishingly
small.

> As it relates to pgbackrest, we're currently contemplating having a
> > higher level loop which, upon detecting any page with an invalid
> > checksum, continues to scan to the end of that file and perform the
> > compression, encryption, et al, but then loops back after we've
> > completed that file and skips through the file again, re-reading those
> > pages which didn't have a valid checksum the first time to see if their
> > LSN has changed and is within the range of the backup. This will
> > certainly give more opportunity for the kernel to 'catch up', if needed,
> > and give us an updated page without a random 100ms delay, and will also
> > make it easier for us to, eventually, check and make sure the page was
> > in the WAL that was been produced as part of the backup, to give us a
> > complete guarantee that the contents of this page don't matter and that
> > the failed checksum isn't a sign of latent storage corruption.
>
> That would reduce the likelyhood of facing torn pages, still you
> cannot fully discard the problem either as a same page may get changed
> again once you loop over, no? And what if a corruption has updated
> pd_lsn on-disk? Unlikely so, still possible.

We surely don’t care about a page which has been changed multiple times by
PG during the backup, since all those changes will be, by definition, in
the WAL, no? Therefore, one loop to see that the value of the LSN
*changed*, meaning something wrote something new there, with a cross-check
to see that the LSN was in the expected range, is going an awfully long way
to assuring that this isn’t a case of latent storage corruption. If there
is an attacker who is not the PG process but who is modifying files then,
yes, that’s a risk, and won’t be picked up by this, but why would they
create an invalid checksum in the first place..?

We aren’t attempting to protect against a sophisticated attack, we are
trying to detect latent storage corruption.

I would also ask for a clarification as to if you feel that checking the
WAL for the page to be insufficient somehow, since I mentioned that as also
being on the roadmap. If there’s some reason that checking the WAL for the
page wouldn’t be sufficient, I am anxious to understand that reasoning.

Thanks,

Stephen

From:	David Steele <david(at)pgmasters(dot)net>
To:	Michael Paquier <michael(at)paquier(dot)xyz>, Stephen Frost <sfrost(at)snowman(dot)net>
Cc:	Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>, Magnus Hagander <magnus(at)hagander(dot)net>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Asif Rehman <asifr(dot)rehman(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2020-11-24 17:38:30
Message-ID:	8a4df8ee-0381-26ad-d09a-0367f03914a1@pgmasters.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi Michael,

On 11/23/20 8:10 PM, Michael Paquier wrote:
> On Mon, Nov 23, 2020 at 10:35:54AM -0500, Stephen Frost wrote:
>
>> Also- what is the point of reading the page from shared buffers
>> anyway..? All we need to do is prove that the page will be rewritten
>> during WAL replay. If we can prove that, we don't actually care what
>> the contents of the page are. We certainly can't calculate the
>> checksum on a page we plucked out of shared buffers since we only
>> calculate the checksum when we go to write the page out.
>
> A LSN-based check makes the thing tricky. How do you make sure that
> pd_lsn is not itself broken? It could be perfectly possible that a
> random on-disk corruption makes pd_lsn seen as having a correct value,
> still the rest of the page is borked.

We are not just looking at one LSN value. Here are the steps we are
proposing (I'll skip checks for zero pages here):

1) Test the page checksum. If it passes the page is OK.
2) If the checksum does not pass then record the page offset and LSN and
continue.
3) After the file is copied, reopen and reread the file, seeking to
offsets where possible invalid pages were recorded in the first pass.
a) If the page is now valid then it is OK.
b) If the page is not valid but the LSN has increased from the LSN
recorded in the previous pass then it is OK. We can infer this because
the LSN has been updated in a way that is not consistent with storage
corruption.

This is what we are planning for the first round of improving our page
checksum validation. We believe that doing the retry in a second pass
will be faster and more reliable because some time will have passed
since the first read without having to build in a delay for each page error.

A further improvement is to check the ascending LSNs found in 3b against
PostgreSQL to be completely sure they are valid. We are planning this
for our second round of improvements.

Reopening the file for the second pass does require some additional logic:

1) The file may have been deleted by PG since the first pass and in that
case we won't report any page errors.
2) The file may have been truncated by PG since the first pass so we
won't report any errors past the point of truncation.

A malicious attacker could easily trick these checks, but as Stephen
pointed out elsewhere they would likely make the checksums valid which
would escape detection anyway.

We believe that the chances of random storage corruption passing all
these checks is incredibly small, but eventually we'll also check
against the WAL to be completely sure.

Regards,
--
-David
david(at)pgmasters(dot)net

From:	Michael Paquier <michael(at)paquier(dot)xyz>
To:	David Steele <david(at)pgmasters(dot)net>
Cc:	Stephen Frost <sfrost(at)snowman(dot)net>, Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>, Magnus Hagander <magnus(at)hagander(dot)net>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Asif Rehman <asifr(dot)rehman(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2020-11-26 07:42:37
Message-ID:	X79cbS/zGQy8fWSu@paquier.xyz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Nov 24, 2020 at 12:38:30PM -0500, David Steele wrote:
> We are not just looking at one LSN value. Here are the steps we are
> proposing (I'll skip checks for zero pages here):
>
> 1) Test the page checksum. If it passes the page is OK.
> 2) If the checksum does not pass then record the page offset and LSN and
> continue.

But here the checksum is broken, so while the offset is something we
can rely on how do you make sure that the LSN is fine? A broken
checksum could perfectly mean that the LSN is actually *not* fine if
the page header got corrupted.

> 3) After the file is copied, reopen and reread the file, seeking to offsets
> where possible invalid pages were recorded in the first pass.
> a) If the page is now valid then it is OK.
> b) If the page is not valid but the LSN has increased from the LSN

Per se the previous point about the LSN value that we cannot rely on.

> A malicious attacker could easily trick these checks, but as Stephen pointed
> out elsewhere they would likely make the checksums valid which would escape
> detection anyway.
>
> We believe that the chances of random storage corruption passing all these
> checks is incredibly small, but eventually we'll also check against the WAL
> to be completely sure.

The lack of check for any concurrent I/O on the follow-up retries is
disturbing. How do you guarantee that on the second retry what you
have is a torn page and not something corrupted? Init forks for
example are made of up to 2 blocks, so the window would get short for
at least those. There are many instances with tables that have few
pages as well.
--
Michael

From:	Magnus Hagander <magnus(at)hagander(dot)net>
To:	Michael Paquier <michael(at)paquier(dot)xyz>
Cc:	David Steele <david(at)pgmasters(dot)net>, Stephen Frost <sfrost(at)snowman(dot)net>, Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Asif Rehman <asifr(dot)rehman(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2020-11-26 08:13:59
Message-ID:	CABUevEwS7o01gBDPya88Vf4RMjmEcTv66wN+vqJ=jK-amqrnEA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Nov 26, 2020 at 8:42 AM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
>
> On Tue, Nov 24, 2020 at 12:38:30PM -0500, David Steele wrote:
> > We are not just looking at one LSN value. Here are the steps we are
> > proposing (I'll skip checks for zero pages here):
> >
> > 1) Test the page checksum. If it passes the page is OK.
> > 2) If the checksum does not pass then record the page offset and LSN and
> > continue.
>
> But here the checksum is broken, so while the offset is something we
> can rely on how do you make sure that the LSN is fine? A broken
> checksum could perfectly mean that the LSN is actually *not* fine if
> the page header got corrupted.
>
> > 3) After the file is copied, reopen and reread the file, seeking to offsets
> > where possible invalid pages were recorded in the first pass.
> > a) If the page is now valid then it is OK.
> > b) If the page is not valid but the LSN has increased from the LSN
>
> Per se the previous point about the LSN value that we cannot rely on.

We cannot rely on the LSN itself. But it's a lot more likely that we
can rely on the LSN changing, and on the LSN changing in a "correct
way". That is, if the LSN *decreases* we know it's corrupt. If the LSN
*doesn't change* we know it's corrupt. But if the LSN *increases* AND
the new page now has a correct checksum, it's very most likely to be
correct. You could perhaps even put cap on it saying "if the LSN
increased, but less than <n>", where <n> is a sufficiently high number
that it's entirely unreasonable to advanced that far between the
reading of two blocks. But it has to have a very high margin in that
case.

> > A malicious attacker could easily trick these checks, but as Stephen pointed
> > out elsewhere they would likely make the checksums valid which would escape
> > detection anyway.
> >
> > We believe that the chances of random storage corruption passing all these
> > checks is incredibly small, but eventually we'll also check against the WAL
> > to be completely sure.
>
> The lack of check for any concurrent I/O on the follow-up retries is
> disturbing. How do you guarantee that on the second retry what you
> have is a torn page and not something corrupted? Init forks for
> example are made of up to 2 blocks, so the window would get short for
> at least those. There are many instances with tables that have few
> pages as well.

Here I was more worried that the window might get *too long* if tables
are large :)

The risk is certainly that you get a torn page *again* on the second
read. It could be the same torn page (if it hasn't changed), but you
can detect that (by the fact that it hasn't actually changed) and
possibly do a short delay before trying again if it gets that far.
That could happen if the process is too quick. It could also be that
you are unlucky and that you hit a *new* write, and you were so
unlucky that both times it happened to hit exactly when you were
reading the page the next time. I'm not sure the chance of that
happening is even big enough we have to care about it, though?

--
Magnus Hagander
Me: https://www.hagander.net/
Work: https://www.redpill-linpro.com/

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	Magnus Hagander <magnus(at)hagander(dot)net>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, David Steele <david(at)pgmasters(dot)net>, Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Asif Rehman <asifr(dot)rehman(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2020-11-27 16:15:27
Message-ID:	20201127161527.GF16415@tamriel.snowman.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Greetings,

* Magnus Hagander (magnus(at)hagander(dot)net) wrote:
> On Thu, Nov 26, 2020 at 8:42 AM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
> > On Tue, Nov 24, 2020 at 12:38:30PM -0500, David Steele wrote:
> > > We are not just looking at one LSN value. Here are the steps we are
> > > proposing (I'll skip checks for zero pages here):
> > >
> > > 1) Test the page checksum. If it passes the page is OK.
> > > 2) If the checksum does not pass then record the page offset and LSN and
> > > continue.
> >
> > But here the checksum is broken, so while the offset is something we
> > can rely on how do you make sure that the LSN is fine? A broken
> > checksum could perfectly mean that the LSN is actually *not* fine if
> > the page header got corrupted.

Of course that could be the case, but it gets to be a smaller and
smaller chance by checking that the LSN read falls within reasonable
bounds.

> > > 3) After the file is copied, reopen and reread the file, seeking to offsets
> > > where possible invalid pages were recorded in the first pass.
> > > a) If the page is now valid then it is OK.
> > > b) If the page is not valid but the LSN has increased from the LSN
> >
> > Per se the previous point about the LSN value that we cannot rely on.
>
> We cannot rely on the LSN itself. But it's a lot more likely that we
> can rely on the LSN changing, and on the LSN changing in a "correct
> way". That is, if the LSN *decreases* we know it's corrupt. If the LSN
> *doesn't change* we know it's corrupt. But if the LSN *increases* AND
> the new page now has a correct checksum, it's very most likely to be
> correct. You could perhaps even put cap on it saying "if the LSN
> increased, but less than <n>", where <n> is a sufficiently high number
> that it's entirely unreasonable to advanced that far between the
> reading of two blocks. But it has to have a very high margin in that
> case.

This is, in fact, included in what was proposed- the "max increase"
would be "the ending LSN of the backup". I don't think we can make it
any tighter than that though without risking false positives, which is
surely worse than a false negative in this particular case- we already
risk false negatives due to the fact that our checksum isn't perfect, so
even a perfect check to make sure that the page will, in fact, be
replayed over during crash recovery doesn't guarantee that there's no
corruption.

> > > A malicious attacker could easily trick these checks, but as Stephen pointed
> > > out elsewhere they would likely make the checksums valid which would escape
> > > detection anyway.
> > >
> > > We believe that the chances of random storage corruption passing all these
> > > checks is incredibly small, but eventually we'll also check against the WAL
> > > to be completely sure.
> >
> > The lack of check for any concurrent I/O on the follow-up retries is
> > disturbing. How do you guarantee that on the second retry what you
> > have is a torn page and not something corrupted? Init forks for
> > example are made of up to 2 blocks, so the window would get short for
> > at least those. There are many instances with tables that have few
> > pages as well.

If there's an easy and cheap way to see if there was concurrent i/o
happening for the page, then let's hear it. One idea that has occured
to me which hasn't been discussed is checking the file's mtime to see if
it's changed since the backup started. In that case, I would think it'd
be something like:

- Checksum is invalid
- LSN is within range
- Close file
- Stat file
- If mtime is from before the backup then signal possible corruption

If the checksum is invalid and the LSN isn't in range, then signal
corruption.

In general, however, I don't like the idea of reaching into PG and
asking PG for this page.

> Here I was more worried that the window might get *too long* if tables
> are large :)

I'm not sure that there's really a 'too long' possibility here.

> The risk is certainly that you get a torn page *again* on the second
> read. It could be the same torn page (if it hasn't changed), but you
> can detect that (by the fact that it hasn't actually changed) and
> possibly do a short delay before trying again if it gets that far.

I'm really not a fan of introducing these delays in the hopes that
they'll work..

> That could happen if the process is too quick. It could also be that
> you are unlucky and that you hit a *new* write, and you were so
> unlucky that both times it happened to hit exactly when you were
> reading the page the next time. I'm not sure the chance of that
> happening is even big enough we have to care about it, though?

If there's actually a new write, surely the LSN would be new? At the
least, it wouldn't be the same LSN as the first read that picked up a
torn page.

In general though, I agree, we are getting to the point here where the
chances of missing something with this approach seems extremely slim. I
do still like the idea of doing better by actually scanning the WAL but
at least for now, this is far better than what we have today while not
introducing a huge amount of additional code or complexity.

Thanks,

Stephen

From:	Michael Paquier <michael(at)paquier(dot)xyz>
To:	Stephen Frost <sfrost(at)snowman(dot)net>
Cc:	Magnus Hagander <magnus(at)hagander(dot)net>, David Steele <david(at)pgmasters(dot)net>, Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Asif Rehman <asifr(dot)rehman(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2020-11-29 06:06:51
Message-ID:	X8M6e4nhjMR2e8t6@paquier.xyz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Nov 27, 2020 at 11:15:27AM -0500, Stephen Frost wrote:
> * Magnus Hagander (magnus(at)hagander(dot)net) wrote:
>> On Thu, Nov 26, 2020 at 8:42 AM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
>>> But here the checksum is broken, so while the offset is something we
>>> can rely on how do you make sure that the LSN is fine? A broken
>>> checksum could perfectly mean that the LSN is actually *not* fine if
>>> the page header got corrupted.
>
> Of course that could be the case, but it gets to be a smaller and
> smaller chance by checking that the LSN read falls within reasonable
> bounds.

FWIW, I find that scary.

>> We cannot rely on the LSN itself. But it's a lot more likely that we
>> can rely on the LSN changing, and on the LSN changing in a "correct
>> way". That is, if the LSN *decreases* we know it's corrupt. If the LSN
>> *doesn't change* we know it's corrupt. But if the LSN *increases* AND
>> the new page now has a correct checksum, it's very most likely to be
>> correct. You could perhaps even put cap on it saying "if the LSN
>> increased, but less than <n>", where <n> is a sufficiently high number
>> that it's entirely unreasonable to advanced that far between the
>> reading of two blocks. But it has to have a very high margin in that
>> case.
>
> This is, in fact, included in what was proposed- the "max increase"
> would be "the ending LSN of the backup". I don't think we can make it
> any tighter than that though without risking false positives, which is
> surely worse than a false negative in this particular case- we already
> risk false negatives due to the fact that our checksum isn't perfect, so
> even a perfect check to make sure that the page will, in fact, be
> replayed over during crash recovery doesn't guarantee that there's no
> corruption.
>
> If there's an easy and cheap way to see if there was concurrent i/o
> happening for the page, then let's hear it.

This has been discussed for a couple of months now. I would recommend
to go through this thread:
/message-id/CAOBaU_aVvMjQn=ge5qPiJOPMmOj5=ii3st5Q0Y+WuLML5sR17w@mail.gmail.com

And this bit is interesting, because that would give the guarantees
you are looking for with a page retry (just grep for BM_IO_IN_PROGRESS
on the thread):
/message-id/20201102193457.fc2hoen7ahth4bbc@alap3.anarazel.de

> One idea that has occured
> to me which hasn't been discussed is checking the file's mtime to see if
> it's changed since the backup started. In that case, I would think it'd
> be something like:
>
> - Checksum is invalid
> - LSN is within range
> - Close file
> - Stat file
> - If mtime is from before the backup then signal possible corruption

I suspect that relying on mtime may cause problems. One case coming
to my mind is NFS.

> In general, however, I don't like the idea of reaching into PG and
> asking PG for this page.

It seems to me that if we don't ask to PG what it thinks about a page,
we will never have a fully bullet-proof design either.
--
Michael

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	Michael Paquier <michael(at)paquier(dot)xyz>
Cc:	Magnus Hagander <magnus(at)hagander(dot)net>, David Steele <david(at)pgmasters(dot)net>, Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Asif Rehman <asifr(dot)rehman(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2020-11-30 14:27:12
Message-ID:	20201130142712.GE16415@tamriel.snowman.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Greetings,

* Michael Paquier (michael(at)paquier(dot)xyz) wrote:
> On Fri, Nov 27, 2020 at 11:15:27AM -0500, Stephen Frost wrote:
> > * Magnus Hagander (magnus(at)hagander(dot)net) wrote:
> >> On Thu, Nov 26, 2020 at 8:42 AM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
> >>> But here the checksum is broken, so while the offset is something we
> >>> can rely on how do you make sure that the LSN is fine? A broken
> >>> checksum could perfectly mean that the LSN is actually *not* fine if
> >>> the page header got corrupted.
> >
> > Of course that could be the case, but it gets to be a smaller and
> > smaller chance by checking that the LSN read falls within reasonable
> > bounds.
>
> FWIW, I find that scary.

There's ultimately different levels of 'scary' and the risk here that
something is actually wrong following these checks strikes me as being
on the same order as random bits being flipped in the page and still
getting a valid checksum (which is entirely possible with our current
checksum...), or maybe even less. Both cases would result in a false
negative, which is surely bad, though that strikes me as better than a
false positive, where we say there's corruption when there isn't.

> And this bit is interesting, because that would give the guarantees
> you are looking for with a page retry (just grep for BM_IO_IN_PROGRESS
> on the thread):
> /message-id/20201102193457.fc2hoen7ahth4bbc@alap3.anarazel.de

There's no guarantee that the page is still in shared buffers or that we
have a buffer descriptor still for it by the time we're doing this, as I
said up-thread. This approach requires that we reach into PG, acquire
at least a buffer descriptor and set BM_IO_IN_PROGRESS on it and then
read the page again and checksum it again before finally looking at the
(now 'trusted' LSN, even though it might have had some bits flipped in
it and we wouldn't know..) and see if it's higher than the start of the
backup, and maybe less than the current LSN. Maybe we can avoid
actually pulling the page into shared buffers (reading it into our own
memory instead) and just have the buffer descriptor but none of this
seems like it's going to be very unobtrusive in either code or the
running system, and it isn't going to give us an actual guarantee that
there's been no corruption. The amount that it improves on the checks
that I outline above seems to be exceedingly small and the question is
if it's worth it for, most likely, exclusively pg_basebackup (unless
we're going to figure out a way to expose this via SQL, which seems
unlikely).

> > One idea that has occured
> > to me which hasn't been discussed is checking the file's mtime to see if
> > it's changed since the backup started. In that case, I would think it'd
> > be something like:
> >
> > - Checksum is invalid
> > - LSN is within range
> > - Close file
> > - Stat file
> > - If mtime is from before the backup then signal possible corruption
>
> I suspect that relying on mtime may cause problems. One case coming
> to my mind is NFS.

I agree that it might not be perfect but it also seems like something
which could be reasonably cheaply checked and the window (between when
the backup started and the time we hit this torn page) is very likely to
be large enough that the mtime will have been updated and be different
(and forward, if it was modified) of what it was at the time the backup
started. It's also something that incremental backups may be looking
at, so if there's serious problems with it then there's a good chance
you've got bigger issues.

> > In general, however, I don't like the idea of reaching into PG and
> > asking PG for this page.
>
> It seems to me that if we don't ask to PG what it thinks about a page,
> we will never have a fully bullet-proof design either.

None of this is bullet-proof, it's all trade-offs.

Thanks,

Stephen

From:	David Steele <david(at)pgmasters(dot)net>
To:	Stephen Frost <sfrost(at)snowman(dot)net>, Michael Paquier <michael(at)paquier(dot)xyz>, Andres Freund <andres(at)anarazel(dot)de>
Cc:	Magnus Hagander <magnus(at)hagander(dot)net>, Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Asif Rehman <asifr(dot)rehman(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2020-11-30 23:38:07
Message-ID:	deb9ef07-d50a-7073-5eaa-9626a843bc70@pgmasters.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 11/30/20 9:27 AM, Stephen Frost wrote:
> Greetings,
>
> * Michael Paquier (michael(at)paquier(dot)xyz) wrote:
>> On Fri, Nov 27, 2020 at 11:15:27AM -0500, Stephen Frost wrote:
>>> * Magnus Hagander (magnus(at)hagander(dot)net) wrote:
>>>> On Thu, Nov 26, 2020 at 8:42 AM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
>>>>> But here the checksum is broken, so while the offset is something we
>>>>> can rely on how do you make sure that the LSN is fine? A broken
>>>>> checksum could perfectly mean that the LSN is actually *not* fine if
>>>>> the page header got corrupted.
>>>
>>> Of course that could be the case, but it gets to be a smaller and
>>> smaller chance by checking that the LSN read falls within reasonable
>>> bounds.
>>
>> FWIW, I find that scary.
>
> There's ultimately different levels of 'scary' and the risk here that
> something is actually wrong following these checks strikes me as being
> on the same order as random bits being flipped in the page and still
> getting a valid checksum (which is entirely possible with our current
> checksum...), or maybe even less.

I would say a lot less. First you'd need to corrupt one of the eight
bytes that make up the LSN (pretty likely since corruption will probably
affect the entire block) and then it would need to be updated to a value
that falls within the current backup range, a 1 in 16 million chance if
a terabyte of WAL is generated during the backup. Plus, the corruption
needs to happen during the backup since we are going to check for that,
and the corrupted LSN needs to be ascending, and the LSN originally read
needs to be within the backup range (another 1 in 16 million chance)
since pages written before the start backup checkpoint should not be torn.

So as far as I can see there are more likely to be false negatives from
the checksum itself.

It would also be easy to add a few rounds of checks, i.e. test if the
LSN ascends but stays in the backup LSN range N times.

Honestly, I'm much more worried about corruption zeroing the entire
page. I don't know how likely that is, but I know none of our proposed
solutions would catch it.

Andres, since you brought this issue up originally perhaps you'd like to
weigh in?

Regards,
--
-David
david(at)pgmasters(dot)net

From:	David Steele <david(at)pgmasters(dot)net>
To:	Stephen Frost <sfrost(at)snowman(dot)net>, Michael Paquier <michael(at)paquier(dot)xyz>, Andres Freund <andres(at)anarazel(dot)de>
Cc:	Magnus Hagander <magnus(at)hagander(dot)net>, Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Asif Rehman <asifr(dot)rehman(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2021-03-09 17:43:46
Message-ID:	41799223-7d65-dd52-a3a7-182d04e515cb@pgmasters.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 11/30/20 6:38 PM, David Steele wrote:
> On 11/30/20 9:27 AM, Stephen Frost wrote:
>> * Michael Paquier (michael(at)paquier(dot)xyz) wrote:
>>> On Fri, Nov 27, 2020 at 11:15:27AM -0500, Stephen Frost wrote:
>>>> * Magnus Hagander (magnus(at)hagander(dot)net) wrote:
>>>>> On Thu, Nov 26, 2020 at 8:42 AM Michael Paquier
>>>>> <michael(at)paquier(dot)xyz> wrote:
>>>>>> But here the checksum is broken, so while the offset is something we
>>>>>> can rely on how do you make sure that the LSN is fine? A broken
>>>>>> checksum could perfectly mean that the LSN is actually *not* fine if
>>>>>> the page header got corrupted.
>>>>
>>>> Of course that could be the case, but it gets to be a smaller and
>>>> smaller chance by checking that the LSN read falls within reasonable
>>>> bounds.
>>>
>>> FWIW, I find that scary.
>>
>> There's ultimately different levels of 'scary' and the risk here that
>> something is actually wrong following these checks strikes me as being
>> on the same order as random bits being flipped in the page and still
>> getting a valid checksum (which is entirely possible with our current
>> checksum...), or maybe even less.
>
> I would say a lot less. First you'd need to corrupt one of the eight
> bytes that make up the LSN (pretty likely since corruption will probably
> affect the entire block) and then it would need to be updated to a value
> that falls within the current backup range, a 1 in 16 million chance if
> a terabyte of WAL is generated during the backup. Plus, the corruption
> needs to happen during the backup since we are going to check for that,
> and the corrupted LSN needs to be ascending, and the LSN originally read
> needs to be within the backup range (another 1 in 16 million chance)
> since pages written before the start backup checkpoint should not be torn.
>
> So as far as I can see there are more likely to be false negatives from
> the checksum itself.
>
> It would also be easy to add a few rounds of checks, i.e. test if the
> LSN ascends but stays in the backup LSN range N times.
>
> Honestly, I'm much more worried about corruption zeroing the entire
> page. I don't know how likely that is, but I know none of our proposed
> solutions would catch it.
>
> Andres, since you brought this issue up originally perhaps you'd like to
> weigh in?

I had another look at this patch and though I think my suggestions above
would improve the patch, I have no objections to going forward as is (if
that is the consensus) since this seems an improvement over what we have
now.

It comes down to whether you prefer false negatives or false positives.
With the LSN checking Stephen and I advocate it is theoretically
possible to have a false negative but the chances of the LSN ascending N
times but staying within the backup LSN range due to corruption seems to
be approaching zero.

I think Michael's method is unlikely to throw false positives, but it
seems at least possible that a block would be hot enough to be appear
torn N times in a row. Torn pages themselves are really easy to reproduce.

If we do go forward with this method I would likely propose the
LSN-based approach as a future improvement.

Regards,
--
-David
david(at)pgmasters(dot)net

From:	Ibrar Ahmed <ibrar(dot)ahmad(at)gmail(dot)com>
To:	David Steele <david(at)pgmasters(dot)net>, Michael Banck <michael(dot)banck(at)credativ(dot)de>
Cc:	Stephen Frost <sfrost(at)snowman(dot)net>, Michael Paquier <michael(at)paquier(dot)xyz>, Andres Freund <andres(at)anarazel(dot)de>, Magnus Hagander <magnus(at)hagander(dot)net>, Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Asif Rehman <asifr(dot)rehman(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2021-07-09 20:00:20
Message-ID:	CALtqXTeRKZCyHzJZjsA0xTNOWtJWKRoEQ=F-qwJ3+kGuAxW5Jw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Mar 9, 2021 at 10:43 PM David Steele <david(at)pgmasters(dot)net> wrote:

> On 11/30/20 6:38 PM, David Steele wrote:
> > On 11/30/20 9:27 AM, Stephen Frost wrote:
> >> * Michael Paquier (michael(at)paquier(dot)xyz) wrote:
> >>> On Fri, Nov 27, 2020 at 11:15:27AM -0500, Stephen Frost wrote:
> >>>> * Magnus Hagander (magnus(at)hagander(dot)net) wrote:
> >>>>> On Thu, Nov 26, 2020 at 8:42 AM Michael Paquier
> >>>>> <michael(at)paquier(dot)xyz> wrote:
> >>>>>> But here the checksum is broken, so while the offset is something we
> >>>>>> can rely on how do you make sure that the LSN is fine? A broken
> >>>>>> checksum could perfectly mean that the LSN is actually *not* fine if
> >>>>>> the page header got corrupted.
> >>>>
> >>>> Of course that could be the case, but it gets to be a smaller and
> >>>> smaller chance by checking that the LSN read falls within reasonable
> >>>> bounds.
> >>>
> >>> FWIW, I find that scary.
> >>
> >> There's ultimately different levels of 'scary' and the risk here that
> >> something is actually wrong following these checks strikes me as being
> >> on the same order as random bits being flipped in the page and still
> >> getting a valid checksum (which is entirely possible with our current
> >> checksum...), or maybe even less.
> >
> > I would say a lot less. First you'd need to corrupt one of the eight
> > bytes that make up the LSN (pretty likely since corruption will probably
> > affect the entire block) and then it would need to be updated to a value
> > that falls within the current backup range, a 1 in 16 million chance if
> > a terabyte of WAL is generated during the backup. Plus, the corruption
> > needs to happen during the backup since we are going to check for that,
> > and the corrupted LSN needs to be ascending, and the LSN originally read
> > needs to be within the backup range (another 1 in 16 million chance)
> > since pages written before the start backup checkpoint should not be
> torn.
> >
> > So as far as I can see there are more likely to be false negatives from
> > the checksum itself.
> >
> > It would also be easy to add a few rounds of checks, i.e. test if the
> > LSN ascends but stays in the backup LSN range N times.
> >
> > Honestly, I'm much more worried about corruption zeroing the entire
> > page. I don't know how likely that is, but I know none of our proposed
> > solutions would catch it.
> >
> > Andres, since you brought this issue up originally perhaps you'd like to
> > weigh in?
>
> I had another look at this patch and though I think my suggestions above
> would improve the patch, I have no objections to going forward as is (if
> that is the consensus) since this seems an improvement over what we have
> now.
>
> It comes down to whether you prefer false negatives or false positives.
> With the LSN checking Stephen and I advocate it is theoretically
> possible to have a false negative but the chances of the LSN ascending N
> times but staying within the backup LSN range due to corruption seems to
> be approaching zero.
>
> I think Michael's method is unlikely to throw false positives, but it
> seems at least possible that a block would be hot enough to be appear
> torn N times in a row. Torn pages themselves are really easy to reproduce.
>
> If we do go forward with this method I would likely propose the
> LSN-based approach as a future improvement.
>
> Regards,
> --
> -David
> david(at)pgmasters(dot)net
>
>
>
I am changing the status to "Waiting on Author" based on the latest
comments of @David Steele <david(at)pgmasters(dot)net>
and secondly the patch does not apply cleanly.

http://cfbot.cputube.org/patch_33_2719.log

--
Ibrar Ahmed

From:	Daniel Gustafsson <daniel(at)yesql(dot)se>
To:	Ibrar Ahmed <ibrar(dot)ahmad(at)gmail(dot)com>
Cc:	David Steele <david(at)pgmasters(dot)net>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, Stephen Frost <sfrost(at)snowman(dot)net>, Michael Paquier <michael(at)paquier(dot)xyz>, Andres Freund <andres(at)anarazel(dot)de>, Magnus Hagander <magnus(at)hagander(dot)net>, Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Asif Rehman <asifr(dot)rehman(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2021-09-02 11:18:47
Message-ID:	B3E410E7-3FFA-4590-A73C-5B736307FB31@yesql.se
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On 9 Jul 2021, at 22:00, Ibrar Ahmed <ibrar(dot)ahmad(at)gmail(dot)com> wrote:

> I am changing the status to "Waiting on Author" based on the latest comments of @David Steele
> and secondly the patch does not apply cleanly.

This patch hasn’t moved since marked as WoA in the last CF and still doesn’t
apply, unless there is a new version brewing it seems apt to close this as RwF
and await a new entry in a future CF.

--
Daniel Gustafsson https://vmware.com/

From:	Daniel Gustafsson <daniel(at)yesql(dot)se>
To:	Ibrar Ahmed <ibrar(dot)ahmad(at)gmail(dot)com>
Cc:	David Steele <david(at)pgmasters(dot)net>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, Stephen Frost <sfrost(at)snowman(dot)net>, Michael Paquier <michael(at)paquier(dot)xyz>, Andres Freund <andres(at)anarazel(dot)de>, Magnus Hagander <magnus(at)hagander(dot)net>, Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Asif Rehman <asifr(dot)rehman(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online verification of checksums
Date:	2021-09-13 11:45:08
Message-ID:	17C8D292-E12E-4FDF-AF1A-2BF08AB3AFC1@yesql.se
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On 2 Sep 2021, at 13:18, Daniel Gustafsson <daniel(at)yesql(dot)se> wrote:
>
>> On 9 Jul 2021, at 22:00, Ibrar Ahmed <ibrar(dot)ahmad(at)gmail(dot)com> wrote:
>
>> I am changing the status to "Waiting on Author" based on the latest comments of @David Steele
>> and secondly the patch does not apply cleanly.
>
> This patch hasn’t moved since marked as WoA in the last CF and still doesn’t
> apply, unless there is a new version brewing it seems apt to close this as RwF
> and await a new entry in a future CF.

As there has been no movement, I've marked this patch as RwF.

--
Daniel Gustafsson https://vmware.com/