Quick Links

pg_rewind race condition just after promotion

Lists:	pgsql-hackers

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Cc:	Stephen Frost <sfrost(at)snowman(dot)net>, Oleksii Kliukin <alexk(at)hintbits(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Subject:	pg_rewind race condition just after promotion
Date:	2020-12-07 18:13:25
Message-ID:	9f568c97-87fe-a716-bd39-65299b8a60f4@iki.fi
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

There's a race condition between the checkpoint at promotion and
pg_rewind. When a server is promoted, the startup process writes an
end-of-recovery checkpoint that includes the new TLI, and the server is
immediate opened for business. The startup process requests the
checkpointer process to perform a checkpoint, but it can take a few
seconds or more to complete. If you run pg_rewind, using the just
promoted server as the source, pg_rewind will think that the server is
still on the old timeline, because it only looks at TLI in the control
file's copy of the checkpoint record. That's not updated until the
checkpoint is finished.

This isn't a new issue. Stephen Frost first reported it back 2015 [1].
Back then, it was deemed just a small annoyance, and we just worked
around it in the tests by issuing a checkpoint command after promotion,
to wait for the checkpoint to finish. I just ran into it again today,
with the new pg_rewind test, and silenced it in the similar way.

I think we should fix this properly. I'm not sure if it can lead to a
broken cluster, but at least it can cause pg_rewind to fail
unnecessarily and in a user-unfriendly way. But this is actually pretty
simple to fix. pg_rewind looks at the control file to find out the
timeline the server is on. When promotion happens, the startup process
updates minRecoveryPoint and minRecoveryPointTLI fields in the control
file. We just need to read it from there. Patch attached.

I think we should also backpatch this. Back in 2015, we decided that we
can live with this, but it's always been a bit bogus, and seems simple
enough to fix.

Thoughts?

[1]
/message-id/20150428180253.GU30322%40tamriel.snowman.net

- Heikki

Attachment	Content-Type	Size
0001-pg_rewind-Fix-determining-TLI-when-server-was-just-p.patch	text/x-patch	8.6 KB

From:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To:	hlinnaka(at)iki(dot)fi
Cc:	pgsql-hackers(at)postgresql(dot)org, sfrost(at)snowman(dot)net, alexk(at)hintbits(dot)com, michael(dot)paquier(at)gmail(dot)com
Subject:	Re: pg_rewind race condition just after promotion
Date:	2020-12-08 04:45:33
Message-ID:	20201208.134533.202575181178574697.horikyota.ntt@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

At Mon, 7 Dec 2020 20:13:25 +0200, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote in
> There's a race condition between the checkpoint at promotion and
> pg_rewind. When a server is promoted, the startup process writes an
> end-of-recovery checkpoint that includes the new TLI, and the server
> is immediate opened for business. The startup process requests the
> checkpointer process to perform a checkpoint, but it can take a few
> seconds or more to complete. If you run pg_rewind, using the just
> promoted server as the source, pg_rewind will think that the server is
> still on the old timeline, because it only looks at TLI in the control
> file's copy of the checkpoint record. That's not updated until the
> checkpoint is finished.
>
> This isn't a new issue. Stephen Frost first reported it back 2015
> [1]. Back then, it was deemed just a small annoyance, and we just
> worked around it in the tests by issuing a checkpoint command after
> promotion, to wait for the checkpoint to finish. I just ran into it
> again today, with the new pg_rewind test, and silenced it in the
> similar way.

I (or we) faced that and avoided that by checking for history file on
the primary.

> I think we should fix this properly. I'm not sure if it can lead to a
> broken cluster, but at least it can cause pg_rewind to fail
> unnecessarily and in a user-unfriendly way. But this is actually
> pretty simple to fix. pg_rewind looks at the control file to find out
> the timeline the server is on. When promotion happens, the startup
> process updates minRecoveryPoint and minRecoveryPointTLI fields in the
> control file. We just need to read it from there. Patch attached.

Looks fine to me. A bit concerned about making sourceHistory
needlessly file-local but on the other hand unifying sourceHistory and
targetHistory looks better.

For the test part, that change doesn't necessariry catch the failure
of the current version, but I *believe* the prevous code is the result
of an actual failure in the past so the test probablistically (or
dependently on platforms?) hits the failure if it happned.

> I think we should also backpatch this. Back in 2015, we decided that
> we can live with this, but it's always been a bit bogus, and seems
> simple enough to fix.

I don't think this changes any successful behavior and it just saves
the failure case so +1 for back-patching.

> Thoughts?
>
> [1]
> /message-id/20150428180253.GU30322%40tamriel.snowman.net

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, sfrost(at)snowman(dot)net, alexk(at)hintbits(dot)com, michael(dot)paquier(at)gmail(dot)com
Subject:	Re: pg_rewind race condition just after promotion
Date:	2020-12-09 13:35:18
Message-ID:	22b18d17-0556-da06-3e55-b7311ac39d3d@iki.fi
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 08/12/2020 06:45, Kyotaro Horiguchi wrote:
> At Mon, 7 Dec 2020 20:13:25 +0200, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote in
>> I think we should fix this properly. I'm not sure if it can lead to a
>> broken cluster, but at least it can cause pg_rewind to fail
>> unnecessarily and in a user-unfriendly way. But this is actually
>> pretty simple to fix. pg_rewind looks at the control file to find out
>> the timeline the server is on. When promotion happens, the startup
>> process updates minRecoveryPoint and minRecoveryPointTLI fields in the
>> control file. We just need to read it from there. Patch attached.
>
> Looks fine to me. A bit concerned about making sourceHistory
> needlessly file-local but on the other hand unifying sourceHistory and
> targetHistory looks better.

Looking closer, findCommonAncestorTimeline() was freeing sourceHistory,
which was pretty horrible when it's a file-local variable. I changed it
so that both the source and target histories are passed to
findCommonAncestorTimeline() as arguments. That seems more clear.

> For the test part, that change doesn't necessariry catch the failure
> of the current version, but I *believe* the prevous code is the result
> of an actual failure in the past so the test probablistically (or
> dependently on platforms?) hits the failure if it happned.

Right. I think the current test coverage is good enough. We've been
bitten by this a few times by now, when we've forgotten to add the
manual checkpoint commands to new tests, and the buildfarm has caught it
pretty quickly.

>> I think we should also backpatch this. Back in 2015, we decided that
>> we can live with this, but it's always been a bit bogus, and seems
>> simple enough to fix.
>
> I don't think this changes any successful behavior and it just saves
> the failure case so +1 for back-patching.

Thanks for the review! New patch version attached.

- Heikki

Attachment	Content-Type	Size
v2-0001-pg_rewind-Fix-determining-TLI-when-server-was-jus.patch	text/x-patch	11.0 KB

From:	Ibrar Ahmed <ibrar(dot)ahmad(at)gmail(dot)com>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Stephen Frost <sfrost(at)snowman(dot)net>, alexk(at)hintbits(dot)com, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Subject:	Re: pg_rewind race condition just after promotion
Date:	2021-03-08 13:06:48
Message-ID:	CALtqXTdD_V1+NtC3CJG=mxph4MskRbu8gyASSMxGSBn76Sp7ng@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Dec 9, 2020 at 6:35 PM Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:

> On 08/12/2020 06:45, Kyotaro Horiguchi wrote:
> > At Mon, 7 Dec 2020 20:13:25 +0200, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
> wrote in
> >> I think we should fix this properly. I'm not sure if it can lead to a
> >> broken cluster, but at least it can cause pg_rewind to fail
> >> unnecessarily and in a user-unfriendly way. But this is actually
> >> pretty simple to fix. pg_rewind looks at the control file to find out
> >> the timeline the server is on. When promotion happens, the startup
> >> process updates minRecoveryPoint and minRecoveryPointTLI fields in the
> >> control file. We just need to read it from there. Patch attached.
> >
> > Looks fine to me. A bit concerned about making sourceHistory
> > needlessly file-local but on the other hand unifying sourceHistory and
> > targetHistory looks better.
>
> Looking closer, findCommonAncestorTimeline() was freeing sourceHistory,
> which was pretty horrible when it's a file-local variable. I changed it
> so that both the source and target histories are passed to
> findCommonAncestorTimeline() as arguments. That seems more clear.
>
> > For the test part, that change doesn't necessariry catch the failure
> > of the current version, but I *believe* the prevous code is the result
> > of an actual failure in the past so the test probablistically (or
> > dependently on platforms?) hits the failure if it happned.
>
> Right. I think the current test coverage is good enough. We've been
> bitten by this a few times by now, when we've forgotten to add the
> manual checkpoint commands to new tests, and the buildfarm has caught it
> pretty quickly.
>
> >> I think we should also backpatch this. Back in 2015, we decided that
> >> we can live with this, but it's always been a bit bogus, and seems
> >> simple enough to fix.
> >
> > I don't think this changes any successful behavior and it just saves
> > the failure case so +1 for back-patching.
>
> Thanks for the review! New patch version attached.
>
> - Heikki
>

The patch does not apply successfully

http://cfbot.cputube.org/patch_32_2864.log
1 out of 10 hunks FAILED -- saving rejects to file
src/bin/pg_rewind/pg_rewind.c.rej

There is a minor issue therefore I rebase the patch. Please take a look at
that.

--
Ibrar Ahmed

Attachment	Content-Type	Size
v3-0001-pg_rewind-Fix-determining-TLI-when-server-was-jus.patch	application/octet-stream	11.0 KB

From:	Aleksander Alekseev <aleksander(at)timescale(dot)com>
To:	pgsql-hackers(at)lists(dot)postgresql(dot)org
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Subject:	Re: pg_rewind race condition just after promotion
Date:	2021-07-14 12:03:22
Message-ID:	162626420228.1167.13414583689741111142.pgcf@coridan.postgresql.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

The following review has been posted through the commitfest application:
make installcheck-world: tested, passed
Implements feature: tested, passed
Spec compliant: tested, passed
Documentation: tested, passed

The v3 patch LGTM. I wonder if we should explicitly say in pg_rewind tests that
they _don't_ have to call `checkpoint`, or otherwise, we will lose the test
coverage for this scenario. But I don't have a strong opinion on this one.

The new status of this patch is: Ready for Committer

From:	Daniel Gustafsson <daniel(at)yesql(dot)se>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Aleksander Alekseev <aleksander(at)timescale(dot)com>
Subject:	Re: pg_rewind race condition just after promotion
Date:	2021-11-09 11:31:51
Message-ID:	492590F3-9566-4AC2-986D-6CFB35B110CD@yesql.se
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On 14 Jul 2021, at 14:03, Aleksander Alekseev <aleksander(at)timescale(dot)com> wrote:
>
> The following review has been posted through the commitfest application:
> make installcheck-world: tested, passed
> Implements feature: tested, passed
> Spec compliant: tested, passed
> Documentation: tested, passed
>
> The v3 patch LGTM. I wonder if we should explicitly say in pg_rewind tests that
> they _don't_ have to call `checkpoint`, or otherwise, we will lose the test
> coverage for this scenario. But I don't have a strong opinion on this one.
>
> The new status of this patch is: Ready for Committer

Heikki, do you have plans to address this patch during this CF?

--
Daniel Gustafsson https://vmware.com/

From:	Ian Lawrence Barwick <barwick(at)gmail(dot)com>
To:	Daniel Gustafsson <daniel(at)yesql(dot)se>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Aleksander Alekseev <aleksander(at)timescale(dot)com>
Subject:	Re: pg_rewind race condition just after promotion
Date:	2022-12-11 00:01:05
Message-ID:	CAB8KJ=iNTR6NOSd=EVmSqcGURNbyO9ah3tKZtz1p8KzNzc1ugA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

2021年11月9日(火) 20:31 Daniel Gustafsson <daniel(at)yesql(dot)se>:
>
> > On 14 Jul 2021, at 14:03, Aleksander Alekseev <aleksander(at)timescale(dot)com> wrote:
> >
> > The following review has been posted through the commitfest application:
> > make installcheck-world: tested, passed
> > Implements feature: tested, passed
> > Spec compliant: tested, passed
> > Documentation: tested, passed
> >
> > The v3 patch LGTM. I wonder if we should explicitly say in pg_rewind tests that
> > they _don't_ have to call `checkpoint`, or otherwise, we will lose the test
> > coverage for this scenario. But I don't have a strong opinion on this one.
> >
> > The new status of this patch is: Ready for Committer
>
> Heikki, do you have plans to address this patch during this CF?

Friendly reminder ping one year on; I haven't looked at this patch in
detail but going by the thread contents it seems it should be marked
"Ready for Committer"? Moved to the next CF anyway.

Regards

Ian Barwick

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Cc:	Aleksander Alekseev <aleksander(at)timescale(dot)com>, Ian Lawrence Barwick <barwick(at)gmail(dot)com>, Daniel Gustafsson <daniel(at)yesql(dot)se>, Ibrar Ahmed <ibrar(dot)ahmad(at)gmail(dot)com>
Subject:	Re: pg_rewind race condition just after promotion
Date:	2023-02-22 14:00:27
Message-ID:	848e03ce-0c16-2166-3994-f2c3dcf467d0@iki.fi
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 11/12/2022 02:01, Ian Lawrence Barwick wrote:
> 2021年11月9日(火) 20:31 Daniel Gustafsson <daniel(at)yesql(dot)se>:
>>
>>> On 14 Jul 2021, at 14:03, Aleksander Alekseev <aleksander(at)timescale(dot)com> wrote:
>>>
>>> The following review has been posted through the commitfest application:
>>> make installcheck-world: tested, passed
>>> Implements feature: tested, passed
>>> Spec compliant: tested, passed
>>> Documentation: tested, passed
>>>
>>> The v3 patch LGTM. I wonder if we should explicitly say in pg_rewind tests that
>>> they _don't_ have to call `checkpoint`, or otherwise, we will lose the test
>>> coverage for this scenario. But I don't have a strong opinion on this one.
>>>
>>> The new status of this patch is: Ready for Committer
>>
>> Heikki, do you have plans to address this patch during this CF?
>
> Friendly reminder ping one year on; I haven't looked at this patch in
> detail but going by the thread contents it seems it should be marked
> "Ready for Committer"? Moved to the next CF anyway.

Here's an updated version of the patch.

I renamed the arguments to findCommonAncestorTimeline() so that the
'targetHistory' argument doesn't shadow the global 'targetHistory'
variable. No other changes, and this still looks good to me, so I'll
wait for the cfbot to run on this and commit in the next few days.

- Heikki

Attachment	Content-Type	Size
v4-0001-pg_rewind-Fix-determining-TLI-when-server-was-jus.patch	text/x-patch	11.5 KB

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Cc:	Aleksander Alekseev <aleksander(at)timescale(dot)com>, Ian Lawrence Barwick <barwick(at)gmail(dot)com>, Daniel Gustafsson <daniel(at)yesql(dot)se>, Ibrar Ahmed <ibrar(dot)ahmad(at)gmail(dot)com>
Subject:	Re: pg_rewind race condition just after promotion
Date:	2023-02-23 13:43:02
Message-ID:	55b992da-c362-7474-1251-e1fe34eb136b@iki.fi
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 22/02/2023 16:00, Heikki Linnakangas wrote:
> On 11/12/2022 02:01, Ian Lawrence Barwick wrote:
>> 2021年11月9日(火) 20:31 Daniel Gustafsson <daniel(at)yesql(dot)se>:
>>>
>>>> On 14 Jul 2021, at 14:03, Aleksander Alekseev <aleksander(at)timescale(dot)com> wrote:
>>>>
>>>> The following review has been posted through the commitfest application:
>>>> make installcheck-world: tested, passed
>>>> Implements feature: tested, passed
>>>> Spec compliant: tested, passed
>>>> Documentation: tested, passed
>>>>
>>>> The v3 patch LGTM. I wonder if we should explicitly say in pg_rewind tests that
>>>> they _don't_ have to call `checkpoint`, or otherwise, we will lose the test
>>>> coverage for this scenario. But I don't have a strong opinion on this one.
>>>>
>>>> The new status of this patch is: Ready for Committer
>>>
>>> Heikki, do you have plans to address this patch during this CF?
>>
>> Friendly reminder ping one year on; I haven't looked at this patch in
>> detail but going by the thread contents it seems it should be marked
>> "Ready for Committer"? Moved to the next CF anyway.
>
> Here's an updated version of the patch.
>
> I renamed the arguments to findCommonAncestorTimeline() so that the
> 'targetHistory' argument doesn't shadow the global 'targetHistory'
> variable. No other changes, and this still looks good to me, so I'll
> wait for the cfbot to run on this and commit in the next few days.

Pushed. I decided not to backpatch this, after all. We haven't really
been treating this as a bug so far, and the patch didn't apply cleanly
to v13 and before.

- Heikki