Re: BUG #17903: There is a bug in the KeepLogSeg()

Lists: pgsql-bugs
From: PG Bug reporting form <noreply(at)postgresql(dot)org>
To: pgsql-bugs(at)lists(dot)postgresql(dot)org
Cc: xu(dot)xw2008(at)163(dot)com
Subject: BUG #17903: There is a bug in the KeepLogSeg()
Date: 2023-04-19 10:26:13
Message-ID: 17903-4288d439dee856c6@postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

The following bug has been logged on the website:

Bug reference: 17903
Logged by: xu xingwang
Email address: xu(dot)xw2008(at)163(dot)com
PostgreSQL version: 13.3
Operating system: openEuler
Description:

Hi,

I found that KeepLogSeg() has a piece of code that is not correctly.

segno may be larger than currSegNo, since the slot_keep_segs variable is of
type "uint64", in this case the code "if (currSegNo - segno >
slot_keep_segs)" is incorrect.

"if (currSegNo - segno < keep_segs)" is also the same.

Checkpoint calls the KeepLogSeg function, and there are many operations
between recptr and XLogGetReplicationSlotMinimumLSN, including updating the
pg_control file, so segno may be larger than currSegNo.

KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)
{
XLogSegNo currSegNo;
XLogSegNo segno;
XLogRecPtr keep;

XLByteToSeg(recptr, currSegNo, wal_segment_size);
segno = currSegNo;

/*
* Calculate how many segments are kept by slots first, adjusting for
* max_slot_wal_keep_size.
*/
keep = XLogGetReplicationSlotMinimumLSN();
if (keep != InvalidXLogRecPtr)
{
XLByteToSeg(keep, segno, wal_segment_size);

/* Cap by max_slot_wal_keep_size ... */
if (max_slot_wal_keep_size_mb >= 0)
{
uint64 slot_keep_segs;

slot_keep_segs =
ConvertToXSegs(max_slot_wal_keep_size_mb, wal_segment_size);

if (currSegNo - segno > slot_keep_segs)
segno = currSegNo - slot_keep_segs;
}
}

/* but, keep at least wal_keep_size if that's set */
if (wal_keep_size_mb > 0)
{
uint64 keep_segs;

keep_segs = ConvertToXSegs(wal_keep_size_mb, wal_segment_size);
if (currSegNo - segno < keep_segs)
{
/* avoid underflow, don't go below 1 */
if (currSegNo <= keep_segs)
segno = 1;
else
segno = currSegNo - keep_segs;
}
}

regards.

--
xu xingwang


From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: xu(dot)xw2008(at)163(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #17903: There is a bug in the KeepLogSeg()
Date: 2023-04-20 03:04:17
Message-ID: 20230420.120417.1609083651022565895.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

At Wed, 19 Apr 2023 10:26:13 +0000, PG Bug reporting form <noreply(at)postgresql(dot)org> wrote in
> I found that KeepLogSeg() has a piece of code that is not correctly.
>
> segno may be larger than currSegNo, since the slot_keep_segs variable is of
> type "uint64", in this case the code "if (currSegNo - segno >
> slot_keep_segs)" is incorrect.
>
> "if (currSegNo - segno < keep_segs)" is also the same.
>
> Checkpoint calls the KeepLogSeg function, and there are many operations
> between recptr and XLogGetReplicationSlotMinimumLSN, including updating the
> pg_control file, so segno may be larger than currSegNo.

Correct. Thanks for the report.

If checkpointer somehow takes a long time between inserting a
checkpoint record and removing WAL files, while replication advances a
certain distnace, it can actually happen. Although that behavior
doesn't directly affect max_slot_wal_keep_size, it does disrupt the
effect of wal_keep_size.

The thinko was that we incorrectly assumed the slot minimum LSN can't
be larger than the checkpoint record LSN. We don't need to consider
max_slot_wal_keep_size if the slot minimum LSN is already larger than
currSegNo.

The attached fix works. However, I can't come up with a reasonable
testing script.

This dates back to 13, where max_slot_wal_keep_size was introduced.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment Content-Type Size
0001-Fix-incorrect-calculation-regarding-max_slot_wal_kee.patch text/x-patch 1.2 KB

From: Junwang Zhao <zhjwpku(at)gmail(dot)com>
To: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc: xu(dot)xw2008(at)163(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #17903: There is a bug in the KeepLogSeg()
Date: 2023-04-20 07:40:14
Message-ID: CAEG8a3L2BB5k9USO4fC_wQjULGX-1rqaX2JSHn1_0vco8KB28Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

If `segno` can be larger than `currSegNo`, your patch seems to miss
the following branch,
are you missing this for some particular reason?

```
/* but, keep at least wal_keep_size if that's set */
if (wal_keep_size_mb > 0)
{
uint64 keep_segs;

keep_segs = ConvertToXSegs(wal_keep_size_mb, wal_segment_size);
if (currSegNo - segno < keep_segs) <<<< see here
{
/* avoid underflow, don't go below 1 */
if (currSegNo <= keep_segs)
segno = 1;
else
segno = currSegNo - keep_segs;
}
}
```

On Thu, Apr 20, 2023 at 11:04 AM Kyotaro Horiguchi
<horikyota(dot)ntt(at)gmail(dot)com> wrote:
>
> At Wed, 19 Apr 2023 10:26:13 +0000, PG Bug reporting form <noreply(at)postgresql(dot)org> wrote in
> > I found that KeepLogSeg() has a piece of code that is not correctly.
> >
> > segno may be larger than currSegNo, since the slot_keep_segs variable is of
> > type "uint64", in this case the code "if (currSegNo - segno >
> > slot_keep_segs)" is incorrect.
> >
> > "if (currSegNo - segno < keep_segs)" is also the same.
> >
> > Checkpoint calls the KeepLogSeg function, and there are many operations
> > between recptr and XLogGetReplicationSlotMinimumLSN, including updating the
> > pg_control file, so segno may be larger than currSegNo.
>
> Correct. Thanks for the report.
>
> If checkpointer somehow takes a long time between inserting a
> checkpoint record and removing WAL files, while replication advances a
> certain distnace, it can actually happen. Although that behavior
> doesn't directly affect max_slot_wal_keep_size, it does disrupt the
> effect of wal_keep_size.
>
> The thinko was that we incorrectly assumed the slot minimum LSN can't
> be larger than the checkpoint record LSN. We don't need to consider
> max_slot_wal_keep_size if the slot minimum LSN is already larger than
> currSegNo.
>
> The attached fix works. However, I can't come up with a reasonable
> testing script.
>
> This dates back to 13, where max_slot_wal_keep_size was introduced.
>
> regards.
>
>
> --
> Kyotaro Horiguchi
> NTT Open Source Software Center

--
Regards
Junwang Zhao


From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: zhjwpku(at)gmail(dot)com
Cc: xu(dot)xw2008(at)163(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #17903: There is a bug in the KeepLogSeg()
Date: 2023-04-20 08:58:14
Message-ID: 20230420.175814.1157027125103997227.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

At Thu, 20 Apr 2023 15:40:14 +0800, Junwang Zhao <zhjwpku(at)gmail(dot)com> wrote in
> If `segno` can be larger than `currSegNo`, your patch seems to miss
> the following branch,
> are you missing this for some particular reason?

Oops! Sorry for the mistake and thanks for pointing it out.

I should have kept segno within the reasonable range.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment Content-Type Size
v2-0001-Fix-incorrect-calculation-regarding-max_slot_wal_.patch text/x-patch 1.2 KB

From: Junwang Zhao <zhjwpku(at)gmail(dot)com>
To: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc: xu(dot)xw2008(at)163(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #17903: There is a bug in the KeepLogSeg()
Date: 2023-04-20 11:17:02
Message-ID: CAEG8a3Kx2tO4d0kaDH8-75LthUtOc_A2HEZFpkgGcQ4NB1A=vA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

This patch looks reasonable.

+1

On Thu, Apr 20, 2023 at 4:58 PM Kyotaro Horiguchi
<horikyota(dot)ntt(at)gmail(dot)com> wrote:
>
> At Thu, 20 Apr 2023 15:40:14 +0800, Junwang Zhao <zhjwpku(at)gmail(dot)com> wrote in
> > If `segno` can be larger than `currSegNo`, your patch seems to miss
> > the following branch,
> > are you missing this for some particular reason?
>
> Oops! Sorry for the mistake and thanks for pointing it out.
>
> I should have kept segno within the reasonable range.
>
> regards.
>
> --
> Kyotaro Horiguchi
> NTT Open Source Software Center

--
Regards
Junwang Zhao


From: Nathan Bossart <nathandbossart(at)gmail(dot)com>
To: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc: zhjwpku(at)gmail(dot)com, xu(dot)xw2008(at)163(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #17903: There is a bug in the KeepLogSeg()
Date: 2023-04-20 22:02:42
Message-ID: 20230420220242.GB1435672@nathanxps13
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

On Thu, Apr 20, 2023 at 05:58:14PM +0900, Kyotaro Horiguchi wrote:
> + /*
> + * Slot minimum LSN could be greater than recptr. In such cases, no
> + * segments are protected by slots but we still need to keep segno in a
> + * reasonable range for subsequent calculations to avoid underflow.
> + */
> + if (segno > currSegNo)
> + segno = currSegNo;
> +

I wonder if it'd be better to instead change

if (currSegNo - segno > slot_keep_segs)
to
if (currSegNo > segno + slot_keep_segs)

and

if (currSegNo - segno < keep_segs)
to
if (currSegNo < segNo + keep_segs)

If segno > currSegNo, the first conditional would be false as expected, and
the second would be true as expected. We could also use
pg_sub_u64_overflow() to detect underflow, but that might be excessive in
this case.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com


From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: nathandbossart(at)gmail(dot)com
Cc: zhjwpku(at)gmail(dot)com, xu(dot)xw2008(at)163(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #17903: There is a bug in the KeepLogSeg()
Date: 2023-04-21 01:42:31
Message-ID: 20230421.104231.1594652765237022266.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

At Thu, 20 Apr 2023 15:02:42 -0700, Nathan Bossart <nathandbossart(at)gmail(dot)com> wrote in
> I wonder if it'd be better to instead change
>
> if (currSegNo - segno > slot_keep_segs)
> to
> if (currSegNo > segno + slot_keep_segs)
>
> and
>
> if (currSegNo - segno < keep_segs)
> to
> if (currSegNo < segNo + keep_segs)
>
> If segno > currSegNo, the first conditional would be false as expected, and
> the second would be true as expected. We could also use
> pg_sub_u64_overflow() to detect underflow, but that might be excessive in
> this case.

From what I understand, the XLogSegNo calculations are designed
without considering the actual value range. Basically, it assumes that
(XLogSegNo + <any positive int>) can overflow. If we take the actual
value range into account, we can make that change.

The choice lies on whether we assume the actual value range or not.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center


From: Nathan Bossart <nathandbossart(at)gmail(dot)com>
To: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc: zhjwpku(at)gmail(dot)com, xu(dot)xw2008(at)163(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #17903: There is a bug in the KeepLogSeg()
Date: 2023-04-21 21:43:47
Message-ID: 20230421214347.GA1441786@nathanxps13
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

On Fri, Apr 21, 2023 at 10:42:31AM +0900, Kyotaro Horiguchi wrote:
> From what I understand, the XLogSegNo calculations are designed
> without considering the actual value range. Basically, it assumes that
> (XLogSegNo + <any positive int>) can overflow. If we take the actual
> value range into account, we can make that change.
>
> The choice lies on whether we assume the actual value range or not.

Yeah, after staring at this some more, I think your proposed fix is the way
to go. Alternatively, we could adjust the conditional for the
max_slot_wal_keep_size block to

if (keep != InvalidXLogRecPtr && keep < recptr)

but AFAICT that yields the exact same behavior.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com


From: Nathan Bossart <nathandbossart(at)gmail(dot)com>
To: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc: zhjwpku(at)gmail(dot)com, xu(dot)xw2008(at)163(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #17903: There is a bug in the KeepLogSeg()
Date: 2023-04-24 19:14:52
Message-ID: 20230424191452.GA1698829@nathanxps13
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Here is a new version of the patch. It is fundamentally the same as v2,
but I've adjusted the comment and commit message a bit. Barring
objections, I am planning to commit this (and back-patch to v13) in the
near future.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment Content-Type Size
v3-0001-Prevent-underflow-in-KeepLogSeg.patch text/x-diff 1.6 KB

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: nathandbossart(at)gmail(dot)com
Cc: zhjwpku(at)gmail(dot)com, xu(dot)xw2008(at)163(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #17903: There is a bug in the KeepLogSeg()
Date: 2023-04-25 04:14:52
Message-ID: 20230425.131452.517547289278432754.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

At Mon, 24 Apr 2023 12:14:52 -0700, Nathan Bossart <nathandbossart(at)gmail(dot)com> wrote in
> Here is a new version of the patch. It is fundamentally the same as v2,
> but I've adjusted the comment and commit message a bit. Barring
> objections, I am planning to commit this (and back-patch to v13) in the
> near future.

> When determining the oldest segment that must be kept for
> replication slots, KeepLogSeg() might calculate a segment number
> ahead of the one for the LSN given to the function. This causes

Maybe the KeepLogSeg() is a mistake of
XLogGetReplicationSlotMinimumLSN()?

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center


From: Nathan Bossart <nathandbossart(at)gmail(dot)com>
To: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc: zhjwpku(at)gmail(dot)com, xu(dot)xw2008(at)163(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #17903: There is a bug in the KeepLogSeg()
Date: 2023-04-27 21:47:46
Message-ID: 20230427214746.GA1986346@nathanxps13
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

On Tue, Apr 25, 2023 at 01:14:52PM +0900, Kyotaro Horiguchi wrote:
> At Mon, 24 Apr 2023 12:14:52 -0700, Nathan Bossart <nathandbossart(at)gmail(dot)com> wrote in
>> When determining the oldest segment that must be kept for
>> replication slots, KeepLogSeg() might calculate a segment number
>> ahead of the one for the LSN given to the function. This causes
>
> Maybe the KeepLogSeg() is a mistake of
> XLogGetReplicationSlotMinimumLSN()?

I adjusted the commit message to call out that function explicitly. Also,
I decided to go with the "keep < recptr" approach since there's no reason
to do anything in that block otherwise.

Thanks!

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com


From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Nathan Bossart <nathandbossart(at)gmail(dot)com>
Cc: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, zhjwpku(at)gmail(dot)com, xu(dot)xw2008(at)163(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #17903: There is a bug in the KeepLogSeg()
Date: 2023-04-27 23:08:01
Message-ID: ZEsAUeAsD6zvvKhc@paquier.xyz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

On Thu, Apr 27, 2023 at 02:47:46PM -0700, Nathan Bossart wrote:
> I adjusted the commit message to call out that function explicitly. Also,
> I decided to go with the "keep < recptr" approach since there's no reason
> to do anything in that block otherwise.

b726236 exists in the tree, but it seems like the message of
pgsql-committers is held on moderation?
--
Michael


From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: nathandbossart(at)gmail(dot)com
Cc: zhjwpku(at)gmail(dot)com, xu(dot)xw2008(at)163(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #17903: There is a bug in the KeepLogSeg()
Date: 2023-04-28 06:21:49
Message-ID: 20230428.152149.1957059339218200205.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: PostgreSQL : PostgreSQL 메일 링리스트 : 2023-04-28 이후 PGSQL 윈 토토 06:21

At Thu, 27 Apr 2023 14:47:46 -0700, Nathan Bossart <nathandbossart(at)gmail(dot)com> wrote in
> On Tue, Apr 25, 2023 at 01:14:52PM +0900, Kyotaro Horiguchi wrote:
> > At Mon, 24 Apr 2023 12:14:52 -0700, Nathan Bossart <nathandbossart(at)gmail(dot)com> wrote in
> >> When determining the oldest segment that must be kept for
> >> replication slots, KeepLogSeg() might calculate a segment number
> >> ahead of the one for the LSN given to the function. This causes
> >
> > Maybe the KeepLogSeg() is a mistake of
> > XLogGetReplicationSlotMinimumLSN()?
>
> I adjusted the commit message to call out that function explicitly. Also,
> I decided to go with the "keep < recptr" approach since there's no reason
> to do anything in that block otherwise.

That works, too. Thanks!

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center