Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?

Lists: pgsql-bugspgsql-hackers
From: Jonathon Nelson <jdnelson(at)dyn(dot)com>
To: pgsql-bugs(at)postgresql(dot)org
Subject: Bug in Physical Replication Slots (at least 9.5)?
Date: 2016-11-28 19:39:28
Message-ID: CACJqAM3xVz0JY1XFDKPP+JoJAjoGx=GNuOAshEDWCext7BFvCQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

We think we have discovered a bug in the physical replication slots
functionality in PostgreSQL 9.5.
We've seen the behavior across Operating Systems (CentOS-7 and openSUSE
LEAP 42.1), filesystems (ext4 and xfs), and versions (9.5.3 and 9.5.4). All
were on x86_64.

We notice that if we stop and then re-start the *standby*, upon restart it
will - sometimes - request a WAL file that the master no longer has.

First, the postgresql configuration differs only minimally from the stock
config:

Assume wal_keep_segments = 0.
Assume the use of physical replication slots.
Assume one master, one standby.

Lastly, we have observed the behavior "in the wild" at least twice and in
the lab a dozen or so times.

EXAMPLE #1 (logs are from the replica):

user=,db=,app=,client= DEBUG: creating and filling new WAL file
user=,db=,app=,client= DEBUG: done creating and filling new WAL file
user=,db=,app=,client= DEBUG: sending write 6/8B000000 flush 6/8A000000
apply 5/748425A0
user=,db=,app=,client= DEBUG: sending write 6/8B000000 flush 6/8B000000
apply 5/74843020
<control-c here>
user=,db=,app=,client= DEBUG: postmaster received signal 2
user=,db=,app=,client= LOG: received fast shutdown request
user=,db=,app=,client= LOG: aborting any active transactions

And, upon restart:

user=,db=,app=,client= LOG: restartpoint starting: xlog
user=,db=,app=,client= DEBUG: updated min recovery point to 6/67002390 on
timeline 1
user=,db=,app=,client= DEBUG: performing replication slot checkpoint
user=,db=,app=,client= DEBUG: updated min recovery point to 6/671768C0 on
timeline 1
user=,db=,app=,client= CONTEXT: writing block 589 of relation
base/13294/16501
user=,db=,app=,client= LOG: invalid magic number 0000 in log segment
00000001000000060000008B, offset 0
user=,db=,app=,client= DEBUG: switched WAL source from archive to stream
after failure
user=,db=,app=,client= LOG: started streaming WAL from primary at
6/8A000000 on timeline 1
user=,db=,app=,client= FATAL: could not receive data from WAL stream:
ERROR: requested WAL segment 00000001000000060000008A has already been
removed

A physical analysis shows that the WAL file 00000001000000060000008B is
100% zeroes (ASCII NUL).

The results of querying pg_replication_slots shows a restart_lsn that
matches ….6/8B.

Pg_controldata shows values like:
Minimum recovery ending location: 6/8Axxxxxx

How can the master show a position that is greater than the minimum
recovery ending location?

EXAMPLE #2:

Minimum recovery ending location: 19DD/73FFFFE0
Log segment 00000001000019DD00000073 was not available.
The restart LSN was 19DD/74000000.
The last few lines from pg_xlogdump 00000001000019DD00000073:

rmgr: Btree len (rec/tot): 2/ 64, tx: 77257, lsn:
19DD/73FFFF60, prev 19DD/73FFFF20, desc: INSERT_LEAF off 132, blkref #0:
rel 1663/16403/150017028 blk 1832
rmgr: Btree len (rec/tot): 2/ 64, tx: 77257, lsn:
19DD/73FFFFA0, prev 19DD/73FFFF60, desc: INSERT_LEAF off 206, blkref #0:
rel 1663/16403/150017028 blk 11709

If I'm understanding this properly, (0x73FFFFA0 - 0x73000000) is the first
byte of the last record in this file, and the record length is 64 bytes
which places the first byte of the next record at: 16777184 (0xffffe0)
(logical position 0x73ffffe0: this jives with pg_controldata).

However, there are only 32 bytes of file left:
0x73FFFFA0 - 0x73000000 + 64 -=> 16777184
16777216 - 16777184 -=> 32

Which means that the next record is in the WAL file
00000001000019DD00000074.

A possible theory:

Let us assume PG has applied 100% of the data in a given WAL file, and
let’s assume (as in this case) that the WAL file is
00000001000019DD00000073. When it starts up again, it uses the control
data to start and say “The next record is at 19DD/0x73ffffe0" which it
truncates to 0x73000000. However, PG has *also* already told the master
that is has fully received, written, and flushed all of the data for that
WAL file, so the master has 0x74000000 as the start position (and has
consequently removed the WAL file for 0x73). The relationship between
pg_controldata and pg_replication_slots.restart_lsn seem to be very
slightly (but importantly) at odds.

Could it be this part of the code?

From src/backend/replication/walreceiverfuncs.c in RequestXLogStreaming (as
of a0aa358ca603d8189fe4be72f614cf7cf363d81a):

235 /*
236 * We always start at the beginning of the segment. That prevents a
broken
237 * segment (i.e., with no records in the first half of a segment)
from
238 * being created by XLOG streaming, which might cause trouble later
on if
239 * the segment is e.g archived.
240 */
241 if (recptr % XLogSegSize != 0)
242 recptr -= recptr % XLogSegSize;
243

We start up with 19DD/0x73ffffe0 (but there would not be enough room in
that segment for any more records, so logically we'd have to go to
19DD/0x74000000). When we start WAL receiving, we truncate 0x73ffffe0 to
0x73000000, which the master has already removed (and - technically - we
don't actually need?).

--
Jon Nelson
Dyn / Principal Software Engineer


From: Jonathon Nelson <jdnelson(at)dyn(dot)com>
To: pgsql-bugs(at)postgresql(dot)org
Subject: Re: Bug in Physical Replication Slots (at least 9.5)?
Date: 2017-01-13 14:47:06
Message-ID: CACJqAM1ydcZcd5DoCp+y5hkWto1ZeGW+Mj8UK7avqctbGJO8Bw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Mon, Nov 28, 2016 at 1:39 PM, Jonathon Nelson <jdnelson(at)dyn(dot)com> wrote:

> We think we have discovered a bug in the physical replication slots
> functionality in PostgreSQL 9.5.
> We've seen the behavior across Operating Systems (CentOS-7 and openSUSE
> LEAP 42.1), filesystems (ext4 and xfs), and versions (9.5.3 and 9.5.4). All
> were on x86_64.
>
> We notice that if we stop and then re-start the *standby*, upon restart it
> will - sometimes - request a WAL file that the master no longer has.
>

I hate to largely re-quote my entire wall-of-text email/bug report, but
there were no responses to this (to be fair, it was at the end of a month
of US holidays, etc...).

Is there more information I should provide? Can I get this added to some
sort of official bug list (it doesn't have a bug number)?

Any help or advice here would be appreciated.

> First, the postgresql configuration differs only minimally from the stock
> config:
>
> Assume wal_keep_segments = 0.
> Assume the use of physical replication slots.
> Assume one master, one standby.
>
> Lastly, we have observed the behavior "in the wild" at least twice and in
> the lab a dozen or so times.
>
> EXAMPLE #1 (logs are from the replica):
>
> user=,db=,app=,client= DEBUG: creating and filling new WAL file
> user=,db=,app=,client= DEBUG: done creating and filling new WAL file
> user=,db=,app=,client= DEBUG: sending write 6/8B000000 flush 6/8A000000
> apply 5/748425A0
> user=,db=,app=,client= DEBUG: sending write 6/8B000000 flush 6/8B000000
> apply 5/74843020
> <control-c here>
> user=,db=,app=,client= DEBUG: postmaster received signal 2
> user=,db=,app=,client= LOG: received fast shutdown request
> user=,db=,app=,client= LOG: aborting any active transactions
>
> And, upon restart:
>
> user=,db=,app=,client= LOG: restartpoint starting: xlog
> user=,db=,app=,client= DEBUG: updated min recovery point to 6/67002390 on
> timeline 1
> user=,db=,app=,client= DEBUG: performing replication slot checkpoint
> user=,db=,app=,client= DEBUG: updated min recovery point to 6/671768C0 on
> timeline 1
> user=,db=,app=,client= CONTEXT: writing block 589 of relation
> base/13294/16501
> user=,db=,app=,client= LOG: invalid magic number 0000 in log segment
> 00000001000000060000008B, offset 0
> user=,db=,app=,client= DEBUG: switched WAL source from archive to stream
> after failure
> user=,db=,app=,client= LOG: started streaming WAL from primary at
> 6/8A000000 on timeline 1
> user=,db=,app=,client= FATAL: could not receive data from WAL stream:
> ERROR: requested WAL segment 00000001000000060000008A has already been
> removed
>
> A physical analysis shows that the WAL file 00000001000000060000008B is
> 100% zeroes (ASCII NUL).
>
> The results of querying pg_replication_slots shows a restart_lsn that
> matches ….6/8B.
>
> Pg_controldata shows values like:
> Minimum recovery ending location: 6/8Axxxxxx
>
> How can the master show a position that is greater than the minimum
> recovery ending location?
>
> EXAMPLE #2:
>
> Minimum recovery ending location: 19DD/73FFFFE0
> Log segment 00000001000019DD00000073 was not available.
> The restart LSN was 19DD/74000000.
> The last few lines from pg_xlogdump 00000001000019DD00000073:
>
>
> rmgr: Btree len (rec/tot): 2/ 64, tx: 77257, lsn:
> 19DD/73FFFF60, prev 19DD/73FFFF20, desc: INSERT_LEAF off 132, blkref #0:
> rel 1663/16403/150017028 blk 1832
> rmgr: Btree len (rec/tot): 2/ 64, tx: 77257, lsn:
> 19DD/73FFFFA0, prev 19DD/73FFFF60, desc: INSERT_LEAF off 206, blkref #0:
> rel 1663/16403/150017028 blk 11709
>
> If I'm understanding this properly, (0x73FFFFA0 - 0x73000000) is the first
> byte of the last record in this file, and the record length is 64 bytes
> which places the first byte of the next record at: 16777184 (0xffffe0)
> (logical position 0x73ffffe0: this jives with pg_controldata).
>
> However, there are only 32 bytes of file left:
> 0x73FFFFA0 - 0x73000000 + 64 -=> 16777184
> 16777216 - 16777184 -=> 32
>
> Which means that the next record is in the WAL file
> 00000001000019DD00000074.
>
> A possible theory:
>
> Let us assume PG has applied 100% of the data in a given WAL file, and
> let’s assume (as in this case) that the WAL file is
> 00000001000019DD00000073. When it starts up again, it uses the control
> data to start and say “The next record is at 19DD/0x73ffffe0" which it
> truncates to 0x73000000. However, PG has *also* already told the master
> that is has fully received, written, and flushed all of the data for that
> WAL file, so the master has 0x74000000 as the start position (and has
> consequently removed the WAL file for 0x73). The relationship between
> pg_controldata and pg_replication_slots.restart_lsn seem to be very
> slightly (but importantly) at odds.
>
> Could it be this part of the code?
>
> From src/backend/replication/walreceiverfuncs.c in RequestXLogStreaming
> (as of a0aa358ca603d8189fe4be72f614cf7cf363d81a):
>
> 235 /*
> 236 * We always start at the beginning of the segment. That prevents
> a broken
> 237 * segment (i.e., with no records in the first half of a segment)
> from
> 238 * being created by XLOG streaming, which might cause trouble
> later on if
> 239 * the segment is e.g archived.
> 240 */
> 241 if (recptr % XLogSegSize != 0)
> 242 recptr -= recptr % XLogSegSize;
> 243
>
> We start up with 19DD/0x73ffffe0 (but there would not be enough room in
> that segment for any more records, so logically we'd have to go to
> 19DD/0x74000000). When we start WAL receiving, we truncate 0x73ffffe0 to
> 0x73000000, which the master has already removed (and - technically - we
> don't actually need?).
>
>
> --
> Jon Nelson
> Dyn / Principal Software Engineer
>

--
Jon Nelson
Dyn / Principal Software Engineer


From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: jdnelson(at)dyn(dot)com
Cc: pgsql-hackers(at)postgresql(dot)org, pgsql-bugs(at)postgresql(dot)org
Subject: Re: Bug in Physical Replication Slots (at least 9.5)?
Date: 2017-01-17 10:36:45
Message-ID: 20170117.193645.160386781.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Hello. I added pgsql-hackers.

This occurs also on git master and back to 9.4.

At Fri, 13 Jan 2017 08:47:06 -0600, Jonathon Nelson <jdnelson(at)dyn(dot)com> wrote in <CACJqAM1ydcZcd5DoCp+y5hkWto1ZeGW+Mj8UK7avqctbGJO8Bw(at)mail(dot)gmail(dot)com>
> On Mon, Nov 28, 2016 at 1:39 PM, Jonathon Nelson <jdnelson(at)dyn(dot)com> wrote:
> > First, the postgresql configuration differs only minimally from the stock
> > config:
> >
> > Assume wal_keep_segments = 0.
> > Assume the use of physical replication slots.
> > Assume one master, one standby.
> >
> > Lastly, we have observed the behavior "in the wild" at least twice and in
> > the lab a dozen or so times.
> >
> > EXAMPLE #1 (logs are from the replica):
> >
> > user=,db=,app=,client= DEBUG: creating and filling new WAL file
> > user=,db=,app=,client= DEBUG: done creating and filling new WAL file
> > user=,db=,app=,client= DEBUG: sending write 6/8B000000 flush 6/8A000000
> > apply 5/748425A0
> > user=,db=,app=,client= DEBUG: sending write 6/8B000000 flush 6/8B000000
> > apply 5/74843020
> > <control-c here>
> > user=,db=,app=,client= DEBUG: postmaster received signal 2
> > user=,db=,app=,client= LOG: received fast shutdown request
> > user=,db=,app=,client= LOG: aborting any active transactions
> >
> > And, upon restart:
> >
> > user=,db=,app=,client= LOG: restartpoint starting: xlog
> > user=,db=,app=,client= DEBUG: updated min recovery point to 6/67002390 on
> > timeline 1
> > user=,db=,app=,client= DEBUG: performing replication slot checkpoint
> > user=,db=,app=,client= DEBUG: updated min recovery point to 6/671768C0 on
> > timeline 1
> > user=,db=,app=,client= CONTEXT: writing block 589 of relation
> > base/13294/16501
> > user=,db=,app=,client= LOG: invalid magic number 0000 in log segment
> > 00000001000000060000008B, offset 0
> > user=,db=,app=,client= DEBUG: switched WAL source from archive to stream
> > after failure
> > user=,db=,app=,client= LOG: started streaming WAL from primary at
> > 6/8A000000 on timeline 1
> > user=,db=,app=,client= FATAL: could not receive data from WAL stream:
> > ERROR: requested WAL segment 00000001000000060000008A has already been
> > removed

I managed to reproduce this. A little tweak as the first patch
lets the standby to suicide as soon as walreceiver sees a
contrecord at the beginning of a segment.

- M(aster): createdb as a master with wal_keep_segments = 0
(default), min_log_messages = debug2
- M: Create a physical repslot.
- S(tandby): Setup a standby database.
- S: Edit recovery.conf to use the replication slot above then
start it.
- S: touch /tmp/hoge
- M: Run pgbench ...
- S: After a while, the standby stops.
> LOG: #################### STOP THE SERVER

- M: Stop pgbench.
- M: Do 'checkpoint;' twice.
- S: rm /tmp/hoge
- S: Fails to catch up with the following error.

> FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 00000001000000000000002B has already been removed

This problem occurs when only the earlier parts in a continued
record is replicated then the segment is removed on the
master. In other words, the first half is only on standby, and
the second half is only on the master.

I believe that a continuation record cannot be span over three or
more *segments* (is it right?), so kepping one spare segment
would be enough. The attached second patch does this.

Other possible measures might be,

- Allowing switching wal source while reading a continuation
record. Currently ReadRecord assumes that a continuation record
can be read from single source. But this needs refactoring
involving xlog.c, xlogreader.c and relatives.

- Delaying recycing a segment until the last partial record on it
completes. This seems doable in page-wise (coarse resolution)
but would cost additional reading of past xlog files (page
header of past pages is required).

- Delaying write/flush feedback until the current record is
completed. walreceiver is not conscious of a WAL record and
this might break synchronous replication.

Any thoughts?

=========================================
> > A physical analysis shows that the WAL file 00000001000000060000008B is
> > 100% zeroes (ASCII NUL).

I suppose it is on the standby so the segment file is the one
where the next transferred record will be written onto.

> > The results of querying pg_replication_slots shows a restart_lsn that
> > matches ….6/8B.

It is the beginning of the next record to be replicatd as
documentation. In other words, just after the last transferred
record (containing padding).

> > Pg_controldata shows values like:
> > Minimum recovery ending location: 6/8Axxxxxx

It is the beginning of the last applied record.

> > How can the master show a position that is greater than the minimum
> > recovery ending location?

So it is natural that the former is larger than the latter.

> > EXAMPLE #2:
> >
> > Minimum recovery ending location: 19DD/73FFFFE0
> > Log segment 00000001000019DD00000073 was not available.
> > The restart LSN was 19DD/74000000.
> > The last few lines from pg_xlogdump 00000001000019DD00000073:
> >
> >
> > rmgr: Btree len (rec/tot): 2/ 64, tx: 77257, lsn:
> > 19DD/73FFFF60, prev 19DD/73FFFF20, desc: INSERT_LEAF off 132, blkref #0:
> > rel 1663/16403/150017028 blk 1832
> > rmgr: Btree len (rec/tot): 2/ 64, tx: 77257, lsn:
> > 19DD/73FFFFA0, prev 19DD/73FFFF60, desc: INSERT_LEAF off 206, blkref #0:
> > rel 1663/16403/150017028 blk 11709
> >
> > If I'm understanding this properly, (0x73FFFFA0 - 0x73000000) is the first
> > byte of the last record in this file, and the record length is 64 bytes
> > which places the first byte of the next record at: 16777184 (0xffffe0)
> > (logical position 0x73ffffe0: this jives with pg_controldata).

Maybe right. pg_xlogdump skips partial records.

> > However, there are only 32 bytes of file left:
> > 0x73FFFFA0 - 0x73000000 + 64 -=> 16777184
> > 16777216 - 16777184 -=> 32
> >
> > Which means that the next record is in the WAL file
> > 00000001000019DD00000074.

Maybe right.

> > A possible theory:
> >
> > Let us assume PG has applied 100% of the data in a given WAL file, and
> > let’s assume (as in this case) that the WAL file is
> > 00000001000019DD00000073. When it starts up again, it uses the control
> > data to start and say “The next record is at 19DD/0x73ffffe0" which it
> > truncates to 0x73000000. However, PG has *also* already told the master
> > that is has fully received, written, and flushed all of the data for that
> > WAL file, so the master has 0x74000000 as the start position (and has
> > consequently removed the WAL file for 0x73). The relationship between
> > pg_controldata and pg_replication_slots.restart_lsn seem to be very
> > slightly (but importantly) at odds.
> >
> > Could it be this part of the code?

No. the code does the right thing. The problem is that a
continuation record is assumed to be on the same wal source, that
is, archive/wal and streaming. But a continueation record is
distributed to two sources.

> > From src/backend/replication/walreceiverfuncs.c in RequestXLogStreaming
> > (as of a0aa358ca603d8189fe4be72f614cf7cf363d81a):
> >
> > 235 /*
> > 236 * We always start at the beginning of the segment. That prevents
> > a broken
> > 237 * segment (i.e., with no records in the first half of a segment)
> > from
> > 238 * being created by XLOG streaming, which might cause trouble
> > later on if
> > 239 * the segment is e.g archived.
> > 240 */
> > 241 if (recptr % XLogSegSize != 0)
> > 242 recptr -= recptr % XLogSegSize;
> > 243
> >
> > We start up with 19DD/0x73ffffe0 (but there would not be enough room in
> > that segment for any more records, so logically we'd have to go to
> > 19DD/0x74000000). When we start WAL receiving, we truncate 0x73ffffe0 to
> > 0x73000000, which the master has already removed (and - technically - we
> > don't actually need?).

0x73ffffe0 and 0x73000000 are on the same segment. Current
recovery mechanism requires reading the record starts from
0x73ffffe0 and it is on the standby and it is read.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment Content-Type Size
repl_border_bug_test.patch text/x-patch 1.0 KB
keep_extra_one_seg_for_replslot.patch text/x-patch 601 bytes

From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, jdnelson(at)dyn(dot)com
Cc: "<pgsql-bugs(at)postgresql(dot)org>" <pgsql-bugs(at)postgresql(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [BUGS] Bug in Physical Replication Slots (at least 9.5)?
Date: 2017-01-17 11:17:01
Message-ID: CAM103DsCyXsfwZou9xQcKW8ap7UcTFNP_GM0E1orRAHmeJgZoA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Auch! It is wrong.

Not decrement keep, decrement segno.

2017年1月17日(火) 19:37 Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>:

> Hello. I added pgsql-hackers.
>
> This occurs also on git master and back to 9.4.
>
> At Fri, 13 Jan 2017 08:47:06 -0600, Jonathon Nelson <jdnelson(at)dyn(dot)com>
> wrote in <
> CACJqAM1ydcZcd5DoCp+y5hkWto1ZeGW+Mj8UK7avqctbGJO8Bw(at)mail(dot)gmail(dot)com>
> > On Mon, Nov 28, 2016 at 1:39 PM, Jonathon Nelson <jdnelson(at)dyn(dot)com>
> wrote:
> > > First, the postgresql configuration differs only minimally from the
> stock
> > > config:
> > >
> > > Assume wal_keep_segments = 0.
> > > Assume the use of physical replication slots.
> > > Assume one master, one standby.
> > >
> > > Lastly, we have observed the behavior "in the wild" at least twice and
> in
> > > the lab a dozen or so times.
> > >
> > > EXAMPLE #1 (logs are from the replica):
> > >
> > > user=,db=,app=,client= DEBUG: creating and filling new WAL file
> > > user=,db=,app=,client= DEBUG: done creating and filling new WAL file
> > > user=,db=,app=,client= DEBUG: sending write 6/8B000000 flush
> 6/8A000000
> > > apply 5/748425A0
> > > user=,db=,app=,client= DEBUG: sending write 6/8B000000 flush
> 6/8B000000
> > > apply 5/74843020
> > > <control-c here>
> > > user=,db=,app=,client= DEBUG: postmaster received signal 2
> > > user=,db=,app=,client= LOG: received fast shutdown request
> > > user=,db=,app=,client= LOG: aborting any active transactions
> > >
> > > And, upon restart:
> > >
> > > user=,db=,app=,client= LOG: restartpoint starting: xlog
> > > user=,db=,app=,client= DEBUG: updated min recovery point to
> 6/67002390 on
> > > timeline 1
> > > user=,db=,app=,client= DEBUG: performing replication slot checkpoint
> > > user=,db=,app=,client= DEBUG: updated min recovery point to
> 6/671768C0 on
> > > timeline 1
> > > user=,db=,app=,client= CONTEXT: writing block 589 of relation
> > > base/13294/16501
> > > user=,db=,app=,client= LOG: invalid magic number 0000 in log segment
> > > 00000001000000060000008B, offset 0
> > > user=,db=,app=,client= DEBUG: switched WAL source from archive to
> stream
> > > after failure
> > > user=,db=,app=,client= LOG: started streaming WAL from primary at
> > > 6/8A000000 on timeline 1
> > > user=,db=,app=,client= FATAL: could not receive data from WAL stream:
> > > ERROR: requested WAL segment 00000001000000060000008A has already been
> > > removed
>
> I managed to reproduce this. A little tweak as the first patch
> lets the standby to suicide as soon as walreceiver sees a
> contrecord at the beginning of a segment.
>
> - M(aster): createdb as a master with wal_keep_segments = 0
> (default), min_log_messages = debug2
> - M: Create a physical repslot.
> - S(tandby): Setup a standby database.
> - S: Edit recovery.conf to use the replication slot above then
> start it.
> - S: touch /tmp/hoge
> - M: Run pgbench ...
> - S: After a while, the standby stops.
> > LOG: #################### STOP THE SERVER
>
> - M: Stop pgbench.
> - M: Do 'checkpoint;' twice.
> - S: rm /tmp/hoge
> - S: Fails to catch up with the following error.
>
> > FATAL: could not receive data from WAL stream: ERROR: requested WAL
> segment 00000001000000000000002B has already been removed
>
>
> This problem occurs when only the earlier parts in a continued
> record is replicated then the segment is removed on the
> master. In other words, the first half is only on standby, and
> the second half is only on the master.
>
> I believe that a continuation record cannot be span over three or
> more *segments* (is it right?), so kepping one spare segment
> would be enough. The attached second patch does this.
>
>
> Other possible measures might be,
>
> - Allowing switching wal source while reading a continuation
> record. Currently ReadRecord assumes that a continuation record
> can be read from single source. But this needs refactoring
> involving xlog.c, xlogreader.c and relatives.
>
> - Delaying recycing a segment until the last partial record on it
> completes. This seems doable in page-wise (coarse resolution)
> but would cost additional reading of past xlog files (page
> header of past pages is required).
>
> - Delaying write/flush feedback until the current record is
> completed. walreceiver is not conscious of a WAL record and
> this might break synchronous replication.
>
> Any thoughts?
>
>
> =========================================
> > > A physical analysis shows that the WAL file 00000001000000060000008B is
> > > 100% zeroes (ASCII NUL).
>
> I suppose it is on the standby so the segment file is the one
> where the next transferred record will be written onto.
>
> > > The results of querying pg_replication_slots shows a restart_lsn that
> > > matches ….6/8B.
>
> It is the beginning of the next record to be replicatd as
> documentation. In other words, just after the last transferred
> record (containing padding).
>
> > > Pg_controldata shows values like:
> > > Minimum recovery ending location: 6/8Axxxxxx
>
> It is the beginning of the last applied record.
>
> > > How can the master show a position that is greater than the minimum
> > > recovery ending location?
>
> So it is natural that the former is larger than the latter.
>
> > > EXAMPLE #2:
> > >
> > > Minimum recovery ending location: 19DD/73FFFFE0
> > > Log segment 00000001000019DD00000073 was not available.
> > > The restart LSN was 19DD/74000000.
> > > The last few lines from pg_xlogdump 00000001000019DD00000073:
> > >
> > >
> > > rmgr: Btree len (rec/tot): 2/ 64, tx: 77257, lsn:
> > > 19DD/73FFFF60, prev 19DD/73FFFF20, desc: INSERT_LEAF off 132, blkref
> #0:
> > > rel 1663/16403/150017028 blk 1832
> > > rmgr: Btree len (rec/tot): 2/ 64, tx: 77257, lsn:
> > > 19DD/73FFFFA0, prev 19DD/73FFFF60, desc: INSERT_LEAF off 206, blkref
> #0:
> > > rel 1663/16403/150017028 blk 11709
> > >
> > > If I'm understanding this properly, (0x73FFFFA0 - 0x73000000) is the
> first
> > > byte of the last record in this file, and the record length is 64 bytes
> > > which places the first byte of the next record at: 16777184 (0xffffe0)
> > > (logical position 0x73ffffe0: this jives with pg_controldata).
>
> Maybe right. pg_xlogdump skips partial records.
>
> > > However, there are only 32 bytes of file left:
> > > 0x73FFFFA0 - 0x73000000 + 64 -=> 16777184
> > > 16777216 - 16777184 -=> 32
> > >
> > > Which means that the next record is in the WAL file
> > > 00000001000019DD00000074.
>
> Maybe right.
>
> > > A possible theory:
> > >
> > > Let us assume PG has applied 100% of the data in a given WAL file, and
> > > let’s assume (as in this case) that the WAL file is
> > > 00000001000019DD00000073. When it starts up again, it uses the control
> > > data to start and say “The next record is at 19DD/0x73ffffe0" which it
> > > truncates to 0x73000000. However, PG has *also* already told the
> master
> > > that is has fully received, written, and flushed all of the data for
> that
> > > WAL file, so the master has 0x74000000 as the start position (and has
> > > consequently removed the WAL file for 0x73). The relationship between
> > > pg_controldata and pg_replication_slots.restart_lsn seem to be very
> > > slightly (but importantly) at odds.
> > >
> > > Could it be this part of the code?
>
> No. the code does the right thing. The problem is that a
> continuation record is assumed to be on the same wal source, that
> is, archive/wal and streaming. But a continueation record is
> distributed to two sources.
>
> > > From src/backend/replication/walreceiverfuncs.c in RequestXLogStreaming
> > > (as of a0aa358ca603d8189fe4be72f614cf7cf363d81a):
> > >
> > > 235 /*
> > > 236 * We always start at the beginning of the segment. That
> prevents
> > > a broken
> > > 237 * segment (i.e., with no records in the first half of a
> segment)
> > > from
> > > 238 * being created by XLOG streaming, which might cause trouble
> > > later on if
> > > 239 * the segment is e.g archived.
> > > 240 */
> > > 241 if (recptr % XLogSegSize != 0)
> > > 242 recptr -= recptr % XLogSegSize;
> > > 243
> > >
> > > We start up with 19DD/0x73ffffe0 (but there would not be enough room in
> > > that segment for any more records, so logically we'd have to go to
> > > 19DD/0x74000000). When we start WAL receiving, we truncate 0x73ffffe0
> to
> > > 0x73000000, which the master has already removed (and - technically -
> we
> > > don't actually need?).
>
> 0x73ffffe0 and 0x73000000 are on the same segment. Current
> recovery mechanism requires reading the record starts from
> 0x73ffffe0 and it is on the standby and it is read.
>
>
> regards,
>
> --
> Kyotaro Horiguchi
> NTT Open Source Software Center
>
> --
> Sent via pgsql-bugs mailing list (pgsql-bugs(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-bugs
>


From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: jdnelson(at)dyn(dot)com, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>, PostgreSQL mailing lists <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: Bug in Physical Replication Slots (at least 9.5)?
Date: 2017-01-18 03:34:51
Message-ID: CAB7nPqQytF2giE7FD-4oJJpPVwiKJrDQPc24hLNGThX01SbSmA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Tue, Jan 17, 2017 at 7:36 PM, Kyotaro HORIGUCHI
<horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> I managed to reproduce this. A little tweak as the first patch
> lets the standby to suicide as soon as walreceiver sees a
> contrecord at the beginning of a segment.

Good idea.

> I believe that a continuation record cannot be span over three or
> more *segments* (is it right?), so keeping one spare segment
> would be enough. The attached second patch does this.

I have to admit that I did not think about this problem much yet (I
bookmarked this report weeks ago to be honest as something to look
at), but that does not look right to me. Couldn't a record be spawned
across even more segments? Take a random string longer than 64MB or
event longer for example.

> Other possible measures might be,
>
> - Allowing switching wal source while reading a continuation
> record. Currently ReadRecord assumes that a continuation record
> can be read from single source. But this needs refactoring
> involving xlog.c, xlogreader.c and relatives.

This is scary thinking about back-branches.

> - Delaying recycling a segment until the last partial record on it
> completes. This seems doable in page-wise (coarse resolution)
> but would cost additional reading of past xlog files (page
> header of past pages is required).

Hm, yes. That looks like the least invasive way to go. At least that
looks more correct than the others.

> - Delaying write/flush feedback until the current record is
> completed. walreceiver is not conscious of a WAL record and
> this might break synchronous replication.

Not sure about this one yet.
--
Michael


From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: michael(dot)paquier(at)gmail(dot)com
Cc: jdnelson(at)dyn(dot)com, pgsql-hackers(at)postgresql(dot)org, pgsql-bugs(at)postgresql(dot)org
Subject: Re: Bug in Physical Replication Slots (at least 9.5)?
Date: 2017-01-19 09:37:31
Message-ID: 20170119.183731.223893446.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Hello,

At Wed, 18 Jan 2017 12:34:51 +0900, Michael Paquier <michael(dot)paquier(at)gmail(dot)com> wrote in <CAB7nPqQytF2giE7FD-4oJJpPVwiKJrDQPc24hLNGThX01SbSmA(at)mail(dot)gmail(dot)com>
> On Tue, Jan 17, 2017 at 7:36 PM, Kyotaro HORIGUCHI
> <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> > I managed to reproduce this. A little tweak as the first patch
> > lets the standby to suicide as soon as walreceiver sees a
> > contrecord at the beginning of a segment.
>
> Good idea.

Thanks. Fortunately(?), the problematic situation seems to happen
at almost all segment boundary.

> > I believe that a continuation record cannot be span over three or
> > more *segments* (is it right?), so keeping one spare segment
> > would be enough. The attached second patch does this.
>
> I have to admit that I did not think about this problem much yet (I
> bookmarked this report weeks ago to be honest as something to look
> at), but that does not look right to me. Couldn't a record be spawned
> across even more segments? Take a random string longer than 64MB or
> event longer for example.

Though I haven't look closer to how a modification is splitted
into WAL records. A tuple cannot be so long. As a simple test, I
observed rechder->xl_tot_len at the end of XLogRecordAssemble
inserting an about 400KB not-so-compressable string into a text
column, but I saw a series of many records with shorter than
several thousand bytes.

> > Other possible measures might be,
> >
> > - Allowing switching wal source while reading a continuation
> > record. Currently ReadRecord assumes that a continuation record
> > can be read from single source. But this needs refactoring
> > involving xlog.c, xlogreader.c and relatives.
>
> This is scary thinking about back-branches.

Yes. It would be no longer a bug fix. (Or becomes a quite ugly hack..)

> > - Delaying recycling a segment until the last partial record on it
> > completes. This seems doable in page-wise (coarse resolution)
> > but would cost additional reading of past xlog files (page
> > header of past pages is required).
>
> Hm, yes. That looks like the least invasive way to go. At least that
> looks more correct than the others.

The attached patch does that. Usually it reads page headers only
on segment boundaries, but once continuation record found (or
failed to read the next page header, that is, the first record on
the first page in the next segment has not been replicated), it
becomes to happen on every page boundary until non-continuation
page comes.

I leave a debug info (at LOG level) in the attached file shown on
every state change of keep pointer. At least for pgbench, the
cost seems ignorable.

> > - Delaying write/flush feedback until the current record is
> > completed. walreceiver is not conscious of a WAL record and
> > this might break synchronous replication.
>
> Not sure about this one yet.

I'm not sure, too:p

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment Content-Type Size
retard_keeplogseg.patch text/x-patch 6.1 KB

From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: michael(dot)paquier(at)gmail(dot)com
Cc: jdnelson(at)dyn(dot)com, pgsql-hackers(at)postgresql(dot)org, pgsql-bugs(at)postgresql(dot)org
Subject: Re: Bug in Physical Replication Slots (at least 9.5)?
Date: 2017-01-20 02:07:29
Message-ID: 20170120.110729.107284864.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Hello,

At Thu, 19 Jan 2017 18:37:31 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote in <20170119(dot)183731(dot)223893446(dot)horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
> > > - Delaying recycling a segment until the last partial record on it
> > > completes. This seems doable in page-wise (coarse resolution)
> > > but would cost additional reading of past xlog files (page
> > > header of past pages is required).
> >
> > Hm, yes. That looks like the least invasive way to go. At least that
> > looks more correct than the others.
>
> The attached patch does that. Usually it reads page headers only
> on segment boundaries, but once continuation record found (or
> failed to read the next page header, that is, the first record on
> the first page in the next segment has not been replicated), it
> becomes to happen on every page boundary until non-continuation
> page comes.
>
> I leave a debug info (at LOG level) in the attached file shown on
> every state change of keep pointer. At least for pgbench, the
> cost seems ignorable.

I revised it. It became neater and less invasive.

- Removed added keep from struct WalSnd. It is never referrenced
from other processes. It is static variable now.

- Restore keepPtr from replication slot on starting.

- Moved the main part to more appropriate position.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment Content-Type Size
retard_keeplogseg_v2.patch text/x-patch 6.4 KB

From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: michael(dot)paquier(at)gmail(dot)com
Cc: jdnelson(at)dyn(dot)com, pgsql-hackers(at)postgresql(dot)org, pgsql-bugs(at)postgresql(dot)org
Subject: Re: [HACKERS] Bug in Physical Replication Slots (at least 9.5)?
Date: 2017-02-01 07:41:38
Message-ID: 20170201.164138.24166847.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Hello, I'll add the rebased version to the next CF.

At Fri, 20 Jan 2017 11:07:29 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote in <20170120(dot)110729(dot)107284864(dot)horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
> > > > - Delaying recycling a segment until the last partial record on it
> > > > completes. This seems doable in page-wise (coarse resolution)
> > > > but would cost additional reading of past xlog files (page
> > > > header of past pages is required).
> > >
> > > Hm, yes. That looks like the least invasive way to go. At least that
> > > looks more correct than the others.
> >
> > The attached patch does that. Usually it reads page headers only
> > on segment boundaries, but once continuation record found (or
> > failed to read the next page header, that is, the first record on
> > the first page in the next segment has not been replicated), it
> > becomes to happen on every page boundary until non-continuation
> > page comes.
> >
> > I leave a debug info (at LOG level) in the attached file shown on
> > every state change of keep pointer. At least for pgbench, the
> > cost seems ignorable.
>
> I revised it. It became neater and less invasive.
>
> - Removed added keep from struct WalSnd. It is never referrenced
> from other processes. It is static variable now.
>
> - Restore keepPtr from replication slot on starting.

keepPtr is renamed to a more meaningful name restartLSN.

> - Moved the main part to more appropriate position.

- Removed the debug print code.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment Content-Type Size
0001-Fix-a-bug-of-physical-replication-slot.patch text/x-patch 7.3 KB

From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, jdnelson(at)dyn(dot)com, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, pgsql-bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: Bug in Physical Replication Slots (at least 9.5)?
Date: 2017-02-01 16:26:03
Message-ID: CAHGQGwEET=QBA_jND=xhrXn+9ZreP4_qMBAqsBZg56beqxbveg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Thu, Jan 19, 2017 at 6:37 PM, Kyotaro HORIGUCHI
<horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> Hello,
>
> At Wed, 18 Jan 2017 12:34:51 +0900, Michael Paquier <michael(dot)paquier(at)gmail(dot)com> wrote in <CAB7nPqQytF2giE7FD-4oJJpPVwiKJrDQPc24hLNGThX01SbSmA(at)mail(dot)gmail(dot)com>
>> On Tue, Jan 17, 2017 at 7:36 PM, Kyotaro HORIGUCHI
>> <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>> > I managed to reproduce this. A little tweak as the first patch
>> > lets the standby to suicide as soon as walreceiver sees a
>> > contrecord at the beginning of a segment.
>>
>> Good idea.
>
> Thanks. Fortunately(?), the problematic situation seems to happen
> at almost all segment boundary.
>
>> > I believe that a continuation record cannot be span over three or
>> > more *segments* (is it right?), so keeping one spare segment
>> > would be enough. The attached second patch does this.
>>
>> I have to admit that I did not think about this problem much yet (I
>> bookmarked this report weeks ago to be honest as something to look
>> at), but that does not look right to me. Couldn't a record be spawned
>> across even more segments? Take a random string longer than 64MB or
>> event longer for example.
>
> Though I haven't look closer to how a modification is splitted
> into WAL records. A tuple cannot be so long. As a simple test, I
> observed rechder->xl_tot_len at the end of XLogRecordAssemble
> inserting an about 400KB not-so-compressable string into a text
> column, but I saw a series of many records with shorter than
> several thousand bytes.
>
>> > Other possible measures might be,
>> >
>> > - Allowing switching wal source while reading a continuation
>> > record. Currently ReadRecord assumes that a continuation record
>> > can be read from single source. But this needs refactoring
>> > involving xlog.c, xlogreader.c and relatives.
>>
>> This is scary thinking about back-branches.
>
> Yes. It would be no longer a bug fix. (Or becomes a quite ugly hack..)
>
>> > - Delaying recycling a segment until the last partial record on it
>> > completes. This seems doable in page-wise (coarse resolution)
>> > but would cost additional reading of past xlog files (page
>> > header of past pages is required).
>>
>> Hm, yes. That looks like the least invasive way to go. At least that
>> looks more correct than the others.
>
> The attached patch does that. Usually it reads page headers only
> on segment boundaries, but once continuation record found (or
> failed to read the next page header, that is, the first record on
> the first page in the next segment has not been replicated), it
> becomes to happen on every page boundary until non-continuation
> page comes.

I'm afraid that many WAL segments would start with a continuation record
when there are the workload of short transactions (e.g., by pgbench), and
which would make restart_lsn go behind very much. No?

The discussion on this thread just makes me think that restart_lsn should
indicate the replay location instead of flush location. This seems safer.

Regards,

--
Fujii Masao


From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: masao(dot)fujii(at)gmail(dot)com
Cc: michael(dot)paquier(at)gmail(dot)com, jdnelson(at)dyn(dot)com, pgsql-hackers(at)postgresql(dot)org, pgsql-bugs(at)postgresql(dot)org
Subject: Re: [HACKERS] Bug in Physical Replication Slots (at least 9.5)?
Date: 2017-02-02 02:28:29
Message-ID: 20170202.112829.188781915.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Thank you for the comment.

At Thu, 2 Feb 2017 01:26:03 +0900, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote in <CAHGQGwEET=QBA_jND=xhrXn+9ZreP4_qMBAqsBZg56beqxbveg(at)mail(dot)gmail(dot)com>
> > The attached patch does that. Usually it reads page headers only
> > on segment boundaries, but once continuation record found (or
> > failed to read the next page header, that is, the first record on
> > the first page in the next segment has not been replicated), it
> > becomes to happen on every page boundary until non-continuation
> > page comes.
>
> I'm afraid that many WAL segments would start with a continuation record
> when there are the workload of short transactions (e.g., by pgbench), and
> which would make restart_lsn go behind very much. No?

I agreed. So trying to release the lock for every page boundary
but restart_lsn goes behind much if so many contiguous pages were
CONTRECORD. But I think the chance for the situation sticks for
one or more segments is ignorablly low. Being said that, there
*is* possibility of false continuation, anyway.

> The discussion on this thread just makes me think that restart_lsn should
> indicate the replay location instead of flush location. This seems safer.

Standby restarts from minRecoveryPoint, which is a copy of
XLogCtl->replayEndRecPtr and updated by
UpdateMinRecoveryPoint(). Whlie, applyPtr in reply messages is a
copy of XLogCtl->lastReplayedEndRecptr which is updated after the
upate of on-disk minRecoveryPoint. It seems safe from the
viewpoint.

On the other hand, apply is pausable. Records are copied and
flushd on standby then the segments on master that is already
sent are safely be removed even for the case. In spite of that,
older segments on the master are kept from being removed during
the pause. If applyPtr were used as restart_lsn, this could be
another problem and this is sure to happen.

I'm not sure how much possibility is there for several contiguous
segments are full of contpages. But I think it's worse that apply
pause causes needless pg_wal flooding.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center


From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, jdnelson(at)dyn(dot)com, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, pgsql-bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: [BUGS] Bug in Physical Replication Slots (at least 9.5)?
Date: 2017-02-02 06:34:33
Message-ID: CAB7nPqQ05G15JooRMEONgPkW0osot77yaFAUF9_6Q8G+v+2+xg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Thu, Feb 2, 2017 at 1:26 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> I'm afraid that many WAL segments would start with a continuation record
> when there are the workload of short transactions (e.g., by pgbench), and
> which would make restart_lsn go behind very much. No?

I don't quite understand this argument. Even if there are many small
transactions, that would cause restart_lsn to just be late by one
segment, all the time.

> The discussion on this thread just makes me think that restart_lsn should
> indicate the replay location instead of flush location. This seems safer.

That would penalize WAL retention on the primary with standbys using
recovery_min_apply_delay and a slot for example...

We can attempt to address this problem two ways. The patch proposed
(ugly btw and there are two typos!) is doing it in the WAL sender by
not making restart_lsn jump to the next segment if a continuation
record is found. Or we could have the standby request for the next
segment instead if the record it wants to replay from is at a boundary
and that it locally has the beginning of the record, and it has it
because it already confirmed to the primary that it flushed to the
next segment. Not sure which fix is better though.
--
Michael


From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: michael(dot)paquier(at)gmail(dot)com
Cc: masao(dot)fujii(at)gmail(dot)com, jdnelson(at)dyn(dot)com, pgsql-hackers(at)postgresql(dot)org, pgsql-bugs(at)postgresql(dot)org
Subject: Re: [BUGS] Bug in Physical Replication Slots (at least 9.5)?
Date: 2017-02-03 03:16:49
Message-ID: 20170203.121649.184184028.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

At Thu, 2 Feb 2017 15:34:33 +0900, Michael Paquier <michael(dot)paquier(at)gmail(dot)com> wrote in <CAB7nPqQ05G15JooRMEONgPkW0osot77yaFAUF9_6Q8G+v+2+xg(at)mail(dot)gmail(dot)com>
> On Thu, Feb 2, 2017 at 1:26 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> > I'm afraid that many WAL segments would start with a continuation record
> > when there are the workload of short transactions (e.g., by pgbench), and
> > which would make restart_lsn go behind very much. No?
>
> I don't quite understand this argument. Even if there are many small
> transactions, that would cause restart_lsn to just be late by one
> segment, all the time.
>
> > The discussion on this thread just makes me think that restart_lsn should
> > indicate the replay location instead of flush location. This seems safer.
>
> That would penalize WAL retention on the primary with standbys using
> recovery_min_apply_delay and a slot for example...
>
> We can attempt to address this problem two ways. The patch proposed
> (ugly btw and there are two typos!) is doing it in the WAL sender by
> not making restart_lsn jump to the next segment if a continuation
> record is found.

Sorry for the ug..:p Anyway, the previous version was not the
latest. The attached one is the revised version. (Sorry, I
haven't find a typo by myself..)

> Or we could have the standby request for the next
> segment instead if the record it wants to replay from is at a boundary
> and that it locally has the beginning of the record, and it has it
> because it already confirmed to the primary that it flushed to the
> next segment. Not sure which fix is better though.

We could it as I said, with some refactoring ReadRecord involving
reader plugin mechanism..

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment Content-Type Size
0001-Fix-a-bug-of-physical-replication-slot.patch text/x-patch 7.3 KB

From: Venkata B Nagothi <nag1010(at)gmail(dot)com>
To: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: jdnelson(at)dyn(dot)com, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, PostgreSQL Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: [BUGS] Bug in Physical Replication Slots (at least 9.5)?
Date: 2017-03-13 00:06:00
Message-ID: CAEyp7J-4MmVwGoZSwvaSULZC80JDD_tL-9KsNiqF17+bNqiSBg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Tue, Jan 17, 2017 at 9:36 PM, Kyotaro HORIGUCHI <
horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:

> Hello. I added pgsql-hackers.
>
> This occurs also on git master and back to 9.4.
>
> At Fri, 13 Jan 2017 08:47:06 -0600, Jonathon Nelson <jdnelson(at)dyn(dot)com>
> wrote in <CACJqAM1ydcZcd5DoCp+y5hkWto1ZeGW+Mj8UK7avqctbGJO8Bw(at)mail(dot)
> gmail.com>
> > On Mon, Nov 28, 2016 at 1:39 PM, Jonathon Nelson <jdnelson(at)dyn(dot)com>
> wrote:
> > > First, the postgresql configuration differs only minimally from the
> stock
> > > config:
> > >
> > > Assume wal_keep_segments = 0.
> > > Assume the use of physical replication slots.
> > > Assume one master, one standby.
> > >
> > > Lastly, we have observed the behavior "in the wild" at least twice and
> in
> > > the lab a dozen or so times.
> > >
> > > EXAMPLE #1 (logs are from the replica):
> > >
> > > user=,db=,app=,client= DEBUG: creating and filling new WAL file
> > > user=,db=,app=,client= DEBUG: done creating and filling new WAL file
> > > user=,db=,app=,client= DEBUG: sending write 6/8B000000 flush
> 6/8A000000
> > > apply 5/748425A0
> > > user=,db=,app=,client= DEBUG: sending write 6/8B000000 flush
> 6/8B000000
> > > apply 5/74843020
> > > <control-c here>
> > > user=,db=,app=,client= DEBUG: postmaster received signal 2
> > > user=,db=,app=,client= LOG: received fast shutdown request
> > > user=,db=,app=,client= LOG: aborting any active transactions
> > >
> > > And, upon restart:
> > >
> > > user=,db=,app=,client= LOG: restartpoint starting: xlog
> > > user=,db=,app=,client= DEBUG: updated min recovery point to
> 6/67002390 on
> > > timeline 1
> > > user=,db=,app=,client= DEBUG: performing replication slot checkpoint
> > > user=,db=,app=,client= DEBUG: updated min recovery point to
> 6/671768C0 on
> > > timeline 1
> > > user=,db=,app=,client= CONTEXT: writing block 589 of relation
> > > base/13294/16501
> > > user=,db=,app=,client= LOG: invalid magic number 0000 in log segment
> > > 00000001000000060000008B, offset 0
> > > user=,db=,app=,client= DEBUG: switched WAL source from archive to
> stream
> > > after failure
> > > user=,db=,app=,client= LOG: started streaming WAL from primary at
> > > 6/8A000000 on timeline 1
> > > user=,db=,app=,client= FATAL: could not receive data from WAL stream:
> > > ERROR: requested WAL segment 00000001000000060000008A has already been
> > > removed
>
> I managed to reproduce this. A little tweak as the first patch
> lets the standby to suicide as soon as walreceiver sees a
> contrecord at the beginning of a segment.
>
> - M(aster): createdb as a master with wal_keep_segments = 0
> (default), min_log_messages = debug2
> - M: Create a physical repslot.
> - S(tandby): Setup a standby database.
> - S: Edit recovery.conf to use the replication slot above then
> start it.
> - S: touch /tmp/hoge
> - M: Run pgbench ...
> - S: After a while, the standby stops.
> > LOG: #################### STOP THE SERVER
>
> - M: Stop pgbench.
> - M: Do 'checkpoint;' twice.
> - S: rm /tmp/hoge
> - S: Fails to catch up with the following error.
>
> > FATAL: could not receive data from WAL stream: ERROR: requested WAL
> segment 00000001000000000000002B has already been removed
>
>
I have been testing / reviewing the latest patch
"0001-Fix-a-bug-of-physical-replication-slot.patch" and i think, i might
need some more clarification on this.

Before applying the patch, I tried re-producing the above error -

- I had master->standby in streaming replication
- Took the backup of master
- with a low max_wal_size and wal_keep_segments = 0
- Configured standby with recovery.conf
- Created replication slot on master
- Configured the replication slot on standby and started the standby
- I got the below error

>> 2017-03-10 11:58:15.704 AEDT [478] LOG: invalid record length at
0/F2000140: wanted 24, got 0
>> 2017-03-10 11:58:15.706 AEDT [481] LOG: started streaming WAL from
primary at 0/F2000000 on timeline 1
>> 2017-03-10 11:58:15.706 AEDT [481] FATAL: could not receive data
from WAL stream: ERROR: requested WAL segment 0000000100000000000000F2 has
already been removed

and i could notice that the file "0000000100000000000000F2" was removed
from the master. This can be easily re-produced and this occurs
irrespective of configuring replication slots.

As long as the file "0000000100000000000000F2" is available on the master,
standby continues to stream WALs without any issues.

some more details -

Contents of the file "0000000100000000000000F2" on standby before
pg_stop_backup()

rmgr: Standby len (rec/tot): 24/ 50, tx: 0, lsn:
0/F2000028, prev 0/F1000098, desc: RUNNING_XACTS nextXid 638
latestCompletedXid 637 oldestRunningXid 638
rmgr: Standby len (rec/tot): 24/ 50, tx: 0, lsn:
0/F2000060, prev 0/F2000028, desc: RUNNING_XACTS nextXid 638
latestCompletedXid 637 oldestRunningXid 638
rmgr: XLOG len (rec/tot): 80/ 106, tx: 0, lsn:
0/F2000098, prev 0/F2000060, desc: CHECKPOINT_ONLINE redo 0/F2000060; tli
1; prev tli 1; fpw true; xid 0:638; oid 16487; multi 1; offset 0; oldest
xid 544 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp
xid: 0/0; oldest running xid 638; online
rmgr: Standby len (rec/tot): 24/ 50, tx: 0, lsn:
0/F2000108, prev 0/F2000098, desc: RUNNING_XACTS nextXid 638
latestCompletedXid 637 oldestRunningXid 638
pg_waldump: FATAL: error in WAL record at 0/F2000108: invalid record
length at 0/F2000140: wanted 24, got 0

Contents of the file on master after pg_stop_backup()

rmgr: Standby len (rec/tot): 24/ 50, tx: 0, lsn:
0/F2000028, prev 0/F1000098, desc: RUNNING_XACTS nextXid 638
latestCompletedXid 637 oldestRunningXid 638
rmgr: Standby len (rec/tot): 24/ 50, tx: 0, lsn:
0/F2000060, prev 0/F2000028, desc: RUNNING_XACTS nextXid 638
latestCompletedXid 637 oldestRunningXid 638
rmgr: XLOG len (rec/tot): 80/ 106, tx: 0, lsn:
0/F2000098, prev 0/F2000060, desc: CHECKPOINT_ONLINE redo 0/F2000060; tli
1; prev tli 1; fpw true; xid 0:638; oid 16487; multi 1; offset 0; oldest
xid 544 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp
xid: 0/0; oldest running xid 638; online
rmgr: Standby len (rec/tot): 24/ 50, tx: 0, lsn:
0/F2000108, prev 0/F2000098, desc: RUNNING_XACTS nextXid 638
latestCompletedXid 637 oldestRunningXid 638
rmgr: Heap2 len (rec/tot): 8/ 7735, tx: 0, lsn:
0/F2000140, prev 0/F2000108, desc: CLEAN remxid 620, blkref #0: rel
1663/13179/2619 blk 2 FPW
rmgr: Heap2 len (rec/tot): 8/ 6863, tx: 0, lsn:
0/F2001F78, prev 0/F2000140, desc: CLEAN remxid 620, blkref #0: rel
1663/13179/2840 blk 0 FPW
rmgr: Standby len (rec/tot): 24/ 50, tx: 0, lsn:
0/F2003A60, prev 0/F2001F78, desc: RUNNING_XACTS nextXid 638
latestCompletedXid 637 oldestRunningXid 638
rmgr: Standby len (rec/tot): 24/ 50, tx: 0, lsn:
0/F2003A98, prev 0/F2003A60, desc: RUNNING_XACTS nextXid 638
latestCompletedXid 637 oldestRunningXid 638
rmgr: XLOG len (rec/tot): 80/ 106, tx: 0, lsn:
0/F2003AD0, prev 0/F2003A98, desc: CHECKPOINT_ONLINE redo 0/F2003A98; tli
1; prev tli 1; fpw true; xid 0:638; oid 16487; multi 1; offset 0; oldest
xid 544 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp
xid: 0/0; oldest running xid 638; online
rmgr: Standby len (rec/tot): 24/ 50, tx: 0, lsn:
0/F2003B40, prev 0/F2003AD0, desc: RUNNING_XACTS nextXid 638
latestCompletedXid 637 oldestRunningXid 638
rmgr: XLOG len (rec/tot): 8/ 34, tx: 0, lsn:
0/F2003B78, prev 0/F2003B40, desc: BACKUP_END 0/F2000060
rmgr: XLOG len (rec/tot): 0/ 24, tx: 0, lsn:
0/F2003BA0, prev 0/F2003B78, desc: SWITCH

If the scenario i created to reproduce the error is correct, then, applying
the patch is not making a difference.

I think, i need help in building a specific test case which will re-produce
the specific BUG related to physical replication slots as reported ?

Will continue to review the patch, once i have any comments on this.

Regards,
Venkata B N

Database Consultant


From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: nag1010(at)gmail(dot)com
Cc: jdnelson(at)dyn(dot)com, pgsql-hackers(at)postgresql(dot)org, pgsql-bugs(at)postgresql(dot)org
Subject: Re: [BUGS] Bug in Physical Replication Slots (at least 9.5)?
Date: 2017-03-17 07:48:27
Message-ID: 20170317.164827.46663014.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Hello,

At Mon, 13 Mar 2017 11:06:00 +1100, Venkata B Nagothi <nag1010(at)gmail(dot)com> wrote in <CAEyp7J-4MmVwGoZSwvaSULZC80JDD_tL-9KsNiqF17+bNqiSBg(at)mail(dot)gmail(dot)com>
> On Tue, Jan 17, 2017 at 9:36 PM, Kyotaro HORIGUCHI <
> horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> > I managed to reproduce this. A little tweak as the first patch
> > lets the standby to suicide as soon as walreceiver sees a
> > contrecord at the beginning of a segment.
> >
> > - M(aster): createdb as a master with wal_keep_segments = 0
> > (default), min_log_messages = debug2
> > - M: Create a physical repslot.
> > - S(tandby): Setup a standby database.
> > - S: Edit recovery.conf to use the replication slot above then
> > start it.
> > - S: touch /tmp/hoge
> > - M: Run pgbench ...
> > - S: After a while, the standby stops.
> > > LOG: #################### STOP THE SERVER
> >
> > - M: Stop pgbench.
> > - M: Do 'checkpoint;' twice.
> > - S: rm /tmp/hoge
> > - S: Fails to catch up with the following error.
> >
> > > FATAL: could not receive data from WAL stream: ERROR: requested WAL
> > segment 00000001000000000000002B has already been removed
> >
> >
> I have been testing / reviewing the latest patch
> "0001-Fix-a-bug-of-physical-replication-slot.patch" and i think, i might
> need some more clarification on this.
>
> Before applying the patch, I tried re-producing the above error -
>
> - I had master->standby in streaming replication
> - Took the backup of master
> - with a low max_wal_size and wal_keep_segments = 0
> - Configured standby with recovery.conf
> - Created replication slot on master
> - Configured the replication slot on standby and started the standby

I suppose the "configure" means primary_slot_name in recovery.conf.

> - I got the below error
>
> >> 2017-03-10 11:58:15.704 AEDT [478] LOG: invalid record length at
> 0/F2000140: wanted 24, got 0
> >> 2017-03-10 11:58:15.706 AEDT [481] LOG: started streaming WAL from
> primary at 0/F2000000 on timeline 1
> >> 2017-03-10 11:58:15.706 AEDT [481] FATAL: could not receive data
> from WAL stream: ERROR: requested WAL segment 0000000100000000000000F2 has
> already been removed

Maybe you created the master slot with non-reserve (default) mode
and put a some-minites pause after making the backup and before
starting the standby. For the case the master slot doesn't keep
WAL segments unless the standby connects so a couple of
checkpoints can blow away the first segment required by the
standby. This is quite reasonable behavior. The following steps
makes this more sure.

> - Took the backup of master
> - with a low max_wal_size = 2 and wal_keep_segments = 0
> - Configured standby with recovery.conf
> - Created replication slot on master
+ - SELECT pg_switch_wal(); on master twice.
+ - checkpoint; on master twice.
> - Configured the replication slot on standby and started the standby

Creating the slot with the following command will save it.

=# select pg_create_physical_replication_slot('s1', true);

> and i could notice that the file "0000000100000000000000F2" was removed
> from the master. This can be easily re-produced and this occurs
> irrespective of configuring replication slots.
>
> As long as the file "0000000100000000000000F2" is available on the master,
> standby continues to stream WALs without any issues.
...
> If the scenario i created to reproduce the error is correct, then, applying
> the patch is not making a difference.

Yes, the patch is not for saving this case. The patch saves the
case where the previous segment to the first required segment by
standby was removed and it contains the first part of a record
continues to the first required segment. On the other hand this
case is that the segment at the start point of standby is just
removed.

> I think, i need help in building a specific test case which will re-produce
> the specific BUG related to physical replication slots as reported ?
>
> Will continue to review the patch, once i have any comments on this.

Thaks a lot!

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center


From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: nag1010(at)gmail(dot)com
Cc: jdnelson(at)dyn(dot)com, pgsql-hackers(at)postgresql(dot)org, pgsql-bugs(at)postgresql(dot)org
Subject: Re: [HACKERS] Bug in Physical Replication Slots (at least 9.5)?
Date: 2017-03-28 06:51:00
Message-ID: 20170328.155100.219725603.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

This conflicts with 6912acc (replication lag tracker) so just
rebased on a6f22e8.

At Fri, 17 Mar 2017 16:48:27 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote in <20170317(dot)164827(dot)46663014(dot)horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
> Hello,
>
> At Mon, 13 Mar 2017 11:06:00 +1100, Venkata B Nagothi <nag1010(at)gmail(dot)com> wrote in <CAEyp7J-4MmVwGoZSwvaSULZC80JDD_tL-9KsNiqF17+bNqiSBg(at)mail(dot)gmail(dot)com>
> > On Tue, Jan 17, 2017 at 9:36 PM, Kyotaro HORIGUCHI <
> > horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> > > I managed to reproduce this. A little tweak as the first patch
> > > lets the standby to suicide as soon as walreceiver sees a
> > > contrecord at the beginning of a segment.
> > >
> > > - M(aster): createdb as a master with wal_keep_segments = 0
> > > (default), min_log_messages = debug2
> > > - M: Create a physical repslot.
> > > - S(tandby): Setup a standby database.
> > > - S: Edit recovery.conf to use the replication slot above then
> > > start it.
> > > - S: touch /tmp/hoge
> > > - M: Run pgbench ...
> > > - S: After a while, the standby stops.
> > > > LOG: #################### STOP THE SERVER
> > >
> > > - M: Stop pgbench.
> > > - M: Do 'checkpoint;' twice.
> > > - S: rm /tmp/hoge
> > > - S: Fails to catch up with the following error.
> > >
> > > > FATAL: could not receive data from WAL stream: ERROR: requested WAL
> > > segment 00000001000000000000002B has already been removed
> > >
> > >
> > I have been testing / reviewing the latest patch
> > "0001-Fix-a-bug-of-physical-replication-slot.patch" and i think, i might
> > need some more clarification on this.
> >
> > Before applying the patch, I tried re-producing the above error -
> >
> > - I had master->standby in streaming replication
> > - Took the backup of master
> > - with a low max_wal_size and wal_keep_segments = 0
> > - Configured standby with recovery.conf
> > - Created replication slot on master
> > - Configured the replication slot on standby and started the standby
>
> I suppose the "configure" means primary_slot_name in recovery.conf.
>
> > - I got the below error
> >
> > >> 2017-03-10 11:58:15.704 AEDT [478] LOG: invalid record length at
> > 0/F2000140: wanted 24, got 0
> > >> 2017-03-10 11:58:15.706 AEDT [481] LOG: started streaming WAL from
> > primary at 0/F2000000 on timeline 1
> > >> 2017-03-10 11:58:15.706 AEDT [481] FATAL: could not receive data
> > from WAL stream: ERROR: requested WAL segment 0000000100000000000000F2 has
> > already been removed
>
> Maybe you created the master slot with non-reserve (default) mode
> and put a some-minites pause after making the backup and before
> starting the standby. For the case the master slot doesn't keep
> WAL segments unless the standby connects so a couple of
> checkpoints can blow away the first segment required by the
> standby. This is quite reasonable behavior. The following steps
> makes this more sure.
>
> > - Took the backup of master
> > - with a low max_wal_size = 2 and wal_keep_segments = 0
> > - Configured standby with recovery.conf
> > - Created replication slot on master
> + - SELECT pg_switch_wal(); on master twice.
> + - checkpoint; on master twice.
> > - Configured the replication slot on standby and started the standby
>
> Creating the slot with the following command will save it.
>
> =# select pg_create_physical_replication_slot('s1', true);
>
>
> > and i could notice that the file "0000000100000000000000F2" was removed
> > from the master. This can be easily re-produced and this occurs
> > irrespective of configuring replication slots.
> >
> > As long as the file "0000000100000000000000F2" is available on the master,
> > standby continues to stream WALs without any issues.
> ...
> > If the scenario i created to reproduce the error is correct, then, applying
> > the patch is not making a difference.
>
> Yes, the patch is not for saving this case. The patch saves the
> case where the previous segment to the first required segment by
> standby was removed and it contains the first part of a record
> continues to the first required segment. On the other hand this
> case is that the segment at the start point of standby is just
> removed.
>
> > I think, i need help in building a specific test case which will re-produce
> > the specific BUG related to physical replication slots as reported ?
> >
> > Will continue to review the patch, once i have any comments on this.
>
> Thaks a lot!

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment Content-Type Size
0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch text/x-patch 7.2 KB

From: Venkata B Nagothi <nag1010(at)gmail(dot)com>
To: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: Jonathon Nelson <jdnelson(at)dyn(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, PostgreSQL Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: [HACKERS] Bug in Physical Replication Slots (at least 9.5)?
Date: 2017-03-29 23:49:21
Message-ID: CAEyp7J_k-_qEE74B039XCK5UXVK_vqFZqdzx4GNpXLrdhAjzXw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Regards,

Venkata B N
Database Consultant

On Tue, Mar 28, 2017 at 5:51 PM, Kyotaro HORIGUCHI <
horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:

> This conflicts with 6912acc (replication lag tracker) so just
> rebased on a6f22e8.
>

I tried applying this patch to latest master, it is not getting applied

[dba(at)buildhost postgresql]$ git apply
/data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch
/data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch:28:
trailing whitespace.
/*
/data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch:29:
trailing whitespace.
* This variable corresponds to restart_lsn in pg_replication_slots for a
/data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch:30:
trailing whitespace.
* physical slot. This has a valid value only when it differs from the
current
/data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch:31:
trailing whitespace.
* flush pointer.
/data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch:32:
trailing whitespace.
*/
error: patch failed: src/backend/replication/walsender.c:210
error: src/backend/replication/walsender.c: patch does not apply

Regards,

Venkata Balaji N
Database Consultant


From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Venkata B Nagothi <nag1010(at)gmail(dot)com>
Cc: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, Jonathon Nelson <jdnelson(at)dyn(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, PostgreSQL Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: [BUGS] Bug in Physical Replication Slots (at least 9.5)?
Date: 2017-03-29 23:55:08
Message-ID: CAB7nPqQjucAe7GqBHsYnjW+-68-Zh7Y0JBmRNGvwXFQVRkOFBw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Thu, Mar 30, 2017 at 8:49 AM, Venkata B Nagothi <nag1010(at)gmail(dot)com> wrote:
> On Tue, Mar 28, 2017 at 5:51 PM, Kyotaro HORIGUCHI
> <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> I tried applying this patch to latest master, it is not getting applied
>
> [dba(at)buildhost postgresql]$ git apply
> /data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch
> /data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch:28:
> trailing whitespace.
> /*
> /data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch:29:
> trailing whitespace.
> * This variable corresponds to restart_lsn in pg_replication_slots for a
> /data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch:30:
> trailing whitespace.
> * physical slot. This has a valid value only when it differs from the
> current
> /data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch:31:
> trailing whitespace.
> * flush pointer.
> /data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch:32:
> trailing whitespace.
> */
> error: patch failed: src/backend/replication/walsender.c:210
> error: src/backend/replication/walsender.c: patch does not apply

git apply and git am can be very picky sometimes, so you may want to
fallback to patch -p1 if things don't work. In this case it does.
--
Michael


From: Tatsuo Ishii <ishii(at)sraoss(dot)co(dot)jp>
To: michael(dot)paquier(at)gmail(dot)com
Cc: nag1010(at)gmail(dot)com, horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp, jdnelson(at)dyn(dot)com, pgsql-hackers(at)postgresql(dot)org, pgsql-bugs(at)postgresql(dot)org
Subject: Re: [BUGS] Bug in Physical Replication Slots (at least 9.5)?
Date: 2017-03-30 00:12:56
Message-ID: 20170330.091256.279023036219349264.t-ishii@sraoss.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

> On Thu, Mar 30, 2017 at 8:49 AM, Venkata B Nagothi <nag1010(at)gmail(dot)com> wrote:
>> On Tue, Mar 28, 2017 at 5:51 PM, Kyotaro HORIGUCHI
>> <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>> I tried applying this patch to latest master, it is not getting applied
>>
>> [dba(at)buildhost postgresql]$ git apply
>> /data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch
>> /data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch:28:
>> trailing whitespace.
>> /*
>> /data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch:29:
>> trailing whitespace.
>> * This variable corresponds to restart_lsn in pg_replication_slots for a
>> /data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch:30:
>> trailing whitespace.
>> * physical slot. This has a valid value only when it differs from the
>> current
>> /data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch:31:
>> trailing whitespace.
>> * flush pointer.
>> /data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch:32:
>> trailing whitespace.
>> */
>> error: patch failed: src/backend/replication/walsender.c:210
>> error: src/backend/replication/walsender.c: patch does not apply
>
> git apply and git am can be very picky sometimes, so you may want to
> fallback to patch -p1 if things don't work. In this case it does.

Committers will not apply patches which has trailing whitespace
issues. So the patch submitter needs to fix them anyway.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp


From: Venkata B Nagothi <nag1010(at)gmail(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, Jonathon Nelson <jdnelson(at)dyn(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, PostgreSQL Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: [HACKERS] Bug in Physical Replication Slots (at least 9.5)?
Date: 2017-03-30 00:12:56
Message-ID: CAEyp7J9ZuuGc_Ndgw-rQF5oWUPH+p4=p7XE-B5cC3QRr6SUT7w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Thu, Mar 30, 2017 at 10:55 AM, Michael Paquier <michael(dot)paquier(at)gmail(dot)com
> wrote:

> On Thu, Mar 30, 2017 at 8:49 AM, Venkata B Nagothi <nag1010(at)gmail(dot)com>
> wrote:
> > On Tue, Mar 28, 2017 at 5:51 PM, Kyotaro HORIGUCHI
> > <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> > I tried applying this patch to latest master, it is not getting applied
> >
> > [dba(at)buildhost postgresql]$ git apply
> > /data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/
> 0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch
> > /data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/
> 0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch:28:
> > trailing whitespace.
> > /*
> > /data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/
> 0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch:29:
> > trailing whitespace.
> > * This variable corresponds to restart_lsn in pg_replication_slots for a
> > /data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/
> 0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch:30:
> > trailing whitespace.
> > * physical slot. This has a valid value only when it differs from the
> > current
> > /data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/
> 0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch:31:
> > trailing whitespace.
> > * flush pointer.
> > /data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/
> 0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch:32:
> > trailing whitespace.
> > */
> > error: patch failed: src/backend/replication/walsender.c:210
> > error: src/backend/replication/walsender.c: patch does not apply
>
> git apply and git am can be very picky sometimes, so you may want to
> fallback to patch -p1 if things don't work. In this case it does.
>

patch -p1 seems to be working. Thanks !

Regards,

Venkata B N
Database Consultant


From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Tatsuo Ishii <ishii(at)sraoss(dot)co(dot)jp>
Cc: nag1010(at)gmail(dot)com, horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp, jdnelson(at)dyn(dot)com, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>, PostgreSQL mailing lists <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: [HACKERS] Bug in Physical Replication Slots (at least 9.5)?
Date: 2017-03-30 00:41:30
Message-ID: CAB7nPqSfXcptn-V-KGvh1GShRF42Tdv9Nfdjcr1NdH8h1rxJVQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Thu, Mar 30, 2017 at 9:12 AM, Tatsuo Ishii <ishii(at)sraoss(dot)co(dot)jp> wrote:
> Committers will not apply patches which has trailing whitespace
> issues. So the patch submitter needs to fix them anyway.

I cannot comment on that point (committers are free to pick up things
the way they want), but just using git commands to apply a patch
should not be an obstacle for a review if a patch can be easily
applied as long as they roughly respect GNU's diff format.
--
Michael


From: Tatsuo Ishii <ishii(at)sraoss(dot)co(dot)jp>
To: michael(dot)paquier(at)gmail(dot)com
Cc: ishii(at)sraoss(dot)co(dot)jp, nag1010(at)gmail(dot)com, horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp, jdnelson(at)dyn(dot)com, pgsql-hackers(at)postgresql(dot)org, pgsql-bugs(at)postgresql(dot)org
Subject: Re: [BUGS] Bug in Physical Replication Slots (at least 9.5)?
Date: 2017-03-30 00:51:47
Message-ID: 20170330.095147.1660425449608313162.t-ishii@sraoss.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

> On Thu, Mar 30, 2017 at 9:12 AM, Tatsuo Ishii <ishii(at)sraoss(dot)co(dot)jp> wrote:
>> Committers will not apply patches which has trailing whitespace
>> issues. So the patch submitter needs to fix them anyway.
>
> I cannot comment on that point (committers are free to pick up things
> the way they want), but just using git commands to apply a patch
> should not be an obstacle for a review if a patch can be easily
> applied as long as they roughly respect GNU's diff format.

My point is, the coding standard. Having trainling whitespace is
against our coding standard and committers should not accept such a
code, I believe.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp


From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: nag1010(at)gmail(dot)com
Cc: michael(dot)paquier(at)gmail(dot)com, jdnelson(at)dyn(dot)com, pgsql-hackers(at)postgresql(dot)org, pgsql-bugs(at)postgresql(dot)org
Subject: Re: [HACKERS] Bug in Physical Replication Slots (at least 9.5)?
Date: 2017-03-30 04:51:56
Message-ID: 20170330.135156.18332040.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

At Thu, 30 Mar 2017 11:12:56 +1100, Venkata B Nagothi <nag1010(at)gmail(dot)com> wrote in <CAEyp7J9ZuuGc_Ndgw-rQF5oWUPH+p4=p7XE-B5cC3QRr6SUT7w(at)mail(dot)gmail(dot)com>
> On Thu, Mar 30, 2017 at 10:55 AM, Michael Paquier <michael(dot)paquier(at)gmail(dot)com
> > wrote:
>
> > On Thu, Mar 30, 2017 at 8:49 AM, Venkata B Nagothi <nag1010(at)gmail(dot)com>
> > wrote:
> > > On Tue, Mar 28, 2017 at 5:51 PM, Kyotaro HORIGUCHI
> > > <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> > > I tried applying this patch to latest master, it is not getting applied
> > >
> > > [dba(at)buildhost postgresql]$ git apply
> > > /data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/
> > 0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch
> > > /data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/
> > 0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch:28:
> > > trailing whitespace.
> > > /*
> > 0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch:29:
> > > trailing whitespace.
> > > * This variable corresponds to restart_lsn in pg_replication_slots for a
...
> > git apply and git am can be very picky sometimes, so you may want to
> > fallback to patch -p1 if things don't work. In this case it does.
> >
>
> patch -p1 seems to be working. Thanks !

That's quite strange. The patch I sent doesn't cantain trailing
spaces at all. The cited lines doesn't seem to contain them. It
applied cleanly with "git am" for me.

The file restored from the attachment of received mail also don't.

The original files contains the following,

0002440 66 6f 72 20 61 0a 2b 20 2a 20 70 68 79 73 69 63
f o r a \n + * p h y s i c

The corresponding part of the file restored from mail on Windows
is the following,
0002460 63 61 74 69 6f 6e 5f 73 6c 6f 74 73 20 66 6f 72
c a t i o n _ s l o t s f o r
0002500 20 61 0d 0a 2b 20 2a 20 70 68 79 73 69 63 61 6c
a \r \n + * p h y s i c a l

Both doesn't contain a space at the end of a line. How did you
retrieve the patch from the mail?

regarsd,

--
Kyotaro Horiguchi
NTT Open Source Software Center


From: Venkata B Nagothi <nag1010(at)gmail(dot)com>
To: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Jonathon Nelson <jdnelson(at)dyn(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, PostgreSQL Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: [HACKERS] Bug in Physical Replication Slots (at least 9.5)?
Date: 2017-03-30 04:59:14
Message-ID: CAEyp7J-TxoLZGh2qioA9nJJepiXUvvsAWrk=BxwU0J05X_j9_A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Thu, Mar 30, 2017 at 3:51 PM, Kyotaro HORIGUCHI <
horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:

> At Thu, 30 Mar 2017 11:12:56 +1100, Venkata B Nagothi <nag1010(at)gmail(dot)com>
> wrote in <CAEyp7J9ZuuGc_Ndgw-rQF5oWUPH+p4=p7XE-B5cC3QRr6SUT7w(at)mail(dot)
> gmail.com>
> > On Thu, Mar 30, 2017 at 10:55 AM, Michael Paquier <
> michael(dot)paquier(at)gmail(dot)com
> > > wrote:
> >
> > > On Thu, Mar 30, 2017 at 8:49 AM, Venkata B Nagothi <nag1010(at)gmail(dot)com>
> > > wrote:
> > > > On Tue, Mar 28, 2017 at 5:51 PM, Kyotaro HORIGUCHI
> > > > <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> > > > I tried applying this patch to latest master, it is not getting
> applied
> > > >
> > > > [dba(at)buildhost postgresql]$ git apply
> > > > /data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/
> > > 0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch
> > > > /data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/
> > > 0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch:28:
> > > > trailing whitespace.
> > > > /*
> > > 0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch:29:
> > > > trailing whitespace.
> > > > * This variable corresponds to restart_lsn in pg_replication_slots
> for a
> ...
> > > git apply and git am can be very picky sometimes, so you may want to
> > > fallback to patch -p1 if things don't work. In this case it does.
> > >
> >
> > patch -p1 seems to be working. Thanks !
>
> That's quite strange. The patch I sent doesn't cantain trailing
> spaces at all. The cited lines doesn't seem to contain them. It
> applied cleanly with "git am" for me.
>
> The file restored from the attachment of received mail also don't.
>
> The original files contains the following,
>
> 0002440 66 6f 72 20 61 0a 2b 20 2a 20 70 68 79 73 69 63
> f o r a \n + * p h y s i c
>
> The corresponding part of the file restored from mail on Windows
> is the following,
> 0002460 63 61 74 69 6f 6e 5f 73 6c 6f 74 73 20 66 6f 72
> c a t i o n _ s l o t s f o r
> 0002500 20 61 0d 0a 2b 20 2a 20 70 68 79 73 69 63 61 6c
> a \r \n + * p h y s i c a l
>
> Both doesn't contain a space at the end of a line. How did you
> retrieve the patch from the mail?
>

Yes, downloaded from the email on Windows and copied across to Linux and
did "git apply".

Regards,

Venkata B N
Database Consultant


From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: nag1010(at)gmail(dot)com
Cc: michael(dot)paquier(at)gmail(dot)com, jdnelson(at)dyn(dot)com, pgsql-hackers(at)postgresql(dot)org, pgsql-bugs(at)postgresql(dot)org
Subject: Re: [HACKERS] Bug in Physical Replication Slots (at least 9.5)?
Date: 2017-03-30 05:46:35
Message-ID: 20170330.144635.187857856.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Hello,

At Thu, 30 Mar 2017 15:59:14 +1100, Venkata B Nagothi <nag1010(at)gmail(dot)com> wrote in <CAEyp7J-TxoLZGh2qioA9nJJepiXUvvsAWrk=BxwU0J05X_j9_A(at)mail(dot)gmail(dot)com>
> Yes, downloaded from the email on Windows and copied across to Linux and
> did "git apply".

The same works for me. But --keep-cr gave me the same result with
you.

> $ git am --keep-cr ~/work/patches/0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch
> Applying: Fix a bug of physical replication slot.
> .git/rebase-apply/patch:13: trailing whitespace.
> /*

https://git-scm.com/docs/git-am

| --[no-]keep-cr
|
| With --keep-cr, call git mailsplit (see git-mailsplit[1]) with
| the same option, to prevent it from stripping CR at the end of
| lines. am.keepcr configuration variable can be used to specify
| the default behaviour. --no-keep-cr is useful to override
| am.keepcr.

I don't know why it preserves CRs only for the lines, but anyway,
don't you have am.keepcr in you configuration?

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center


From: Venkata B Nagothi <nag1010(at)gmail(dot)com>
To: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Jonathon Nelson <jdnelson(at)dyn(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, PostgreSQL Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: [HACKERS] Bug in Physical Replication Slots (at least 9.5)?
Date: 2017-03-30 09:59:15
Message-ID: CAEyp7J-MZg1b3hehQYLsBGc-xeO5gptpCyvYMiUkDWzaTroKXA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Thu, Mar 30, 2017 at 4:46 PM, Kyotaro HORIGUCHI <
horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:

> Hello,
>
> At Thu, 30 Mar 2017 15:59:14 +1100, Venkata B Nagothi <nag1010(at)gmail(dot)com>
> wrote in <CAEyp7J-TxoLZGh2qioA9nJJepiXUvvsAWrk=B
> xwU0J05X_j9_A(at)mail(dot)gmail(dot)com>
> > Yes, downloaded from the email on Windows and copied across to Linux and
> > did "git apply".
>
> The same works for me. But --keep-cr gave me the same result with
> you.
>
> > $ git am --keep-cr ~/work/patches/0001-Fix-a-bug-
> of-physical-replication-slot_a6f22e8.patch
> > Applying: Fix a bug of physical replication slot.
> > .git/rebase-apply/patch:13: trailing whitespace.
> > /*
>

for me too -

[dba(at)buildhost postgresql]$ git am --keep-cr
/data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch
Applying: Fix a bug of physical replication slot.
/data/PostgreSQL-GIT-Repo/postgresql/.git/rebase-apply/patch:13: trailing
whitespace.
/*
/data/PostgreSQL-GIT-Repo/postgresql/.git/rebase-apply/patch:14: trailing
whitespace.
* This variable corresponds to restart_lsn in pg_replication_slots for a
/data/PostgreSQL-GIT-Repo/postgresql/.git/rebase-apply/patch:15: trailing
whitespace.
* physical slot. This has a valid value only when it differs from the
current
/data/PostgreSQL-GIT-Repo/postgresql/.git/rebase-apply/patch:16: trailing
whitespace.
* flush pointer.
/data/PostgreSQL-GIT-Repo/postgresql/.git/rebase-apply/patch:17: trailing
whitespace.
*/
error: patch failed: src/backend/replication/walsender.c:210
error: src/backend/replication/walsender.c: patch does not apply
Patch failed at 0001 Fix a bug of physical replication slot.
The copy of the patch that failed is found in:
/data/PostgreSQL-GIT-Repo/postgresql/.git/rebase-apply/patch
When you have resolved this problem, run "git am --resolved".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

> https://git-scm.com/docs/git-am
>
> | --[no-]keep-cr
> |
> | With --keep-cr, call git mailsplit (see git-mailsplit[1]) with
> | the same option, to prevent it from stripping CR at the end of
> | lines. am.keepcr configuration variable can be used to specify
> | the default behaviour. --no-keep-cr is useful to override
> | am.keepcr.
>
> I don't know why it preserves CRs only for the lines, but anyway,
> don't you have am.keepcr in you configuration?
>

May be, I do not think i have am.keepcr in my configuration. I am not 100%
sure of it.

I only did "git apply.." which produced white space errors.

Regards,

Venkata B N
Database Consultant


From: Venkata B Nagothi <nag1010(at)gmail(dot)com>
To: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: Jonathon Nelson <jdnelson(at)dyn(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, PostgreSQL Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: Bug in Physical Replication Slots (at least 9.5)?
Date: 2017-03-31 02:40:00
Message-ID: CAEyp7J-BmsMqCVO7zX5vpEoEuRiM+KVPZPJzodUqOdOTZsLpqQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Fri, Mar 17, 2017 at 6:48 PM, Kyotaro HORIGUCHI <
horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:

> Hello,
>
> At Mon, 13 Mar 2017 11:06:00 +1100, Venkata B Nagothi <nag1010(at)gmail(dot)com>
> wrote in <CAEyp7J-4MmVwGoZSwvaSULZC80JDD_tL-9KsNiqF17+bNqiSBg(at)mail(dot)
> gmail.com>
> > On Tue, Jan 17, 2017 at 9:36 PM, Kyotaro HORIGUCHI <
> > horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> > > I managed to reproduce this. A little tweak as the first patch
> > > lets the standby to suicide as soon as walreceiver sees a
> > > contrecord at the beginning of a segment.
> > >
> > > - M(aster): createdb as a master with wal_keep_segments = 0
> > > (default), min_log_messages = debug2
> > > - M: Create a physical repslot.
> > > - S(tandby): Setup a standby database.
> > > - S: Edit recovery.conf to use the replication slot above then
> > > start it.
> > > - S: touch /tmp/hoge
> > > - M: Run pgbench ...
> > > - S: After a while, the standby stops.
> > > > LOG: #################### STOP THE SERVER
> > >
> > > - M: Stop pgbench.
> > > - M: Do 'checkpoint;' twice.
> > > - S: rm /tmp/hoge
> > > - S: Fails to catch up with the following error.
> > >
> > > > FATAL: could not receive data from WAL stream: ERROR: requested
> WAL
> > > segment 00000001000000000000002B has already been removed
> > >
> > >
> > I have been testing / reviewing the latest patch
> > "0001-Fix-a-bug-of-physical-replication-slot.patch" and i think, i might
> > need some more clarification on this.
> >
> > Before applying the patch, I tried re-producing the above error -
> >
> > - I had master->standby in streaming replication
> > - Took the backup of master
> > - with a low max_wal_size and wal_keep_segments = 0
> > - Configured standby with recovery.conf
> > - Created replication slot on master
> > - Configured the replication slot on standby and started the standby
>
> I suppose the "configure" means primary_slot_name in recovery.conf.
>
> > - I got the below error
> >
> > >> 2017-03-10 11:58:15.704 AEDT [478] LOG: invalid record length at
> > 0/F2000140: wanted 24, got 0
> > >> 2017-03-10 11:58:15.706 AEDT [481] LOG: started streaming WAL from
> > primary at 0/F2000000 on timeline 1
> > >> 2017-03-10 11:58:15.706 AEDT [481] FATAL: could not receive data
> > from WAL stream: ERROR: requested WAL segment 0000000100000000000000F2
> has
> > already been removed
>
> Maybe you created the master slot with non-reserve (default) mode
> and put a some-minites pause after making the backup and before
> starting the standby. For the case the master slot doesn't keep
> WAL segments unless the standby connects so a couple of
> checkpoints can blow away the first segment required by the
> standby. This is quite reasonable behavior. The following steps
> makes this more sure.
>
> > - Took the backup of master
> > - with a low max_wal_size = 2 and wal_keep_segments = 0
> > - Configured standby with recovery.conf
> > - Created replication slot on master
> + - SELECT pg_switch_wal(); on master twice.
> + - checkpoint; on master twice.
> > - Configured the replication slot on standby and started the standby
>
> Creating the slot with the following command will save it.
>
> =# select pg_create_physical_replication_slot('s1', true);
>

I did a test again, by applying the patch and I am not sure if the patch is
doing the right thing ?

Here is test case -

- I ran pgbench
- I took the backup of the master first

- Below are the WALs on master after the stop backup -

postgres=# select pg_stop_backup();

NOTICE: WAL archiving is not enabled; you must ensure that all required
WAL segments are copied through other means to complete the backup
pg_stop_backup
----------------
0/8C000130
(1 row)

postgres=# \q
[dba(at)buildhost data]$ ls -ltrh pgdata-10dev-prsb-1/pg_wal/
total 65M
drwx------. 2 dba dba 4.0K Mar 31 09:36 archive_status
-rw-------. 1 dba dba 16M Mar 31 11:09 00000001000000000000008E
-rw-------. 1 dba dba 16M Mar 31 11:17 00000001000000000000008F
-rw-------. 1 dba dba 16M Mar 31 11:18 00000001000000000000008C
-rw-------. 1 dba dba 16M Mar 31 11:18 00000001000000000000008D

- After the backup, i created the physical replication slot

postgres=# select pg_create_physical_replication_slot('repslot',true);

pg_create_physical_replication_slot
-------------------------------------
(repslot,0/8D000028)
(1 row)

postgres=# select pg_walfile_name('0/8D000028');

pg_walfile_name
---------------------------------------
00000001000000000000008D
(1 row)

Here, When you start the standby, it would ask for the file
00000001000000000000008C, which is the first file needed for the standby
and since i applied your patch, i am assuming that, the file
00000001000000000000008C should also be retained without being removed -
correct ?

- I started the standby and the below error occurs

>> 2017-03-31 11:26:01.288 AEDT [17475] LOG: invalid record length at
0/8C000108: wanted 24, got 0
>> 2017-03-31 11:26:01.291 AEDT [17486] LOG: started streaming WAL from
primary at 0/8C000000 on timeline 1
>> 2017-03-31 11:26:01.291 AEDT [17486] FATAL: could not receive data from
WAL stream: ERROR: requested WAL segment 00000001000000000000008C has
already been removed

> and i could notice that the file "0000000100000000000000F2" was removed
> > from the master. This can be easily re-produced and this occurs
> > irrespective of configuring replication slots.
> >
> > As long as the file "0000000100000000000000F2" is available on the
> master,
> > standby continues to stream WALs without any issues.
> ...
> > If the scenario i created to reproduce the error is correct, then,
> applying
> > the patch is not making a difference.
>
> Yes, the patch is not for saving this case. The patch saves the
> case where the previous segment to the first required segment by
> standby was removed and it contains the first part of a record
> continues to the first required segment. On the other hand this
> case is that the segment at the start point of standby is just
> removed.
>

Which means, the file 00000001000000000000008C must be retained as it is
the first file standby is looking for - correct ?

Regards,
Venkata Balaji N

Database Consultant


From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: nag1010(at)gmail(dot)com
Cc: jdnelson(at)dyn(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [BUGS] Bug in Physical Replication Slots (at least 9.5)?
Date: 2017-03-31 05:05:13
Message-ID: 20170331.140513.226934614.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Thank you having a look on this.

# I removed -bugs in CC:.

At Fri, 31 Mar 2017 13:40:00 +1100, Venkata B Nagothi <nag1010(at)gmail(dot)com> wrote in <CAEyp7J-BmsMqCVO7zX5vpEoEuRiM+KVPZPJzodUqOdOTZsLpqQ(at)mail(dot)gmail(dot)com>
> On Fri, Mar 17, 2017 at 6:48 PM, Kyotaro HORIGUCHI <
> horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>
> > Hello,
> >
> > At Mon, 13 Mar 2017 11:06:00 +1100, Venkata B Nagothi <nag1010(at)gmail(dot)com>
> > wrote in <CAEyp7J-4MmVwGoZSwvaSULZC80JDD_tL-9KsNiqF17+bNqiSBg(at)mail(dot)
> > gmail.com>
> > > On Tue, Jan 17, 2017 at 9:36 PM, Kyotaro HORIGUCHI <
> > > horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> > > > I managed to reproduce this. A little tweak as the first patch
> > > > lets the standby to suicide as soon as walreceiver sees a
> > > > contrecord at the beginning of a segment.
> > > >
> > > > - M(aster): createdb as a master with wal_keep_segments = 0
> > > > (default), min_log_messages = debug2
> > > > - M: Create a physical repslot.
> > > > - S(tandby): Setup a standby database.
> > > > - S: Edit recovery.conf to use the replication slot above then
> > > > start it.
> > > > - S: touch /tmp/hoge
> > > > - M: Run pgbench ...
> > > > - S: After a while, the standby stops.
> > > > > LOG: #################### STOP THE SERVER
> > > >
> > > > - M: Stop pgbench.
> > > > - M: Do 'checkpoint;' twice.
> > > > - S: rm /tmp/hoge
> > > > - S: Fails to catch up with the following error.
> > > >
> > > > > FATAL: could not receive data from WAL stream: ERROR: requested
> > WAL
> > > > segment 00000001000000000000002B has already been removed
> > > >
> > > >
> > > I have been testing / reviewing the latest patch
> > > "0001-Fix-a-bug-of-physical-replication-slot.patch" and i think, i might
> > > need some more clarification on this.
> > >
> > > Before applying the patch, I tried re-producing the above error -
> > >
> > > - I had master->standby in streaming replication
> > > - Took the backup of master
> > > - with a low max_wal_size and wal_keep_segments = 0
> > > - Configured standby with recovery.conf
> > > - Created replication slot on master
> > > - Configured the replication slot on standby and started the standby
> >
> > I suppose the "configure" means primary_slot_name in recovery.conf.
> >
> > > - I got the below error
> > >
> > > >> 2017-03-10 11:58:15.704 AEDT [478] LOG: invalid record length at
> > > 0/F2000140: wanted 24, got 0
> > > >> 2017-03-10 11:58:15.706 AEDT [481] LOG: started streaming WAL from
> > > primary at 0/F2000000 on timeline 1
> > > >> 2017-03-10 11:58:15.706 AEDT [481] FATAL: could not receive data
> > > from WAL stream: ERROR: requested WAL segment 0000000100000000000000F2
> > has
> > > already been removed
> >
> > Maybe you created the master slot with non-reserve (default) mode
> > and put a some-minites pause after making the backup and before
> > starting the standby. For the case the master slot doesn't keep
> > WAL segments unless the standby connects so a couple of
> > checkpoints can blow away the first segment required by the
> > standby. This is quite reasonable behavior. The following steps
> > makes this more sure.
> >
> > > - Took the backup of master
> > > - with a low max_wal_size = 2 and wal_keep_segments = 0
> > > - Configured standby with recovery.conf
> > > - Created replication slot on master
> > + - SELECT pg_switch_wal(); on master twice.
> > + - checkpoint; on master twice.
> > > - Configured the replication slot on standby and started the standby
> >
> > Creating the slot with the following command will save it.
> >
> > =# select pg_create_physical_replication_slot('s1', true);
> >
>
> I did a test again, by applying the patch and I am not sure if the patch is
> doing the right thing ?

This is a bit difficult to make it happen.

> Here is test case -
>
> - I ran pgbench
> - I took the backup of the master first
>
> - Below are the WALs on master after the stop backup -
>
> postgres=# select pg_stop_backup();
>
> NOTICE: WAL archiving is not enabled; you must ensure that all required
> WAL segments are copied through other means to complete the backup
> pg_stop_backup
> ----------------
> 0/8C000130
> (1 row)
>
> postgres=# \q
> [dba(at)buildhost data]$ ls -ltrh pgdata-10dev-prsb-1/pg_wal/
> total 65M
> drwx------. 2 dba dba 4.0K Mar 31 09:36 archive_status
> -rw-------. 1 dba dba 16M Mar 31 11:09 00000001000000000000008E
> -rw-------. 1 dba dba 16M Mar 31 11:17 00000001000000000000008F
> -rw-------. 1 dba dba 16M Mar 31 11:18 00000001000000000000008C
> -rw-------. 1 dba dba 16M Mar 31 11:18 00000001000000000000008D
>
> - After the backup, i created the physical replication slot
>
>
> postgres=# select pg_create_physical_replication_slot('repslot',true);
>
> pg_create_physical_replication_slot
> -------------------------------------
> (repslot,0/8D000028)
> (1 row)

At this point, segment 8C is not protected from removal. It's too
late if the first record in 8D is continued from 8C.

> postgres=# select pg_walfile_name('0/8D000028');
>
> pg_walfile_name
> ---------------------------------------
> 00000001000000000000008D
> (1 row)
>
> Here, When you start the standby, it would ask for the file
> 00000001000000000000008C, which is the first file needed for the standby
> and since i applied your patch, i am assuming that, the file
> 00000001000000000000008C should also be retained without being removed -
> correct ?

Yes if repslot were staying in 8C. So this is not the case.

> - I started the standby and the below error occurs
>
> >> 2017-03-31 11:26:01.288 AEDT [17475] LOG: invalid record length at
> 0/8C000108: wanted 24, got 0
> >> 2017-03-31 11:26:01.291 AEDT [17486] LOG: started streaming WAL from
> primary at 0/8C000000 on timeline 1
> >> 2017-03-31 11:26:01.291 AEDT [17486] FATAL: could not receive data from
> WAL stream: ERROR: requested WAL segment 00000001000000000000008C has
> already been removed

LoL. You hit this by a single shot! However, the case cannot be
saved with this patch since the repslot is already in 8D. In
other words, newly created slot doesn't protect the previous
segments at the time of creation. The existence of required WAL
segments is not guranteed without repslot anyway.

> > and i could notice that the file "0000000100000000000000F2" was removed
> > > from the master. This can be easily re-produced and this occurs
> > > irrespective of configuring replication slots.
> > >
> > > As long as the file "0000000100000000000000F2" is available on the
> > master,
> > > standby continues to stream WALs without any issues.
> > ...
> > > If the scenario i created to reproduce the error is correct, then,
> > applying
> > > the patch is not making a difference.
> >
> > Yes, the patch is not for saving this case. The patch saves the
> > case where the previous segment to the first required segment by
> > standby was removed and it contains the first part of a record
> > continues to the first required segment. On the other hand this
> > case is that the segment at the start point of standby is just
> > removed.
> >
>
> Which means, the file 00000001000000000000008C must be retained as it is
> the first file standby is looking for - correct ?

Yes, but if and only if the repslot is successfully connected at
least once. So I attached an additional patch to effectively
cause the problematic situation. The attached patch let standby
suicide when the condition for this bug is made.

I wrote in <20170117(dot)193645(dot)160386781(dot)horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>:
> I managed to reproduce this. A little tweak as the first patch
> lets the standby to suicide as soon as walreceiver sees a
> contrecord at the beginning of a segment.
>
> - M(aster): createdb as a master with wal_keep_segments = 0
> (default), min_log_messages = debug2
> - M: Create a physical repslot.
> - S(tandby): Setup a standby database.
> - S: Edit recovery.conf to use the replication slot above then
> start it.
> - S: touch /tmp/hoge
> - M: Run pgbench ...
> - S: After a while, the standby stops.
> > LOG: #################### STOP THE SERVER

At this point, pg_replication_slots.restart_lsn should be in the
previous segment of the segment for the LSN shown in the last
"LOG: ####.. CHECK AT" log. For example in the case below.

> LOG: ############# CHECK AT 5000000 : 1
> LOG: #################### STOP THE SERVER

The master responds to the query "select restart_lsn from
pg_replication_slots" with '0/4F......' with this patch but
'0/50......' without this patch.

> - M: Stop pgbench.
> - M: Do 'checkpoint;' twice.
> - S: rm /tmp/hoge
> - S: Fails to catch up with the following error.
>
> > FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 00000001000000000000002B has already been removed

One other point of thie patch is that repslot *shoutd not* retain
needless semgents. But this doesn't seem externally
confirmable...

regards,

Attachment Content-Type Size
repl_border_bug_test.patch text/x-patch 1.0 KB

From: Venkata B Nagothi <nag1010(at)gmail(dot)com>
To: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: Jonathon Nelson <jdnelson(at)dyn(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [BUGS] Bug in Physical Replication Slots (at least 9.5)?
Date: 2017-04-02 22:19:20
Message-ID: CAEyp7J9A1-hzdLfoTKaKp=OO6ichifXaYzvOzAzP-NNfnTX27A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Fri, Mar 31, 2017 at 4:05 PM, Kyotaro HORIGUCHI <
horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:

> Thank you having a look on this.
>
> # I removed -bugs in CC:.
>
> At Fri, 31 Mar 2017 13:40:00 +1100, Venkata B Nagothi <nag1010(at)gmail(dot)com>
> wrote in <CAEyp7J-BmsMqCVO7zX5vpEoEuRiM+KVPZPJzodUqOdOTZsLpqQ(at)mail(dot)
> gmail.com>
> > On Fri, Mar 17, 2017 at 6:48 PM, Kyotaro HORIGUCHI <
> > horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> >
> > > Hello,
> > >
> > > At Mon, 13 Mar 2017 11:06:00 +1100, Venkata B Nagothi <
> nag1010(at)gmail(dot)com>
> > > wrote in <CAEyp7J-4MmVwGoZSwvaSULZC80JDD_tL-9KsNiqF17+bNqiSBg(at)mail(dot)
> > > gmail.com>
> > > > On Tue, Jan 17, 2017 at 9:36 PM, Kyotaro HORIGUCHI <
> > > > horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> > > > > I managed to reproduce this. A little tweak as the first patch
> > > > > lets the standby to suicide as soon as walreceiver sees a
> > > > > contrecord at the beginning of a segment.
> > > > >
> > > > > - M(aster): createdb as a master with wal_keep_segments = 0
> > > > > (default), min_log_messages = debug2
> > > > > - M: Create a physical repslot.
> > > > > - S(tandby): Setup a standby database.
> > > > > - S: Edit recovery.conf to use the replication slot above then
> > > > > start it.
> > > > > - S: touch /tmp/hoge
> > > > > - M: Run pgbench ...
> > > > > - S: After a while, the standby stops.
> > > > > > LOG: #################### STOP THE SERVER
> > > > >
> > > > > - M: Stop pgbench.
> > > > > - M: Do 'checkpoint;' twice.
> > > > > - S: rm /tmp/hoge
> > > > > - S: Fails to catch up with the following error.
> > > > >
> > > > > > FATAL: could not receive data from WAL stream: ERROR:
> requested
> > > WAL
> > > > > segment 00000001000000000000002B has already been removed
> > > > >
> > > > >
> > > > I have been testing / reviewing the latest patch
> > > > "0001-Fix-a-bug-of-physical-replication-slot.patch" and i think, i
> might
> > > > need some more clarification on this.
> > > >
> > > > Before applying the patch, I tried re-producing the above error -
> > > >
> > > > - I had master->standby in streaming replication
> > > > - Took the backup of master
> > > > - with a low max_wal_size and wal_keep_segments = 0
> > > > - Configured standby with recovery.conf
> > > > - Created replication slot on master
> > > > - Configured the replication slot on standby and started the standby
> > >
> > > I suppose the "configure" means primary_slot_name in recovery.conf.
> > >
> > > > - I got the below error
> > > >
> > > > >> 2017-03-10 11:58:15.704 AEDT [478] LOG: invalid record length
> at
> > > > 0/F2000140: wanted 24, got 0
> > > > >> 2017-03-10 11:58:15.706 AEDT [481] LOG: started streaming WAL
> from
> > > > primary at 0/F2000000 on timeline 1
> > > > >> 2017-03-10 11:58:15.706 AEDT [481] FATAL: could not receive
> data
> > > > from WAL stream: ERROR: requested WAL segment
> 0000000100000000000000F2
> > > has
> > > > already been removed
> > >
> > > Maybe you created the master slot with non-reserve (default) mode
> > > and put a some-minites pause after making the backup and before
> > > starting the standby. For the case the master slot doesn't keep
> > > WAL segments unless the standby connects so a couple of
> > > checkpoints can blow away the first segment required by the
> > > standby. This is quite reasonable behavior. The following steps
> > > makes this more sure.
> > >
> > > > - Took the backup of master
> > > > - with a low max_wal_size = 2 and wal_keep_segments = 0
> > > > - Configured standby with recovery.conf
> > > > - Created replication slot on master
> > > + - SELECT pg_switch_wal(); on master twice.
> > > + - checkpoint; on master twice.
> > > > - Configured the replication slot on standby and started the standby
> > >
> > > Creating the slot with the following command will save it.
> > >
> > > =# select pg_create_physical_replication_slot('s1', true);
> > >
> >
> > I did a test again, by applying the patch and I am not sure if the patch
> is
> > doing the right thing ?
>
> This is a bit difficult to make it happen.
>

Yes, it seems to be quite difficult to re-produce.

>
> > Here is test case -
> >
> > - I ran pgbench
> > - I took the backup of the master first
> >
> > - Below are the WALs on master after the stop backup -
> >
> > postgres=# select pg_stop_backup();
> >
> > NOTICE: WAL archiving is not enabled; you must ensure that all required
> > WAL segments are copied through other means to complete the backup
> > pg_stop_backup
> > ----------------
> > 0/8C000130
> > (1 row)
> >
> > postgres=# \q
> > [dba(at)buildhost data]$ ls -ltrh pgdata-10dev-prsb-1/pg_wal/
> > total 65M
> > drwx------. 2 dba dba 4.0K Mar 31 09:36 archive_status
> > -rw-------. 1 dba dba 16M Mar 31 11:09 00000001000000000000008E
> > -rw-------. 1 dba dba 16M Mar 31 11:17 00000001000000000000008F
> > -rw-------. 1 dba dba 16M Mar 31 11:18 00000001000000000000008C
> > -rw-------. 1 dba dba 16M Mar 31 11:18 00000001000000000000008D
> >
> > - After the backup, i created the physical replication slot
> >
> >
> > postgres=# select pg_create_physical_replication_slot('repslot',true);
> >
> > pg_create_physical_replication_slot
> > -------------------------------------
> > (repslot,0/8D000028)
> > (1 row)
>
> At this point, segment 8C is not protected from removal. It's too
> late if the first record in 8D is continued from 8C.
>
> > postgres=# select pg_walfile_name('0/8D000028');
> >
> > pg_walfile_name
> > ---------------------------------------
> > 00000001000000000000008D
> > (1 row)
> >
> > Here, When you start the standby, it would ask for the file
> > 00000001000000000000008C, which is the first file needed for the standby
> > and since i applied your patch, i am assuming that, the file
> > 00000001000000000000008C should also be retained without being removed -
> > correct ?
>
> Yes if repslot were staying in 8C. So this is not the case.
>
> > - I started the standby and the below error occurs
> >
> > >> 2017-03-31 11:26:01.288 AEDT [17475] LOG: invalid record length at
> > 0/8C000108: wanted 24, got 0
> > >> 2017-03-31 11:26:01.291 AEDT [17486] LOG: started streaming WAL from
> > primary at 0/8C000000 on timeline 1
> > >> 2017-03-31 11:26:01.291 AEDT [17486] FATAL: could not receive data
> from
> > WAL stream: ERROR: requested WAL segment 00000001000000000000008C has
> > already been removed
>
> LoL. You hit this by a single shot! However, the case cannot be
> saved with this patch since the repslot is already in 8D. In
> other words, newly created slot doesn't protect the previous
> segments at the time of creation. The existence of required WAL
> segments is not guranteed without repslot anyway.
>

Yeah, this can be re-produced easily.

>
> > > and i could notice that the file "0000000100000000000000F2" was removed
> > > > from the master. This can be easily re-produced and this occurs
> > > > irrespective of configuring replication slots.
> > > >
> > > > As long as the file "0000000100000000000000F2" is available on the
> > > master,
> > > > standby continues to stream WALs without any issues.
> > > ...
> > > > If the scenario i created to reproduce the error is correct, then,
> > > applying
> > > > the patch is not making a difference.
> > >
> > > Yes, the patch is not for saving this case. The patch saves the
> > > case where the previous segment to the first required segment by
> > > standby was removed and it contains the first part of a record
> > > continues to the first required segment. On the other hand this
> > > case is that the segment at the start point of standby is just
> > > removed.
> > >
> >
> > Which means, the file 00000001000000000000008C must be retained as it is
> > the first file standby is looking for - correct ?
>
> Yes, but if and only if the repslot is successfully connected at
> least once. So I attached an additional patch to effectively
> cause the problematic situation. The attached patch let standby
> suicide when the condition for this bug is made.
>
> I wrote in <20170117(dot)193645(dot)160386781(dot)horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>:
> > I managed to reproduce this. A little tweak as the first patch
> > lets the standby to suicide as soon as walreceiver sees a
> > contrecord at the beginning of a segment.
> >
> > - M(aster): createdb as a master with wal_keep_segments = 0
> > (default), min_log_messages = debug2
> > - M: Create a physical repslot.
> > - S(tandby): Setup a standby database.
> > - S: Edit recovery.conf to use the replication slot above then
> > start it.
> > - S: touch /tmp/hoge
> > - M: Run pgbench ...
> > - S: After a while, the standby stops.
> > > LOG: #################### STOP THE SERVER
>
> At this point, pg_replication_slots.restart_lsn should be in the
> previous segment of the segment for the LSN shown in the last
> "LOG: ####.. CHECK AT" log. For example in the case below.
>
> > LOG: ############# CHECK AT 5000000 : 1
> > LOG: #################### STOP THE SERVER
>
> The master responds to the query "select restart_lsn from
> pg_replication_slots" with '0/4F......' with this patch but
> '0/50......' without this patch.
>
> > - M: Stop pgbench.
> > - M: Do 'checkpoint;' twice.
> > - S: rm /tmp/hoge
> > - S: Fails to catch up with the following error.
> >
> > > FATAL: could not receive data from WAL stream: ERROR: requested
> WAL segment 00000001000000000000002B has already been removed
>
> One other point of thie patch is that repslot *shoutd not* retain
> needless semgents. But this doesn't seem externally
> confirmable...
>

I will need to use this patch and see if i can re-produce the case.

As we are already past the commitfest, I am not sure, what should i change
the patch status to ?

Regards,

Venkata B N
Database Consultant


From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Venkata B Nagothi <nag1010(at)gmail(dot)com>
Cc: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, Jonathon Nelson <jdnelson(at)dyn(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [BUGS] Bug in Physical Replication Slots (at least 9.5)?
Date: 2017-04-02 23:38:47
Message-ID: CAB7nPqT8dQk_Ce29YQ0CKAQ7htLDyUHNdFv6dELe4PkYr3SSjA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Mon, Apr 3, 2017 at 7:19 AM, Venkata B Nagothi <nag1010(at)gmail(dot)com> wrote:
> As we are already past the commitfest, I am not sure, what should i change
> the patch status to ?

The commit fest finishes on the 7th of April. Even with the deadline
passed, there is nothing preventing to work on bug fixes. So this item
ought to be moved to the next CF with the same category.
--
Michael


From: Ryan Murphy <ryanfmurphy(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Kyotaro Horiguchi <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Subject: Re: Bug in Physical Replication Slots (at least 9.5)?
Date: 2017-07-06 15:00:36
Message-ID: 20170706150036.1381.32532.pgcf@coridan.postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Poking this. Looking back through the discussion, this issue has been reproduced by multiple people. The patch still applies to HEAD without issues. I have no experience with PostgreSQL replication, so I'm not qualified to really review this. From what I can see with the patch, it's just a small block of code added to /src/backend/replication/walreceiver.c to handle some edge case where the WAL file no longer exists or something.

I think one thing that would help move this forward is if we edited the patch to include a comment explaining why this new code is necessary. There's lots of great discussion on this issue in the email list, so if a summary of that gets into the code I think it would make the patch easier to understand and make the new walreceiver.c less confusing.


From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: michael(dot)paquier(at)gmail(dot)com
Cc: nag1010(at)gmail(dot)com, jdnelson(at)dyn(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [BUGS] Bug in Physical Replication Slots (at least 9.5)?
Date: 2017-08-28 11:02:40
Message-ID: 20170828.200240.73003255.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Hello,

This problem still occurs on the master.
I rebased this to the current master.

At Mon, 3 Apr 2017 08:38:47 +0900, Michael Paquier <michael(dot)paquier(at)gmail(dot)com> wrote in <CAB7nPqT8dQk_Ce29YQ0CKAQ7htLDyUHNdFv6dELe4PkYr3SSjA(at)mail(dot)gmail(dot)com>
> On Mon, Apr 3, 2017 at 7:19 AM, Venkata B Nagothi <nag1010(at)gmail(dot)com> wrote:
> > As we are already past the commitfest, I am not sure, what should i change
> > the patch status to ?
>
> The commit fest finishes on the 7th of April. Even with the deadline
> passed, there is nothing preventing to work on bug fixes. So this item
> ought to be moved to the next CF with the same category.

The steps to reproduce the problem follows.

- Apply the second patch (0002-) attached and recompile. It
effectively reproduces the problematic state of database.

- M(aster): initdb the master with wal_keep_segments = 0
(default), log_min_messages = debug2
- M: Create a physical repslot.
- S(tandby): Setup a standby database.
- S: Edit recovery.conf to use the replication slot above then
start it.
- S: touch /tmp/hoge
- M: Run pgbench ...
- S: After a while, the standby stops.
> LOG: #################### STOP THE SERVER

- M: Stop pgbench.
- M: Do 'checkpoint;' twice.
- S: rm /tmp/hoge
- S: Fails to catch up with the following error.

> FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 00000001000000000000002B has already been removed

The first patch (0001-) fixes this problem, preventing the
problematic state of WAL segments by retarding restart LSN of a
physical replication slot in a certain condition.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment Content-Type Size
0001-Retard-restart-LSN-of-a-slot-when-a-segment-starts-w.patch text/x-patch 7.3 KB
0002-Debug-assistant-code.patch text/x-patch 1.4 KB

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: nag1010(at)gmail(dot)com, jdnelson(at)dyn(dot)com, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [BUGS] Bug in Physical Replication Slots (at least 9.5)?
Date: 2017-08-28 11:14:54
Message-ID: CAB7nPqT03+uaHXun3ft4LJWNDviKTgWSZDsXiqyNdtcCfeqcgg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Mon, Aug 28, 2017 at 8:02 PM, Kyotaro HORIGUCHI
<horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> The first patch (0001-) fixes this problem, preventing the
> problematic state of WAL segments by retarding restart LSN of a
> physical replication slot in a certain condition.

FWIW, I have this patch marked on my list of things to look at, so you
can count me as a reviewer. There are also some approaches that I
would like to test because I rely on replication slots for some
infrastructure. Still...

+ if (oldFlushPtr != InvalidXLogRecPtr &&
+ (restartLSN == InvalidXLogRecPtr ?
+ oldFlushPtr / XLOG_SEG_SIZE != flushPtr / XLOG_SEG_SIZE :
+ restartLSN / XLOG_BLCKSZ != flushPtr / XLOG_BLCKSZ))
I find such code patterns not readable.
--
Michael


From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: michael(dot)paquier(at)gmail(dot)com
Cc: nag1010(at)gmail(dot)com, jdnelson(at)dyn(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [BUGS] Bug in Physical Replication Slots (at least 9.5)?
Date: 2017-09-04 06:51:51
Message-ID: 20170904.155151.168399829.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Hello,

Thank you for reviewing this.

At Mon, 28 Aug 2017 20:14:54 +0900, Michael Paquier <michael(dot)paquier(at)gmail(dot)com> wrote in <CAB7nPqT03+uaHXun3ft4LJWNDviKTgWSZDsXiqyNdtcCfeqcgg(at)mail(dot)gmail(dot)com>
> On Mon, Aug 28, 2017 at 8:02 PM, Kyotaro HORIGUCHI
> <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> > The first patch (0001-) fixes this problem, preventing the
> > problematic state of WAL segments by retarding restart LSN of a
> > physical replication slot in a certain condition.
>
> FWIW, I have this patch marked on my list of things to look at, so you
> can count me as a reviewer. There are also some approaches that I
> would like to test because I rely on replication slots for some
> infrastructure. Still...

This test patch modifies the code for easiness. The window for
this bug to occur is from receiving the first record of a segment
to in most cases receiving the second record or after receiving
several records. Intentionally emitting a record spanning two or
more segments would work?

> + if (oldFlushPtr != InvalidXLogRecPtr &&
> + (restartLSN == InvalidXLogRecPtr ?
> + oldFlushPtr / XLOG_SEG_SIZE != flushPtr / XLOG_SEG_SIZE :
> + restartLSN / XLOG_BLCKSZ != flushPtr / XLOG_BLCKSZ))
> I find such code patterns not readable.

Yeah, I agree. I rewrote there and the result in the attached
patch is far cleaner than the blob.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment Content-Type Size
0001-Retard-restart-LSN-of-a-slot-when-a-segment-starts-w.patch text/x-patch 7.7 KB

From: Andres Freund <andres(at)anarazel(dot)de>
To: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: michael(dot)paquier(at)gmail(dot)com, nag1010(at)gmail(dot)com, jdnelson(at)dyn(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [BUGS] Bug in Physical Replication Slots (at least 9.5)?
Date: 2017-09-04 07:04:34
Message-ID: 20170904070434.fngklyhuqg5lflcl@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Hi,

On 2017-09-04 15:51:51 +0900, Kyotaro HORIGUCHI wrote:
> SpinLockAcquire(&walsnd->mutex);
> + oldFlushPtr = walsnd->flush;
> walsnd->write = writePtr;
> walsnd->flush = flushPtr;
> walsnd->apply = applyPtr;
> @@ -1821,7 +1836,93 @@ ProcessStandbyReplyMessage(void)
> if (SlotIsLogical(MyReplicationSlot))
> LogicalConfirmReceivedLocation(flushPtr);
> else
> - PhysicalConfirmReceivedLocation(flushPtr);
> + {
> + /*
> + * Recovery on standby requires that a continuation record is
> + * available from single WAL source. For the reason, physical
> + * replication slot should stay in the first segment of the
> + * multiple segments that a continued record is spanning
> + * over. Since we look pages and don't look into individual record
> + * here, restartLSN may stay a bit too behind but it doesn't
> + * matter.
> + *
> + * Since the objective is avoiding to remove required segments,
> + * checking at the beginning of every segment is enough. But once
> + * restartLSN goes behind, check every page for quick restoration.
> + *
> + * restartLSN has a valid value only when it is behind flushPtr.
> + */
> + bool check_contrecords = false;
> +
> + if (restartLSN != InvalidXLogRecPtr)
> + {
> + /*
> + * We're retarding restartLSN, check continuation records
> + * at every page boundary for quick restoration.
> + */
> + if (restartLSN / XLOG_BLCKSZ != flushPtr / XLOG_BLCKSZ)
> + check_contrecords = true;
> + }
> + else if (oldFlushPtr != InvalidXLogRecPtr)
> + {
> + /*
> + * We're not retarding restartLSN , check continuation records
> + * only at segment boundaries. No need to check if
> + */
> + if (oldFlushPtr / XLOG_SEG_SIZE != flushPtr / XLOG_SEG_SIZE)
> + check_contrecords = true;
> + }
> +
> + if (check_contrecords)
> + {
> + XLogRecPtr rp;
> +
> + if (restartLSN == InvalidXLogRecPtr)
> + restartLSN = oldFlushPtr;
> +
> + rp = restartLSN - (restartLSN % XLOG_BLCKSZ);
> +
> + /*
> + * We may have let the record at flushPtr be sent, so it's
> + * worth looking
> + */
> + while (rp <= flushPtr)
> + {
> + XLogPageHeaderData header;
> +
> + /*
> + * If the page header is not available for now, don't move
> + * restartLSN forward. We can read it by the next chance.
> + */
> + if(sentPtr - rp >= sizeof(XLogPageHeaderData))
> + {
> + bool found;
> + /*
> + * Fetch the page header of the next page. Move
> + * restartLSN forward only if it is not a continuation
> + * page.
> + */
> + found = XLogRead((char *)&header, rp,
> + sizeof(XLogPageHeaderData), true);
> + if (found &&
> + (header.xlp_info & XLP_FIRST_IS_CONTRECORD) == 0)
> + restartLSN = rp;
> + }
> + rp += XLOG_BLCKSZ;
> + }
> +
> + /*
> + * If restartLSN is on the same page with flushPtr, it means
> + * that we are catching up.
> + */
> + if (restartLSN / XLOG_BLCKSZ == flushPtr / XLOG_BLCKSZ)
> + restartLSN = InvalidXLogRecPtr;
> + }
> +
> + /* restartLSN == InvalidXLogRecPtr means catching up */
> + PhysicalConfirmReceivedLocation(restartLSN != InvalidXLogRecPtr ?
> + restartLSN : flushPtr);
> + }

I've not read through the thread, but this seems like the wrong approach
to me. The receiving side should use a correct value, instead of putting
this complexity on the sender's side.

Don't we also use the value on the receiving side to delete old segments
and such?

- Andres


From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, nag1010(at)gmail(dot)com, jdnelson(at)dyn(dot)com, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [BUGS] Bug in Physical Replication Slots (at least 9.5)?
Date: 2017-09-04 08:17:19
Message-ID: CAB7nPqSPf0qkq=DhSO-tAM9++LSA2aEYSVJ3oY_EdUdb=jKi1w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Mon, Sep 4, 2017 at 4:04 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> I've not read through the thread, but this seems like the wrong approach
> to me. The receiving side should use a correct value, instead of putting
> this complexity on the sender's side.

Yes I agree with that. The current patch gives me a bad feeling to be
honest with the way it does things..
--
Michael


From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: michael(dot)paquier(at)gmail(dot)com
Cc: andres(at)anarazel(dot)de, nag1010(at)gmail(dot)com, jdnelson(at)dyn(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [BUGS] Bug in Physical Replication Slots (at least 9.5)?
Date: 2017-09-06 08:36:02
Message-ID: 20170906.173602.195129146.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Hi,

At Mon, 4 Sep 2017 17:17:19 +0900, Michael Paquier <michael(dot)paquier(at)gmail(dot)com> wrote in <CAB7nPqSPf0qkq=DhSO-tAM9++LSA2aEYSVJ3oY_EdUdb=jKi1w(at)mail(dot)gmail(dot)com>
> On Mon, Sep 4, 2017 at 4:04 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> > I've not read through the thread, but this seems like the wrong approach
> > to me. The receiving side should use a correct value, instead of putting
> > this complexity on the sender's side.
>
> Yes I agree with that. The current patch gives me a bad feeling to be
> honest with the way it does things..

The problem is that the current ReadRecord needs the first one of
a series of continuation records from the same source with the
other part, the master in the case.

A (or the) solution closed in the standby side is allowing to
read a seris of continuation records from muliple sources. In
this case the first part from the standby's pg_wal and the second
part from the master via streaming replication. ReadRecord
needed refactoring, (seems to me) breaking the concept of
XLogReader plug-in system to accomplish this behavior.

If it is preferable for you, I'll re-try that. Or hints for other
solutions are also welcome.

Is there any suggestions?

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center


From: Andres Freund <andres(at)anarazel(dot)de>
To: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: michael(dot)paquier(at)gmail(dot)com, nag1010(at)gmail(dot)com, jdnelson(at)dyn(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [BUGS] Bug in Physical Replication Slots (at least 9.5)?
Date: 2017-09-06 19:23:53
Message-ID: 20170906192353.ufp2dq7wm5fd6qa7@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On 2017-09-06 17:36:02 +0900, Kyotaro HORIGUCHI wrote:
> Hi,
>
> At Mon, 4 Sep 2017 17:17:19 +0900, Michael Paquier <michael(dot)paquier(at)gmail(dot)com> wrote in <CAB7nPqSPf0qkq=DhSO-tAM9++LSA2aEYSVJ3oY_EdUdb=jKi1w(at)mail(dot)gmail(dot)com>
> > On Mon, Sep 4, 2017 at 4:04 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> > > I've not read through the thread, but this seems like the wrong approach
> > > to me. The receiving side should use a correct value, instead of putting
> > > this complexity on the sender's side.
> >
> > Yes I agree with that. The current patch gives me a bad feeling to be
> > honest with the way it does things..
>
> The problem is that the current ReadRecord needs the first one of
> a series of continuation records from the same source with the
> other part, the master in the case.

What's the problem with that? We can easily keep track of the beginning
of a record, and only confirm the address before that.

> A (or the) solution closed in the standby side is allowing to
> read a seris of continuation records from muliple sources.

I'm not following. All we need to use is the beginning of the relevant
records, that's easy enough to keep track of. We don't need to read the
WAL or anything.

- Andres


From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: andres(at)anarazel(dot)de
Cc: michael(dot)paquier(at)gmail(dot)com, nag1010(at)gmail(dot)com, jdnelson(at)dyn(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [BUGS] Bug in Physical Replication Slots (at least 9.5)?
Date: 2017-09-07 03:33:47
Message-ID: 20170907.123347.101584520.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Hello,

At Wed, 6 Sep 2017 12:23:53 -0700, Andres Freund <andres(at)anarazel(dot)de> wrote in <20170906192353(dot)ufp2dq7wm5fd6qa7(at)alap3(dot)anarazel(dot)de>
> On 2017-09-06 17:36:02 +0900, Kyotaro HORIGUCHI wrote:
> > The problem is that the current ReadRecord needs the first one of
> > a series of continuation records from the same source with the
> > other part, the master in the case.
>
> What's the problem with that? We can easily keep track of the beginning
> of a record, and only confirm the address before that.

After failure while reading a record locally, ReadRecored tries
streaming to read from the beginning of a record, which is not on
the master, then retry locally and.. This loops forever.

> > A (or the) solution closed in the standby side is allowing to
> > read a seris of continuation records from muliple sources.
>
> I'm not following. All we need to use is the beginning of the relevant
> records, that's easy enough to keep track of. We don't need to read the
> WAL or anything.

The beginning is already tracked and nothing more to do.

I reconsider that way and found that it doesn't need such
destructive refactoring.

The first *problem* was WaitForWALToBecomeAvaialble requests the
beginning of a record, which is not on the page the function has
been told to fetch. Still tliRecPtr is required to determine the
TLI to request, it should request RecPtr to be streamed.

The rest to do is let XLogPageRead retry other sources
immediately. To do this I made ValidXLogPageHeader(at)xlogreader(dot)c
public (and renamed to XLogReaderValidatePageHeader).

The patch attached fixes the problem and passes recovery
tests. However, the test for this problem is not added. It needs
to go to the last page in a segment then put a record continues
to the next segment, then kill the standby after receiving the
previous segment but before receiving the whole record.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment Content-Type Size
0001-Allow-switch-WAL-source-midst-of-record.patch text/x-patch 3.9 KB
0002-Debug-assistant-code.patch text/x-patch 1.4 KB

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: Andres Freund <andres(at)anarazel(dot)de>, nag1010(at)gmail(dot)com, jdnelson(at)dyn(dot)com, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [BUGS] Bug in Physical Replication Slots (at least 9.5)?
Date: 2017-10-16 08:58:03
Message-ID: CAB7nPqR+J1Xw+yzfsrehiQ+rh3ac+n5sEUgP7UOQ4_ymFnO9wg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Thu, Sep 7, 2017 at 12:33 PM, Kyotaro HORIGUCHI
<horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> At Wed, 6 Sep 2017 12:23:53 -0700, Andres Freund <andres(at)anarazel(dot)de> wrote in <20170906192353(dot)ufp2dq7wm5fd6qa7(at)alap3(dot)anarazel(dot)de>
>> I'm not following. All we need to use is the beginning of the relevant
>> records, that's easy enough to keep track of. We don't need to read the
>> WAL or anything.
>
> The beginning is already tracked and nothing more to do.

I have finally allocated time to look at your newly-proposed patch,
sorry for the time it took. Patch 0002 forgot to include sys/stat.h to
allow the debugging tool to compile :)

> The first *problem* was WaitForWALToBecomeAvailable requests the
> beginning of a record, which is not on the page the function has
> been told to fetch. Still tliRecPtr is required to determine the
> TLI to request, it should request RecPtr to be streamed.

[...]

> The rest to do is let XLogPageRead retry other sources
> immediately. To do this I made ValidXLogPageHeader(at)xlogreader(dot)c
> public (and renamed to XLogReaderValidatePageHeader).
>
> The patch attached fixes the problem and passes recovery
> tests. However, the test for this problem is not added. It needs
> to go to the last page in a segment then put a record continues
> to the next segment, then kill the standby after receiving the
> previous segment but before receiving the whole record.

+typedef struct XLogPageHeaderData *XLogPageHeader;
[...]
+/* Validate a page */
+extern bool XLogReaderValidatePageHeader(XLogReaderState *state,
+ XLogRecPtr recptr, XLogPageHeader hdr);
Instead of exposing XLogPageHeaderData, I would recommend just using
char* and remove this declaration. The comment on top of
XLogReaderValidatePageHeader needs to make clear what caller should
provide.

+ if (!XLogReaderValidatePageHeader(xlogreader, targetPagePtr,
+ (XLogPageHeader) readBuf))
+ goto next_record_is_invalid;
+
[...]
- ptr = tliRecPtr;
+ ptr = RecPtr;
tli = tliOfPointInHistory(tliRecPtr, expectedTLEs);

if (curFileTLI > 0 && tli < curFileTLI)
The core of the patch is here (the patch has no comment so it is hard
to understand what's the point of what is being done), and if I
understand that correctly, you allow the receiver to fetch the
portions of a record spawned across multiple segments from different
sources, and I am not sure that we'd want to break that promise.
Shouldn't we instead have the receiver side track the beginning of the
record and send that position for the physical slot's restart_lsn?
This way the receiver would retain WAL segments from the real
beginning of a record. restart_lsn for replication slots is set when
processing the standby message in ProcessStandbyReplyMessage() using
now the flush LSN, so a more correct value should be provided using
that. Andres, what's your take on the matter?
--
Michael


From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: michael(dot)paquier(at)gmail(dot)com
Cc: andres(at)anarazel(dot)de, nag1010(at)gmail(dot)com, jdnelson(at)dyn(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [BUGS] Bug in Physical Replication Slots (at least 9.5)?
Date: 2017-10-26 10:05:51
Message-ID: 20171026.190551.208996945.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Hello. Thank you for looking this.

At Mon, 16 Oct 2017 17:58:03 +0900, Michael Paquier <michael(dot)paquier(at)gmail(dot)com> wrote in <CAB7nPqR+J1Xw+yzfsrehiQ+rh3ac+n5sEUgP7UOQ4_ymFnO9wg(at)mail(dot)gmail(dot)com>
> On Thu, Sep 7, 2017 at 12:33 PM, Kyotaro HORIGUCHI
> <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> > At Wed, 6 Sep 2017 12:23:53 -0700, Andres Freund <andres(at)anarazel(dot)de> wrote in <20170906192353(dot)ufp2dq7wm5fd6qa7(at)alap3(dot)anarazel(dot)de>
> >> I'm not following. All we need to use is the beginning of the relevant
> >> records, that's easy enough to keep track of. We don't need to read the
> >> WAL or anything.
> >
> > The beginning is already tracked and nothing more to do.
>
> I have finally allocated time to look at your newly-proposed patch,
> sorry for the time it took. Patch 0002 forgot to include sys/stat.h to
> allow the debugging tool to compile :)
>
> > The first *problem* was WaitForWALToBecomeAvailable requests the
> > beginning of a record, which is not on the page the function has
> > been told to fetch. Still tliRecPtr is required to determine the
> > TLI to request, it should request RecPtr to be streamed.
>
> [...]
>
> > The rest to do is let XLogPageRead retry other sources
> > immediately. To do this I made ValidXLogPageHeader(at)xlogreader(dot)c
> > public (and renamed to XLogReaderValidatePageHeader).
> >
> > The patch attached fixes the problem and passes recovery
> > tests. However, the test for this problem is not added. It needs
> > to go to the last page in a segment then put a record continues
> > to the next segment, then kill the standby after receiving the
> > previous segment but before receiving the whole record.
>
> +typedef struct XLogPageHeaderData *XLogPageHeader;
> [...]
> +/* Validate a page */
> +extern bool XLogReaderValidatePageHeader(XLogReaderState *state,
> + XLogRecPtr recptr, XLogPageHeader hdr);
> Instead of exposing XLogPageHeaderData, I would recommend just using
> char* and remove this declaration. The comment on top of
> XLogReaderValidatePageHeader needs to make clear what caller should
> provide.

Seems reasonable. Added several lines in the comment for the
function.

> + if (!XLogReaderValidatePageHeader(xlogreader, targetPagePtr,
> + (XLogPageHeader) readBuf))
> + goto next_record_is_invalid;
> +
> [...]
> - ptr = tliRecPtr;
> + ptr = RecPtr;
> tli = tliOfPointInHistory(tliRecPtr, expectedTLEs);
>
> if (curFileTLI > 0 && tli < curFileTLI)
> The core of the patch is here (the patch has no comment so it is hard
> to understand what's the point of what is being done), and if I

Hmm, sorry. Added a brief comment there.

> understand that correctly, you allow the receiver to fetch the
> portions of a record spawned across multiple segments from different
> sources, and I am not sure that we'd want to break that promise.

We are allowing consecutive records at a segment boundary from
different sources are in the same series of xlog records. A
continuation records never spans over two TLIs but I might be
missing something here. (I found that an error message shows an
incorrect record pointer. The error message seems still be
useful.)

> Shouldn't we instead have the receiver side track the beginning of the
> record and send that position for the physical slot's restart_lsn?

The largest obstacle to do that is that walreceiver is not
utterly concerned to record internals. In other words, it doesn't
know what a record is. Teaching that introduces much complexity
and the complexity slows down the walreceiver.

Addition to that, this "problem" occurs also on replication
without a slot. The latest patch also help the case.

> This way the receiver would retain WAL segments from the real
> beginning of a record. restart_lsn for replication slots is set when
> processing the standby message in ProcessStandbyReplyMessage() using
> now the flush LSN, so a more correct value should be provided using
> that. Andres, what's your take on the matter?

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment Content-Type Size
0001-Allow-switch-WAL-source-midst-of-record.patch text/x-patch 5.1 KB

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: Andres Freund <andres(at)anarazel(dot)de>, nag1010(at)gmail(dot)com, jdnelson(at)dyn(dot)com, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [BUGS] Bug in Physical Replication Slots (at least 9.5)?
Date: 2017-10-26 18:13:55
Message-ID: CAB7nPqSaWrJD0fVN4uX=e=D9FqEYqPq-S2X_TC60kJ0YX2-5eQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Thu, Oct 26, 2017 at 3:05 AM, Kyotaro HORIGUCHI
<horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> The largest obstacle to do that is that walreceiver is not
> utterly concerned to record internals. In other words, it doesn't
> know what a record is. Teaching that introduces much complexity
> and the complexity slows down the walreceiver.
>
> Addition to that, this "problem" occurs also on replication
> without a slot. The latest patch also help the case.

That's why replication slots have been introduced to begin with. The
WAL receiver gives no guarantee that a segment will be retained or not
based on the beginning of a record. That's sad that the WAL receiver
does not track properly restart LSN and instead just uses the flush
LSN. I am beginning to think that a new message type used to report
the restart LSN when a replication slot is in use would just be a
better and more stable solution. I haven't looked at the
implementation details yet though.
--
Michael


From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: Andres Freund <andres(at)anarazel(dot)de>, nag1010(at)gmail(dot)com, jdnelson(at)dyn(dot)com, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?
Date: 2017-11-29 04:57:48
Message-ID: CAB7nPqT4kVLhPZSd5zC953jfhfexkZkMiVXKfxiBrSxoA4r5Eg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Fri, Oct 27, 2017 at 3:13 AM, Michael Paquier
<michael(dot)paquier(at)gmail(dot)com> wrote:
> On Thu, Oct 26, 2017 at 3:05 AM, Kyotaro HORIGUCHI
> <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>> The largest obstacle to do that is that walreceiver is not
>> utterly concerned to record internals. In other words, it doesn't
>> know what a record is. Teaching that introduces much complexity
>> and the complexity slows down the walreceiver.
>>
>> Addition to that, this "problem" occurs also on replication
>> without a slot. The latest patch also help the case.
>
> That's why replication slots have been introduced to begin with. The
> WAL receiver gives no guarantee that a segment will be retained or not
> based on the beginning of a record. That's sad that the WAL receiver
> does not track properly restart LSN and instead just uses the flush
> LSN. I am beginning to think that a new message type used to report
> the restart LSN when a replication slot is in use would just be a
> better and more stable solution. I haven't looked at the
> implementation details yet though.

Yeah, I am still on track with this idea. Splitting the flush LSN and
the restart LSN properly can allow for better handling on clients. For
now I am moving this patch to the next CF.
--
Michael


From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, Andres Freund <andres(at)anarazel(dot)de>, nag1010(at)gmail(dot)com, jdnelson(at)dyn(dot)com, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?
Date: 2018-01-17 12:38:32
Message-ID: 20180117123832.GB2416@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Michael, Andres,

* Michael Paquier (michael(dot)paquier(at)gmail(dot)com) wrote:
> On Fri, Oct 27, 2017 at 3:13 AM, Michael Paquier
> <michael(dot)paquier(at)gmail(dot)com> wrote:
> > On Thu, Oct 26, 2017 at 3:05 AM, Kyotaro HORIGUCHI
> > <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> >> The largest obstacle to do that is that walreceiver is not
> >> utterly concerned to record internals. In other words, it doesn't
> >> know what a record is. Teaching that introduces much complexity
> >> and the complexity slows down the walreceiver.
> >>
> >> Addition to that, this "problem" occurs also on replication
> >> without a slot. The latest patch also help the case.
> >
> > That's why replication slots have been introduced to begin with. The
> > WAL receiver gives no guarantee that a segment will be retained or not
> > based on the beginning of a record. That's sad that the WAL receiver
> > does not track properly restart LSN and instead just uses the flush
> > LSN. I am beginning to think that a new message type used to report
> > the restart LSN when a replication slot is in use would just be a
> > better and more stable solution. I haven't looked at the
> > implementation details yet though.
>
> Yeah, I am still on track with this idea. Splitting the flush LSN and
> the restart LSN properly can allow for better handling on clients. For
> now I am moving this patch to the next CF.

Thanks for looking at this patch previously. This patch is still in
Needs Review but it's not clear if that's correct or not, but this seems
to be a bug-fix, so it would be great if we could make some progress on
it (particularly since it was reported over a year ago and goes back to
9.4, according to the thread, from what I can tell).

Any chance one of you can provide some further thoughts on it or make it
clear if we are waiting for a new patch or if the existing one is still
reasonable and just needs to be reviewed (in which case, perhaps one of
you could help with that..)?

Thanks again!

Stephen


From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, Andres Freund <andres(at)anarazel(dot)de>, nag1010(at)gmail(dot)com, jdnelson(at)dyn(dot)com, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?
Date: 2018-01-18 11:58:27
Message-ID: 20180118115827.GA12826@paquier.xyz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Wed, Jan 17, 2018 at 07:38:32AM -0500, Stephen Frost wrote:
> Thanks for looking at this patch previously. This patch is still in
> Needs Review but it's not clear if that's correct or not, but this seems
> to be a bug-fix, so it would be great if we could make some progress on
> it (particularly since it was reported over a year ago and goes back to
> 9.4, according to the thread, from what I can tell).
>
> Any chance one of you can provide some further thoughts on it or make it
> clear if we are waiting for a new patch or if the existing one is still
> reasonable and just needs to be reviewed (in which case, perhaps one of
> you could help with that..)?

Here is my input then with a summary regarding this thread. The base
problem here is that WAL segments are not retained on a primary when
using a replication slot if a record begins at a segment boundary. When
in recovery and trying to fetch a record from a source, the standby
would try to fetch a segment from its beginning, but if segments have
been recycled, then the segment could have been recycled, which breaks
the contract that the replication slot provides. If we want to get
things correctly addressed, I think that we want two things:
1) On the primary side, track correctly the beginning of a record so as
when a record runs across multiple segments, the slot does not recycle
them.
2) On the standby side, we need to track as well the beginning of the
record so as we don't risk recycling things because of restart points.

On this thread have been discussed a couple of approaches:
a) The first patch was doing things on the primary side by opening an
XLogReader which was in charge to check if the record we are looking at
was at a segment boundary. On top of being costly, this basically
achieved 1) but failed on 2).
b) The second patch that I would like to mention is doing things on the
standby side by being able to change a WAL source when fetching a single
record so as it is possible to fetch the beginning of a cross-segment
record from say an archive or the local xlog, and then request the rest
on the primary. This patch can be found in
/message-id/20171026.190551.208996945.horiguchi.kyotaro%40lab.ntt.co.jp
and the diff in WaitForWALToBecomeAvailable() scares me a lot because,
it seems to me that this actually breaks timeline switch
consistency. The concept of switching a WAL source in the middle of a
WAL segment is risky anyway, because we perform sanity checks while
switching sources. The bug we are dealing about here is very rare, and
seeing a problem like that for a timeline switch is even more rare, but
the risk is not zero and we may finish with a corrupted instance, so we
should not in my opinion take this approach. Also this actually achieves
point 2) above, not completely though, but not 1).

An approach I have been thinking about to address points 1) and 2) is to
introduce a new messages type in the replication protocol which can be
used report as well at which position a replication slot should hold its
position. It would have been tempting to extend the existing standby
feedback message but we cannot afford a protocol break either. So my
proposal is to create a new replication message which can be used in the
context of a replication slot instead of the usual standby feedback
message, let's call it slot feedback message. This is similar to the
standby message, except that it reports as well the LSN the slot should
wait for. This results in an addition of... 8 bytes in the message which
is a lot. I have thoughts about the possibility to reduce it to 4 bytes
but one record can spawn across more than 4GB worth of segments, right?
(Note that replication slot data should updated with that instead of the
flush LSN).

Would love to hear thoughts from others. Of course, feel free to correct
me as needed.
--
Michael


From: Andres Freund <andres(at)anarazel(dot)de>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: Stephen Frost <sfrost(at)snowman(dot)net>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, nag1010(at)gmail(dot)com, jdnelson(at)dyn(dot)com, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?
Date: 2018-01-18 19:52:52
Message-ID: 20180118195252.hyxgkb3krcqjzfjm@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Hi,

On 2018-01-18 20:58:27 +0900, Michael Paquier wrote:
> On Wed, Jan 17, 2018 at 07:38:32AM -0500, Stephen Frost wrote:
> > Thanks for looking at this patch previously. This patch is still in
> > Needs Review but it's not clear if that's correct or not, but this seems
> > to be a bug-fix, so it would be great if we could make some progress on
> > it (particularly since it was reported over a year ago and goes back to
> > 9.4, according to the thread, from what I can tell).
> >
> > Any chance one of you can provide some further thoughts on it or make it
> > clear if we are waiting for a new patch or if the existing one is still
> > reasonable and just needs to be reviewed (in which case, perhaps one of
> > you could help with that..)?
>
> Here is my input then with a summary regarding this thread. The base
> problem here is that WAL segments are not retained on a primary when
> using a replication slot if a record begins at a segment boundary.

> 1) On the primary side, track correctly the beginning of a record so as
> when a record runs across multiple segments, the slot does not recycle
> them.

I think that's a really bad idea. There's no bug on the sender side
here. The sender allows removal of WAL based on what the receiver
acks. This requires that all senders have low-level understanding on
what's being sent - not a good idea, as we e.g. imo want to have tools
that serve WAL over the streaming protocol from an archive.

> 2) On the standby side, we need to track as well the beginning of the
> record so as we don't risk recycling things because of restart points.

> On this thread have been discussed a couple of approaches:
> a) The first patch was doing things on the primary side by opening an
> XLogReader which was in charge to check if the record we are looking at
> was at a segment boundary. On top of being costly, this basically
> achieved 1) but failed on 2).

I am very very strongly against this approach.

> b) The second patch that I would like to mention is doing things on the
> standby side by being able to change a WAL source when fetching a single
> record so as it is possible to fetch the beginning of a cross-segment
> record from say an archive or the local xlog, and then request the rest
> on the primary. This patch can be found in
> /message-id/20171026.190551.208996945.horiguchi.kyotaro%40lab.ntt.co.jp
> and the diff in WaitForWALToBecomeAvailable() scares me a lot because,
> it seems to me that this actually breaks timeline switch
> consistency. The concept of switching a WAL source in the middle of a
> WAL segment is risky anyway, because we perform sanity checks while
> switching sources. The bug we are dealing about here is very rare, and
> seeing a problem like that for a timeline switch is even more rare, but
> the risk is not zero and we may finish with a corrupted instance, so we
> should not in my opinion take this approach. Also this actually achieves
> point 2) above, not completely though, but not 1).

I don't agree with the sentiment about the approach, but I agree there
might be weaknesses in the implementation (which I have not reviewed). I
think it's perfectly sensible to require fetching one segment from
pg_xlog and the next one via another method. Currently the inability to
do so often leads to us constantly refetching the same segment.

> An approach I have been thinking about to address points 1) and 2) is to
> introduce a new messages type in the replication protocol which can be
> used report as well at which position a replication slot should hold its
> position. It would have been tempting to extend the existing standby
> feedback message but we cannot afford a protocol break either. So my
> proposal is to create a new replication message which can be used in the
> context of a replication slot instead of the usual standby feedback
> message, let's call it slot feedback message. This is similar to the
> standby message, except that it reports as well the LSN the slot should
> wait for. This results in an addition of... 8 bytes in the message which
> is a lot. I have thoughts about the possibility to reduce it to 4 bytes
> but one record can spawn across more than 4GB worth of segments, right?
> (Note that replication slot data should updated with that instead of the
> flush LSN).

-0.5, I think this just adds complexity instead of fixing the underlying
problem.

Greetings,

Andres Freund


From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: andres(at)anarazel(dot)de
Cc: michael(dot)paquier(at)gmail(dot)com, sfrost(at)snowman(dot)net, nag1010(at)gmail(dot)com, jdnelson(at)dyn(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?
Date: 2018-01-19 01:54:53
Message-ID: 20180119.105453.132600443.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Hello,

Thank you for the complrehensive explanation, Michael.

I happened to see another instance of this failure on one of our
client site. The precise steps lead to the situation is not
available but it is reported that it had occurred without
immediate shutdown. As far as I know just starting the standby
after pg_basebackup caused that.

(The standby fetches archives of the primary server so just
swtichg wal segment could break the loop in the case but it's
not always the case.)

At Thu, 18 Jan 2018 11:52:52 -0800, Andres Freund <andres(at)anarazel(dot)de> wrote in <20180118195252(dot)hyxgkb3krcqjzfjm(at)alap3(dot)anarazel(dot)de>
> Hi,
>
> On 2018-01-18 20:58:27 +0900, Michael Paquier wrote:
> > On Wed, Jan 17, 2018 at 07:38:32AM -0500, Stephen Frost wrote:
> > > Thanks for looking at this patch previously. This patch is still in
> > > Needs Review but it's not clear if that's correct or not, but this seems
> > > to be a bug-fix, so it would be great if we could make some progress on
> > > it (particularly since it was reported over a year ago and goes back to
> > > 9.4, according to the thread, from what I can tell).
> > >
> > > Any chance one of you can provide some further thoughts on it or make it
> > > clear if we are waiting for a new patch or if the existing one is still
> > > reasonable and just needs to be reviewed (in which case, perhaps one of
> > > you could help with that..)?
> >
> > Here is my input then with a summary regarding this thread. The base
> > problem here is that WAL segments are not retained on a primary when
> > using a replication slot if a record begins at a segment boundary.
>
> > 1) On the primary side, track correctly the beginning of a record so as
> > when a record runs across multiple segments, the slot does not recycle
> > them.
>
> I think that's a really bad idea. There's no bug on the sender side
> here. The sender allows removal of WAL based on what the receiver
> acks. This requires that all senders have low-level understanding on
> what's being sent - not a good idea, as we e.g. imo want to have tools
> that serve WAL over the streaming protocol from an archive.
>
> > 2) On the standby side, we need to track as well the beginning of the
> > record so as we don't risk recycling things because of restart points.
>
>
> > On this thread have been discussed a couple of approaches:
> > a) The first patch was doing things on the primary side by opening an
> > XLogReader which was in charge to check if the record we are looking at
> > was at a segment boundary. On top of being costly, this basically
> > achieved 1) but failed on 2).
>
> I am very very strongly against this approach.
>
> > b) The second patch that I would like to mention is doing things on the
> > standby side by being able to change a WAL source when fetching a single
> > record so as it is possible to fetch the beginning of a cross-segment
> > record from say an archive or the local xlog, and then request the rest
> > on the primary. This patch can be found in
> > /message-id/20171026.190551.208996945.horiguchi.kyotaro%40lab.ntt.co.jp
> > and the diff in WaitForWALToBecomeAvailable() scares me a lot because,
> > it seems to me that this actually breaks timeline switch
> > consistency. The concept of switching a WAL source in the middle of a
> > WAL segment is risky anyway, because we perform sanity checks while
> > switching sources. The bug we are dealing about here is very rare, and
> > seeing a problem like that for a timeline switch is even more rare, but
> > the risk is not zero and we may finish with a corrupted instance, so we
> > should not in my opinion take this approach. Also this actually achieves
> > point 2) above, not completely though, but not 1).
>
> I don't agree with the sentiment about the approach, but I agree there
> might be weaknesses in the implementation (which I have not reviewed). I
> think it's perfectly sensible to require fetching one segment from
> pg_xlog and the next one via another method. Currently the inability to
> do so often leads to us constantly refetching the same segment.

Though I'm still not fully confident, if reading a set of
continuation records from two (or more) sources can lead to
inconsistency, two successve (complete) records are facing the
same risk.

> > An approach I have been thinking about to address points 1) and 2) is to
> > introduce a new messages type in the replication protocol which can be
> > used report as well at which position a replication slot should hold its
> > position. It would have been tempting to extend the existing standby
> > feedback message but we cannot afford a protocol break either. So my
> > proposal is to create a new replication message which can be used in the
> > context of a replication slot instead of the usual standby feedback
> > message, let's call it slot feedback message. This is similar to the
> > standby message, except that it reports as well the LSN the slot should
> > wait for. This results in an addition of... 8 bytes in the message which
> > is a lot. I have thoughts about the possibility to reduce it to 4 bytes
> > but one record can spawn across more than 4GB worth of segments, right?
> > (Note that replication slot data should updated with that instead of the
> > flush LSN).
>
> -0.5, I think this just adds complexity instead of fixing the underlying
> problem.

On the other hand if one logical record must be read from single
source, we need any means to deterrent wal-recycling on the
primary side. Conducting that within the primary side is rejected
by consensus. (There might be smarter way to do that, though.) To
do that from the standby side, just retarding write feedbacks or
this kind of special feedback would be needed. In any way it is
necessary to introduce WAL-record awareness into WAL shipping
process and it's seemingly large complexity.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center


From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: andres(at)anarazel(dot)de, sfrost(at)snowman(dot)net, nag1010(at)gmail(dot)com, jdnelson(at)dyn(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?
Date: 2018-01-19 09:24:56
Message-ID: 20180119092456.GA1169@paquier.xyz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Fri, Jan 19, 2018 at 10:54:53AM +0900, Kyotaro HORIGUCHI wrote:
> On the other hand if one logical record must be read from single
> source, we need any means to deterrent wal-recycling on the
> primary side. Conducting that within the primary side is rejected
> by consensus.

There is this approach of extending the message protocol as well so as
the primary can retain the segments it needs to keep around...

> (There might be smarter way to do that, though.) To
> do that from the standby side, just retarding write feedbacks or
> this kind of special feedback would be needed. In any way it is
> necessary to introduce WAL-record awareness into WAL shipping
> process and it's seemingly large complexity.

Coming to logical slots, don't you need to be aware of the beginning of
the record on the primary to perform correctly decoding of tuples
through time travel? If the record is present across segments, it seems
to me that it matters. Andres knows the answer to that for sure, so I
would expect to be counter-argued in the next 12 hours or so.
--
Michael


From: Greg Stark <stark(at)mit(dot)edu>
To: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Jonathon Nelson <jdnelson(at)dyn(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, PostgreSQL Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: Bug in Physical Replication Slots (at least 9.5)?
Date: 2018-01-19 11:28:58
Message-ID: CAM-w4HMU==SkHHzS6KDSrNiKU9vk2R4TG73M4FJzA-8Yui34+g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On 19 January 2017 at 09:37, Kyotaro HORIGUCHI
<horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>
> Though I haven't look closer to how a modification is splitted
> into WAL records. A tuple cannot be so long. As a simple test, I
> observed rechder->xl_tot_len at the end of XLogRecordAssemble
> inserting an about 400KB not-so-compressable string into a text
> column, but I saw a series of many records with shorter than
> several thousand bytes.

I think the case to check is a commit record with many thousands of
subtransactions. I'm not sure you can fill a whole segment though.

--
greg


From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: stark(at)mit(dot)edu
Cc: michael(dot)paquier(at)gmail(dot)com, jdnelson(at)dyn(dot)com, pgsql-hackers(at)postgresql(dot)org, pgsql-bugs(at)postgresql(dot)org
Subject: Re: Bug in Physical Replication Slots (at least 9.5)?
Date: 2018-01-23 08:05:11
Message-ID: 20180123.170511.161839081.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Hello,

At Fri, 19 Jan 2018 11:28:58 +0000, Greg Stark <stark(at)mit(dot)edu> wrote in <CAM-w4HMU==SkHHzS6KDSrNiKU9vk2R4TG73M4FJzA-8Yui34+g(at)mail(dot)gmail(dot)com>
> On 19 January 2017 at 09:37, Kyotaro HORIGUCHI
> <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> >
> > Though I haven't look closer to how a modification is splitted
> > into WAL records. A tuple cannot be so long. As a simple test, I
> > observed rechder->xl_tot_len at the end of XLogRecordAssemble
> > inserting an about 400KB not-so-compressable string into a text
> > column, but I saw a series of many records with shorter than
> > several thousand bytes.
>
> I think the case to check is a commit record with many thousands of
> subtransactions. I'm not sure you can fill a whole segment though.

Thanks, potentially it can. 1 subtransaction adds 4 bytes so
roughly 4.2M subtransactions will fill a segment but a
transaction with 100000 subtrans didn't end returning a pile of
many-many commans tags.

... Anyway, current point of the discussion is I think moved to
the validity of taking a series of continuation records from
different WAL sources, or acceptability of adding
record-awareness to wal-receiver side.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center


From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: michael(dot)paquier(at)gmail(dot)com
Cc: andres(at)anarazel(dot)de, sfrost(at)snowman(dot)net, nag1010(at)gmail(dot)com, jdnelson(at)dyn(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?
Date: 2018-01-23 09:50:00
Message-ID: 20180123.185000.232069302.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

At Fri, 19 Jan 2018 18:24:56 +0900, Michael Paquier <michael(dot)paquier(at)gmail(dot)com> wrote in <20180119092456(dot)GA1169(at)paquier(dot)xyz>
> On Fri, Jan 19, 2018 at 10:54:53AM +0900, Kyotaro HORIGUCHI wrote:
> > On the other hand if one logical record must be read from single
> > source, we need any means to deterrent wal-recycling on the
> > primary side. Conducting that within the primary side is rejected
> > by consensus.
>
> There is this approach of extending the message protocol as well so as
> the primary can retain the segments it needs to keep around...

I haven't seen it in detail but it seems meaning that it is done
by notifying something from the standby to the parimary not
knowing what is a WAL record on the standby side.

> > (There might be smarter way to do that, though.) To
> > do that from the standby side, just retarding write feedbacks or
> > this kind of special feedback would be needed. In any way it is
> > necessary to introduce WAL-record awareness into WAL shipping
> > process and it's seemingly large complexity.
>
> Coming to logical slots, don't you need to be aware of the beginning of
> the record on the primary to perform correctly decoding of tuples
> through time travel? If the record is present across segments, it seems

I'm not sure what the time travel is, but the requried segments
seems kept safely on logical replication since the publisher
moves restart_lsn not based on finished commits LSN, but on the
beginning LSN of running transactions. In other words, logical
decoding doesn't need to track each record for the purpose of
keeping segments since it already keeping track of the beginning
of a transaction. Physical replication is totally unaware of a
WAL record, it just copies blobs in a convenient unit and only
cares LSN. But the ignorance seems required to keep performance.

> to me that it matters. Andres knows the answer to that for sure, so I
> would expect to be counter-argued in the next 12 hours or so.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center


From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: michael(dot)paquier(at)gmail(dot)com
Cc: andres(at)anarazel(dot)de, sfrost(at)snowman(dot)net, nag1010(at)gmail(dot)com, jdnelson(at)dyn(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?
Date: 2018-04-09 04:26:54
Message-ID: 20180409.132654.189070572.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Hello, I added this as Older Bugs in Open items. (I believe it's legit.)

The latest patch still applies on the HEAD with some offsets.

At Tue, 23 Jan 2018 18:50:00 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote in <20180123(dot)185000(dot)232069302(dot)horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
> At Fri, 19 Jan 2018 18:24:56 +0900, Michael Paquier <michael(dot)paquier(at)gmail(dot)com> wrote in <20180119092456(dot)GA1169(at)paquier(dot)xyz>
> > On Fri, Jan 19, 2018 at 10:54:53AM +0900, Kyotaro HORIGUCHI wrote:
> > > On the other hand if one logical record must be read from single
> > > source, we need any means to deterrent wal-recycling on the
> > > primary side. Conducting that within the primary side is rejected
> > > by consensus.
> >
> > There is this approach of extending the message protocol as well so as
> > the primary can retain the segments it needs to keep around...
>
> I haven't seen it in detail but it seems meaning that it is done
> by notifying something from the standby to the parimary not
> knowing what is a WAL record on the standby side.
>
> > > (There might be smarter way to do that, though.) To
> > > do that from the standby side, just retarding write feedbacks or
> > > this kind of special feedback would be needed. In any way it is
> > > necessary to introduce WAL-record awareness into WAL shipping
> > > process and it's seemingly large complexity.
> >
> > Coming to logical slots, don't you need to be aware of the beginning of
> > the record on the primary to perform correctly decoding of tuples
> > through time travel? If the record is present across segments, it seems
>
> I'm not sure what the time travel is, but the requried segments
> seems kept safely on logical replication since the publisher
> moves restart_lsn not based on finished commits LSN, but on the
> beginning LSN of running transactions. In other words, logical
> decoding doesn't need to track each record for the purpose of
> keeping segments since it already keeping track of the beginning
> of a transaction. Physical replication is totally unaware of a
> WAL record, it just copies blobs in a convenient unit and only
> cares LSN. But the ignorance seems required to keep performance.
>
> > to me that it matters. Andres knows the answer to that for sure, so I
> > would expect to be counter-argued in the next 12 hours or so.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center


From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: michael(dot)paquier(at)gmail(dot)com, andres(at)anarazel(dot)de, sfrost(at)snowman(dot)net, nag1010(at)gmail(dot)com, jdnelson(at)dyn(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?
Date: 2018-04-09 04:59:45
Message-ID: 20180409045945.GB1740@paquier.xyz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Mon, Apr 09, 2018 at 01:26:54PM +0900, Kyotaro HORIGUCHI wrote:
> Hello, I added this as Older Bugs in Open items. (I believe it's
> legit.)

Yes, I think that's adapted. I am hesitating to do the same thing with
all the other patches marked as bug fixes.
--
Michael


From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: michael(at)paquier(dot)xyz
Cc: michael(dot)paquier(at)gmail(dot)com, andres(at)anarazel(dot)de, sfrost(at)snowman(dot)net, nag1010(at)gmail(dot)com, jdnelson(at)dyn(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?
Date: 2018-04-09 05:19:24
Message-ID: 20180409.141924.246254700.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

At Mon, 9 Apr 2018 13:59:45 +0900, Michael Paquier <michael(at)paquier(dot)xyz> wrote in <20180409045945(dot)GB1740(at)paquier(dot)xyz>
> On Mon, Apr 09, 2018 at 01:26:54PM +0900, Kyotaro HORIGUCHI wrote:
> > Hello, I added this as Older Bugs in Open items. (I believe it's
> > legit.)
>
> Yes, I think that's adapted. I am hesitating to do the same thing with
> all the other patches marked as bug fixes.

Thanks. I'm also not going to do so on the other "bug fix"es. The
reasons for this patch are I myself recently had a suspected case
(on a costomer site) and the size is not so large. (This doesn't
mean it is easy, of couse..)

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center


From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, andres(at)anarazel(dot)de
Cc: michael(dot)paquier(at)gmail(dot)com, sfrost(at)snowman(dot)net, nag1010(at)gmail(dot)com, jdnelson(at)dyn(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?
Date: 2018-04-23 07:41:47
Message-ID: 89e33d4f-5c75-0738-3dcb-44c4df59e920@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On 18/01/18 20:54, Kyotaro HORIGUCHI wrote:
> At Thu, 18 Jan 2018 11:52:52 -0800, Andres Freund <andres(at)anarazel(dot)de> wrote in <20180118195252(dot)hyxgkb3krcqjzfjm(at)alap3(dot)anarazel(dot)de>
>> On 2018-01-18 20:58:27 +0900, Michael Paquier wrote:
>>> b) The second patch that I would like to mention is doing things on the
>>> standby side by being able to change a WAL source when fetching a single
>>> record so as it is possible to fetch the beginning of a cross-segment
>>> record from say an archive or the local xlog, and then request the rest
>>> on the primary. This patch can be found in
>>> /message-id/20171026.190551.208996945.horiguchi.kyotaro%40lab.ntt.co.jp
>>> and the diff in WaitForWALToBecomeAvailable() scares me a lot because,
>>> it seems to me that this actually breaks timeline switch
>>> consistency. The concept of switching a WAL source in the middle of a
>>> WAL segment is risky anyway, because we perform sanity checks while
>>> switching sources. The bug we are dealing about here is very rare, and
>>> seeing a problem like that for a timeline switch is even more rare, but
>>> the risk is not zero and we may finish with a corrupted instance, so we
>>> should not in my opinion take this approach. Also this actually achieves
>>> point 2) above, not completely though, but not 1).
>>
>> I don't agree with the sentiment about the approach, but I agree there
>> might be weaknesses in the implementation (which I have not reviewed). I
>> think it's perfectly sensible to require fetching one segment from
>> pg_xlog and the next one via another method. Currently the inability to
>> do so often leads to us constantly refetching the same segment.
>
> Though I'm still not fully confident, if reading a set of
> continuation records from two (or more) sources can lead to
> inconsistency, two successve (complete) records are facing the
> same risk.

This seems like the best approach to me as well.

Looking at the patch linked above
(/message-id/20171026.190551.208996945.horiguchi.kyotaro%40lab.ntt.co.jp)

> --- a/src/backend/access/transam/xlog.c
> +++ b/src/backend/access/transam/xlog.c
> @@ -11693,6 +11693,10 @@ retry:
> Assert(reqLen <= readLen);
>
> *readTLI = curFileTLI;
> +
> + if (!XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf))
> + goto next_record_is_invalid;
> +
> return readLen;
>
> next_record_is_invalid:

What is the point of adding this XLogReaderValidatePageHeader() call?
The caller of this callback function, ReadPageInternal(), checks the
page header anyway. Earlier in this thread, you said:

> The rest to do is let XLogPageRead retry other sources
> immediately. To do this I made ValidXLogPageHeader(at)xlogreader(dot)c
> public (and renamed to XLogReaderValidatePageHeader).

but I don't understand that.

- Heikki


From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: hlinnaka(at)iki(dot)fi
Cc: andres(at)anarazel(dot)de, michael(dot)paquier(at)gmail(dot)com, sfrost(at)snowman(dot)net, nag1010(at)gmail(dot)com, jdnelson(at)dyn(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?
Date: 2018-04-24 10:57:12
Message-ID: 20180424.195712.56235724.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Thank you very much for looking this!

At Mon, 23 Apr 2018 03:41:47 -0400, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote in <89e33d4f-5c75-0738-3dcb-44c4df59e920(at)iki(dot)fi>
> On 18/01/18 20:54, Kyotaro HORIGUCHI wrote:
> > At Thu, 18 Jan 2018 11:52:52 -0800, Andres Freund <andres(at)anarazel(dot)de>
> > wrote in <20180118195252(dot)hyxgkb3krcqjzfjm(at)alap3(dot)anarazel(dot)de>
> >> On 2018-01-18 20:58:27 +0900, Michael Paquier wrote:
> >>> b) The second patch that I would like to mention is doing things on
> >>> the
> >>> standby side by being able to change a WAL source when fetching a
> >>> single
> >>> record so as it is possible to fetch the beginning of a cross-segment
> >>> record from say an archive or the local xlog, and then request the
> >>> rest
> >>> on the primary. This patch can be found in
> >>> /message-id/20171026.190551.208996945.horiguchi.kyotaro%40lab.ntt.co.jp
> >>> and the diff in WaitForWALToBecomeAvailable() scares me a lot because,
> >>> it seems to me that this actually breaks timeline switch
> >>> consistency. The concept of switching a WAL source in the middle of a
> >>> WAL segment is risky anyway, because we perform sanity checks while
> >>> switching sources. The bug we are dealing about here is very rare, and
> >>> seeing a problem like that for a timeline switch is even more rare,
> >>> but
> >>> the risk is not zero and we may finish with a corrupted instance, so
> >>> we
> >>> should not in my opinion take this approach. Also this actually
> >>> achieves
> >>> point 2) above, not completely though, but not 1).
> >>
> >> I don't agree with the sentiment about the approach, but I agree there
> >> might be weaknesses in the implementation (which I have not
> >> reviewed). I
> >> think it's perfectly sensible to require fetching one segment from
> >> pg_xlog and the next one via another method. Currently the inability
> >> to
> >> do so often leads to us constantly refetching the same segment.
> > Though I'm still not fully confident, if reading a set of
> > continuation records from two (or more) sources can lead to
> > inconsistency, two successve (complete) records are facing the
> > same risk.
>
> This seems like the best approach to me as well.
>
> Looking at the patch linked above
> (/message-id/20171026.190551.208996945.horiguchi.kyotaro%40lab.ntt.co.jp)
>
> > --- a/src/backend/access/transam/xlog.c
> > +++ b/src/backend/access/transam/xlog.c
> > @@ -11693,6 +11693,10 @@ retry:
> > Assert(reqLen <= readLen);
> > *readTLI = curFileTLI;
> > +
> > + if (!XLogReaderValidatePageHeader(xlogreader, targetPagePtr,
> > readBuf))
> > + goto next_record_is_invalid;
> > +
> > return readLen;
> > next_record_is_invalid:
>
> What is the point of adding this XLogReaderValidatePageHeader() call?
> The caller of this callback function, ReadPageInternal(), checks the
> page header anyway. Earlier in this thread, you said:

Without the lines, server actually fails to start replication.

(I try to remember the details...)

The difference is whether the function can cause retry for the
same portion of a set of continued records (without changing the
plugin API). XLogPageRead can do that. On the other hand all
ReadPageInternal can do is just letting the caller ReadRecords
retry from the beginning of the set of continued records since
the caller side handles only complete records.

In the failure case, in XLogPageRead, when read() fails, it can
try the next source midst of a continued records. On the other
hand if the segment was read but it was recycled one, it passes
"success" to ReadPageInternal and leads to retring from the
beginning of the recrod. Infinite loop.

Calling XLogReaderValidatePageHeader in ReadPageInternal is
redundant, but removing it may be interface change of xlogreader
plugin and I am not sure that is allowed.

> > The rest to do is let XLogPageRead retry other sources
> > immediately. To do this I made ValidXLogPageHeader(at)xlogreader(dot)c
> > public (and renamed to XLogReaderValidatePageHeader).
>
> but I don't understand that.

It is exposed so that XLogPageRead can validate the page header
using it.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center


From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, hlinnaka(at)iki(dot)fi
Cc: andres(at)anarazel(dot)de, michael(dot)paquier(at)gmail(dot)com, sfrost(at)snowman(dot)net, nag1010(at)gmail(dot)com, jdnelson(at)dyn(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?
Date: 2018-05-04 07:05:41
Message-ID: 28f5a502-8f27-94af-6d5a-ed939be87125@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On 24/04/18 13:57, Kyotaro HORIGUCHI wrote:
> At Mon, 23 Apr 2018 03:41:47 -0400, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote in <89e33d4f-5c75-0738-3dcb-44c4df59e920(at)iki(dot)fi>
>> Looking at the patch linked above
>> (/message-id/20171026.190551.208996945.horiguchi.kyotaro%40lab.ntt.co.jp)
>>
>>> --- a/src/backend/access/transam/xlog.c
>>> +++ b/src/backend/access/transam/xlog.c
>>> @@ -11693,6 +11693,10 @@ retry:
>>> Assert(reqLen <= readLen);
>>> *readTLI = curFileTLI;
>>> +
>>> + if (!XLogReaderValidatePageHeader(xlogreader, targetPagePtr,
>>> readBuf))
>>> + goto next_record_is_invalid;
>>> +
>>> return readLen;
>>> next_record_is_invalid:
>>
>> What is the point of adding this XLogReaderValidatePageHeader() call?
>> The caller of this callback function, ReadPageInternal(), checks the
>> page header anyway. Earlier in this thread, you said:
>
> Without the lines, server actually fails to start replication.
>
> (I try to remember the details...)
>
> The difference is whether the function can cause retry for the
> same portion of a set of continued records (without changing the
> plugin API). XLogPageRead can do that. On the other hand all
> ReadPageInternal can do is just letting the caller ReadRecords
> retry from the beginning of the set of continued records since
> the caller side handles only complete records.
>
> In the failure case, in XLogPageRead, when read() fails, it can
> try the next source midst of a continued records. On the other
> hand if the segment was read but it was recycled one, it passes
> "success" to ReadPageInternal and leads to retring from the
> beginning of the recrod. Infinite loop.

I see. You have the same problem if you have a WAL file that's corrupt
in some other way, but one of the sources would have a correct copy.
ValidXLogPageHeader() only checks the page header, after all. But unlike
a recycled WAL segment, that's not supposed to happen as part of normal
operation, so I guess we can live with that.

> Calling XLogReaderValidatePageHeader in ReadPageInternal is
> redundant, but removing it may be interface change of xlogreader
> plugin and I am not sure that is allowed.

We should avoid doing that in back-branches, at least. But in 'master',
I wouldn't mind redesigning the API. Dealing with all the retrying is
pretty complicated as it is, if we can simplify that somehow, I think
that'd be good.

- Heikki


From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: andres(at)anarazel(dot)de, michael(dot)paquier(at)gmail(dot)com, sfrost(at)snowman(dot)net, nag1010(at)gmail(dot)com, jdnelson(at)dyn(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?
Date: 2018-05-04 22:49:31
Message-ID: 47215279-228d-f30d-35d1-16af695e53f3@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On 04/05/18 10:05, Heikki Linnakangas wrote:
> On 24/04/18 13:57, Kyotaro HORIGUCHI wrote:
>> At Mon, 23 Apr 2018 03:41:47 -0400, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote in <89e33d4f-5c75-0738-3dcb-44c4df59e920(at)iki(dot)fi>
>>> Looking at the patch linked above
>>> (/message-id/20171026.190551.208996945.horiguchi.kyotaro%40lab.ntt.co.jp)
>>>
>>>> --- a/src/backend/access/transam/xlog.c
>>>> +++ b/src/backend/access/transam/xlog.c
>>>> @@ -11693,6 +11693,10 @@ retry:
>>>> Assert(reqLen <= readLen);
>>>> *readTLI = curFileTLI;
>>>> +
>>>> + if (!XLogReaderValidatePageHeader(xlogreader, targetPagePtr,
>>>> readBuf))
>>>> + goto next_record_is_invalid;
>>>> +
>>>> return readLen;
>>>> next_record_is_invalid:
>>>
>>> What is the point of adding this XLogReaderValidatePageHeader() call?
>>> The caller of this callback function, ReadPageInternal(), checks the
>>> page header anyway. Earlier in this thread, you said:
>>
>> Without the lines, server actually fails to start replication.
>>
>> (I try to remember the details...)
>>
>> The difference is whether the function can cause retry for the
>> same portion of a set of continued records (without changing the
>> plugin API). XLogPageRead can do that. On the other hand all
>> ReadPageInternal can do is just letting the caller ReadRecords
>> retry from the beginning of the set of continued records since
>> the caller side handles only complete records.
>>
>> In the failure case, in XLogPageRead, when read() fails, it can
>> try the next source midst of a continued records. On the other
>> hand if the segment was read but it was recycled one, it passes
>> "success" to ReadPageInternal and leads to retring from the
>> beginning of the recrod. Infinite loop.
>
> I see. You have the same problem if you have a WAL file that's corrupt
> in some other way, but one of the sources would have a correct copy.
> ValidXLogPageHeader() only checks the page header, after all. But unlike
> a recycled WAL segment, that's not supposed to happen as part of normal
> operation, so I guess we can live with that.

Pushed this now, after adding some comments. Thanks!

>> Calling XLogReaderValidatePageHeader in ReadPageInternal is
>> redundant, but removing it may be interface change of xlogreader
>> plugin and I am not sure that is allowed.
>
> We should avoid doing that in back-branches, at least. But in 'master',
> I wouldn't mind redesigning the API. Dealing with all the retrying is
> pretty complicated as it is, if we can simplify that somehow, I think
> that'd be good.

I spent some time musing on what a better API might look like. We could
remove the ReadPage callback, and instead have XLogReadRecord return a
special return code to mean "give me more data". I'm thinking of
something like:

/* return code of XLogReadRecord() */
typedef enum
{
XLREAD_SUCCESS,
XLREAD_INVALID_RECORD, /* a record was read, but it was corrupt */
XLREAD_INVALID_PAGE, /* the page that was supplied looks invalid. */
XLREAD_NEED_DATA, /* caller should place more data in buffer,
and retry */
} XLogReadRecord_Result;

And the calls to XLogReadRecord() would look something like this:

for(;;)
{
rc = XLogReadRecord(reader, startptr, errormsg);

if (rc == XLREAD_SUCCESS)
{
/* great, got record */
}
if (rc == XLREAD_INVALID_PAGE || XLREAD_INVALID_RECORD)
{
elog(ERROR, "invalid record");
}
if (rc == XLREAD_NEED_DATA)
{
/*
* Read a page from disk, and place it into reader->readBuf
*/
XLogPageRead(reader->readPagePtr, /* page to read */
reader->reqLen /* # of bytes to read */ );
/*
* Now that we have read the data that XLogReadRecord()
* requested, call it again.
*/
continue;
}
}

So instead of having a callback, XLogReadRecord() would return
XLREAD_NEED_DATA. The caller would then need to place that data into the
buffer, and call it again. If a record spans multiple pages,
XLogReadRecord() would return with XLREAD_NEED_DATA multiple times, to
read each page.

The important difference for the bug we're discussing on this thread is
is that if you passed an invalid page to XLogReadRecord(), it would
return with XLREAD_INVALID_PAGE. You could then try reading the same
page from a different source, and call XLogReadRecord() again, and it
could continue where it was left off, even if it was in the middle of a
continuation record.

This is clearly not backpatchable, but maybe something to think about
for v12.

- Heikki


From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: hlinnaka(at)iki(dot)fi
Cc: andres(at)anarazel(dot)de, michael(dot)paquier(at)gmail(dot)com, sfrost(at)snowman(dot)net, nag1010(at)gmail(dot)com, jdnelson(at)dyn(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?
Date: 2018-05-14 06:59:13
Message-ID: 20180514.155913.76739505.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Thank you for adding the detailed comment and commiting.

At Sat, 5 May 2018 01:49:31 +0300, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote in <47215279-228d-f30d-35d1-16af695e53f3(at)iki(dot)fi>
> On 04/05/18 10:05, Heikki Linnakangas wrote:
> > On 24/04/18 13:57, Kyotaro HORIGUCHI wrote:
> >> At Mon, 23 Apr 2018 03:41:47 -0400, Heikki Linnakangas
> >> <hlinnaka(at)iki(dot)fi> wrote in
> >> <89e33d4f-5c75-0738-3dcb-44c4df59e920(at)iki(dot)fi>
> >>> Looking at the patch linked above
> >>> (/message-id/20171026.190551.208996945.horiguchi.kyotaro%40lab.ntt.co.jp)
> >>>
> >>>> --- a/src/backend/access/transam/xlog.c
> >>>> +++ b/src/backend/access/transam/xlog.c
> >>>> @@ -11693,6 +11693,10 @@ retry:
> >>>> Assert(reqLen <= readLen);
> >>>> *readTLI = curFileTLI;
> >>>> +
> >>>> + if (!XLogReaderValidatePageHeader(xlogreader, targetPagePtr,
> >>>> readBuf))
> >>>> + goto next_record_is_invalid;
> >>>> +
> >>>> return readLen;
> >>>> next_record_is_invalid:
> >>>
> >>> What is the point of adding this XLogReaderValidatePageHeader() call?
> >>> The caller of this callback function, ReadPageInternal(), checks the
> >>> page header anyway. Earlier in this thread, you said:
> >>
> >> Without the lines, server actually fails to start replication.
> >>
> >> (I try to remember the details...)
> >>
> >> The difference is whether the function can cause retry for the
> >> same portion of a set of continued records (without changing the
> >> plugin API). XLogPageRead can do that. On the other hand all
> >> ReadPageInternal can do is just letting the caller ReadRecords
> >> retry from the beginning of the set of continued records since
> >> the caller side handles only complete records.
> >>
> >> In the failure case, in XLogPageRead, when read() fails, it can
> >> try the next source midst of a continued records. On the other
> >> hand if the segment was read but it was recycled one, it passes
> >> "success" to ReadPageInternal and leads to retring from the
> >> beginning of the recrod. Infinite loop.
> > I see. You have the same problem if you have a WAL file that's corrupt
> > in some other way, but one of the sources would have a correct copy.
> > ValidXLogPageHeader() only checks the page header, after all. But
> > unlike
> > a recycled WAL segment, that's not supposed to happen as part of
> > normal
> > operation, so I guess we can live with that.

Anyway we read successive complete records from different sources
so I think if such curruption causes a problem we would have
faced the same problem even without this.

> Pushed this now, after adding some comments. Thanks!

Thanks!

> >> Calling XLogReaderValidatePageHeader in ReadPageInternal is
> >> redundant, but removing it may be interface change of xlogreader
> >> plugin and I am not sure that is allowed.
> > We should avoid doing that in back-branches, at least. But in
> > 'master',
> > I wouldn't mind redesigning the API. Dealing with all the retrying is
> > pretty complicated as it is, if we can simplify that somehow, I think
> > that'd be good.

Agreed.

> I spent some time musing on what a better API might look like. We
> could remove the ReadPage callback, and instead have XLogReadRecord
> return a special return code to mean "give me more data". I'm thinking
> of something like:

Sounds reasonable. That makes an additional return/call iteration
to XLogReadRecord but it would be ignorable comparing to the cost
of XLogPageRead.

> /* return code of XLogReadRecord() */
> typedef enum
> {
> XLREAD_SUCCESS,
> XLREAD_INVALID_RECORD, /* a record was read, but it was corrupt */
> XLREAD_INVALID_PAGE, /* the page that was supplied looks invalid. */
> XLREAD_NEED_DATA, /* caller should place more data in buffer, and
> retry */
> } XLogReadRecord_Result;
>
>
> And the calls to XLogReadRecord() would look something like this:
>
> for(;;)
> {
> rc = XLogReadRecord(reader, startptr, errormsg);
>
> if (rc == XLREAD_SUCCESS)
> {
> /* great, got record */
> }
> if (rc == XLREAD_INVALID_PAGE || XLREAD_INVALID_RECORD)
> {
> elog(ERROR, "invalid record");
> }
> if (rc == XLREAD_NEED_DATA)
> {
> /*
> * Read a page from disk, and place it into reader->readBuf
> */
> XLogPageRead(reader->readPagePtr, /* page to read */
> reader->reqLen /* # of bytes to read */ );
> /*
> * Now that we have read the data that XLogReadRecord()
> * requested, call it again.
> */
> continue;
> }
> }
>
> So instead of having a callback, XLogReadRecord() would return
> XLREAD_NEED_DATA. The caller would then need to place that data into
> the buffer, and call it again. If a record spans multiple pages,
> XLogReadRecord() would return with XLREAD_NEED_DATA multiple times, to
> read each page.

It seems easier to understand at a glance. In short, I like it.

> The important difference for the bug we're discussing on this thread
> is is that if you passed an invalid page to XLogReadRecord(), it would
> return with XLREAD_INVALID_PAGE. You could then try reading the same
> page from a different source, and call XLogReadRecord() again, and it
> could continue where it was left off, even if it was in the middle of
> a continuation record.

We will have more control on how to continue reading WAL with
this than the current code and it seems preferable as the whole.

> This is clearly not backpatchable, but maybe something to think about
> for v12.

Are you planning to working on this shortly?

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center