Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4
Date: 2024-11-11 22:08:49
Message-ID: abf794c4-c459-4fed-84d9-968c4f0e2052@vondra.me
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 11/11/24 15:17, Tomas Vondra wrote:
> On 11/11/24 14:51, Ashutosh Bapat wrote:
>> ...
>>
>> I think the problem is about processing older running transactions
>> record and setting data.restart_lsn based on the candidates those
>> records produce. But what is not clear to me is how come a newer
>> candidate_restart_lsn is available immediately upon WAL sender
>> restart. I.e. in the sequence of events you mentioned in your first
>> email
>> 1. 344.139 LOG: starting logical decoding for slot "s1"
>>
>> 2. 344.139 DETAIL: Streaming transactions committing after 1/E97EAC30,
>> reading WAL from 1/E96FB4D0.
>>
>> 3. 344.140 LOG: logical decoding found consistent point at 1/E96FB4D0
>>
>> 4. 344.140 DETAIL: Logical decoding will begin using saved snapshot.
>>
>> 5. 344.140 LOG: LogicalConfirmReceivedLocation 1/E9865398
>>
>> 6. 344.140 LOG: LogicalConfirmReceivedLocation updating
>> data.restart_lsn to 1/E979D4C8 (from 1/E96FB4D0)
>> candidate_restart_valid 0/0 (from 1/E9865398)
>> candidate_restart_lsn 0/0 (from 1/E979D4C8)
>>
>> how did candidate_restart_lsn = 1/E979D4C8 and candidate_restart_valid
>> = 1/E9865398 were set in ReplicationSlot after WAL sender? It means it
>> must have read and processed running transaction record at 1/E9865398.
>> If that's true, how come it went back to a running transactions WAL
>> record at 1/E979D4C8? It should be reading WAL records sequentially,
>> hence read 1/E979D4C8 first then 1/E9865398.
>>
>> 344.145 LOG: LogicalIncreaseRestartDecodingForSlot s1
>> candidate_restart_valid_lsn 1/E979D4C8 (0/0)
>> candidate_restart_lsn 1/E96FB4D0 (0/0)
>>
>
> Those are good questions, but IIUC that's explained by this comment from
> Masahiko-san's analysis [1]:
>
> Thinking about the root cause more, it seems to me that the root cause
> is not the fact that candidate_xxx values are not cleared when being
> released.
>
> In the scenario I reproduced, after restarting the logical decoding,
> the walsender sets the restart_lsn to a candidate_restart_lsn left in
> the slot upon receiving the ack from the subscriber. ...
>
> If this is correct, then what happens is:
>
> 1) replication is running, at some point we set candidate LSN to B
>
> 2) something breaks, causing reconnect with restart LSN A (< B)
>
> 3) we still have the candidate LSN B in memory, and after receiving
> some confirmation we set it as restart_lsn
>
> 4) we get to decode the RUNNING_XACTS, which moves restart_lsn back
>
>
> If this analysis is correct, I think it's rather suspicious we don't
> reset the candidate fields on restart. Can those "old" values ever be
> valid? But I haven't tried resetting them.
>

To clarify this a bit, I mean something like in the attached 0003 patch.

The reasoning is that after ReplicationSlotAcquire() we should get the
slot in the same state as if we just read it from disk. Because why not?
Why should the result be different from what we'd get if the primary
restated right before the reconnect?

Parts 0001 and 0002 add a couple asserts to prevent backwards move for
both the restart_lsn and the various candidate LSN fields.

Both the 0003 and 0004 patches (applied separately) seems to fix crashes
in my stress test, and none of the asserts from 0001+0002 seem to fail.
I'm not sure if we need both fixes or just one of them.

But neither of those fixes prevents backwards move for confirmed_flush
LSN, as enforced by asserts in the 0005 patch. I don't know if this
assert is incorrect or now. It seems natural that once we get a
confirmation for some LSN, we can't move before that position, but I'm
not sure about that. Maybe it's too strict.

regards

--
Tomas Vondra

Attachment Content-Type Size
0001-asserts-on-restart_lsn-backwards-move.patch text/x-patch 2.0 KB
0002-asserts-for-candidate-lsn-fields.patch text/x-patch 1.9 KB
0003-reset-slot-in-acquire.patch text/x-patch 1.3 KB
0004-fix-LogicalIncreaseRestartDecodingForSlot.patch text/x-patch 2.3 KB
0005-confirmed_flush-asserts.patch text/x-patch 1.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2024-11-11 22:10:32 Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4
Previous Message Michel Pelletier 2024-11-11 21:34:06 Re: Using Expanded Objects other than Arrays from plpgsql