Re: Repeated requests for feedback in logical standby

Lists: pgsql-bugs
From: Magnus Hagander <magnus(at)hagander(dot)net>
To: pgsql-bugs <pgsql-bugs(at)postgresql(dot)org>
Cc: Marko Tiikkaja <marko(at)joh(dot)to>
Subject: Repeated requests for feedback in logical standby
Date: 2016-04-12 15:18:32
Message-ID: CABUevExsjROqDcD0A2rnJ6HK6FuKGyewJr3PL12pw85BHFGS2Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

I've been running some tests, and come across a weird behavior on shutdown
with logical replication, 9.5.2. It seems that on shutdown, a lot of extra
requests for feedback are sent from the walsender.

On my laptop I get about 3-15 requests (different each time), and on a
server with a bit more power but still very low activity, the number is
simply huge.

Basically what happens is:

* Replication works fine
* I have *zero* activity on the master since it got started, not even a
single psql
* I issue a fast shutdown on the master
* Nothing happens at all for a while
* Once pg_recvlogical times out and sends a status message things wake up
* The server sends a whole bunch of requests for status, which
pg_recvlogical responds to
* Eventually things shut down (on the proper server, I've had to kill
things because it takes too long - I'm unsure if it ever shuts down)

Log output from pg_recvlogical:

$ ./pg_recvlogical -v -p 5495 -S trivialslot -o include_transaction=on -v
-d postgres --start -f - -P test_decoding 2>&1 | ts '[%Y-%m-%d %H:%M:%S]'
[2016-04-12 17:14:31] pg_recvlogical: starting log streaming at 0/0 (slot
trivialslot)
[2016-04-12 17:14:31] pg_recvlogical: streaming initiated
[2016-04-12 17:14:31] pg_recvlogical: confirming write up to 0/0, flush to
0/0 (slot trivialslot)
[2016-04-12 17:14:41] pg_recvlogical: confirming write up to 0/18F9DE88,
flush to 0/18F9DE88 (slot trivialslot)
[2016-04-12 17:14:51] pg_recvlogical: confirming write up to 0/18F9DEF8,
flush to 0/18F9DEF8 (slot trivialslot)
[2016-04-12 17:14:51] pg_recvlogical: confirming write up to 0/18F9DFA0,
flush to 0/18F9DFA0 (slot trivialslot)
[2016-04-12 17:14:51] pg_recvlogical: confirming write up to 0/18F9DFA0,
flush to 0/18F9DFA0 (slot trivialslot)
[2016-04-12 17:14:51] pg_recvlogical: confirming write up to 0/18F9DFA0,
flush to 0/18F9DFA0 (slot trivialslot)
[2016-04-12 17:14:51] pg_recvlogical: confirming write up to 0/18F9DFA0,
flush to 0/18F9DFA0 (slot trivialslot)
[2016-04-12 17:14:51] pg_recvlogical: confirming write up to 0/18F9DFA0,
flush to 0/18F9DFA0 (slot trivialslot)
[2016-04-12 17:14:51] pg_recvlogical: disconnected; waiting 5 seconds to
try again

In this example, the server has:
2016-04-12 17:14:45 CEST LOG: received fast shutdown request
2016-04-12 17:14:45 CEST LOG: aborting any active transactions
2016-04-12 17:14:45 CEST LOG: autovacuum launcher shutting down
2016-04-12 17:14:45 CEST LOG: shutting down
2016-04-12 17:14:45 CEST LOG: database system is shut down

Note how it took a while (6 seconds) for pg_recvlogical to pick up that
anything was happening - looks like something might not have been flushed?

When running on the server where there's more data it was very clear though
- if I break the walsender with gdb, pg_recvlogical stops spitting out
debugging messages, and as soon as I "cont" in gdb it starts spitting them
out again, so there's a clear connection between the two. And it takes a
*lot* more than a few seconds.

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/


From: Nick Cleaton <nick(at)cleaton(dot)net>
To: Magnus Hagander <magnus(at)hagander(dot)net>
Cc: pgsql-bugs <pgsql-bugs(at)postgresql(dot)org>, Marko Tiikkaja <marko(at)joh(dot)to>
Subject: Re: Repeated requests for feedback in logical standby
Date: 2016-04-14 05:34:03
Message-ID: CAFgz3kvwK0y-yjOk+WbNZ51nj0_jC5Q63NOW7q0Q03KoTUR+wg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

On 12 April 2016 at 16:18, Magnus Hagander <magnus(at)hagander(dot)net> wrote:
>
> I've been running some tests, and come across a weird behavior on shutdown with logical replication, 9.5.2. It seems that on shutdown, a lot of extra requests for feedback are sent from the walsender.
>
> On my laptop I get about 3-15 requests (different each time), and on a server with a bit more power but still very low activity, the number is simply huge.
>
> Basically what happens is:
>
> * Replication works fine
> * I have *zero* activity on the master since it got started, not even a single psql
> * I issue a fast shutdown on the master
> * Nothing happens at all for a while
> * Once pg_recvlogical times out and sends a status message things wake up
> * The server sends a whole bunch of requests for status, which pg_recvlogical responds to
> * Eventually things shut down (on the proper server, I've had to kill things because it takes too long - I'm unsure if it ever shuts down)

That sounds similar to something I've seen with wal-based replication:
http://www.postgresql.org/message-id/flat/CAFgz3kus=rC_avEgBV=+hRK5HYJ8vXskJRh8yEAbahJGTzF2VQ(at)mail(dot)gmail(dot)com#CAFgz3kus=rC_avEgBV=+hRK5HYJ8vXskJRh8yEAbahJGTzF2VQ@mail.gmail.com

Nick


From: Magnus Hagander <magnus(at)hagander(dot)net>
To: Nick Cleaton <nick(at)cleaton(dot)net>
Cc: pgsql-bugs <pgsql-bugs(at)postgresql(dot)org>, Marko Tiikkaja <marko(at)joh(dot)to>
Subject: Re: Repeated requests for feedback in logical standby
Date: 2016-04-14 10:30:55
Message-ID: CABUevEyi70rDiciofHJaPSO4tsXg27O8_RBNuDz8a9YUcrCvyw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

On Thu, Apr 14, 2016 at 7:34 AM, Nick Cleaton <nick(at)cleaton(dot)net> wrote:

> On 12 April 2016 at 16:18, Magnus Hagander <magnus(at)hagander(dot)net> wrote:
> >
> > I've been running some tests, and come across a weird behavior on
> shutdown with logical replication, 9.5.2. It seems that on shutdown, a lot
> of extra requests for feedback are sent from the walsender.
> >
> > On my laptop I get about 3-15 requests (different each time), and on a
> server with a bit more power but still very low activity, the number is
> simply huge.
> >
> > Basically what happens is:
> >
> > * Replication works fine
> > * I have *zero* activity on the master since it got started, not even a
> single psql
> > * I issue a fast shutdown on the master
> > * Nothing happens at all for a while
> > * Once pg_recvlogical times out and sends a status message things wake up
> > * The server sends a whole bunch of requests for status, which
> pg_recvlogical responds to
> > * Eventually things shut down (on the proper server, I've had to kill
> things because it takes too long - I'm unsure if it ever shuts down)
>
>
> That sounds similar to something I've seen with wal-based replication:
>
> http://www.postgresql.org/message-id/flat/CAFgz3kus=rC_avEgBV=+hRK5HYJ8vXskJRh8yEAbahJGTzF2VQ(at)mail(dot)gmail(dot)com#CAFgz3kus=rC_avEgBV=+hRK5HYJ8vXskJRh8yEAbahJGTzF2VQ@mail.gmail.com

Agreed, that sounds very similar.

I don't know the code enough to say whether that's a proper fix though, or
if it might have other sideeffects.

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/


From: Magnus Hagander <magnus(at)hagander(dot)net>
To: Nick Cleaton <nick(at)cleaton(dot)net>
Cc: pgsql-bugs <pgsql-bugs(at)postgresql(dot)org>, Marko Tiikkaja <marko(at)joh(dot)to>
Subject: Re: Repeated requests for feedback in logical standby
Date: 2016-04-14 10:51:34
Message-ID: CABUevEzVcyL_p+Ck10XsK120XszvfBmfhN=LQkh8zvj7=uHYfw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

On Thu, Apr 14, 2016 at 12:30 PM, Magnus Hagander <magnus(at)hagander(dot)net>
wrote:

>
>
> On Thu, Apr 14, 2016 at 7:34 AM, Nick Cleaton <nick(at)cleaton(dot)net> wrote:
>
>> On 12 April 2016 at 16:18, Magnus Hagander <magnus(at)hagander(dot)net> wrote:
>> >
>> > I've been running some tests, and come across a weird behavior on
>> shutdown with logical replication, 9.5.2. It seems that on shutdown, a lot
>> of extra requests for feedback are sent from the walsender.
>> >
>> > On my laptop I get about 3-15 requests (different each time), and on a
>> server with a bit more power but still very low activity, the number is
>> simply huge.
>> >
>> > Basically what happens is:
>> >
>> > * Replication works fine
>> > * I have *zero* activity on the master since it got started, not even a
>> single psql
>> > * I issue a fast shutdown on the master
>> > * Nothing happens at all for a while
>> > * Once pg_recvlogical times out and sends a status message things wake
>> up
>> > * The server sends a whole bunch of requests for status, which
>> pg_recvlogical responds to
>> > * Eventually things shut down (on the proper server, I've had to kill
>> things because it takes too long - I'm unsure if it ever shuts down)
>>
>>
>> That sounds similar to something I've seen with wal-based replication:
>>
>> http://www.postgresql.org/message-id/flat/CAFgz3kus=rC_avEgBV=+hRK5HYJ8vXskJRh8yEAbahJGTzF2VQ(at)mail(dot)gmail(dot)com#CAFgz3kus=rC_avEgBV=+hRK5HYJ8vXskJRh8yEAbahJGTzF2VQ@mail.gmail.com
>
>
> Agreed, that sounds very similar.
>
> I don't know the code enough to say whether that's a proper fix though, or
> if it might have other sideeffects.
>
>

I should add that yes, this patch does seem to remove the problem for me as
well. But I can't confirm if it breaks something else. Hopefully someone
who has worked more on that code can?

I also still see the problem that "pg_ctrl stop -m fast" on the master does
*nothing* until the client sends one message. I tried increasing the
feedback interval to 2 minutes, and that increased the shutdown time to 2
minutes as well. Which can't be right - clearly the shutdown is failing to
actually make the walsender send the request and shut down. This happens
both with and without the suggested patch.

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/


From: Marko Tiikkaja <marko(at)joh(dot)to>
To: Magnus Hagander <magnus(at)hagander(dot)net>, Nick Cleaton <nick(at)cleaton(dot)net>
Cc: pgsql-bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: Repeated requests for feedback in logical standby
Date: 2016-04-14 11:14:39
Message-ID: 570F7B9F.6040300@joh.to
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

On 14/04/16 12:51, Magnus Hagander wrote:
> I should add that yes, this patch does seem to remove the problem for me as
> well. But I can't confirm if it breaks something else.

Did you also see whether it fixes the problem on the bigger box which
seemed to not shut down at all?

.m


From: Magnus Hagander <magnus(at)hagander(dot)net>
To: Marko Tiikkaja <marko(at)joh(dot)to>
Cc: Nick Cleaton <nick(at)cleaton(dot)net>, pgsql-bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: Repeated requests for feedback in logical standby
Date: 2016-04-14 11:21:36
Message-ID: CABUevEz_9pvDfiJYoYzMRt9440tryQaTP4Zj7ff=ux1xidYA-w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

On Thu, Apr 14, 2016 at 1:14 PM, Marko Tiikkaja <marko(at)joh(dot)to> wrote:

> On 14/04/16 12:51, Magnus Hagander wrote:
>
>> I should add that yes, this patch does seem to remove the problem for me
>> as
>> well. But I can't confirm if it breaks something else.
>>
>
> Did you also see whether it fixes the problem on the bigger box which
> seemed to not shut down at all?

Not yet, but those symptoms sound even more like those of Nick. I have it
fixing two different installations that I have locally here though.

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/


From: Nick Cleaton <nick(at)cleaton(dot)net>
To: Magnus Hagander <magnus(at)hagander(dot)net>
Cc: pgsql-bugs <pgsql-bugs(at)postgresql(dot)org>, Marko Tiikkaja <marko(at)joh(dot)to>
Subject: Re: Repeated requests for feedback in logical standby
Date: 2016-04-14 11:54:13
Message-ID: CAFgz3kupDmOm_2+pbG+MtD9caNLJ7ar+LD1id_okkOjkPm-FmA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

On 14 April 2016 at 11:30, Magnus Hagander <magnus(at)hagander(dot)net> wrote:
>> http://www.postgresql.org/message-id/flat/CAFgz3kus=rC_avEgBV=+hRK5HYJ8vXskJRh8yEAbahJGTzF2VQ(at)mail(dot)gmail(dot)com#CAFgz3kus=rC_avEgBV=+hRK5HYJ8vXskJRh8yEAbahJGTzF2VQ@mail.gmail.com
>
>
> Agreed, that sounds very similar.
>
> I don't know the code enough to say whether that's a proper fix though, or
> if it might have other sideeffects.

Certainly. My "this fixes it for me" was intended more as a piece of
diagnostic information than as a proposed change to the code.


From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Nick Cleaton <nick(at)cleaton(dot)net>
Cc: Magnus Hagander <magnus(at)hagander(dot)net>, pgsql-bugs <pgsql-bugs(at)postgresql(dot)org>, Marko Tiikkaja <marko(at)joh(dot)to>
Subject: Re: Repeated requests for feedback in logical standby
Date: 2016-05-09 07:42:31
Message-ID: CAB7nPqSPu+vhPLpjcWUxxGRfMbZiJUk3gCNp_13y5owOnXv0tQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

On Thu, Apr 14, 2016 at 8:54 PM, Nick Cleaton <nick(at)cleaton(dot)net> wrote:
> Certainly. My "this fixes it for me" was intended more as a piece of
> diagnostic information than as a proposed change to the code.

As I am just bumping on it. For the archive's sake, a fix has been
pushed as 5945501 (commit log references this thread as well).
--
Michael


From: Magnus Hagander <magnus(at)hagander(dot)net>
To: Marko Tiikkaja <marko(at)joh(dot)to>
Cc: Nick Cleaton <nick(at)cleaton(dot)net>, pgsql-bugs <pgsql-bugs(at)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: Repeated requests for feedback in logical standby
Date: 2016-06-16 13:13:49
Message-ID: CABUevEw82OzggLdiRa046AznxWE2U0PHLgMeQZJsNTLwrooUZg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

On Thu, Apr 14, 2016 at 1:21 PM, Magnus Hagander <magnus(at)hagander(dot)net>
wrote:

>
>
> On Thu, Apr 14, 2016 at 1:14 PM, Marko Tiikkaja <marko(at)joh(dot)to> wrote:
>
>> On 14/04/16 12:51, Magnus Hagander wrote:
>>
>>> I should add that yes, this patch does seem to remove the problem for me
>>> as
>>> well. But I can't confirm if it breaks something else.
>>>
>>
>> Did you also see whether it fixes the problem on the bigger box which
>> seemed to not shut down at all?
>
>
> Not yet, but those symptoms sound even more like those of Nick. I have it
> fixing two different installations that I have locally here though.
>
>
Finally got around to test this properly. And no, it does not work - 9.5.3
with logical replication. I captured a couple of stack traces, and they all
look like this:

#0 0x00007f8fa933e6b0 in __read_nocancel () at
../sysdeps/unix/syscall-template.S:81
#1 0x00005641a49aded3 in read (__nbytes=8192, __buf=0x5641a53d9090,
__fd=6) at /usr/include/x86_64-linux-gnu/bits/unistd.h:44
#2 XLogRead (buf=buf(at)entry=0x5641a53d9090 "\207\320\002",
startptr=startptr(at)entry=1056964608, count=count(at)entry=8192)
at
/build/postgresql-9.5-h9JXst/postgresql-9.5-9.5.3/build/../src/backend/replication/walsender.c:2124
#3 0x00005641a49af0db in logical_read_xlog_page (state=<optimized out>,
targetPagePtr=1056964608, reqLen=<optimized out>, targetRecPtr=<optimized
out>,
cur_page=0x5641a53d9090 "\207\320\002", pageTLI=<optimized out>)
at
/build/postgresql-9.5-h9JXst/postgresql-9.5-9.5.3/build/../src/backend/replication/walsender.c:772
#4 0x00005641a480f4d2 in ReadPageInternal (state=state(at)entry=0x5641a53d6b08,
pageptr=pageptr(at)entry=1062813696, reqLen=reqLen(at)entry=360)
at
/build/postgresql-9.5-h9JXst/postgresql-9.5-9.5.3/build/../src/backend/access/transam/xlogreader.c:524
#5 0x00005641a480fed1 in XLogReadRecord (state=0x5641a53d6b08,
RecPtr=1062814032, errormsg=errormsg(at)entry=0x7ffe3bd8ee48)
at
/build/postgresql-9.5-h9JXst/postgresql-9.5-9.5.3/build/../src/backend/access/transam/xlogreader.c:243
#6 0x00005641a49ae17b in XLogSendLogical () at
/build/postgresql-9.5-h9JXst/postgresql-9.5-9.5.3/build/../src/backend/replication/walsender.c:2437
#7 0x00005641a49afc63 in WalSndLoop (send_data=send_data(at)entry=0x5641a49ae150
<XLogSendLogical>)
at
/build/postgresql-9.5-h9JXst/postgresql-9.5-9.5.3/build/../src/backend/replication/walsender.c:1852
#8 0x00005641a49b0aa2 in StartLogicalReplication (cmd=<optimized out>)
at
/build/postgresql-9.5-h9JXst/postgresql-9.5-9.5.3/build/../src/backend/replication/walsender.c:1015
#9 exec_replication_command (cmd_string=cmd_string(at)entry=0x5641a53a3c50
"START_REPLICATION SLOT \"nineone\" LOGICAL 0/3F5244F8
(\"include_transaction\" 'on')")
at
/build/postgresql-9.5-h9JXst/postgresql-9.5-9.5.3/build/../src/backend/replication/walsender.c:1344
#10 0x00005641a49ed8b4 in PostgresMain (argc=<optimized out>,
argv=argv(at)entry=0x5641a5335118, dbname=0x5641a5335030 "db",
username=<optimized out>)
at
/build/postgresql-9.5-h9JXst/postgresql-9.5-9.5.3/build/../src/backend/tcop/postgres.c:4028
#11 0x00005641a47925b8 in BackendRun (port=0x5641a5354bb0) at
/build/postgresql-9.5-h9JXst/postgresql-9.5-9.5.3/build/../src/backend/postmaster/postmaster.c:4239
#12 BackendStartup (port=0x5641a5354bb0) at
/build/postgresql-9.5-h9JXst/postgresql-9.5-9.5.3/build/../src/backend/postmaster/postmaster.c:3913
#13 ServerLoop () at
/build/postgresql-9.5-h9JXst/postgresql-9.5-9.5.3/build/../src/backend/postmaster/postmaster.c:1684
#14 0x00005641a4993e7b in PostmasterMain (argc=5, argv=<optimized out>)
at
/build/postgresql-9.5-h9JXst/postgresql-9.5-9.5.3/build/../src/backend/postmaster/postmaster.c:1292
#15 0x00005641a47932b2 in main (argc=5, argv=0x5641a53341f0) at
/build/postgresql-9.5-h9JXst/postgresql-9.5-9.5.3/build/../src/backend/main/main.c:228

They all come out at approximately the same place there.

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/