Re: BUG #14416: checkpoints never completed

Lists: pgsql-bugs
From: jdnelson(at)dyn(dot)com
To: pgsql-bugs(at)postgresql(dot)org
Subject: BUG #14416: checkpoints never completed
Date: 2016-11-07 18:27:07
Message-ID: 20161107182707.1393.91785@wrigleys.postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

The following bug has been logged on the website:

Bug reference: 14416
Logged by: Jon Nelson
Email address: jdnelson(at)dyn(dot)com
PostgreSQL version: 9.4.9
Operating system: Linux
Description:

PostgreSQL 9.4.9 on CentOS-7, x86_64, on ext4.

We encountered a problem where checkpoints appeared to queue up and never
finish.

The logs show that checkpoints were happening normally, once per minute
(which is our desired configuration in this case).

01:08:27 the system clock was synchronized. It was behind by about one
hour.
01:23:31 a backup (pg_start_backup) is started
03:40:14 we saw our first “out of disk space” message
03:47:51 we saw “checkpoint starting: time” (with no corresponding
“checkpoint complete” messages) *once per second*. We are also out of disk
space at this time.
04:00:01 we see a ‘checkpoint complete’ message (despite still being out of
disk space!)

Checkpoints appear normal until 4:07:01 at which point the the “checkpoint
start: time” message occurs once per *second*. Still out of disk space.

05:51:13 we see the first “checkpoint complete”.
05:51:22 (7 seconds later) we see a “checkpoint starting: time” message. No
messages containing "checkpoint" appear after this.

07:43:54 an error message indicating that “pg_stop_backup()” says a backup
isn’t in progress. NOTE: we do *not* see a successful pg_stop_backup(), but
it’s possible that it took less than 500ms which is our
log_min_duration_statement threshold.

23:19:32 our attempts at a manual CHECKPOINT all hang, and the only messages
with ‘checkpoint’ in them are notices of our cancellations.

We chose to shut the instance down. Over 400GB of WAL files had accrued due
to no checkpoint completing. The instance came up without issue and resumed
operations.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: jdnelson(at)dyn(dot)com
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #14416: checkpoints never completed
Date: 2016-11-07 19:22:15
Message-ID: 4037.1478546535@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

jdnelson(at)dyn(dot)com writes:
> We encountered a problem where checkpoints appeared to queue up and never
> finish.
> ...
> 01:08:27 the system clock was synchronized. It was behind by about one
> hour.

I wonder if this is a problem similar to the autovacuum issue we fixed
in da1a9d0f5, ie perhaps moving the system clock setting confuses the
checkpoint timing logic.

What checkpoint-related settings are you using? Can you be more specific
about the system clock change?

regards, tom lane


From: Jonathon Nelson <jdnelson(at)dyn(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #14416: checkpoints never completed
Date: 2016-11-07 19:48:52
Message-ID: CACJqAM0no5HUKc9GhoOBBV77rpcJSJ_cSs85vw_Zw7taZchu=A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

On Mon, Nov 7, 2016 at 1:22 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> jdnelson(at)dyn(dot)com writes:
> > We encountered a problem where checkpoints appeared to queue up and never
> > finish.
> > ...
> > 01:08:27 the system clock was synchronized. It was behind by about one
> > hour.
>
> I wonder if this is a problem similar to the autovacuum issue we fixed
> in da1a9d0f5, ie perhaps moving the system clock setting confuses the
> checkpoint timing logic.
>

That is more or less what we were thinking as well.

> What checkpoint-related settings are you using? Can you be more specific
> about the system clock change?
>

checkpoint_completion_target = 0.85
checkpoint_segments = 2048
checkpoint_timeout = 60

We have noticed that the system clock has not been the most reliable. In
this case, though, there is only evidence of a single clock correction at
the noted time. No other corrections or drift during the actual event were
detected, but that doesn't mean that there weren't any.

--
Jon Nelson


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jonathon Nelson <jdnelson(at)dyn(dot)com>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #14416: checkpoints never completed
Date: 2016-11-07 22:20:08
Message-ID: 13353.1478557208@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: Postg토토 핫SQL : Postg토토 핫SQL 메일 링리스트 : 2016-11-07 이후 PGSQL-BUGS 22:20

Jonathon Nelson <jdnelson(at)dyn(dot)com> writes:
> On Mon, Nov 7, 2016 at 1:22 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> I wonder if this is a problem similar to the autovacuum issue we fixed
>> in da1a9d0f5, ie perhaps moving the system clock setting confuses the
>> checkpoint timing logic.

> That is more or less what we were thinking as well.

Looking at the logic around this in checkpointer.c, it's pretty obvious
that it would not behave nicely if system time goes backwards after a
checkpoint starts; it would think it was ahead of schedule and would
just loaf, basically, until the clock catches up to where it had been.
There's no sanity check to notice a negative elapsed-time reading.
But if system time goes forwards, it would think it was very far behind
schedule and would do a burst of work, which doesn't seem to match your
symptom.

Please confirm the sign of the system clock correction that happened
on your machine?

regards, tom lane


From: Jonathon Nelson <jdnelson(at)dyn(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #14416: checkpoints never completed
Date: 2016-11-08 01:59:51
Message-ID: CACJqAM0cKT0JRcY_D+vgu99KAAtOHz8=z7ECWm=zxwtB=Bvrkw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

On Mon, Nov 7, 2016 at 4:20 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Jonathon Nelson <jdnelson(at)dyn(dot)com> writes:
> > On Mon, Nov 7, 2016 at 1:22 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> >> I wonder if this is a problem similar to the autovacuum issue we fixed
> >> in da1a9d0f5, ie perhaps moving the system clock setting confuses the
> >> checkpoint timing logic.
>
> > That is more or less what we were thinking as well.
>
> Looking at the logic around this in checkpointer.c, it's pretty obvious
> that it would not behave nicely if system time goes backwards after a
> checkpoint starts; it would think it was ahead of schedule and would
> just loaf, basically, until the clock catches up to where it had been.
> There's no sanity check to notice a negative elapsed-time reading.
> But if system time goes forwards, it would think it was very far behind
> schedule and would do a burst of work, which doesn't seem to match your
> symptom.
>
> Please confirm the sign of the system clock correction that happened
> on your machine?
>

Before responding, I triple checked everything I have. I did make a
mistake, but it's one of scale: the time went forward 1d, 57m, and 1.7s
(earlier I said it was about an hour). Prior to the event, the system clock
was all over the place, however I cannot find evidence of any further time
corrections. This is a busy system and easily logs more than once a second,
so I chose to identify time jumps by taking the logs (in the order they
appeared in!) and subtracting the previous log's timestamp. If the
difference was either negative or greater than 2 seconds, I set it aside. I
did not find any such events during this timeframe.

--
Jon Nelson