Quick Links

Re: BUG #15641: Autoprewarm worker fails to start on Windows with huge pages in use Old PostgreSQL community/pgsql-bugs x

Lists:	Postg사설 토토 사이트SQL : Postg사설 토토 사이트SQL 메일 링리스트 : 2019-03-18 이후 PGSQL-BUGS 15:31Postg토토SQL : Postg토토SQL

From:	Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To:	buschmann(at)nidsa(dot)net, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject:	Re: BUG #15641: Autoprewarm worker fails to start on Windows with huge pages in use Old PostgreSQL community/pgsql-bugs x
Date:	2019-02-20 02:21:05
Message-ID:	CA+hUKGKpQJCWcgyy3QTC9vdn6uKAR_8r__A-MMm2GYfj45caag@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs pgsql-hackers

On Tue, Feb 19, 2019 at 7:31 AM PG Bug reporting form
<noreply(at)postgresql(dot)org> wrote:
>
> The following bug has been logged on the website:
>
> Bug reference: 15641
> Logged by: Hans Buschmann
> Email address: buschmann(at)nidsa(dot)net
> PostgreSQL version: 11.2
> Operating system: Windows Server 2019 Standard
> Description:
>
> I recently moved a production system from PG 10.7 to 11.2 on a different
> Server.
>
> The configuration settings where mostly taken from the old system and
> enhanced by new features of PG 11.
>
> pg_prewarm was used for a long time (with no specific configuration).
>
> Now I have added Huge page support for Windows in the OS and verified it
> with vmmap tool from Sysinternals to be active.
> (the shared buffers are locked in memory: Lock_WS is set).
>
> When pg_prewarm.autoprewarm is set to on (using the default after initial
> database import via pg_restore), the autoprewarm worker process
> terminates immediately and generates a huge number of logfile entries
> like:
>
> CPS PRD 2019-02-17 16:11:53 CET 00000 11:> LOG: background worker
> "autoprewarm worker" (PID 3996) exited with exit code 1
> CPS PRD 2019-02-17 16:11:53 CET 55000 1:> ERROR: could not map dynamic
> shared memory segment

Hmm. It's not clear to me how using large pages for the main
PostgreSQL shared memory region could have any impact on autoprewarm's
entirely separate DSM segment. I wonder if other DSM use cases are
impacted. Does parallel query work? For example, the following
produces a parallel query that uses a few DSM segments:

create table foo as select generate_series(1, 1000000)::int i;
analyze foo;
explain analyze select count(*) from foo f1 join foo f2 using (i);

Looking at the place where that error occurs, it seems like it simply
failed to find the handle, as if it didn't exist at all at the time
dsm_attach() was called. I'm not entirely sure how that could happen
just because you turned on huge pages. Is it possible that there is a
race where apw_load_buffers() manages to detach before the worker
attached, and the timing changes? At a glance, that shouldn't happen
because apw_start_database_worker() waits for the work to exit before
returning.

I think we'll need one of our Windows-enabled hackers to take a look.

PS Sorry for breaking the thread. I wish our archives app had a
"[re]send me this email" button, for people who subscribed after the
message was sent...

--
Thomas Munro
https://enterprisedb.com

From:	"Hans Buschmann" <buschmann(at)nidsa(dot)net>
To:	"Thomas Munro" <thomas(dot)munro(at)gmail(dot)com>, "PostgreSQL mailing lists" <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject:	AW: BUG #15641: Autoprewarm worker fails to start on Windows with huge pages in use Old PostgreSQL community/pgsql-bugs x
Date:	2019-02-20 16:17:08
Message-ID:	D2B9F2A20670C84685EF7D183F2949E202569F1C@gigant.nidsa.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs pgsql-hackers

Thank you for taking a look.

I encountered this problem after switching the production system and then found it also on the new created replica.

I have no knowledge of the shared memory areas involved.

I did some further investigation and tried to reproduce it on the old System (WS2016, PG 11.2) but there it worked fine (without and with huge pages activated!).

Even on a developer machine under WS2019, PG 11.2 the error did not occur (both cases running on different generation of intel machines, Haswell and Nehalem, under different Hypervisors, WS2012R2 and WS2019).

I am really confused to not being able to reproduce the error outside of production and replica instances...

The error caused a massive flood of the logs (about 800 MB in about 1 day, on SSD)

I'll try to investigate further by configuring a second replica tomorrow, using the configuration of the production system as done per pg_basebackup.

I looked at the non-default configuration settings but could not identify anything special.

Here is a current list of the production System having 4GB of memory allocated to the VM.
(all values with XXX are a little obfuscated).

Here, to avoid the error, pg_prewarm.autoprewarm is off!

Thanks

Hans Buschmann

From:	Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To:	Hans Buschmann <buschmann(at)nidsa(dot)net>
Cc:	PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, mithun(dot)cy(at)enterprisedb(dot)com
Subject:	Re: BUG #15641: Autoprewarm worker fails to start on Windows with huge pages in use Old PostgreSQL community/pgsql-bugs x
Date:	2019-02-20 21:41:11
Message-ID:	CA+hUKG+29N_CXU3sF3t2f_4V+JDUqs=2dxxn8DpmcQnyEmBgqw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs pgsql-hackers

On Thu, Feb 21, 2019 at 4:36 AM Hans Buschmann <buschmann(at)nidsa(dot)net> wrote:
> I encountered this problem after switching the production system and then found it also on the new created replica.
>
> I have no knowledge of the shared memory areas involved.
>
> I did some further investigation and tried to reproduce it on the old System (WS2016, PG 11.2) but there it worked fine (without and with huge pages activated!).
>
> Even on a developer machine under WS2019, PG 11.2 the error did not occur (both cases running on different generation of intel machines, Haswell and Nehalem, under different Hypervisors, WS2012R2 and WS2019).
>
> I am really confused to not being able to reproduce the error outside of production and replica instances...
>
> The error caused a massive flood of the logs (about 800 MB in about 1 day, on SSD)
>
> I'll try to investigate further by configuring a second replica tomorrow, using the configuration of the production system as done per pg_basebackup.

Just to confirm: on the machines where it happens, does it happen on
every restart, and does it never happen if you set huge_pages = off?

CC'ing the authors of the auto-prewarm feature to see if they have ideas.

There is a known bug (fixed in commit 6c0fb941 for the next release)
that would cause spurious dsm_attach() failure that would look just
this this (dsm_attach() returns NULL), but that should be very rare
and couldn't cause the behaviour described here, because here the
background worker is repeatedly failing to attach in a loop (hence the
800MB of logs).

--
Thomas Munro
https://enterprisedb.com

From:	"Hans Buschmann" <buschmann(at)nidsa(dot)net>
To:	"Thomas Munro" <thomas(dot)munro(at)gmail(dot)com>
Cc:	"PostgreSQL mailing lists" <pgsql-bugs(at)lists(dot)postgresql(dot)org>, "Robert Haas" <robertmhaas(at)gmail(dot)com>, <mithun(dot)cy(at)enterprisedb(dot)com>
Subject:	AW: BUG #15641: Autoprewarm worker fails to start on Windows with huge pages in use Old PostgreSQL community/pgsql-bugs x
Date:	2019-02-21 09:21:55
Message-ID:	D2B9F2A20670C84685EF7D183F2949E202569F1E@gigant.nidsa.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs pgsql-hackers

hello

Since these are production systems, I did'nt set huge_pages=off.
(huge pages give performance, autoprewarm is not so necessary)

I think it occured on every start, but the systems were only started 1 to 2 times in this error mode.

In the other cases I tried yesterday the results where very confusing (error not reproducable with or without huge pages).

Hans Buschmann

From:	Mithun Cy <mithun(dot)cy(at)gmail(dot)com>
To:	Mithun Cy <mithun(dot)cy(at)enterprisedb(dot)com>, Hans Buschmann <buschmann(at)nidsa(dot)net>, thomas(dot)munro(at)gmail(dot)com
Cc:	pgsql-bugs(at)lists(dot)postgresql(dot)org, robertmhaas(at)gmail(dot)com
Subject:	Re: BUG #15641: Autoprewarm worker fails to start on Windows with huge pages in use Old PostgreSQL community/pgsql-bugs x
Date:	2019-02-21 12:53:29
Message-ID:	CADq3xVb_WUOz7RuhEEEKMnsHAx8hGzX74PKCwyJg57MOdEf=qA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs pgsql-hackers

Hi Thomas, Hans,
On Thu, Feb 21, 2019 at 2:16 PM Hans Buschmann <buschmann(at)nidsa(dot)net> wrote:
>
> hello
>
> Since these are production systems, I did'nt set huge_pages=off.
> (huge pages give performance, autoprewarm is not so necessary)

I did turn autoprewarm on, windows server 2019 and postgresql 11.2 it
runs fine even with huge_pages=on (Thanks to neha sharma). As Thomas
said error is coming from per database worker and main worker waits
till per data database worker exists so from code review I do see an
issue of having an invalid handle in per database worker. A
reproducible testcase will really help. I shall see to recheck the
code again but I am not much hopeful without a proper testcase.

--
Thanks and Regards
Mithun Chicklore Yogendra
EnterpriseDB: http://www.enterprisedb.com

From:	Mithun Cy <mithun(dot)cy(at)gmail(dot)com>
To:	Mithun Cy <mithun(dot)cy(at)enterprisedb(dot)com>, Hans Buschmann <buschmann(at)nidsa(dot)net>, thomas(dot)munro(at)gmail(dot)com
Cc:	pgsql-bugs(at)lists(dot)postgresql(dot)org, robertmhaas(at)gmail(dot)com
Subject:	Re: BUG #15641: Autoprewarm worker fails to start on Windows with huge pages in use Old PostgreSQL community/pgsql-bugs x
Date:	2019-02-21 12:58:28
Message-ID:	CADq3xVZVsDjkjnR3TRP95mYvz+0m8M8Fyk6+WAuRGoY-x8-0=Q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs pgsql-hackers

On Thu, Feb 21, 2019 at 6:23 PM Mithun Cy <mithun(dot)cy(at)gmail(dot)com> wrote:

> said error is coming from per database worker and main worker waits
> till per data database worker exists so from code review I do see an
> issue of having an invalid handle in per database worker.

Sorry a typo error, I meant I do not see an issue from the code.

From:	"Hans Buschmann" <buschmann(at)nidsa(dot)net>
To:	"Mithun Cy" <mithun(dot)cy(at)gmail(dot)com>, "Mithun Cy" <mithun(dot)cy(at)enterprisedb(dot)com>, <thomas(dot)munro(at)gmail(dot)com>
Cc:	<pgsql-bugs(at)lists(dot)postgresql(dot)org>, <robertmhaas(at)gmail(dot)com>
Subject:	AW: BUG #15641: Autoprewarm worker fails to start on Windows with huge pages in use Old PostgreSQL community/pgsql-bugs x
Date:	2019-02-24 14:04:09
Message-ID:	D2B9F2A20670C84685EF7D183F2949E202569F21@gigant.nidsa.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs pgsql-hackers

On the weekend, I did some more investigations:

It seems that Huge pages are NOT the cause of this problem.

The problem is only reproducable ONCE, after a database restart it disappears.

By reinstalling the original pg_pasebackup on another test VM the problem reappeared once.

Here is the start of the error log:

CPS PRD 2019-02-24 12:11:57 CET 00000 1:> LOG: database system was interrupted; last known up at 2019-02-17 16:14:05 CET
CPS PRD 2019-02-24 12:12:16 CET 00000 2:> LOG: entering standby mode
CPS PRD 2019-02-24 12:12:16 CET 00000 3:> LOG: redo starts at 0/23000028
CPS PRD 2019-02-24 12:12:16 CET 00000 4:> LOG: consistent recovery state reached at 0/23000168
CPS PRD 2019-02-24 12:12:16 CET 00000 5:> LOG: invalid record length at 0/24000060: wanted 24, got 0
CPS PRD 2019-02-24 12:12:16 CET 00000 9:> LOG: database system is ready to accept read only connections
CPS PRD 2019-02-24 12:12:16 CET 3D000 1:> FATAL: database 16384 does not exist
CPS PRD 2019-02-24 12:12:16 CET 00000 10:> LOG: background worker "autoprewarm worker" (PID 3968) exited with exit code 1
CPS PRD 2019-02-24 12:12:16 CET 00000 1:> LOG: autoprewarm successfully prewarmed 0 of 12402 previously-loaded blocks
CPS PRD 2019-02-24 12:12:17 CET XX000 1:> FATAL: could not connect to the primary server: FATAL: no pg_hba.conf entry for replication connection from host "192.168.27.155", user "replicator", SSL off
CPS PRD 2019-02-24 12:12:17 CET 55000 1:> ERROR: could not map dynamic shared memory segment
CPS PRD 2019-02-24 12:12:17 CET 00000 11:> LOG: background worker "autoprewarm worker" (PID 3296) exited with exit code 1
CPS PRD 2019-02-24 12:12:17 CET XX000 1:> FATAL: could not connect to the primary server: FATAL: no pg_hba.conf entry for replication connection from host "192.168.27.155", user "replicator", SSL off
CPS PRD 2019-02-24 12:12:17 CET 55000 1:> ERROR: could not map dynamic shared memory segment
CPS PRD 2019-02-24 12:12:17 CET 00000 12:> LOG: background worker "autoprewarm worker" (PID 2756) exited with exit code 1
CPS PRD 2019-02-24 12:12:17 CET 55000 1:> ERROR: could not map dynamic shared memory segment
...
(PS: the correct replication function was not set, so causing the errors concerning replication)

It seems that an outdated autoprewarm.blocks causes the problem.

After a restart the autoprewarm.blocks file seems to be rewritten, so that the next start gives no error.

For a test, I copied the erroneus autoprewarm.blocks files over to the data section and the problem reappeared.

The autoprewarm.blocks file is not corrupted or moved around manually but rather a leftover from the preceding test installation.

On this instance I had installed a copy of the production database under 11.2.
By doing the production switch, I dropped the test database and pg_restored the current one.

This left the previous autoprewarm.blocks file in the data directory.

On the first start the autoprewarm files does not match the newly restored database (perhpas the cause of the fatal error: database 16384 does not exist)

So the problem lies in the initial detection of the autoprewarm.blocks file.

This seems easy to reproduce:

- Install/create a database with autoprewarm on and pg_prewarm loaded.
- Fill the autoprewarm cache with some data
- pg_dump the database
- drop the database
- create the database and pg_restore it from the dump
- start the instance and logs are flooded

I have taken no further investigation in the sourcecode due to limited skills so far...

Thanks

Hans Buschmann

From:	Mithun Cy <mithun(dot)cy(at)gmail(dot)com>
To:	Hans Buschmann <buschmann(at)nidsa(dot)net>
Cc:	Mithun Cy <mithun(dot)cy(at)enterprisedb(dot)com>, thomas(dot)munro(at)gmail(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject:	Re: BUG #15641: Autoprewarm worker fails to start on Windows with huge pages in use Old PostgreSQL community/pgsql-bugs x
Date:	2019-02-24 18:40:49
Message-ID:	CADq3xVY_DjvRMv2DpFugyaW+7ZJAqeEU6vcFPossXTEGSE=toA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs pgsql-hackers

Thanks Hans, for a simple reproducible tests.

On Sun, Feb 24, 2019 at 6:54 PM Hans Buschmann <buschmann(at)nidsa(dot)net> wrote:
> Here is the start of the error log:
>
> CPS PRD 2019-02-24 12:11:57 CET 00000 1:> LOG: database system was
interrupted; last known up at 2019-02-17 16:14:05 CET
> CPS PRD 2019-02-24 12:12:16 CET 00000 2:> LOG: entering standby mode
> CPS PRD 2019-02-24 12:12:16 CET 00000 3:> LOG: redo starts at
0/23000028
> CPS PRD 2019-02-24 12:12:16 CET 00000 4:> LOG: consistent recovery
state reached at 0/23000168
> CPS PRD 2019-02-24 12:12:16 CET 00000 5:> LOG: invalid record length
at 0/24000060: wanted 24, got 0
> CPS PRD 2019-02-24 12:12:16 CET 00000 9:> LOG: database system is
ready to accept read only connections
> CPS PRD 2019-02-24 12:12:16 CET 3D000 1:> FATAL: database 16384 does
not exist
> CPS PRD 2019-02-24 12:12:16 CET 00000 10:> LOG: background worker
"autoprewarm worker" (PID 3968) exited with exit code 1
> CPS PRD 2019-02-24 12:12:16 CET 00000 1:> LOG: autoprewarm
successfully prewarmed 0 of 12402 previously-loaded blocks
> CPS PRD 2019-02-24 12:12:17 CET XX000 1:> FATAL: could not connect to
the primary server: FATAL: no pg_hba.conf entry for replication connection
from host "192.168.27.155", user "replicator", SSL off
> CPS PRD 2019-02-24 12:12:17 CET 55000 1:> ERROR: could not map dynamic
shared memory segment

As per the log Auto prewarm master did exit ("autoprewarm successfully
prewarmed 0 of 12402 previously-loaded blocks") first. Then only we started
getting "could not map dynamic shared memory segment".
That is, master has done dsm_detach and then workers started throwing error
after that.

> This seems easy to reproduce:
>
> - Install/create a database with autoprewarm on and pg_prewarm loaded.
> - Fill the autoprewarm cache with some data
> - pg_dump the database
> - drop the database
> - create the database and pg_restore it from the dump
> - start the instance and logs are flooded
>
> I have taken no further investigation in the sourcecode due to limited
skills so far...

I was able to reproduce same.

The "worker.bgw_restart_time" is never set for autoprewarm workers so on
error it get restarted after some period of time (default behavior). Since
database itself is dropped our attempt to connect to that database failed
and then worker exited. But again got restated by postmaster then we start
seeing above DSM segment error.

I think every autoprewarm worker should be set with
"worker.bgw_restart_time = BGW_NEVER_RESTART;" so that there shall not be
repeated prewarm attempt of a dropped database. I will try to think further
and submit a patch for same.

--
Thanks and Regards
Mithun Chicklore Yogendra
EnterpriseDB: http://www.enterprisedb.com

From:	"Hans Buschmann" <buschmann(at)nidsa(dot)net>
To:	"Mithun Cy" <mithun(dot)cy(at)gmail(dot)com>
Cc:	"Mithun Cy" <mithun(dot)cy(at)enterprisedb(dot)com>, <thomas(dot)munro(at)gmail(dot)com>, <pgsql-bugs(at)lists(dot)postgresql(dot)org>, "Robert Haas" <robertmhaas(at)gmail(dot)com>
Subject:	AW: BUG #15641: Autoprewarm worker fails to start on Windows with huge pages in use Old PostgreSQL community/pgsql-bugs x
Date:	2019-02-25 10:59:48
Message-ID:	D2B9F2A20670C84685EF7D183F2949E202569F22@gigant.nidsa.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs pgsql-hackers

Glad to hear you could reproduce the case easily.

I wanted to add that the problem as it seems now should'nt be restricted to Windows only.

Another thing is the semantic scope of pg_prewarm:

Prewarming affects the whole cluster, so at instance start we can meet some active and/or some dropped databases.

To not affect the other databases the prewarming should occur on all non dropped databases and omit only the dropped ones.

Hope your thinking gives a good patch... ;)

Hans Buschmann

From:	Mithun Cy <mithun(dot)cy(at)gmail(dot)com>
To:	Hans Buschmann <buschmann(at)nidsa(dot)net>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Cc:	Mithun Cy <mithun(dot)cy(at)enterprisedb(dot)com>, thomas(dot)munro(at)gmail(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject:	Re: BUG #15641: Autoprewarm worker fails to start on Windows with huge pages in use Old PostgreSQL community/pgsql-bugs x
Date:	2019-03-18 07:04:18
Message-ID:	CADq3xVZ4oVE6pS_-Bww6OmiY+WeE96civ3POEqUKe0Oa1fJrpA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs pgsql-hackers

t tOn Mon, Feb 25, 2019 at 12:10 AM Mithun Cy <mithun(dot)cy(at)gmail(dot)com> wrote:

> Thanks Hans, for a simple reproducible tests.
>
> The "worker.bgw_restart_time" is never set for autoprewarm workers so on
> error it get restarted after some period of time (default behavior). Since
> database itself is dropped our attempt to connect to that database failed
> and then worker exited. But again got restated by postmaster then we start
> seeing above DSM segment error.
>
> I think every autoprewarm worker should be set with
> "worker.bgw_restart_time = BGW_NEVER_RESTART;" so that there shall not be
> repeated prewarm attempt of a dropped database. I will try to think further
> and submit a patch for same.
>

Here is the patch for same,

autoprewarm waorker should not be restarted. As per the code
@apw_start_database_worker@
master starts a worker per database and wait until it exit by calling
WaitForBackgroundWorkerShutdown. The call WaitForBackgroundWorkerShutdown
cannot handle the case if the worker was restarted. The
WaitForBackgroundWorkerShutdown() get the status BGWH_STOPPED from the call
GetBackgroundWorkerPid() if worker got restarted. So master will next
detach the shared memory and next restarted worker keep failing going in a
unending loop.

I think there is no need to restart at all. Following are the normal error
we might encounter.
1. Connecting database is droped -- So we need to skip to next database
which master will do by starting a new wroker. So not needed.
2. Relation is droped -- try_relation_open(reloid, AccessShareLock) is used
so error due to dropped relation is handled also avoids concurrent
truncation.
3. smgrexists is used before reading from a fork file. Again error is
handled.
4. before reading the block we have check as below. So previously truncated
pages will not be read again.
/* Check whether blocknum is valid and within fork file size. */
if (blk->blocknum >= nblocks)

I think if any other unexpected errors occurs that should be fatal so
restarting will not be correcting same. Hence there is no need to restart
the per database worker process.

I tried to dig why we did not set it earlier. It used to be never restart,
but it changed after fixing comments [1]. At that time we did not make
explicit database connection per worker and did not handle many error cases
as now. So it appeared fair. But, when code changed to make database
connection per worker, we should have set every worker with
BGW_NEVER_RESTART. Which I think was a mistake.

NOTE : On zero exit status we will not restart the bgworker (see
@CleanupBackgroundWorker@
and @maybe_start_bgworkers@)
[1]
/message-id/CA%2BTgmoYNF_wfdwQ3z3713zKy2j0Z9C32WJdtKjvRWzeY7JOL4g%40mail.gmail.com
--
Thanks and Regards
Mithun Chicklore Yogendra
EnterpriseDB: http://www.enterprisedb.com

Attachment	Content-Type	Size
never_restart_apw_worker_01.patch	application/octet-stream	594 bytes

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Mithun Cy <mithun(dot)cy(at)gmail(dot)com>
Cc:	Hans Buschmann <buschmann(at)nidsa(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Mithun Cy <mithun(dot)cy(at)enterprisedb(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject:	Re: BUG #15641: Autoprewarm worker fails to start on Windows with huge pages in use Old PostgreSQL community/pgsql-bugs x
Date:	2019-03-18 15:31:26
Message-ID:	CA+TgmoZV=+=aY3OBFLwH5-Uke=FMS4eppDbjSJJrH5=r-Zh74w@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	Postg사설 토토 사이트SQL : Postg사설 토토 사이트SQL 메일 링리스트 : 2019-03-18 이후 PGSQL-BUGS 15:31 Postg토토SQL : Postg토토SQL

On Mon, Mar 18, 2019 at 3:04 AM Mithun Cy <mithun(dot)cy(at)gmail(dot)com> wrote:
> autoprewarm waorker should not be restarted. As per the code @apw_start_database_worker@ master starts a worker per database and wait until it exit by calling WaitForBackgroundWorkerShutdown. The call WaitForBackgroundWorkerShutdown cannot handle the case if the worker was restarted. The WaitForBackgroundWorkerShutdown() get the status BGWH_STOPPED from the call GetBackgroundWorkerPid() if worker got restarted. So master will next detach the shared memory and next restarted worker keep failing going in a unending loop.

Ugh, that seems like a silly oversight. Does it fix the reported problem?

If I understand correctly, the commit message would be something like this:

==
Don't auto-restart per-database autoprewarm workers.

We should try to prewarm each database only once. Otherwise, if
prewarming fails for some reason, it will just keep retrying in an
infnite loop. The existing code was intended to implement this
behavior, but because it neglected to set worker.bgw_restart_time, the
per-database workers keep restarting, contrary to what was intended.

Mithun Cy, per a report from Hans Buschmann
==

Does that sound right?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Mithun Cy <mithun(dot)cy(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Hans Buschmann <buschmann(at)nidsa(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Mithun Cy <mithun(dot)cy(at)enterprisedb(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject:	Re: BUG #15641: Autoprewarm worker fails to start on Windows with huge pages in use Old PostgreSQL community/pgsql-bugs x
Date:	2019-03-18 17:42:24
Message-ID:	CADq3xVYcKxPantnV+HoHXER7Dg-n5ipsHKs11ibq0ueVYSG-7Q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs pgsql-hackers

Thanks Robert,
On Mon, Mar 18, 2019 at 9:01 PM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> On Mon, Mar 18, 2019 at 3:04 AM Mithun Cy <mithun(dot)cy(at)gmail(dot)com> wrote:
> > autoprewarm waorker should not be restarted. As per the code
> @apw_start_database_worker@ master starts a worker per database and wait
> until it exit by calling WaitForBackgroundWorkerShutdown. The call
> WaitForBackgroundWorkerShutdown cannot handle the case if the worker was
> restarted. The WaitForBackgroundWorkerShutdown() get the status
> BGWH_STOPPED from the call GetBackgroundWorkerPid() if worker got
> restarted. So master will next detach the shared memory and next restarted
> worker keep failing going in a unending loop.
>
> Ugh, that seems like a silly oversight. Does it fix the reported problem?
>

-- Yes this fixes the reported issue, Hans Buschmann has given below steps
to reproduce.

-- It is explained earlier [1] that they used older autoprewarm.blocks
which was generated before drop database. So on restrart autoprewarm worker
failed to connect to droped database and then lead to retry loop. This
patch should fix same.

NOTE : Also, another kind of error user might see because of same bug is,
restarted worker getting connected to next database in autoprewarm.blocks
because autoprewarm master updated shared data "apw_state->database =
current_db;" to start new worker for next database. Both restarted worker
and newly created worker will connect to same database(next one) and try to
load same pages. Hence end up with spurious log messages like "LOG:
autoprewarm successfully prewarmed 13 of 11 previously-loaded blocks"

If I understand correctly, the commit message would be something like this:
>
> ==
> Don't auto-restart per-database autoprewarm workers.
>
> We should try to prewarm each database only once. Otherwise, if
> prewarming fails for some reason, it will just keep retrying in an
> infnite loop. The existing code was intended to implement this
> behavior, but because it neglected to set worker.bgw_restart_time, the
> per-database workers keep restarting, contrary to what was intended.
>
> Mithun Cy, per a report from Hans Buschmann
> ==
>
> Does that sound right?
>

-- Yes I Agree.

[1]
/message-id/D2B9F2A20670C84685EF7D183F2949E202569F21%40gigant.nidsa.net

--
Thanks and Regards
Mithun Chicklore Yogendra
EnterpriseDB: http://www.enterprisedb.com

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Mithun Cy <mithun(dot)cy(at)gmail(dot)com>
Cc:	Hans Buschmann <buschmann(at)nidsa(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Mithun Cy <mithun(dot)cy(at)enterprisedb(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject:	Re: BUG #15641: Autoprewarm worker fails to start on Windows with huge pages in use Old PostgreSQL community/pgsql-bugs x
Date:	2019-03-18 19:35:39
Message-ID:	CA+TgmoZ27jXJDZEiY8z23xZ=3F2zzZxfHqE=Syc=PEWvtYE4ig@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs pgsql-hackers

On Mon, Mar 18, 2019 at 1:43 PM Mithun Cy <mithun(dot)cy(at)gmail(dot)com> wrote:
>> Does that sound right?
>
> -- Yes I Agree.

Committed with a little more tweaking of the commit message, and
back-patched to v11.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company