Lists: | pgsql-hackers |
---|
From: | Andrew Dunstan <andrew(at)dunslane(dot)net> |
---|---|
To: | PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | windows shared memory error |
Date: | 2009-04-30 23:59:43 |
Message-ID: | 49FA3B6F.6080906@dunslane.net |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
I am seeing Postgres 8.3.7 running as a service on Windows Server 2003
repeatedly fail to restart after a backend crash because of the
following code in port/win32_shmem.c:
/*
* If the segment already existed, CreateFileMapping() will return a
* handle to the existing one.
*/
if (GetLastError() == ERROR_ALREADY_EXISTS)
{
/*
* When recycling a shared memory segment, it may take a short while
* before it gets dropped from the global namespace. So re-try after
* sleeping for a second.
*/
CloseHandle(hmap); /* Close the old handle, since we got a
valid
* one to the previous segment. */
Sleep(1000);
hmap = CreateFileMapping((HANDLE) 0xFFFFFFFF, NULL,
PAGE_READWRITE, 0L, (DWORD) size, szShareMem);
if (!hmap)
ereport(FATAL,
(errmsg("could not create shared memory segment:
%lu", GetLastError()),
errdetail("Failed system call was
CreateFileMapping(size=%lu, name=%s).",
(unsigned long) size, szShareMem)));
if (GetLastError() == ERROR_ALREADY_EXISTS)
ereport(FATAL,
(errmsg("pre-existing shared memory block is still in
use"),
errhint("Check if there are any old server processes
still running, and terminate them.")));
}
It strikes me that we really need to try reconnecting to the shared
memory here several times, and maybe the backoff need to increase each
time. On a loaded server this cause postgres to fail to restart fairly
reliably.
thoughts?
cheers
andrew
From: | Dave Page <dpage(at)pgadmin(dot)org> |
---|---|
To: | Andrew Dunstan <andrew(at)dunslane(dot)net> |
Cc: | PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: windows shared memory error |
Date: | 2009-05-01 07:42:00 |
Message-ID: | 937d27e10905010042l67edb4e1ld7734ddcfc330931@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Fri, May 1, 2009 at 12:59 AM, Andrew Dunstan <andrew(at)dunslane(dot)net> wrote:
>
> It strikes me that we really need to try reconnecting to the shared memory
> here several times, and maybe the backoff need to increase each time. On a
> loaded server this cause postgres to fail to restart fairly reliably.
At the risk of sounding predictable, +1. Maybe try 5 times, repeating
at 1, 2, 4 & 8 seconds? Any longer seems like it will be a genuine
failure (so does 8 seconds in fact, but I don't suppose it'll hurt to
try).
--
Dave Page
EnterpriseDB UK: http://www.enterprisedb.com
From: | Greg Stark <stark(at)enterprisedb(dot)com> |
---|---|
To: | Dave Page <dpage(at)pgadmin(dot)org> |
Cc: | Andrew Dunstan <andrew(at)dunslane(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: windows shared memory error |
Date: | 2009-05-01 10:05:53 |
Message-ID: | 4136ffa0905010305p43bad60cm12f42ef431a85b7a@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Fri, May 1, 2009 at 8:42 AM, Dave Page <dpage(at)pgadmin(dot)org> wrote:
> On Fri, May 1, 2009 at 12:59 AM, Andrew Dunstan <andrew(at)dunslane(dot)net> wrote:
>>
>> It strikes me that we really need to try reconnecting to the shared memory
>> here several times, and maybe the backoff need to increase each time. On a
>> loaded server this cause postgres to fail to restart fairly reliably.
>
> At the risk of sounding predictable, +1. Maybe try 5 times, repeating
> at 1, 2, 4 & 8 seconds? Any longer seems like it will be a genuine
> failure (so does 8 seconds in fact, but I don't suppose it'll hurt to
> try).
Do we have any idea why "it may take a short while before it gets
dropped from the global namespace"? Is there some demon running which
only wakes up periodically? Or any specific reason it takes so long?
That might give us a clue exactly how long to sleep for.
Is there any way to tell Windows to wake up and do its job? Or to
block until its done?
--
greg
From: | Dave Page <dpage(at)pgadmin(dot)org> |
---|---|
To: | Greg Stark <stark(at)enterprisedb(dot)com> |
Cc: | Andrew Dunstan <andrew(at)dunslane(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: windows shared memory error |
Date: | 2009-05-01 10:23:18 |
Message-ID: | 937d27e10905010323y5e21eca2l8e0fc798bd22194b@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Fri, May 1, 2009 at 11:05 AM, Greg Stark <stark(at)enterprisedb(dot)com> wrote:
> Do we have any idea why "it may take a short while before it gets
> dropped from the global namespace"? Is there some demon running which
> only wakes up periodically? Or any specific reason it takes so long?
> That might give us a clue exactly how long to sleep for.
I can't find any info in a quick search. I don't see offhand why it
would take time - it's just reference counting handles held by
different processes.
> Is there any way to tell Windows to wake up and do its job? Or to
> block until its done?
Not that I'm aware of.
--
Dave Page
EnterpriseDB UK: http://www.enterprisedb.com
From: | Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> |
---|---|
To: | Dave Page <dpage(at)pgadmin(dot)org> |
Cc: | Andrew Dunstan <andrew(at)dunslane(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: windows shared memory error |
Date: | 2009-05-01 15:10:11 |
Message-ID: | 49FB10D3.1040908@enterprisedb.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
Dave Page wrote:
> On Fri, May 1, 2009 at 12:59 AM, Andrew Dunstan <andrew(at)dunslane(dot)net> wrote:
>> It strikes me that we really need to try reconnecting to the shared memory
>> here several times, and maybe the backoff need to increase each time. On a
>> loaded server this cause postgres to fail to restart fairly reliably.
>
> At the risk of sounding predictable, +1. Maybe try 5 times, repeating
> at 1, 2, 4 & 8 seconds? Any longer seems like it will be a genuine
> failure (so does 8 seconds in fact, but I don't suppose it'll hurt to
> try).
1+2+4+8 = 15 seconds
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
From: | Dave Page <dpage(at)pgadmin(dot)org> |
---|---|
To: | Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> |
Cc: | Andrew Dunstan <andrew(at)dunslane(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: windows shared memory error |
Date: | 2009-05-01 15:11:52 |
Message-ID: | 937d27e10905010811k4a58d4cvd75fd2d73bd1c6b4@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Fri, May 1, 2009 at 4:10 PM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> Dave Page wrote:
>>
>> On Fri, May 1, 2009 at 12:59 AM, Andrew Dunstan <andrew(at)dunslane(dot)net>
>> wrote:
>>>
>>> It strikes me that we really need to try reconnecting to the shared
>>> memory
>>> here several times, and maybe the backoff need to increase each time. On
>>> a
>>> loaded server this cause postgres to fail to restart fairly reliably.
>>
>> At the risk of sounding predictable, +1. Maybe try 5 times, repeating
>> at 1, 2, 4 & 8 seconds? Any longer seems like it will be a genuine
>> failure (so does 8 seconds in fact, but I don't suppose it'll hurt to
>> try).
>
> 1+2+4+8 = 15 seconds
Err, yes. What's your point?
--
Dave Page
EnterpriseDB UK: http://www.enterprisedb.com
From: | Andrew Dunstan <andrew(at)dunslane(dot)net> |
---|---|
To: | Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> |
Cc: | Dave Page <dpage(at)pgadmin(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: windows shared memory error |
Date: | 2009-05-01 15:29:37 |
Message-ID: | 49FB1561.3060304@dunslane.net |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
Heikki Linnakangas wrote:
> Dave Page wrote:
>> On Fri, May 1, 2009 at 12:59 AM, Andrew Dunstan <andrew(at)dunslane(dot)net>
>> wrote:
>>> It strikes me that we really need to try reconnecting to the shared
>>> memory
>>> here several times, and maybe the backoff need to increase each
>>> time. On a
>>> loaded server this cause postgres to fail to restart fairly reliably.
>>
>> At the risk of sounding predictable, +1. Maybe try 5 times, repeating
>> at 1, 2, 4 & 8 seconds? Any longer seems like it will be a genuine
>> failure (so does 8 seconds in fact, but I don't suppose it'll hurt to
>> try).
>
> 1+2+4+8 = 15 seconds
>
15 seconds beats the hell out of not restarting at all.
cheers
andrew
From: | Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> |
---|---|
To: | Dave Page <dpage(at)pgadmin(dot)org> |
Cc: | Andrew Dunstan <andrew(at)dunslane(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: windows shared memory error |
Date: | 2009-05-01 15:30:07 |
Message-ID: | 49FB157F.10105@enterprisedb.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
Dave Page wrote:
> On Fri, May 1, 2009 at 4:10 PM, Heikki Linnakangas
> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>> Dave Page wrote:
>>> On Fri, May 1, 2009 at 12:59 AM, Andrew Dunstan <andrew(at)dunslane(dot)net>
>>> wrote:
>>>> It strikes me that we really need to try reconnecting to the shared
>>>> memory
>>>> here several times, and maybe the backoff need to increase each time. On
>>>> a
>>>> loaded server this cause postgres to fail to restart fairly reliably.
>>> At the risk of sounding predictable, +1. Maybe try 5 times, repeating
>>> at 1, 2, 4 & 8 seconds? Any longer seems like it will be a genuine
>>> failure (so does 8 seconds in fact, but I don't suppose it'll hurt to
>>> try).
>> 1+2+4+8 = 15 seconds
>
> Err, yes. What's your point?
If 8 seconds already seems like it's a genuine failure, then perhaps
retrying at 1, 2 and 4 seconds giving a total delay of 7 seconds is
enough. Maybe you meant that 15 s seems like a genuine failure? Well,
either way, never mind :-)
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | Andrew Dunstan <andrew(at)dunslane(dot)net> |
Cc: | PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: windows shared memory error |
Date: | 2009-05-01 22:24:16 |
Message-ID: | 26468.1241216656@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> It strikes me that we really need to try reconnecting to the shared
> memory here several times, and maybe the backoff need to increase each
> time.
Adding a backoff would make the code significantly more complex, with
no gain that I can see. Just loop a few times around the
one-second-sleep-and-retry logic.
I concur with Greg's opinion that the need for a sleep here at all
is pretty fishy, but I doubt anyone really cares enough to find out
exactly what's happening (and it being Windows, there may be no better
solution anyway ...)
regards, tom lane
From: | Andrew Dunstan <andrew(at)dunslane(dot)net> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: windows shared memory error |
Date: | 2009-05-01 22:42:35 |
Message-ID: | 49FB7ADB.9040701@dunslane.net |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
Tom Lane wrote:
> Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
>
>> It strikes me that we really need to try reconnecting to the shared
>> memory here several times, and maybe the backoff need to increase each
>> time.
>>
>
> Adding a backoff would make the code significantly more complex, with
> no gain that I can see. Just loop a few times around the
> one-second-sleep-and-retry logic.
>
> I concur with Greg's opinion that the need for a sleep here at all
> is pretty fishy, but I doubt anyone really cares enough to find out
> exactly what's happening (and it being Windows, there may be no better
> solution anyway ...)
>
>
>
We've seen similar things with other Windows file operations, IIRC. What
bothers me is that the problem might be precisely because the 1 second
sleep between the CloseHandle() call and the CreateFileMapping() call
might not be enough due to system load, so repeating the cycle without
increasing the sleep period will just repeat the error.
cheers
andrew
From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | Andrew Dunstan <andrew(at)dunslane(dot)net> |
Cc: | PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: windows shared memory error |
Date: | 2009-05-01 22:46:32 |
Message-ID: | 26866.1241217992@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | Postg토토 커뮤니티SQL |
Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> We've seen similar things with other Windows file operations, IIRC. What
> bothers me is that the problem might be precisely because the 1 second
> sleep between the CloseHandle() call and the CreateFileMapping() call
> might not be enough due to system load, so repeating the cycle without
> increasing the sleep period will just repeat the error.
What system load? This is only called after all the backends are dead.
And surely one CreateFileMapping syscall per second does not materially
contribute to any load that is being caused by something else.
regards, tom lane
From: | Andrew Dunstan <andrew(at)dunslane(dot)net> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: windows shared memory error |
Date: | 2009-05-01 23:05:46 |
Message-ID: | 49FB804A.2020001@dunslane.net |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
Tom Lane wrote:
> Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
>
>> We've seen similar things with other Windows file operations, IIRC. What
>> bothers me is that the problem might be precisely because the 1 second
>> sleep between the CloseHandle() call and the CreateFileMapping() call
>> might not be enough due to system load, so repeating the cycle without
>> increasing the sleep period will just repeat the error.
>>
>
> What system load? This is only called after all the backends are dead.
> And surely one CreateFileMapping syscall per second does not materially
> contribute to any load that is being caused by something else.
>
>
>
I didn't say Postgres was creating the load. In the case where this has
been happening for my client, there is an Apache server which can chew
up the machine mightily. I don't have any evidence that just repeating
the cycle a few times won't work, but neither do you have any that it
will, and I don't think the extra code complexity will be terribly
great. If it were more than a few extra lines I'd probably agree with you.
cheers
abdrew
From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | Andrew Dunstan <andrew(at)dunslane(dot)net> |
Cc: | PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: windows shared memory error |
Date: | 2009-05-02 04:41:16 |
Message-ID: | 3590.1241239276@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> I am seeing Postgres 8.3.7 running as a service on Windows Server 2003
> repeatedly fail to restart after a backend crash because of the
> following code in port/win32_shmem.c:
On further review, I see an entirely different explanation for possible
failures of that code.
It says here:
http://msdn.microsoft.com/en-us/library/ms885627.aspx
that GetLastError() continues to return the same error code until
someone calls SetLastError() to change it. It further says that
only a few operating system functions call SetLastError(0) on success,
and that it is explicitly documented whenever a function does so.
I see no such statement for CreateFileMapping:
http://msdn.microsoft.com/en-us/library/aa366537(VS.85).aspx
This leads me to conclude that after a successful creation,
GetLastError will return whatever the errno previously was,
meaning that you cannot reliably distinguish creation from non
creation unless you do SetLastError(0) beforehand. Which we don't.
Now this would only explain problems if there were some code path
through the postmaster that could leave the errno set to
ERROR_ALREADY_EXISTS (a/k/a EEXIST) when this code is reached. I'm not
sure there is one, and I have even less of a theory as to why system
load might make it more probable to happen. Still, this looks like a
bug from here, and repeating the create call won't fix it.
regards, tom lane
From: | Andrew Dunstan <andrew(at)dunslane(dot)net> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: windows shared memory error |
Date: | 2009-05-02 14:57:00 |
Message-ID: | 49FC5F3C.7000203@dunslane.net |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
Tom Lane wrote:
>
> Now this would only explain problems if there were some code path
> through the postmaster that could leave the errno set to
> ERROR_ALREADY_EXISTS (a/k/a EEXIST) when this code is reached. I'm not
> sure there is one, and I have even less of a theory as to why system
> load might make it more probable to happen. Still, this looks like a
> bug from here, and repeating the create call won't fix it.
>
>
>
Oh, I think that this code has such a path. We already know that the
code I showed is entered when that error is set. So the solution would
be to put SetError(0) before the call to CreateFileMapping(), possibly
before both such calls.
Maybe we need to look at all the places we call GetLastError(). There
are quite a few of them.
Good catch, BTW.
cheers
andrew
From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | Andrew Dunstan <andrew(at)dunslane(dot)net>, Magnus Hagander <magnus(at)hagander(dot)net> |
Cc: | PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: windows shared memory error |
Date: | 2009-05-02 15:14:10 |
Message-ID: | 15265.1241277250@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> Maybe we need to look at all the places we call GetLastError(). There
> are quite a few of them.
It would only be an issue with syscalls that have badly designed APIs
like this one. Most of the time you know that the function has failed
and is supposed to have set the error code.
What I'm wondering about right now is whether the sleep-and-retry
logic is needed at all, if we get the error code handling straight.
A look in the archives says that Magnus added it between these two
versions of his draft patch:
http://archives.postgresql.org//pgsql-patches/2007-03/msg00250.php
http://archives.postgresql.org//pgsql-patches/2007-03/msg00263.php
without any indication of why, except that I had complained that
some error checks seemed to be missing in the original. I wonder
if it was a misguided attempt to fix a stale-error-code problem.
regards, tom lane
From: | Magnus Hagander <magnus(at)hagander(dot)net> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Andrew Dunstan <andrew(at)dunslane(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: windows shared memory error |
Date: | 2009-05-03 09:28:45 |
Message-ID: | 49FD63CD.1030805@hagander.net |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
Tom Lane wrote:
> Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
>> I am seeing Postgres 8.3.7 running as a service on Windows Server 2003
>> repeatedly fail to restart after a backend crash because of the
>> following code in port/win32_shmem.c:
>
> On further review, I see an entirely different explanation for possible
> failures of that code.
>
> It says here:
> http://msdn.microsoft.com/en-us/library/ms885627.aspx
FWIW, this is the Windows CE documentation. The one for win32 is at:
http://msdn.microsoft.com/en-us/library/ms679360(VS.85).aspx
> that GetLastError() continues to return the same error code until
> someone calls SetLastError() to change it. It further says that
> only a few operating system functions call SetLastError(0) on success,
> and that it is explicitly documented whenever a function does so.
> I see no such statement for CreateFileMapping:
> http://msdn.microsoft.com/en-us/library/aa366537(VS.85).aspx
>
> This leads me to conclude that after a successful creation,
> GetLastError will return whatever the errno previously was,
> meaning that you cannot reliably distinguish creation from non
> creation unless you do SetLastError(0) beforehand. Which we don't.
>
> Now this would only explain problems if there were some code path
> through the postmaster that could leave the errno set to
> ERROR_ALREADY_EXISTS (a/k/a EEXIST) when this code is reached. I'm not
> sure there is one, and I have even less of a theory as to why system
> load might make it more probable to happen. Still, this looks like a
> bug from here, and repeating the create call won't fix it.
The ref page for CreateFileMapping you linked has:
"If the object exists before the function call, the function returns a
handle to the existing object (with its current size, not the specified
size), and GetLastError returns ERROR_ALREADY_EXISTS. "
I think that qualifies as it documenting that it's setting the return
value, no? That would never work if it isn't set to something other than
ERROR_ALREADY_EXISTS (probably zero) when it *didn't* already exist.
The quick try would be to stick a SetLastError(0) in there, just to be
sure... Could be worth a try?
Andrew, just to confirm: you've found a case where this happens
*repeatably*? That's what we've failed to do before - it's happened now
and then, but never during testing...
//Magnus
From: | Magnus Hagander <magnus(at)hagander(dot)net> |
---|---|
To: | Andrew Dunstan <andrew(at)dunslane(dot)net> |
Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: windows shared memory error |
Date: | 2009-05-03 09:40:27 |
Message-ID: | 49FD668B.7080209@hagander.net |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
Andrew Dunstan wrote:
>
>
> Tom Lane wrote:
>>
>> Now this would only explain problems if there were some code path
>> through the postmaster that could leave the errno set to
>> ERROR_ALREADY_EXISTS (a/k/a EEXIST) when this code is reached. I'm not
>> sure there is one, and I have even less of a theory as to why system
>> load might make it more probable to happen. Still, this looks like a
>> bug from here, and repeating the create call won't fix it.
>>
>>
>>
>
> Oh, I think that this code has such a path. We already know that the
> code I showed is entered when that error is set. So the solution would
> be to put SetError(0) before the call to CreateFileMapping(), possibly
> before both such calls.
>
> Maybe we need to look at all the places we call GetLastError(). There
> are quite a few of them.
A quick look shows that all of these except the one in
pgwin32_get_dynamic_tokeninfo() (which uses a documented way to check
the return code in the case of success) are only called after an API
function fails, so we should be safe there.
//Magnus
From: | Andrew Dunstan <andrew(at)dunslane(dot)net> |
---|---|
To: | Magnus Hagander <magnus(at)hagander(dot)net> |
Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: windows shared memory error |
Date: | 2009-05-03 12:11:59 |
Message-ID: | 49FD8A0F.6020304@dunslane.net |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
Magnus Hagander wrote:
>
> Andrew, just to confirm: you've found a case where this happens
> *repeatably*? That's what we've failed to do before - it's happened now
> and then, but never during testing...
>
>
>
Well, it happened several times to my client within a matter of hours. I
didn't see any successful restarts on the log.
Unfortunately, I can't use this system for experimentation - it's doing
extremely urgent production work.
cheers
andrew
From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | Magnus Hagander <magnus(at)hagander(dot)net> |
Cc: | Andrew Dunstan <andrew(at)dunslane(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: windows shared memory error |
Date: | 2009-05-03 14:34:28 |
Message-ID: | 11254.1241361268@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
Magnus Hagander <magnus(at)hagander(dot)net> writes:
> Tom Lane wrote:
>> It says here:
>> http://msdn.microsoft.com/en-us/library/ms885627.aspx
> FWIW, this is the Windows CE documentation. The one for win32 is at:
> http://msdn.microsoft.com/en-us/library/ms679360(VS.85).aspx
Sorry, that was the one that came up first in a Google search ...
> The ref page for CreateFileMapping you linked has:
> "If the object exists before the function call, the function returns a
> handle to the existing object (with its current size, not the specified
> size), and GetLastError returns ERROR_ALREADY_EXISTS. "
> I think that qualifies as it documenting that it's setting the return
> value, no?
The question is what it does when creating a new object. To be sure
that our existing code isn't misled, it'd be necessary for
CreateFileMapping to do SetLastError(0) in the successful-creation
code path. What I read the GetLastError page to be saying is that
most functions do *not* do SetLastError(0) on success, and that it
is always documented if they do.
> The quick try would be to stick a SetLastError(0) in there, just to be
> sure... Could be worth a try?
I kinda think we should do that whether or not it can be proven to
have anything to do with Andrew's report. It's just like "errno = 0"
for Unix --- sometimes you have to do it to be sure of whether a
particular function has thrown an error.
regards, tom lane
From: | Andrew Dunstan <andrew(at)dunslane(dot)net> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Magnus Hagander <magnus(at)hagander(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: windows shared memory error |
Date: | 2009-05-04 01:41:01 |
Message-ID: | 49FE47AD.6080005@dunslane.net |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
Tom Lane wrote:
>> The quick try would be to stick a SetLastError(0) in there, just to be
>> sure... Could be worth a try?
>>
>
> I kinda think we should do that whether or not it can be proven to
> have anything to do with Andrew's report. It's just like "errno = 0"
> for Unix --- sometimes you have to do it to be sure of whether a
> particular function has thrown an error.
>
>
>
I suspect it has little or nothing to do with it in fact. On my (very
lightly loaded) Vista box a crash with exit code 9 seems to result in a
consistently problem free restart. I did 200 iterations of the test.
Now presumably we sleep for 1 sec between the CloseHandle() call and the
CreateFileMapping() call in that code for a reason. We have seen in
other cases that Windows can take some time after a call has returned
for some operations to actually complete, and I assume we have a similar
case here. So, my question is: how do we know that 1 second is enough?
Was that a wild guess?
I confess I don't have much confidence that just repeating it a few
times without increasing the sleep interval will necessarily solve the
problem.
cheers
andrew
From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | Andrew Dunstan <andrew(at)dunslane(dot)net> |
Cc: | Magnus Hagander <magnus(at)hagander(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: windows shared memory error |
Date: | 2009-05-04 01:47:46 |
Message-ID: | 7374.1241401666@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> Now presumably we sleep for 1 sec between the CloseHandle() call and the
> CreateFileMapping() call in that code for a reason.
I'm not sure. Magnus never did answer my question about why the sleep
and retry was put in at all; it seems not unlikely from here that it
was mere speculation.
regards, tom lane
From: | Magnus Hagander <magnus(at)hagander(dot)net> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Andrew Dunstan <andrew(at)dunslane(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: windows shared memory error |
Date: | 2009-05-04 08:13:18 |
Message-ID: | 49FEA39E.70801@hagander.net |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
Tom Lane wrote:
> Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
>> Now presumably we sleep for 1 sec between the CloseHandle() call and the
>> CreateFileMapping() call in that code for a reason.
>
> I'm not sure. Magnus never did answer my question about why the sleep
> and retry was put in at all; it seems not unlikely from here that it
> was mere speculation.
It was necessary at the time.
The actual 1 second value was completely random - it fixed all the
issues on my test VM at the time. I don't recall exactly the details,
but I do recall having to run a lot of tests before I managed to provoke
an error, and that with the 1 sec thing i could run it for a day of
repeated restarts without any errors.
//Magnus
From: | Magnus Hagander <magnus(at)hagander(dot)net> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Andrew Dunstan <andrew(at)dunslane(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: windows shared memory error |
Date: | 2009-05-04 08:36:45 |
Message-ID: | 49FEA91D.40408@hagander.net |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
Tom Lane wrote:
> Magnus Hagander <magnus(at)hagander(dot)net> writes:
>> Tom Lane wrote:
>>> It says here:
>>> http://msdn.microsoft.com/en-us/library/ms885627.aspx
>
>> FWIW, this is the Windows CE documentation. The one for win32 is at:
>> http://msdn.microsoft.com/en-us/library/ms679360(VS.85).aspx
>
> Sorry, that was the one that came up first in a Google search ...
Yeah, it's annoying, but that often happens. One has to be careful to
check though - the wince stuff is usually just a subset of the full
win32, but sometimes there can be subtle differences. So I recommend
always making sure you look at the win32 docs, not wince.
>> The quick try would be to stick a SetLastError(0) in there, just to be
>> sure... Could be worth a try?
>
> I kinda think we should do that whether or not it can be proven to
> have anything to do with Andrew's report. It's just like "errno = 0"
> for Unix --- sometimes you have to do it to be sure of whether a
> particular function has thrown an error.
Right.
Ok, I've applied a patch that does this. Given that it's certainly not
in a performance critical path, the overhead shouldn't be noticeable,
and it's certainly not wrong to do it :)
//Magnus
From: | Andrew Dunstan <andrew(at)dunslane(dot)net> |
---|---|
To: | Magnus Hagander <magnus(at)hagander(dot)net> |
Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: windows shared memory error |
Date: | 2009-05-04 12:37:53 |
Message-ID: | 49FEE1A1.4070903@dunslane.net |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
Magnus Hagander wrote:
> Tom Lane wrote:
>
>> Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
>>
>>> Now presumably we sleep for 1 sec between the CloseHandle() call and the
>>> CreateFileMapping() call in that code for a reason.
>>>
>> I'm not sure. Magnus never did answer my question about why the sleep
>> and retry was put in at all; it seems not unlikely from here that it
>> was mere speculation.
>>
>
> It was necessary at the time.
>
> The actual 1 second value was completely random - it fixed all the
> issues on my test VM at the time. I don't recall exactly the details,
> but I do recall having to run a lot of tests before I managed to provoke
> an error, and that with the 1 sec thing i could run it for a day of
> repeated restarts without any errors.
>
>
>
Well, my untested hypothesis is that the actual time required is
variable, depending on environmental factors such as machine load. So
testing repeatedly where such factors are constant might not be good
enough. That's why I suggested some sort of increasing backoff, in an
attempt to be adaptive.
cheers
andrew
From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | Andrew Dunstan <andrew(at)dunslane(dot)net> |
Cc: | Magnus Hagander <magnus(at)hagander(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: windows shared memory error |
Date: | 2009-05-04 12:45:34 |
Message-ID: | 27417.1241441134@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> Magnus Hagander wrote:
>> The actual 1 second value was completely random - it fixed all the
>> issues on my test VM at the time. I don't recall exactly the details,
>> but I do recall having to run a lot of tests before I managed to provoke
>> an error, and that with the 1 sec thing i could run it for a day of
>> repeated restarts without any errors.
> Well, my untested hypothesis is that the actual time required is
> variable, depending on environmental factors such as machine load.
Seems reasonable.
> So testing repeatedly where such factors are constant might not be good
> enough. That's why I suggested some sort of increasing backoff, in an
> attempt to be adaptive.
I still think there's absolutely no evidence suggesting that a variable
backoff is necessary. Given how little this code is going to be
exercised in the real world, how long will it take till we find out
if you get it wrong? Use a simple retry loop and be done with it.
regards, tom lane
From: | Andrew Dunstan <andrew(at)dunslane(dot)net> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Magnus Hagander <magnus(at)hagander(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: windows shared memory error |
Date: | 2009-05-04 14:14:10 |
Message-ID: | 49FEF832.30802@dunslane.net |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
Tom Lane wrote:
> I still think there's absolutely no evidence suggesting that a variable
> backoff is necessary. Given how little this code is going to be
> exercised in the real world, how long will it take till we find out
> if you get it wrong? Use a simple retry loop and be done with it.
>
>
>
Why should a 1 second delay between CloseHandle() and
CreateFileMapping() be enough now when it was not enough 1 second ago?
If the event we needed an offset from were outside the loop I'd agree
with you.
cheers
andrew
From: | Magnus Hagander <magnus(at)hagander(dot)net> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Andrew Dunstan <andrew(at)dunslane(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: windows shared memory error |
Date: | 2009-05-04 14:14:59 |
Message-ID: | 49FEF863.5040407@hagander.net |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
Tom Lane wrote:
> Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
>> Magnus Hagander wrote:
>>> The actual 1 second value was completely random - it fixed all the
>>> issues on my test VM at the time. I don't recall exactly the details,
>>> but I do recall having to run a lot of tests before I managed to provoke
>>> an error, and that with the 1 sec thing i could run it for a day of
>>> repeated restarts without any errors.
>
>> Well, my untested hypothesis is that the actual time required is
>> variable, depending on environmental factors such as machine load.
>
> Seems reasonable.
>
>> So testing repeatedly where such factors are constant might not be good
>> enough. That's why I suggested some sort of increasing backoff, in an
>> attempt to be adaptive.
>
> I still think there's absolutely no evidence suggesting that a variable
> backoff is necessary. Given how little this code is going to be
> exercised in the real world, how long will it take till we find out
> if you get it wrong? Use a simple retry loop and be done with it.
+1. Let's keep it as simple as possible for now. I doubt it's actually
dependent on the *failed* call.
Andrew, you want to write up a patch or do you want me to do it?
//Magnus
From: | Andrew Dunstan <andrew(at)dunslane(dot)net> |
---|---|
To: | Magnus Hagander <magnus(at)hagander(dot)net> |
Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: windows shared memory error |
Date: | 2009-05-04 14:18:39 |
Message-ID: | 49FEF93F.2000807@dunslane.net |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
Magnus Hagander wrote:
>
> Andrew, you want to write up a patch or do you want me to do it?
>
>
>
Go for it.
cheers
andrew
From: | Alvaro Herrera <alvherre(at)commandprompt(dot)com> |
---|---|
To: | Magnus Hagander <magnus(at)hagander(dot)net> |
Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Dunstan <andrew(at)dunslane(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: windows shared memory error |
Date: | 2009-05-04 16:07:59 |
Message-ID: | 20090504160758.GC4476@alvh.no-ip.org |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
Magnus Hagander wrote:
>
> Andrew, you want to write up a patch or do you want me to do it?
This is going to be backpatched, I assume?
--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.
From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | Alvaro Herrera <alvherre(at)commandprompt(dot)com> |
Cc: | Magnus Hagander <magnus(at)hagander(dot)net>, Andrew Dunstan <andrew(at)dunslane(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: windows shared memory error |
Date: | 2009-05-04 16:40:59 |
Message-ID: | 2057.1241455259@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> This is going to be backpatched, I assume?
Yeah, back to 8.2 I suppose.
regards, tom lane
From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | Magnus Hagander <magnus(at)hagander(dot)net> |
Cc: | Andrew Dunstan <andrew(at)dunslane(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: windows shared memory error |
Date: | 2009-05-04 16:42:20 |
Message-ID: | 2098.1241455340@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
Magnus Hagander <magnus(at)hagander(dot)net> writes:
> Tom Lane wrote:
>> I still think there's absolutely no evidence suggesting that a variable
>> backoff is necessary. Given how little this code is going to be
>> exercised in the real world, how long will it take till we find out
>> if you get it wrong? Use a simple retry loop and be done with it.
> +1. Let's keep it as simple as possible for now. I doubt it's actually
> dependent on the *failed* call.
Exactly. Presumably we're waiting for some system bookkeeping to
finish. Maybe it will take more than 1 second, but we're not going
to be slowing it noticeably by trying once a second.
regards, tom lane
From: | Magnus Hagander <magnus(at)hagander(dot)net> |
---|---|
To: | Andrew Dunstan <andrew(at)dunslane(dot)net> |
Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: windows shared memory error |
Date: | 2009-05-04 20:29:48 |
Message-ID: | 49FF503C.8040702@hagander.net |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
Andrew Dunstan wrote:
>
>
> Magnus Hagander wrote:
>>
>> Andrew, you want to write up a patch or do you want me to do it?
>>
>>
>>
>
> Go for it.
How does this look?
Passes my tests, but I can't really reproduce the requirement to retry,
so I haven't been able to test that part :(
//Magnus
Attachment | Content-Type | Size |
---|---|---|
win32shmem.patch | text/x-diff | 4.0 KB |
From: | Alvaro Herrera <alvherre(at)commandprompt(dot)com> |
---|---|
To: | Magnus Hagander <magnus(at)hagander(dot)net> |
Cc: | Andrew Dunstan <andrew(at)dunslane(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: windows shared memory error |
Date: | 2009-05-04 20:32:58 |
Message-ID: | 20090504203258.GI4476@alvh.no-ip.org |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
Magnus Hagander wrote:
> How does this look?
>
> Passes my tests, but I can't really reproduce the requirement to retry,
> so I haven't been able to test that part :(
I'm disappointed :-( I thought this thread (without reading it too
deeply) was about fixing the problem that backends sometimes fail to
connect to shmem, on a system that's been running for a while. But I
see that it's about recycling the segment after a crash (and/or
restart?), so it has no relationship to the other case at all :-(
I can't comment on the patch itself.
--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.
From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | Alvaro Herrera <alvherre(at)commandprompt(dot)com> |
Cc: | Magnus Hagander <magnus(at)hagander(dot)net>, Andrew Dunstan <andrew(at)dunslane(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: windows shared memory error |
Date: | 2009-05-04 23:56:22 |
Message-ID: | 20942.1241481382@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> I'm disappointed :-( I thought this thread (without reading it too
> deeply) was about fixing the problem that backends sometimes fail to
> connect to shmem, on a system that's been running for a while.
Nobody knows yet what's wrong there or how to fix it. This thread
is about Andrew's complaint here:
http://archives.postgresql.org//pgsql-hackers/2009-05/msg00003.php
which seems relatively straightforward to fix, or at least reduce
the probability of trouble to the vanishing point.
regards, tom lane
From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | Magnus Hagander <magnus(at)hagander(dot)net> |
Cc: | Andrew Dunstan <andrew(at)dunslane(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: windows shared memory error |
Date: | 2009-05-04 23:57:59 |
Message-ID: | 20982.1241481479@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
Magnus Hagander <magnus(at)hagander(dot)net> writes:
> Passes my tests, but I can't really reproduce the requirement to retry,
> so I haven't been able to test that part :(
The patch looks sane to me. If you want to test, perhaps reducing the
sleep to 1 msec or so would reproduce the need to go around the loop
more than once. (Don't forget to put the machine under additional
load, too.)
regards, tom lane
From: | Magnus Hagander <magnus(at)hagander(dot)net> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Andrew Dunstan <andrew(at)dunslane(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: windows shared memory error |
Date: | 2009-05-05 09:50:28 |
Message-ID: | 4A000BE4.6080702@hagander.net |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
Tom Lane wrote:
> Magnus Hagander <magnus(at)hagander(dot)net> writes:
>> Passes my tests, but I can't really reproduce the requirement to retry,
>> so I haven't been able to test that part :(
>
> The patch looks sane to me. If you want to test, perhaps reducing the
> sleep to 1 msec or so would reproduce the need to go around the loop
> more than once. (Don't forget to put the machine under additional
> load, too.)
I've applied this to HEAD and 8.3 so we can get some buildfarm testing
on it as well.
Andrew, any chance you can get 8.3-tip tested with your client? Or at
least in your own reproducable-environment?
I didn't backpatch to 8.2, because the code is completely different
there. We should probably consider doing it once we know if this fixes
the actual issue, but I don't want to spend the effort on backporting it
until we know it works.
//Magnus