Lists: | pgsql-ports |
---|
From: | Jason Tishler <Jason(dot)Tishler(at)dothill(dot)com> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | pgsql-ports(at)postgresql(dot)org |
Subject: | Re: Cygwin PostgreSQL Regression Test Problems (Revisited) |
Date: | 2001-03-29 21:55:17 |
Message-ID: | 20010329165517.G209@dothill.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | Postg무지개 토토SQL |
Tom,
On Thu, Mar 29, 2001 at 01:00:44PM -0500, Tom Lane wrote:
> Not sure why this guy only responded to me and not the list, but here's
> a lead you might want to follow up ...
>
> On Thu, 29 Mar 2001 10:49:16 -0700, Scott Ribe wrote:
> > On Thu, Mar 29, 2001, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > >Oh-ho, that's interesting! If you look at fe-connect.c you'll see that
> > >CONNECTION_STARTED must indicate that connect() returned EINPROGRESS
> > >rather than a success indication. The socket is supposed to go
> > >write-ready when the connection is finished...
> >
> > Uhm, generally speaking I am not qualified to participate in this
> > discussion...
> >
> > BUT I am pretty sure that some time past while searching for some other
> > network-related info on the MS web site I came across a document
> > describing bugs (or unique MS "features") in non-blocking IO and
> > particularly discussed the EINPROGRESS return value.
> >
> > I don't know what I'm talking about, I could be wrong, but I think you
> > should search on the MS web site for nonblocking IO and EINPROGRESS and you
> > might find the exact info that you need to discuss with the Cygwin folks.
I quickly searched the MSDN and could not find anything explicitly
mentioning problems with non-blocking I/O and EINPROGRESS. Nevertheless,
in src/interfaces/libpq/fe-connect.c, I found the following comment:
/* ----------
* Since I have no idea whether this is a valid thing to do under Windows
* before a connection is made, and since I have no way of testing it, I
* leave the code looking as below. When someone decides that they want
* non-blocking connections under Windows, they can define
* WIN32_NON_BLOCKING_CONNECTIONS before compilation. If it works, then
* this code can be cleaned up.
Cygwin is essentially Windows in this regard since Cygwin uses Windows
sockets to implement Posix sockets. My WAG is that if EINPROGRESS is
returned during a connect attempt then the regression test hangs;
otherwise, the regression test runs to completion.
So, I applied the attached patch so that non-blocking I/O is not enabled
until after the connection has been established (just like with Win32
and SSL). I have the regression test running in a forever loop. So far
it has succeeded 10 times without a hang. On this machine, I have never
been able to get more than three in a row to succeed before.
I am going to run the regression tests all night. I will report back
tomorrow to let the list know whether or not I got any hangs.
Would the PostgreSQL team be willing to accept this patch? At least,
until I determine whether or not I can get Cygwin "fixed?" I will post
to the Cygwin list tomorrow (when/if they are back up).
BTW, Cygwin did not support non-blocking (socket) I/O until 1.1.5
which is in the November 2000 time frame. So another WAG is that this
problem started to occur then, but I don't really remember that well.
Thanks,
Jason
--
Jason Tishler
Director, Software Engineering Phone: +1 (732) 264-8770 x235
Dot Hill Systems Corp. Fax: +1 (732) 264-8798
82 Bethany Road, Suite 7 Email: Jason(dot)Tishler(at)dothill(dot)com
Hazlet, NJ 07730 USA WWW: http://www.dothill.com
Attachment | Content-Type | Size |
---|---|---|
fe-connect.c.patch | text/plain | 895 bytes |
From: | Jason Tishler <Jason(dot)Tishler(at)dothill(dot)com> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | pgsql-ports(at)postgresql(dot)org |
Subject: | Re: Cygwin PostgreSQL Regression Test Problems (Revisited) |
Date: | 2001-03-30 14:25:47 |
Message-ID: | 20010330092547.K209@dothill.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | Postg사설 토토SQL |
Tom,
On Thu, Mar 29, 2001 at 04:55:17PM -0500, Jason Tishler wrote:
> Cygwin is essentially Windows in this regard since Cygwin uses Windows
> sockets to implement Posix sockets. My WAG is that if EINPROGRESS is
> returned during a connect attempt then the regression test hangs;
> otherwise, the regression test runs to completion.
>
> So, I applied the attached patch so that non-blocking I/O is not enabled
> until after the connection has been established (just like with Win32
> and SSL). I have the regression test running in a forever loop. So far
> it has succeeded 10 times without a hang. On this machine, I have never
> been able to get more than three in a row to succeed before.
>
> I am going to run the regression tests all night. I will report back
> tomorrow to let the list know whether or not I got any hangs.
The regression test forever loop ran all night without a hang -- 150+
successes in a row. So, I feel that it is safe to say that Cygwin (or
Windows Sockets) has problems with nonblocking connects.
> Would the PostgreSQL team be willing to accept this patch?
Any feedback on my patch (reattached for convenience)? I would hate to
see 7.1 go out the door with this issue.
BTW, why is libpq's connection policy currently nonblocking for all
platforms except (straight) Win32? Do people try to connect to multiple
postmasters concurrently? If not, then what is the benefit over a
blocking connect?
> At least,
> until I determine whether or not I can get Cygwin "fixed?" I will post
> to the Cygwin list tomorrow (when/if they are back up).
I will post to the Cygwin list regarding this problem. Just to make sure
that I have my story straight: psql is hanging while trying a nonblocking
connect to postmaster (not one of the backends). Correct?
If anyone has any nonblocking socket client code with a corresponding
server lying around please let me know. I would like to post this to
the Cygwin list to facilitate their debugging.
Thanks,
Jason
--
Jason Tishler
Director, Software Engineering Phone: +1 (732) 264-8770 x235
Dot Hill Systems Corp. Fax: +1 (732) 264-8798
82 Bethany Road, Suite 7 Email: Jason(dot)Tishler(at)dothill(dot)com
Hazlet, NJ 07730 USA WWW: http://www.dothill.com
Attachment | Content-Type | Size |
---|---|---|
fe-connect.c.patch | text/plain | 895 bytes |
From: | Jason Tishler <Jason(dot)Tishler(at)dothill(dot)com> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-ports(at)postgresql(dot)org |
Subject: | Re: Cygwin PostgreSQL Regression Test Problems (Revisited) |
Date: | 2001-03-30 22:20:26 |
Message-ID: | 20010330172026.L209@dothill.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | Postg토토 베이SQL |
Tom,
On Fri, Mar 30, 2001 at 09:25:47AM -0500, Jason Tishler wrote:
> Any feedback on my patch (reattached for convenience)? I would hate to
> see 7.1 go out the door with this issue.
I believe that I have finally found the root cause to the psql hangs.
IMO, Cygwin is functioning properly and the issue lies in the libpq's
pqWait() use of select().
The MSDN states the following for select():
..
Summary:
A socket will be identified in a particular set when select returns if:
..
exceptfds:
If processing a connect call (nonblocking), connection attempt failed.
..
In libpq's pqWait(), we have the following:
if (select(conn->sock + 1, &input_mask, &output_mask, (fd_set *) NULL,
(struct timeval *) NULL) < 0)
After reading the above code, I hypothesized that select() was hanging
because the exceptfds was NULL.
Sure enough, if I apply the attached (nasty, hacky) patch, then the
regression test does *not* hang anymore -- even with nonblocking connects.
Although some tests will fail due to a connection refused condition --
which is not unreasonable since postmaster is very busy.
IMO, pqWait() should be enhanced to check the exceptfds too -- at least
for Cygwin. If it is not too late in the release cycle to consider such
a change, then someone with much more intimate knowledge of libpq should
only use my patch as a starting point and then do the right thing.
If the above enhancement is deemed too risky, then I implore the
PostgreSQL team to accept my previous patch that just makes connects
blocking for Cygwin. Note with this patch applied, I did see some
regression test failures due to a connection refused condition -- for
the same reasons as above.
Thanks,
Jason
--
Jason Tishler
Director, Software Engineering Phone: +1 (732) 264-8770 x235
Dot Hill Systems Corp. Fax: +1 (732) 264-8798
82 Bethany Road, Suite 7 Email: Jason(dot)Tishler(at)dothill(dot)com
Hazlet, NJ 07730 USA WWW: http://www.dothill.com
Attachment | Content-Type | Size |
---|---|---|
fe-misc.c.patch | text/plain | 1.1 KB |
From: | "Hiroshi Inoue" <Inoue(at)tpf(dot)co(dot)jp> |
---|---|
To: | <Jason(dot)Tishler(at)dothill(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | <pgsql-ports(at)postgresql(dot)org> |
Subject: | RE: Cygwin PostgreSQL Regression Test Problems (Revisited) |
Date: | 2001-03-31 01:15:08 |
Message-ID: | EKEJJICOHDIEMGPNIFIJOEBIEAAA.Inoue@tpf.co.jp |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-ports |
> -----Original Message-----
> From: Jason Tishler
>
> Tom,
>
> On Fri, Mar 30, 2001 at 09:25:47AM -0500, Jason Tishler wrote:
> > Any feedback on my patch (reattached for convenience)? I would hate to
> > see 7.1 go out the door with this issue.
>
> I believe that I have finally found the root cause to the psql hangs.
> IMO, Cygwin is functioning properly and the issue lies in the libpq's
> pqWait() use of select().
>
> The MSDN states the following for select():
>
> ..
> Summary:
> A socket will be identified in a particular set when select
> returns if:
>
> ..
>
> exceptfds:
> If processing a connect call (nonblocking), connection
> attempt failed.
> ..
>
Oh I found the same description yesterday though I've had no time
to test it. If your patch resolves *hang*, it may be the right solution
at least for cygwin port.
BTW I've never passed the pararell regression test without hang or
refusal(with your previous patch appiled) under my cygwin environ-
ment. I added one more connect() call after the refusal and passed
all regression test successfully. Hmm it may be a more preferable
solution.
regards,
Hiroshi Inoue
From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | "Hiroshi Inoue" <Inoue(at)tpf(dot)co(dot)jp> |
Cc: | Jason(dot)Tishler(at)dothill(dot)com, pgsql-ports(at)postgresql(dot)org |
Subject: | Re: Cygwin PostgreSQL Regression Test Problems (Revisited) |
Date: | 2001-03-31 22:45:45 |
Message-ID: | 24157.986078745@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | Postg토토 꽁 머니SQL |
"Hiroshi Inoue" <Inoue(at)tpf(dot)co(dot)jp> writes:
> Oh I found the same description yesterday though I've had no time
> to test it. If your patch resolves *hang*, it may be the right solution
> at least for cygwin port.
It seems clear that it's a good idea for fe-misc.c to check the
exceptfds bit as well as read/write ready --- I'm surprised we have not
seen problems associated with that on other platforms. I think it
should check exceptfds all the time, regardless of whether we are
waiting to read or to write.
I'm inclined to also accept Jason's change to do the connect() in
blocking mode on Cygwin. If we do both of those things, have we
resolved the issue on Cygwin, or is there still a problem?
regards, tom lane
From: | Jason Tishler <Jason(dot)Tishler(at)dothill(dot)com> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Hiroshi Inoue <Inoue(at)tpf(dot)co(dot)jp>, pgsql-ports(at)postgresql(dot)org |
Subject: | Re: Cygwin PostgreSQL Regression Test Problems (Revisited) |
Date: | 2001-04-01 03:07:22 |
Message-ID: | 20010331220722.A2591@dothill.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-ports |
Tom,
On Sat, Mar 31, 2001 at 05:45:45PM -0500, Tom Lane wrote:
> "Hiroshi Inoue" <Inoue(at)tpf(dot)co(dot)jp> writes:
> > Oh I found the same description yesterday though I've had no time
> > to test it. If your patch resolves *hang*, it may be the right solution
> > at least for cygwin port.
>
> It seems clear that it's a good idea for fe-misc.c to check the
> exceptfds bit as well as read/write ready --- I'm surprised we have not
> seen problems associated with that on other platforms. I think it
> should check exceptfds all the time, regardless of whether we are
> waiting to read or to write.
I'm glad that you agree. Please post to the list when the change is in
CVS and I will test that this solves the Cygwin regression test (i.e.,
psql) hangs.
BTW, this will also solve the problem of Cygwin psql hanging when no
postmaster is running which I stumbled across when enabling Unix domain
socket support. Previously, I thought that it was a Cygwin problem but
now I know that it is caused by the same pqWait() problem.
> I'm inclined to also accept Jason's change to do the connect() in
> blocking mode on Cygwin.
Actually, the blocking connect() change for Cygwin is obviated by the
pqWait() fix. So, I am now no longer recommending making the blocking
connect() change for Cygwin. Unless, you do so for other Unixes too.
> If we do both of those things, have we
> resolved the issue on Cygwin, or is there still a problem?
If you do both of these changes, then the pqWait() fix will never be
triggered under Cygwin. When I tested my hacky patch to pqWait(), I had
to back out my blocking connect() patch in order for the pqWait() changes
to take affect. The regression test still did not hang -- although, I
continued to have spurious failures due to connection refused conditions.
On Sat, Mar 31, 2001 at 10:15:08AM +0900, Hiroshi Inoue wrote:
> BTW I've never passed the pararell regression test without hang or
> refusal(with your previous patch appiled) under my cygwin environ-
> ment. I added one more connect() call after the refusal and passed
> all regression test successfully. Hmm it may be a more preferable
> solution.
I'm wondering whether it makes sense to add a simple connection retry
policy as suggested above by Hiroshi? Otherwise, make check will
generate false negatives due to connection refused conditions.
If it is considered too late in the release cycle for such a change,
then I offer the following suggestions:
1. Change make check to use the serial_schedule or at least allow it to
be easily selected via a make variable (e.g., make schedule=serial_schedule
check).
2. Change the backlog parameter to listen() in src/backend/libpq/pqcomm.c
to a number that will "ensure" that the parallel_schedule version of the
regression test does not generate connection refused conditions. Note
that I'm not even sure this will really work on all (or any) platforms.
Thanks,
Jason
--
Jason Tishler
Director, Software Engineering Phone: +1 (732) 264-8770 x235
Dot Hill Systems Corp. Fax: +1 (732) 264-8798
82 Bethany Road, Suite 7 Email: Jason(dot)Tishler(at)dothill(dot)com
Hazlet, NJ 07730 USA WWW: http://www.dothill.com
From: | Jason Tishler <Jason(dot)Tishler(at)dothill(dot)com> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Hiroshi Inoue <Inoue(at)tpf(dot)co(dot)jp>, pgsql-ports(at)postgresql(dot)org |
Subject: | Re: Cygwin PostgreSQL Regression Test Problems (Revisited) |
Date: | 2001-04-01 03:20:56 |
Message-ID: | 20010331222056.B2591@dothill.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-ports |
Tom,
On Sat, Mar 31, 2001 at 10:07:22PM -0500, Jason Tishler wrote:
> BTW, this will also solve the problem of Cygwin psql hanging when no
> postmaster is running which I stumbled across when enabling Unix domain
> socket support. Previously, I thought that it was a Cygwin problem but
> now I know that it is caused by the same pqWait() problem.
Oops, I meant an unconnected socket file (e.g., /tmp/.s.PGSQL.5432) above
-- not no postmaster is running.
That's the problem with taking notes (which I rarely do but did in this
case) -- you actually have to review the notes for them to be useful... :,)
Jason
--
Jason Tishler
Director, Software Engineering Phone: +1 (732) 264-8770 x235
Dot Hill Systems Corp. Fax: +1 (732) 264-8798
82 Bethany Road, Suite 7 Email: Jason(dot)Tishler(at)dothill(dot)com
Hazlet, NJ 07730 USA WWW: http://www.dothill.com
From: | "Hiroshi Inoue" <Inoue(at)tpf(dot)co(dot)jp> |
---|---|
To: | "Jason Tishler" <Jason(dot)Tishler(at)dothill(dot)com> |
Cc: | <pgsql-ports(at)postgresql(dot)org>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Subject: | RE: Cygwin PostgreSQL Regression Test Problems (Revisited) |
Date: | 2001-04-01 14:45:04 |
Message-ID: | EKEJJICOHDIEMGPNIFIJIECLEAAA.Inoue@tpf.co.jp |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-ports |
> -----Original Message-----
> From: Jason Tishler [mailto:Jason(dot)Tishler(at)dothill(dot)com]
>
> 2. Change the backlog parameter to listen() in src/backend/libpq/pqcomm.c
> to a number that will "ensure" that the parallel_schedule version of the
> regression test does not generate connection refused conditions. Note
> that I'm not even sure this will really work on all (or any) platforms.
>
Hmm, I changed the backlog parameter on trial but I wasn't able
to see any improvements.
regards,
Hiroshi Inoue
From: | Jason Tishler <Jason(dot)Tishler(at)dothill(dot)com> |
---|---|
To: | Hiroshi Inoue <Inoue(at)tpf(dot)co(dot)jp> |
Cc: | pgsql-ports(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Subject: | Re: Cygwin PostgreSQL Regression Test Problems (Revisited) |
Date: | 2001-04-01 15:04:46 |
Message-ID: | 20010401110446.K2591@dothill.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-ports |
Hiroshi,
On Sun, Apr 01, 2001 at 11:45:04PM +0900, Hiroshi Inoue wrote:
> > From: Jason Tishler [mailto:Jason(dot)Tishler(at)dothill(dot)com]
> >
> > 2. Change the backlog parameter to listen() in src/backend/libpq/pqcomm.c
> > to a number that will "ensure" that the parallel_schedule version of the
> > regression test does not generate connection refused conditions. Note
> > that I'm not even sure this will really work on all (or any) platforms.
> >
>
> Hmm, I changed the backlog parameter on trial but I wasn't able
> to see any improvements.
That is what I kind of expected. Even if it worked, it would not have
been a full proof solution anyway.
Thanks,
Jason
--
Jason Tishler
Director, Software Engineering Phone: +1 (732) 264-8770 x235
Dot Hill Systems Corp. Fax: +1 (732) 264-8798
82 Bethany Road, Suite 7 Email: Jason(dot)Tishler(at)dothill(dot)com
Hazlet, NJ 07730 USA WWW: http://www.dothill.com
From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | Jason Tishler <Jason(dot)Tishler(at)dothill(dot)com> |
Cc: | Hiroshi Inoue <Inoue(at)tpf(dot)co(dot)jp>, pgsql-ports(at)postgresql(dot)org |
Subject: | Re: Cygwin PostgreSQL Regression Test Problems (Revisited) |
Date: | 2001-04-01 17:57:35 |
Message-ID: | 180.986147855@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | Postg토토 캔SQL |
Jason Tishler <Jason(dot)Tishler(at)dothill(dot)com> writes:
>> It seems clear that it's a good idea for fe-misc.c to check the
>> exceptfds bit as well as read/write ready --- I'm surprised we have not
>> seen problems associated with that on other platforms. I think it
>> should check exceptfds all the time, regardless of whether we are
>> waiting to read or to write.
> I'm glad that you agree. Please post to the list when the change is in
> CVS and I will test that this solves the Cygwin regression test (i.e.,
> psql) hangs.
Done as of yesterday; should be in this morning's snapshot.
> Actually, the blocking connect() change for Cygwin is obviated by the
> pqWait() fix. So, I am now no longer recommending making the blocking
> connect() change for Cygwin. Unless, you do so for other Unixes too.
I made both changes in the hope that the blocking connect change would
suppress your problem with connection-refused failures. If it does not,
then we may as well reverse out the fe-connect.c change. Let me know.
>> If we do both of those things, have we
>> resolved the issue on Cygwin, or is there still a problem?
> I'm wondering whether it makes sense to add a simple connection retry
> policy as suggested above by Hiroshi?
I do not think it is appropriate for libpq to do that. For one thing,
where would you stop --- why exactly two tries?
> 2. Change the backlog parameter to listen() in src/backend/libpq/pqcomm.c
> to a number that will "ensure" that the parallel_schedule version of the
> regression test does not generate connection refused conditions. Note
> that I'm not even sure this will really work on all (or any) platforms.
We already use SOMAXCONN which is supposed to be defined by the system
as the maximum allowed queue depth. If Cygwin fails to define it, or
defines it as something less than it should be, then we might consider
installing a Cygwin-specific hack to redefine SOMAXCONN. However
Hiroshi says later that he already tried this. I'm inclined to think
that Cygwin simply has a problem with servicing concurrent connection
requests, perhaps even before the alleged SOMAXCONN value is reached.
regards, tom lane
From: | Jason Tishler <Jason(dot)Tishler(at)dothill(dot)com> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Hiroshi Inoue <Inoue(at)tpf(dot)co(dot)jp>, pgsql-ports(at)postgresql(dot)org |
Subject: | Re: Cygwin PostgreSQL Regression Test Problems (Revisited) |
Date: | 2001-04-02 17:19:17 |
Message-ID: | 20010402131917.C798@dothill.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-ports |
Tom,
On Sun, Apr 01, 2001 at 01:57:35PM -0400, Tom Lane wrote:
> Jason Tishler <Jason(dot)Tishler(at)dothill(dot)com> writes:
> > I'm glad that you agree. Please post to the list when the change is in
> > CVS and I will test that this solves the Cygwin regression test (i.e.,
> > psql) hangs.
>
> Done as of yesterday; should be in this morning's snapshot.
Thanks.
> > Actually, the blocking connect() change for Cygwin is obviated by the
> > pqWait() fix. So, I am now no longer recommending making the blocking
> > connect() change for Cygwin. Unless, you do so for other Unixes too.
>
> I made both changes in the hope that the blocking connect change would
> suppress your problem with connection-refused failures. If it does not,
> then we may as well reverse out the fe-connect.c change. Let me know.
With both changes or only the fe-connect.c one, psql does not hang and
displays the following error message when the connection is refused:
psql: connectDBStart() -- connect() failed: Connection refused
Is the postmaster running locally
and accepting connections on Unix socket '/tmp/.s.PGSQL.65432'?
With only the fe-misc.c change, psql does not hang and displays the
following error message when the connection is refused:
psql: PQconnectPoll() -- connect() failed: error 10061
Is the postmaster running locally
and accepting connections on Unix socket '/tmp/.s.PGSQL.65432'?
In both cases there are no hangs, just the error messages are different.
Unfortunately, for the non-blocking case the error message is cryptic.
I tried tracking down error "10061" which comes from getsockopt(), but
I was unsuccessful. Is there any way to improve the readability of this
error message?
Also, the blocking connect change did *not* fix the connection refused
(spurious) regression test failures. So this change should probably be
backed out.
> > I'm wondering whether it makes sense to add a simple connection retry
> > policy as suggested above by Hiroshi?
>
> I do not think it is appropriate for libpq to do that.
When I made my suggestion above, I was concerned that may be libpq was not
the right layer to be implementing connection policies and that possibly
psql was the better place.
> For one thing, where would you stop --- why exactly two tries?
This was another one of my concerns too.
> > 2. Change the backlog parameter to listen() in src/backend/libpq/pqcomm.c
> > to a number that will "ensure" that the parallel_schedule version of the
> > regression test does not generate connection refused conditions. Note
> > that I'm not even sure this will really work on all (or any) platforms.
>
> We already use SOMAXCONN which is supposed to be defined by the system
> as the maximum allowed queue depth. If Cygwin fails to define it, or
> defines it as something less than it should be, then we might consider
> installing a Cygwin-specific hack to redefine SOMAXCONN.
Cygwin defines SOMAXCONN to be 5. However, winsock.h defines it to be 5
while winsock2.h defines it to be 0x7fffffff. So, I'm not sure what it
the real Cygwin (i.e., Windows) maximum.
> However Hiroshi says later that he already tried this.
Even if it worked, this would have just pushed the problem instead of
really fixing it.
> I'm inclined to think
> that Cygwin simply has a problem with servicing concurrent connection
> requests, perhaps even before the alleged SOMAXCONN value is reached.
You meant Windows. Right? :,)
In summary, I feel that the fe-connect.c change should be backed out so
that Cygwin will be consistent with other UNIXes. I also hope that the
non-blocking connection failure message can be made more readable and
that make check will not generate spurious failure messages under Cygwin
on slow machines.
Thanks,
Jason
--
Jason Tishler
Director, Software Engineering Phone: +1 (732) 264-8770 x235
Dot Hill Systems Corp. Fax: +1 (732) 264-8798
82 Bethany Road, Suite 7 Email: Jason(dot)Tishler(at)dothill(dot)com
Hazlet, NJ 07730 USA WWW: http://www.dothill.com
From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | Jason(dot)Tishler(at)dothill(dot)com |
Cc: | Hiroshi Inoue <Inoue(at)tpf(dot)co(dot)jp>, pgsql-ports(at)postgresql(dot)org |
Subject: | Re: Cygwin PostgreSQL Regression Test Problems (Revisited) |
Date: | 2001-04-02 17:44:14 |
Message-ID: | 8565.986233454@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | Postg토토 사이트 |
Jason Tishler <Jason(dot)Tishler(at)dothill(dot)com> writes:
> In both cases there are no hangs, just the error messages are different.
> Unfortunately, for the non-blocking case the error message is cryptic.
> I tried tracking down error "10061" which comes from getsockopt(), but
> I was unsuccessful. Is there any way to improve the readability of this
> error message?
I'm inclined to leave the blocking-connect change in there just to
suppress that peculiar error message. No, I have no idea where it's
coming from, either.
>> However Hiroshi says later that he already tried [ raising SOMAXCONN ]
> Even if it worked, this would have just pushed the problem instead of
> really fixing it.
If the problem were overflow of the accept queue, then raising the
listen() parameter ought to fix it, assuming that Windows does actually
allow larger values for the parameter. Given that we are only hearing
this problem reported on Windows, I have a sneaking suspicion that the
effective queue length limit is 1 on that platform no matter what we
pass to listen(). Is there anyone we might ask about concurrent
connection-request handling on Windows?
regards, tom lane
From: | Jason Tishler <Jason(dot)Tishler(at)dothill(dot)com> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Hiroshi Inoue <Inoue(at)tpf(dot)co(dot)jp>, pgsql-ports(at)postgresql(dot)org |
Subject: | Re: Cygwin PostgreSQL Regression Test Problems (Revisited) |
Date: | 2001-04-02 19:32:52 |
Message-ID: | 20010402153252.H798@dothill.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-ports |
Tom,
On Mon, Apr 02, 2001 at 01:44:14PM -0400, Tom Lane wrote:
> Jason Tishler <Jason(dot)Tishler(at)dothill(dot)com> writes:
> > In both cases there are no hangs, just the error messages are different.
> > Unfortunately, for the non-blocking case the error message is cryptic.
> > I tried tracking down error "10061" which comes from getsockopt(), but
> > I was unsuccessful. Is there any way to improve the readability of this
> > error message?
>
> I'm inclined to leave the blocking-connect change in there just to
> suppress that peculiar error message. No, I have no idea where it's
> coming from, either.
I just figured out what is error 10061 -- it is WSAECONNREFUSED, Winsock's
version of ECONNREFUSED. I just submitted a patch to Cygwin that maps
getsockopt optval's from the Winsock versions to their corresponding
errno values. I just tried psql with an unconnected socket file and
psql displayed:
psql: PQconnectPoll() -- connect() failed: Connection refused
Is the postmaster running locally
and accepting connections on Unix socket '/tmp/.s.PGSQL.5432'?
as desired.
If interested, see the following for details:
http://www.cygwin.com/ml/cygwin-patches/2001-q2/msg00003.html
If my Cygwin patch is accepted, I'll let the list know. At that time, I
think that the fe-connect.c change should be backed out.
> >> However Hiroshi says later that he already tried [ raising SOMAXCONN ]
>
> > Even if it worked, this would have just pushed the problem instead of
> > really fixing it.
>
> If the problem were overflow of the accept queue, then raising the
> listen() parameter ought to fix it, assuming that Windows does actually
> allow larger values for the parameter. Given that we are only hearing
> this problem reported on Windows, I have a sneaking suspicion that the
> effective queue length limit is 1 on that platform no matter what we
> pass to listen(). Is there anyone we might ask about concurrent
> connection-request handling on Windows?
In digging some more through the MSDN, I found out the backlog limit
on NT 4.0 Workstation and Server is 5 and 200, respectively. So, it
would appears that NT is really using this parameter. If interested,
see the following for more details:
http://support.microsoft.com/support/kb/articles/Q127/1/44.asp
When running the parallel_schedule, as many as 18 psql's are trying to
connect to postmaster. Isn't it conceivable that more than 6 are trying
to connection concurrently?
Thanks,
Jason
Jason
--
Jason Tishler
Director, Software Engineering Phone: +1 (732) 264-8770 x235
Dot Hill Systems Corp. Fax: +1 (732) 264-8798
82 Bethany Road, Suite 7 Email: Jason(dot)Tishler(at)dothill(dot)com
Hazlet, NJ 07730 USA WWW: http://www.dothill.com
From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | Jason(dot)Tishler(at)dothill(dot)com |
Cc: | Hiroshi Inoue <Inoue(at)tpf(dot)co(dot)jp>, pgsql-ports(at)postgresql(dot)org |
Subject: | Re: Cygwin PostgreSQL Regression Test Problems (Revisited) |
Date: | 2001-04-02 19:50:55 |
Message-ID: | 15286.986241055@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-ports |
Jason Tishler <Jason(dot)Tishler(at)dothill(dot)com> writes:
> I just figured out what is error 10061 -- it is WSAECONNREFUSED, Winsock's
> version of ECONNREFUSED. I just submitted a patch to Cygwin that maps
> getsockopt optval's from the Winsock versions to their corresponding
> errno values.
Ah so. Sounds good.
> If my Cygwin patch is accepted, I'll let the list know. At that time, I
> think that the fe-connect.c change should be backed out.
My feeling is that we should leave it in place for 7.1 in any case.
Once there's a shipping Cygwin version that maps the error number
correctly, we can back out the patch so that Cygwin is treated more
like other platforms.
> In digging some more through the MSDN, I found out the backlog limit
> on NT 4.0 Workstation and Server is 5 and 200, respectively.
This page only talks about NT; what of other flavors of Windows? Cygwin
runs on more than NT, doesn't it?
Interesting point here: a copy of Postgres compiled on NT WS would
presumably see SOMAXCONN = 5 in the system headers. If the executable
is then moved to NT Server, it would fail to take advantage of the
higher queue limit. Do we need to hardwire a hack to use the larger
value always on Windows?
> When running the parallel_schedule, as many as 18 psql's are trying to
> connect to postmaster. Isn't it conceivable that more than 6 are trying
> to connection concurrently?
Yes (although that's still hypothesis, not the proven cause of failure).
I still suspect there's something else going on here, anyway. SOMAXCONN
is nominally 5 on quite a lot of Unixen, but we've only heard reports of
transient "make check" connect failures on Windows. Why is Windows so
much more prone to show this problem?
regards, tom lane
From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | Jason(dot)Tishler(at)dothill(dot)com |
Cc: | Hiroshi Inoue <Inoue(at)tpf(dot)co(dot)jp>, pgsql-ports(at)postgresql(dot)org |
Subject: | Re: Cygwin PostgreSQL Regression Test Problems (Revisited) |
Date: | 2001-04-02 20:32:16 |
Message-ID: | 22364.986243536@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-ports |
I wrote:
> I still suspect there's something else going on here, anyway. SOMAXCONN
> is nominally 5 on quite a lot of Unixen, but we've only heard reports of
> transient "make check" connect failures on Windows. Why is Windows so
> much more prone to show this problem?
Hm, maybe I need to take this back. Some poking around shows that
SOMAXCONN is defined as 128 on Linux, 20 on HPUX, which are the
platforms I've tested most. As an experiment I reduced the listen()
parameter to 5 on HPUX, and bingo: I get connection-refused failures
in "make check". So it seems that Windows' behavior is not so out of
line after all. We would probably see similar failures on BSD-derived
systems, since BSD systems traditionally set SOMAXCONN to 5. (Any
BSD partisans able to check this?)
I do not think that we should change "make check" to avoid this issue.
If you are on a platform that has a problem with supporting lots of
parallel connection requests, it seems to me that you'd best know about
that limitation, and "make check" is doing you a service by pointing
out the problem.
What I do think we should consider is whether to believe SOMAXCONN
unconditionally, or to use a large value in the listen() call no matter
what the system headers claim SOMAXCONN is. This would avoid
sillinesses such as using an NT-Workstation limit on an NT-Server
machine. The only risk I can see is that some platforms might reject
as erroneous a listen() parameter that's more than they are prepared to
support. The Unix man pages I have access to claim that a too-large
listen() parameter will be clamped to the kernel's SOMAXCONN without
raising an error, but does anyone have an idea whether that behavior
is universal?
In the longer term, we should think about whether we can reduce the
postmaster's connection service delay. Someone recently suggested
that the postmaster should fork a child immediately upon receiving
a connection, and let the child work on the authentication process
while the parent goes right back to accept(). I'm not sure if that
would help "make check" very much, since it's presumably not running
anything more complex than "trust" authentication anyway. But it
should eliminate auth delays caused by SSL, malfunctioning ident
daemons, and sundry other problems.
regards, tom lane
From: | Jason Tishler <Jason(dot)Tishler(at)dothill(dot)com> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Hiroshi Inoue <Inoue(at)tpf(dot)co(dot)jp>, pgsql-ports(at)postgresql(dot)org |
Subject: | Re: Cygwin PostgreSQL Regression Test Problems (Revisited) |
Date: | 2001-04-02 20:34:26 |
Message-ID: | 20010402163426.J798@dothill.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-ports |
Tom,
On Mon, Apr 02, 2001 at 03:50:55PM -0400, Tom Lane wrote:
> Jason Tishler <Jason(dot)Tishler(at)dothill(dot)com> writes:
> > If my Cygwin patch is accepted, I'll let the list know. At that time, I
> > think that the fe-connect.c change should be backed out.
>
> My feeling is that we should leave it in place for 7.1 in any case.
> Once there's a shipping Cygwin version that maps the error number
> correctly, we can back out the patch so that Cygwin is treated more
> like other platforms.
OK, the above plan is reasonable.
> > In digging some more through the MSDN, I found out the backlog limit
> > on NT 4.0 Workstation and Server is 5 and 200, respectively.
>
> This page only talks about NT; what of other flavors of Windows? Cygwin
> runs on more than NT, doesn't it?
Yes, it runs on 2000, 9X/Me, and even XP. Unfortunately, I couldn't
(easily) find the limits for these versions. My WAG is that 2000 and
XP will be the same or similar to NT. I am not concerned about 9X/Me
because I find them unusable for other reasons.
> Interesting point here: a copy of Postgres compiled on NT WS would
> presumably see SOMAXCONN = 5 in the system headers. If the executable
> is then moved to NT Server, it would fail to take advantage of the
> higher queue limit.
Actually, even if compiled on NT Server, SOMAXCONN is it set to 5 due to
Cygwin's socket.h.
> Do we need to hardwire a hack to use the larger
> value always on Windows?
Sounds like a good idea, but the effort only seems reasonable if we can
conclude that Windows will really take advantage of it.
> > When running the parallel_schedule, as many as 18 psql's are trying to
> > connect to postmaster. Isn't it conceivable that more than 6 are trying
> > to connection concurrently?
>
> Yes (although that's still hypothesis, not the proven cause of failure).
>
> I still suspect there's something else going on here, anyway. SOMAXCONN
> is nominally 5 on quite a lot of Unixen, but we've only heard reports of
> transient "make check" connect failures on Windows. Why is Windows so
> much more prone to show this problem?
I don't know! I've been banging my head to find out why and my head is
starting to hurt... :,)
Jason
--
Jason Tishler
Director, Software Engineering Phone: +1 (732) 264-8770 x235
Dot Hill Systems Corp. Fax: +1 (732) 264-8798
82 Bethany Road, Suite 7 Email: Jason(dot)Tishler(at)dothill(dot)com
Hazlet, NJ 07730 USA WWW: http://www.dothill.com
From: | Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Jason(dot)Tishler(at)dothill(dot)com, Hiroshi Inoue <Inoue(at)tpf(dot)co(dot)jp>, pgsql-ports(at)postgresql(dot)org |
Subject: | Re: Cygwin PostgreSQL Regression Test Problems (Revisited) |
Date: | 2001-04-02 22:01:48 |
Message-ID: | 200104022201.SAA28391@candle.pha.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-ports |
> I wrote:
> > I still suspect there's something else going on here, anyway. SOMAXCONN
> > is nominally 5 on quite a lot of Unixen, but we've only heard reports of
> > transient "make check" connect failures on Windows. Why is Windows so
> > much more prone to show this problem?
>
> Hm, maybe I need to take this back. Some poking around shows that
> SOMAXCONN is defined as 128 on Linux, 20 on HPUX, which are the
> platforms I've tested most. As an experiment I reduced the listen()
> parameter to 5 on HPUX, and bingo: I get connection-refused failures
> in "make check". So it seems that Windows' behavior is not so out of
> line after all. We would probably see similar failures on BSD-derived
> systems, since BSD systems traditionally set SOMAXCONN to 5. (Any
> BSD partisans able to check this?)
BSDi 4.01 has:
/*
* Maximum queue length specifiable by listen.
* The kernel has a configurable limit;
* the non-kernel value is the traditional one.
*/
#ifndef KERNEL
#define SOMAXCONN 64 /* XXX, really run-time settable */
#else
#ifndef _POSIX_SOURCE
#define SOMAXCONN_DFLT 64
#endif
#endif
and sysctl has:
net.socket.maxconn = 64
that can be easily changed.
--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
From: | Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Jason(dot)Tishler(at)dothill(dot)com, Hiroshi Inoue <Inoue(at)tpf(dot)co(dot)jp>, pgsql-ports(at)postgresql(dot)org |
Subject: | Re: Cygwin PostgreSQL Regression Test Problems (Revisited) |
Date: | 2001-04-02 22:03:28 |
Message-ID: | 200104022203.SAA28474@candle.pha.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-ports |
> In the longer term, we should think about whether we can reduce the
> postmaster's connection service delay. Someone recently suggested
> that the postmaster should fork a child immediately upon receiving
> a connection, and let the child work on the authentication process
> while the parent goes right back to accept(). I'm not sure if that
> would help "make check" very much, since it's presumably not running
> anything more complex than "trust" authentication anyway. But it
> should eliminate auth delays caused by SSL, malfunctioning ident
> daemons, and sundry other problems.
I think the trust for SSL/indent would be a good idea.
--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026