A couple of random BF failures in kerberosCheck

Lists: pgsql-hackers
From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: A couple of random BF failures in kerberosCheck
Date: 2019-08-02 09:32:43
Message-ID: CA+hUKG+JcvT5K_rd6teGzNeMSv8vxB3pzFFYFHyZe4gR2AWZdg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello,

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=elver&dt=2019-07-24%2003%3A22%3A17
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2019-08-02%2007%3A17%3A25

I wondered if this might be like the recently fixed problem with slapd
not being ready to handle requests yet, since we start up krb5kdc
first and then don't do anything explicit to wait for it, but it
doesn't look like an obvious failure to reach it. It looks like test
3 on elver connected successfully but didn't like the answer it got
for this query:

SELECT gss_authenticated AND encrypted from pg_stat_gssapi where pid =
pg_backend_pid();

--
Thomas Munro
https://enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A couple of random BF failures in kerberosCheck
Date: 2019-08-03 21:04:16
Message-ID: 21009.1564866256@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Thomas Munro <thomas(dot)munro(at)gmail(dot)com> writes:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=elver&dt=2019-07-24%2003%3A22%3A17
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2019-08-02%2007%3A17%3A25

> I wondered if this might be like the recently fixed problem with slapd
> not being ready to handle requests yet, since we start up krb5kdc
> first and then don't do anything explicit to wait for it, but it
> doesn't look like an obvious failure to reach it.

I spent a bit of time trying to reproduce these failures, using a
Fedora 29 install that should pretty nearly match what crake is
running. I didn't yet match the shown failures, but I did get this
after a bunch of attempts:

# Running: /usr/sbin/krb5kdc -P /home/tgl/pgsql/src/test/kerberos/tmp_check/krb5kdc.pid
Bail out! system /usr/sbin/krb5kdc failed

Looking into the tmp_check/krb5kdc.log file finds

Aug 03 16:04:01 mini12.sss.pgh.pa.us krb5kdc[14340](info): setting up network...
krb5kdc: Address already in use - Cannot bind server socket on 127.0.0.1.55324
Aug 03 16:04:01 mini12.sss.pgh.pa.us krb5kdc[14340](Error): Failed setting up a TCP socket (for 127.0.0.1.55324)
krb5kdc: Address already in use - Error setting up network

So this leads to two points:

* kerberos/t/001_auth.pl just blithely assumes that it can pick
any random port above 48K and that's guaranteed to be free.
Maybe we should split out the code in get_new_node for finding
a free TCP port, so we can call it here?

* AFAICS, the only provision for shutting down krb5kdc at the end of
the test run is

END
{
kill 'INT', `cat $kdc_pidfile` if -f $kdc_pidfile;
}

I wonder how reliable that is, especially in contexts where the calling
script might do "rm -rf tmp_check" shortly afterwards. Maybe it'd be
better to try to shut down krb5kdc explicitly before we exit the test
script. I'd suggest waiting for krb5kdc to remove its pidfile, except
it seems not to do so :-(

Despite my suspicions about the shutdown provisions, I found that this is
somewhat reproducible and the problematic port is *not* the one assigned
in the previous iteration of 001_auth.pl. However, I notice that after
running this test in a loop for awhile, there are an awful lot of local
loopback connections in TIME_WAIT state. I hypothesize that the failures
correspond to cases where we try to re-use a port number that some
previous test iteration used, possibly on the client side not the server
side of GSS. I wonder whether we are doing something that keeps those
GSS query connections from being closed more cleanly/rapidly. (The
sockets do go away after a minute or so, but why are they in TIME_WAIT
at all?)

None of these points seem to explain the buildfarm failures, though,
especially not elver's where only one connection attempt failed.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A couple of random BF failures in kerberosCheck
Date: 2019-08-03 22:42:48
Message-ID: 3397.1564872168@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I wrote:
> * kerberos/t/001_auth.pl just blithely assumes that it can pick
> any random port above 48K and that's guaranteed to be free.
> Maybe we should split out the code in get_new_node for finding
> a free TCP port, so we can call it here?

I've confirmed that the reason it's failing on my machine is exactly
that krb5kdc tries to bind to a socket that is still in TIME_WAIT state.
Also, it looks like the socket is typically one that was used by the
GSSAPI client side (no surprise, the test leaves a lot more of those
than the one server socket), so we'd have no record of it even if we
were somehow saving state from prior runs.

So I propose the attached patch, which seems to fix this for me.

The particular case I'm looking at (running these tests in a tight
loop) is of course not that interesting, but I argue that it's just
increasing the odds of failure enough that I can isolate the cause.
A buildfarm animal running both kerberos and ldap tests is almost
certainly at risk of such a failure with low probability.

(Still don't know what actually happened in those two buildfarm
failures, though.)

regards, tom lane

Attachment Content-Type Size
select-unused-port-1.patch text/x-diff 6.3 KB

From: Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A couple of random BF failures in kerberosCheck
Date: 2019-08-04 12:24:25
Message-ID: 680020e8-b659-2405-1d94-76696ba7747f@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On 8/3/19 6:42 PM, Tom Lane wrote:
> I wrote:
>> * kerberos/t/001_auth.pl just blithely assumes that it can pick
>> any random port above 48K and that's guaranteed to be free.
>> Maybe we should split out the code in get_new_node for finding
>> a free TCP port, so we can call it here?
> I've confirmed that the reason it's failing on my machine is exactly
> that krb5kdc tries to bind to a socket that is still in TIME_WAIT state.
> Also, it looks like the socket is typically one that was used by the
> GSSAPI client side (no surprise, the test leaves a lot more of those
> than the one server socket), so we'd have no record of it even if we
> were somehow saving state from prior runs.
>
> So I propose the attached patch, which seems to fix this for me.
>
> The particular case I'm looking at (running these tests in a tight
> loop) is of course not that interesting, but I argue that it's just
> increasing the odds of failure enough that I can isolate the cause.
> A buildfarm animal running both kerberos and ldap tests is almost
> certainly at risk of such a failure with low probability.
>
> (Still don't know what actually happened in those two buildfarm
> failures, though.)
>
>

Looks good. A couple of minor nits:

. since we're exporting the name there's no need to document it as a
class method. I'd remove the "PostgresNode->" from the couple of places
you have it in the docco. You're not actually calling it that way
anywhere, and indeed doing so ends up passing 'PostgresNode' as a
useless parameter to the subroutine. This is different from calling it
with a qualified name (PostgresNode::get_free_port()).

. in the inner loop we should probably exit the loop if we set found to
0. There's no point testing other addresses in that case. Something like
"last unless found;" would do the trick.

cheers

andrew

--
Andrew Dunstan https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A couple of random BF failures in kerberosCheck
Date: 2019-08-04 15:59:52
Message-ID: 22053.1564934392@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com> writes:
> On 8/3/19 6:42 PM, Tom Lane wrote:
>> So I propose the attached patch, which seems to fix this for me.

> Looks good. A couple of minor nits:

Will fix, thanks for the review!

regards, tom lane