Lists: | buildfarm-members |
---|
From: | Tomas Vondra <tomas(at)vondra(dot)me> |
---|---|
To: | buildfarm-members(at)lists(dot)postgresql(dot)org |
Subject: | strange git problems on turaco |
Date: | 2024-12-02 01:20:35 |
Message-ID: | 6a705172-5b28-4023-a40e-fb7805c717c4@vondra.me |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | buildfarm-members |
Hi,
turaco seems to be having some strange git issues - some of the
buildfarm runs fail like this:
turaco:REL_16_STABLE [22:41:11] OK
Sun Dec 1 22:41:27 2024: buildfarm run for turaco:REL_17_STABLE starting
turaco:REL_17_STABLE [22:41:27] checking out source ...
Missing checked out branch bf_REL_17_STABLE:
fatal: not a git repository (or any parent up to mount point /mnt)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
turaco:REL_17_STABLE [22:41:32] failed at stage pgsql-Git
Sun Dec 1 22:41:33 2024: buildfarm run for turaco:HEAD starting
turaco:HEAD [22:41:33] checking out source ...
I initially suspected this might be due to aging storage (SD card on
rpi), but I replaced that, and there's nothing strange in dmesg. Also,
other branches seem to be working fine ...
Any ideas what could be causing this?
regards
--
Tomas Vondra
From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | Tomas Vondra <tomas(at)vondra(dot)me> |
Cc: | buildfarm-members(at)lists(dot)postgresql(dot)org |
Subject: | Re: strange git problems on turaco |
Date: | 2024-12-02 01:56:30 |
Message-ID: | 1515248.1733104590@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | buildfarm-members |
Tomas Vondra <tomas(at)vondra(dot)me> writes:
> turaco seems to be having some strange git issues - some of the
> buildfarm runs fail like this:
Have you tried rm -rf'ing its git repo and letting the script
check that out from scratch? The fact that it's just the 17
branch has a whiff of repo corruption.
Andrew might correct me, but I think you have to remove
both the pgmirror.git directory and the per-branch pgsql
subdirectories to be clean. Don't remove the various
<animal>.* status files.
regards, tom lane
From: | Tomas Vondra <tomas(at)vondra(dot)me> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | buildfarm-members(at)lists(dot)postgresql(dot)org |
Subject: | Re: strange git problems on turaco |
Date: | 2024-12-02 02:23:10 |
Message-ID: | ed54ba1c-bc7e-4dd8-ba93-078775daa5ff@vondra.me |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | buildfarm-members |
On 12/2/24 02:56, Tom Lane wrote:
> Tomas Vondra <tomas(at)vondra(dot)me> writes:
>> turaco seems to be having some strange git issues - some of the
>> buildfarm runs fail like this:
>
> Have you tried rm -rf'ing its git repo and letting the script
> check that out from scratch? The fact that it's just the 17
> branch has a whiff of repo corruption.
>
I actually nuked the whole buildroot, because the old SD card was having
issues and I wasn't sure what might be corrupted. So it's all fresh. But
I also first ran
./run_branches.pl --run-all --nosend --nostatus
just to make sure everything works fine, and it did ...
> Andrew might correct me, but I think you have to remove
> both the pgmirror.git directory and the per-branch pgsql
> subdirectories to be clean. Don't remove the various
> <animal>.* status files.
>
Done. Let's see how quickly it breaks again.
regards
--
Tomas Vondra
From: | Noah Misch <noah(at)leadboat(dot)com> |
---|---|
To: | Tomas Vondra <tomas(at)vondra(dot)me> |
Cc: | buildfarm-members(at)lists(dot)postgresql(dot)org |
Subject: | Re: strange git problems on turaco |
Date: | 2024-12-02 03:46:23 |
Message-ID: | 20241202034623.39@rfd.leadboat.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | buildfarm-members |
On Mon, Dec 02, 2024 at 02:20:35AM +0100, Tomas Vondra wrote:
> turaco seems to be having some strange git issues - some of the
> buildfarm runs fail like this:
>
>
> turaco:REL_16_STABLE [22:41:11] OK
> Sun Dec 1 22:41:27 2024: buildfarm run for turaco:REL_17_STABLE starting
> turaco:REL_17_STABLE [22:41:27] checking out source ...
> Missing checked out branch bf_REL_17_STABLE:
> fatal: not a git repository (or any parent up to mount point /mnt)
> Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
> turaco:REL_17_STABLE [22:41:32] failed at stage pgsql-Git
> Sun Dec 1 22:41:33 2024: buildfarm run for turaco:HEAD starting
> turaco:HEAD [22:41:33] checking out source ...
>
>
> I initially suspected this might be due to aging storage (SD card on
> rpi), but I replaced that, and there's nothing strange in dmesg. Also,
> other branches seem to be working fine ...
>
> Any ideas what could be causing this?
I had this happen ~9 times on the host of my AIX buildfarm members. Example:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mandrill&dt=2024-07-10%2019%3A51%3A28
I figured it was some system problem, so I didn't root-cause it. I carry the
following workaround in my fork of the buildfarm client code. The unknown
problem caused failure reports and work stoppage ~4 times before I installed
this workaround, then logs show the workaround prevented damage 5 times. The
last "removed intruder .git" log message appeared on 2024-07-23. There was no
kernel reboot, and logs don't point to buildfarm client processes getting
involuntary termination, either.
diff --git a/PGBuild/SCM.pm b/PGBuild/SCM.pm
index dcfd180..2cd610a 100644
--- a/PGBuild/SCM.pm
+++ b/PGBuild/SCM.pm
@@ -1059,9 +1059,19 @@ sub _update_target
my @gitlog;
# If a run crashed during copy_source(), repair.
- if (-d "./git-save" && !-d "$target/.git")
+ if (-d "./git-save")
{
+ # As of 2024-07-13, the following has happened about four times in the
+ # last month, to different gcc111 animals. Despite no known crash,
+ # there's a git-save directory containing the proper git repo, and
+ # there's a bogus .git missing most content. Remove the bogus one.
+ # This is deeply hacky, but it beats buildfarm report noise and manual
+ # intervention.
+ if (rmtree("$target/.git") > 0) {
+ print "removed intruder .git\n" if $verbose;
+ }
move "./git-save", "$target/.git";
+ print "restored git-save\n" if $verbose;
}
chdir $target;
From: | Tomas Vondra <tomas(at)vondra(dot)me> |
---|---|
To: | Noah Misch <noah(at)leadboat(dot)com> |
Cc: | buildfarm-members(at)lists(dot)postgresql(dot)org |
Subject: | Re: strange git problems on turaco |
Date: | 2024-12-02 13:51:29 |
Message-ID: | df59e428-05f5-4850-bbe7-e94181883811@vondra.me |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | buildfarm-members |
On 12/2/24 04:46, Noah Misch wrote:
> On Mon, Dec 02, 2024 at 02:20:35AM +0100, Tomas Vondra wrote:
>> turaco seems to be having some strange git issues - some of the
>> buildfarm runs fail like this:
>>
>>
>> turaco:REL_16_STABLE [22:41:11] OK
>> Sun Dec 1 22:41:27 2024: buildfarm run for turaco:REL_17_STABLE starting
>> turaco:REL_17_STABLE [22:41:27] checking out source ...
>> Missing checked out branch bf_REL_17_STABLE:
>> fatal: not a git repository (or any parent up to mount point /mnt)
>> Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
>> turaco:REL_17_STABLE [22:41:32] failed at stage pgsql-Git
>> Sun Dec 1 22:41:33 2024: buildfarm run for turaco:HEAD starting
>> turaco:HEAD [22:41:33] checking out source ...
>>
>>
>> I initially suspected this might be due to aging storage (SD card on
>> rpi), but I replaced that, and there's nothing strange in dmesg. Also,
>> other branches seem to be working fine ...
>>
>> Any ideas what could be causing this?
>
> I had this happen ~9 times on the host of my AIX buildfarm members. Example:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mandrill&dt=2024-07-10%2019%3A51%3A28
>
> I figured it was some system problem, so I didn't root-cause it. I carry the
> following workaround in my fork of the buildfarm client code. The unknown
> problem caused failure reports and work stoppage ~4 times before I installed
> this workaround, then logs show the workaround prevented damage 5 times. The
> last "removed intruder .git" log message appeared on 2024-07-23. There was no
> kernel reboot, and logs don't point to buildfarm client processes getting
> involuntary termination, either.
>
Thanks. I suspect some system issue too, but I didn't want to blame the
system without some kind of proof. I applied your patch, let's see if
that helped after a couple runs.
regards
--
Tomas Vondra
From: | Andrew Dunstan <andrew(at)dunslane(dot)net> |
---|---|
To: | Noah Misch <noah(at)leadboat(dot)com>, Tomas Vondra <tomas(at)vondra(dot)me> |
Cc: | buildfarm-members(at)lists(dot)postgresql(dot)org |
Subject: | Re: strange git problems on turaco |
Date: | 2024-12-12 15:43:58 |
Message-ID: | 7762aec1-971d-4121-83df-57bb5836ef10@dunslane.net |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | buildfarm-members |
On 2024-12-01 Su 10:46 PM, Noah Misch wrote:
> On Mon, Dec 02, 2024 at 02:20:35AM +0100, Tomas Vondra wrote:
>> turaco seems to be having some strange git issues - some of the
>> buildfarm runs fail like this:
>>
>>
>> turaco:REL_16_STABLE [22:41:11] OK
>> Sun Dec 1 22:41:27 2024: buildfarm run for turaco:REL_17_STABLE starting
>> turaco:REL_17_STABLE [22:41:27] checking out source ...
>> Missing checked out branch bf_REL_17_STABLE:
>> fatal: not a git repository (or any parent up to mount point /mnt)
>> Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
>> turaco:REL_17_STABLE [22:41:32] failed at stage pgsql-Git
>> Sun Dec 1 22:41:33 2024: buildfarm run for turaco:HEAD starting
>> turaco:HEAD [22:41:33] checking out source ...
>>
>>
>> I initially suspected this might be due to aging storage (SD card on
>> rpi), but I replaced that, and there's nothing strange in dmesg. Also,
>> other branches seem to be working fine ...
>>
>> Any ideas what could be causing this?
> I had this happen ~9 times on the host of my AIX buildfarm members. Example:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mandrill&dt=2024-07-10%2019%3A51%3A28
>
> I figured it was some system problem, so I didn't root-cause it. I carry the
> following workaround in my fork of the buildfarm client code. The unknown
> problem caused failure reports and work stoppage ~4 times before I installed
> this workaround, then logs show the workaround prevented damage 5 times. The
> last "removed intruder .git" log message appeared on 2024-07-23. There was no
> kernel reboot, and logs don't point to buildfarm client processes getting
> involuntary termination, either.
>
> diff --git a/PGBuild/SCM.pm b/PGBuild/SCM.pm
> index dcfd180..2cd610a 100644
> --- a/PGBuild/SCM.pm
> +++ b/PGBuild/SCM.pm
> @@ -1059,9 +1059,19 @@ sub _update_target
> my @gitlog;
>
> # If a run crashed during copy_source(), repair.
> - if (-d "./git-save" && !-d "$target/.git")
> + if (-d "./git-save")
> {
> + # As of 2024-07-13, the following has happened about four times in the
> + # last month, to different gcc111 animals. Despite no known crash,
> + # there's a git-save directory containing the proper git repo, and
> + # there's a bogus .git missing most content. Remove the bogus one.
> + # This is deeply hacky, but it beats buildfarm report noise and manual
> + # intervention.
> + if (rmtree("$target/.git") > 0) {
> + print "removed intruder .git\n" if $verbose;
> + }
> move "./git-save", "$target/.git";
> + print "restored git-save\n" if $verbose;
> }
>
> chdir $target;
>
>
[catching up a huge email backlog]
That's kinda weird. The .git directory doesn't get moved at all if you
have vpath turned on or you're building with meson (which always does
vpath). So that's one possible workaround.
I guess I should put something like this in the next release ... will go
and do that.
cheers
andrew
--
Andrew Dunstan
EDB: https://www.enterprisedb.com
From: | Andrew Dunstan <andrew(at)dunslane(dot)net> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Tomas Vondra <tomas(at)vondra(dot)me> |
Cc: | buildfarm-members(at)lists(dot)postgresql(dot)org |
Subject: | Re: strange git problems on turaco |
Date: | 2024-12-12 15:46:49 |
Message-ID: | ece4d286-89cf-42a9-9dbc-549017131a8d@dunslane.net |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | buildfarm-members |
On 2024-12-01 Su 8:56 PM, Tom Lane wrote:
> Tomas Vondra <tomas(at)vondra(dot)me> writes:
>> turaco seems to be having some strange git issues - some of the
>> buildfarm runs fail like this:
> Have you tried rm -rf'ing its git repo and letting the script
> check that out from scratch? The fact that it's just the 17
> branch has a whiff of repo corruption.
>
> Andrew might correct me, but I think you have to remove
> both the pgmirror.git directory and the per-branch pgsql
> subdirectories to be clean. Don't remove the various
> <animal>.* status files.
>
>
In most cases you only have to remove the per-branch pgsql directory.
I've only very occasionally seem corruption of the mirror.
cheers
andrew
--
Andrew Dunstan
EDB: https://www.enterprisedb.com