Lists: | pgsql-bugs |
---|
From: | Sudalai <sudalait2(at)gmail(dot)com> |
---|---|
To: | pgsql-bugs(at)postgresql(dot)org |
Subject: | Segmentation Fault in logical decoding get/peek API |
Date: | 2018-02-19 12:20:14 |
Message-ID: | 1519042814659-0.post@n3.nabble.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | 토토 커뮤니티 PostgreSQL : PostgreSQL 메일 링리스트 : 2018-02-19 12:20 이후 PGSQL-BUGS |
Hi everyone,
PostgreSQL backend crashing with segmentation fault 11, while consuming big
transaction changes using pg_logical_slot_(peek|get)_changes. We are using
postgresql version 9.5.10.
About the big transaction : pg_xlog_dump shows 161426 records, and it spread
across 3 WAL files.
BackTace of the backend :
Program received signal SIGSEGV, Segmentation fault.
ReorderBufferCommit (rb=0xd17dc8, xid=<value optimized out>,
commit_lsn=7224098593984, end_lsn=<value optimized out>, commit_time=<value
optimized out>, origin_id=<value optimized out>, origin_lsn=0) at
/home/test/compile/../postgresql-9.5.10/src/backend/replication/logical/reorderbuffer.c:1380
1380 change->action = REORDER_BUFFER_CHANGE_INSERT;
#0 ReorderBufferCommit (rb=0xd17dc8, xid=<value optimized out>,
commit_lsn=7224098593984, end_lsn=<value optimized out>, commit_time=<value
optimized out>, origin_id=<value optimized out>, origin_lsn=0) at
/home/test/compile/../postgresql-9.5.10/src/backend/replication/logical/reorderbuffer.c:1380
#1 0x0000000000675cd1 in DecodeCommit (ctx=0xcffdb8, record=<value
optimized out>) at
/home/test/compile/../postgresql-9.5.10/src/backend/replication/logical/decode.c:549
#2 DecodeXactOp (ctx=0xcffdb8, record=<value optimized out>) at
/home/test/compile/../postgresql-9.5.10/src/backend/replication/logical/decode.c:234
#3 LogicalDecodingProcessRecord (ctx=0xcffdb8, record=<value optimized
out>) at
/home/test/compile/../postgresql-9.5.10/src/backend/replication/logical/decode.c:111
#4 0x00000000006772e5 in pg_logical_slot_get_changes_guts
(fcinfo=0x7ffea1ef0b90, confirm=0 '\000', binary=0 '\000') at
/home/test/compile/../postgresql-9.5.10/src/backend/replication/logical/logicalfuncs.c:442
#5 0x00000000005c4074 in ExecMakeTableFunctionResult (funcexpr=0xce55a0,
econtext=0xce5190, argContext=<value optimized out>, expectedDesc=0xce6360,
randomAccess=0 '\000') at
/home/test/compile/../postgresql-9.5.10/src/backend/executor/execQual.c:2189
#6 0x00000000005d8652 in FunctionNext (node=0xce5080) at
/home/test/compile/../postgresql-9.5.10/src/backend/executor/nodeFunctionscan.c:95
#7 0x00000000005c4d73 in ExecScanFetch (node=0xce5080, accessMtd=0x5d83a0
<FunctionNext>, recheckMtd=0x5d7cd0 <FunctionRecheck>) at
/home/test/compile/../postgresql-9.5.10/src/backend/executor/execScan.c:95
#8 ExecScan (node=0xce5080, accessMtd=0x5d83a0 <FunctionNext>,
recheckMtd=0x5d7cd0 <FunctionRecheck>) at
/home/test/compile/../postgresql-9.5.10/src/backend/executor/execScan.c:145
#9 0x00000000005bd808 in ExecProcNode (node=0xce5080) at
/home/test/compile/../postgresql-9.5.10/src/backend/executor/execProcnode.c:442
#10 0x00000000005caca6 in fetch_input_tuple (aggstate=0xce4808) at
/home/test/compile/../postgresql-9.5.10/src/backend/executor/nodeAgg.c:487
#11 0x00000000005cc286 in agg_retrieve_direct (node=<value optimized out>)
at
/home/test/compile/../postgresql-9.5.10/src/backend/executor/nodeAgg.c:1683
#12 ExecAgg (node=<value optimized out>) at
/home/test/compile/../postgresql-9.5.10/src/backend/executor/nodeAgg.c:1508
#13 0x00000000005bd748 in ExecProcNode (node=0xce4808) at
/home/test/compile/../postgresql-9.5.10/src/backend/executor/execProcnode.c:496
#14 0x00000000005bbfdf in ExecutePlan (queryDesc=0xce93a8, direction=<value
optimized out>, count=0) at
/home/test/compile/../postgresql-9.5.10/src/backend/executor/execMain.c:1549
#15 standard_ExecutorRun (queryDesc=0xce93a8, direction=<value optimized
out>, count=0) at
/home/test/compile/../postgresql-9.5.10/src/backend/executor/execMain.c:337
#16 0x00000000006c1f57 in PortalRunSelect (portal=0xd2b9c8, forward=<value
optimized out>, count=0, dest=<value optimized out>) at
/home/test/compile/../postgresql-9.5.10/src/backend/tcop/pquery.c:942
#17 0x00000000006c31c7 in PortalRun (portal=0xd2b9c8,
count=9223372036854775807, isTopLevel=1 '\001', dest=0xced478,
altdest=0xced478, completionTag=0x7ffea1ef1440 "") at
/home/test/compile/../postgresql-9.5.10/src/backend/tcop/pquery.c:786
#18 0x00000000006bff50 in exec_simple_query (query_string=0xccd7e8 "select
count(*) from pg_logical_slot_peek_changes('pg_slotnew',null,1000);") at
/home/test/compile/../postgresql-9.5.10/src/backend/tcop/postgres.c:1096
#19 0x00000000006c157c in PostgresMain (argc=<value optimized out>,
argv=<value optimized out>, dbname=0xc36100 "sasdb", username=<value
optimized out>) at
/home/test/compile/../postgresql-9.5.10/src/backend/tcop/postgres.c:4049
#20 0x0000000000665d1c in BackendRun (argc=<value optimized out>,
argv=<value optimized out>) at
/home/test/compile/../postgresql-9.5.10/src/backend/postmaster/postmaster.c:4264
#21 BackendStartup (argc=<value optimized out>, argv=<value optimized out>)
at
/home/test/compile/../postgresql-9.5.10/src/backend/postmaster/postmaster.c:3938
#22 ServerLoop (argc=<value optimized out>, argv=<value optimized out>) at
/home/test/compile/../postgresql-9.5.10/src/backend/postmaster/postmaster.c:1703
#23 PostmasterMain (argc=<value optimized out>, argv=<value optimized out>)
at
/home/test/compile/../postgresql-9.5.10/src/backend/postmaster/postmaster.c:1311
#24 0x00000000005f4309 in main (argc=3, argv=0xc34b60) at
/home/test/compile/../postgresql-9.5.10/src/backend/main/main.c:228
Is it a bug? Is any way i could avoid this?
If you guys need any information for debugging, kindly let me know.
Thanks in advance,
Sudalai
-----
sudalai
--
Sent from: http://www.postgresql-archive.org/PostgreSQL-bugs-f2117394.html
From: | Sudalai <sudalait2(at)gmail(dot)com> |
---|---|
To: | pgsql-bugs(at)postgresql(dot)org |
Subject: | Re: Segmentation Fault in logical decoding get/peek API |
Date: | 2018-02-20 11:00:07 |
Message-ID: | 1519124407799-0.post@n3.nabble.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | 무지개 토토 PostgreSQL : PostgreSQL 메일 링리스트 : 2018-02-20 11:00 이후의 PGSQL-BUGS. |
Kindly help me resolve the issue.
-----
sudalai
--
Sent from: http://www.postgresql-archive.org/PostgreSQL-bugs-f2117394.html
From: | Sudalai <sudalait2(at)gmail(dot)com> |
---|---|
To: | pgsql-bugs(at)postgresql(dot)org |
Subject: | Re: Segmentation Fault in logical decoding get/peek API |
Date: | 2018-02-20 11:01:04 |
Message-ID: | 1519124464740-0.post@n3.nabble.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-bugs |
Kindly help me resolve this issue.
-----
sudalai
--
Sent from: http://www.postgresql-archive.org/PostgreSQL-bugs-f2117394.html
From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | Sudalai <sudalait2(at)gmail(dot)com> |
Cc: | pgsql-bugs(at)postgresql(dot)org |
Subject: | Re: Segmentation Fault in logical decoding get/peek API |
Date: | 2018-02-20 17:53:26 |
Message-ID: | 20180220175326.kzy4tdqaevb6r5ec@alap3.anarazel.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | 503 메이저 토토 사이트 페치 실패 |
On 2018-02-20 04:00:07 -0700, Sudalai wrote:
> Kindly help me resolve the issue.
Please don't spam the list with daily requests for status
updates. Especially not two within minutes, less than a day after the
original report.
Greetings,
Andres Freund
From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | Sudalai <sudalait2(at)gmail(dot)com> |
Cc: | pgsql-bugs(at)postgresql(dot)org |
Subject: | Re: Segmentation Fault in logical decoding get/peek API |
Date: | 2018-02-20 17:55:30 |
Message-ID: | 20180220175530.6ib4b3ngzkipekgs@alap3.anarazel.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | Postg토토 베이SQL : Postg토토 베이SQL 메일 링리스트 : 2018-02-20 이후 PGSQL-BUGS 17:55 |
Hi,
On 2018-02-19 05:20:14 -0700, Sudalai wrote:
> PostgreSQL backend crashing with segmentation fault 11, while consuming big
> transaction changes using pg_logical_slot_(peek|get)_changes. We are using
> postgresql version 9.5.10.
Just to be sure, could you check that 9.5.11 shows the same problem?
> About the big transaction : pg_xlog_dump shows 161426 records, and it spread
> across 3 WAL files.
>
> BackTace of the backend :
>
> Program received signal SIGSEGV, Segmentation fault.
> ReorderBufferCommit (rb=0xd17dc8, xid=<value optimized out>,
> commit_lsn=7224098593984, end_lsn=<value optimized out>, commit_time=<value
> optimized out>, origin_id=<value optimized out>, origin_lsn=0) at
> /home/test/compile/../postgresql-9.5.10/src/backend/replication/logical/reorderbuffer.c:1380
> 1380 change->action = REORDER_BUFFER_CHANGE_INSERT;
>
> #0 ReorderBufferCommit (rb=0xd17dc8, xid=<value optimized out>,
> commit_lsn=7224098593984, end_lsn=<value optimized out>, commit_time=<value
> optimized out>, origin_id=<value optimized out>, origin_lsn=0) at
> /home/test/compile/../postgresql-9.5.10/src/backend/replication/logical/reorderbuffer.c:1380
Unfortunately this does not seems to provide enough information to debug
the problem. Could you develop a reproducer we can run locally?
- Andres
From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|
To: | Andres Freund <andres(at)anarazel(dot)de>, Sudalai <sudalait2(at)gmail(dot)com> |
Cc: | pgsql-bugs(at)postgresql(dot)org |
Subject: | Re: Segmentation Fault in logical decoding get/peek API |
Date: | 2018-02-20 21:35:50 |
Message-ID: | d1354af3-d34a-75a2-a171-3ac981bce639@2ndquadrant.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | 토토 사이트 순위 PostgreSQL : PostgreSQL 메일 링리스트 : 2018-02-20 이후 PGSQL-BUGS 21:35 |
On 02/20/2018 06:55 PM, Andres Freund wrote:
> Hi,
>
> On 2018-02-19 05:20:14 -0700, Sudalai wrote:
>> PostgreSQL backend crashing with segmentation fault 11, while consuming big
>> transaction changes using pg_logical_slot_(peek|get)_changes. We are using
>> postgresql version 9.5.10.
>
> Just to be sure, could you check that 9.5.11 shows the same problem?
>
>> About the big transaction : pg_xlog_dump shows 161426 records, and it spread
>> across 3 WAL files.
>>
>> BackTace of the backend :
>>
>> Program received signal SIGSEGV, Segmentation fault.
>> ReorderBufferCommit (rb=0xd17dc8, xid=<value optimized out>,
>> commit_lsn=7224098593984, end_lsn=<value optimized out>, commit_time=<value
>> optimized out>, origin_id=<value optimized out>, origin_lsn=0) at
>> /home/test/compile/../postgresql-9.5.10/src/backend/replication/logical/reorderbuffer.c:1380
This very much seems like a custom PostgreSQL build (judging by the
paths in the backtrace), so I wonder what configure options were used.
It's not clear if this is a production of testing evironment, but if you
could rebuild with debugging symbols / without optimizations, that would
be helpful (it would get rid of some of the "optimized out" stuff).
>> 1380 change->action = REORDER_BUFFER_CHANGE_INSERT;
>>
Hmmm, this would mean it's failing here:
case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
/*
* Confirmation for speculative insertion arrived. Simply
* use as a normal record. It'll be cleaned up at the end
* of INSERT processing.
*/
Assert(specinsert->data.tp.oldtuple == NULL);
change = specinsert;
change->action = REORDER_BUFFER_CHANGE_INSERT;
And if it's segfaulting, it has to mean specinsert is NULL. So either we
never got REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT, or we threw it
away in the "change_done" part. Seems strange in both cases.
Sudalai, are you using speculative inserts in the transaction?
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From: | Peter Geoghegan <pg(at)bowt(dot)ie> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
Cc: | Andres Freund <andres(at)anarazel(dot)de>, Sudalai <sudalait2(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-bugs(at)postgresql(dot)org> |
Subject: | Re: Segmentation Fault in logical decoding get/peek API |
Date: | 2018-02-21 01:20:40 |
Message-ID: | CAH2-WzkXxqtEH8CpUrpSewFfiZFqyjrhjQ1nLeYb_n-kR_uU0Q@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | 사설 토토 사이트 PostgreSQL : PostgreSQL 메일 링리스트 : 2018-02-21 이후 PGSQL-BUGS |
On Tue, Feb 20, 2018 at 1:35 PM, Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> And if it's segfaulting, it has to mean specinsert is NULL. So either we
> never got REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT, or we threw it
> away in the "change_done" part. Seems strange in both cases.
>
> Sudalai, are you using speculative inserts in the transaction?
He means ON CONFLICT DO UPDATE. :-)
The speculative insertion case handling in reorderbuffer.c is likely
something that is hardly ever used, since speculative insertion
conflicts are presumably generally very rare. Though at the same time,
they're surely very common for certain workloads (such as my testing
script for ON CONFLICT DO UPDATE [1]).
[1] https://github.com/petergeoghegan/jjanes_upsert
--
Peter Geoghegan
From: | Sudalai <sudalait2(at)gmail(dot)com> |
---|---|
To: | pgsql-bugs(at)postgresql(dot)org |
Subject: | Re: Segmentation Fault in logical decoding get/peek API |
Date: | 2018-02-21 07:18:58 |
Message-ID: | 1519197538224-0.post@n3.nabble.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-bugs |
>>And if it's segfaulting, it has to mean specinsert is NULL. So either we
>>never got REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT, or we threw it
>>away in the "change_done" part. Seems strange in both cases.
Yes, specinsert is NULL.
>> Sudalai, are you using speculative inserts in the transaction?
Yes . We have done ON CONFLICT DO NOTHING .
Is it correct to add below check ,
if(specinsert == NULL ){
goto change_done;
}
before , Assert(specinsert->data.tp.oldtuple == NULL); to fix segfault ?
Is specinsert is NULL, because of CONFLICT DO NOTHING ?
Thanks for the response Tomas, Peter.
-----
sudalai
--
Sent from: http://www.postgresql-archive.org/PostgreSQL-bugs-f2117394.html
From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|
To: | Sudalai <sudalait2(at)gmail(dot)com>, pgsql-bugs(at)postgresql(dot)org |
Subject: | Re: Segmentation Fault in logical decoding get/peek API |
Date: | 2018-02-21 14:31:04 |
Message-ID: | 9fe0a6de-d204-9067-4eba-dd686077dd89@2ndquadrant.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | PostgreSQL : PostgreSQL 메일 링리스트 : 2018-02-21 이후 PGSQL-BUGS 1| 배트맨 토토 2.0 || Eula 설문지:31 |
On 02/21/2018 08:18 AM, Sudalai wrote:
>
>
>>> And if it's segfaulting, it has to mean specinsert is NULL. So either we
>>> never got REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT, or we threw it
>>> away in the "change_done" part. Seems strange in both cases.
>
> Yes, specinsert is NULL.
>
>>> Sudalai, are you using speculative inserts in the transaction?
> Yes . We have done ON CONFLICT DO NOTHING .
>
>
> Is it correct to add below check ,
> if(specinsert == NULL ){
> goto change_done;
> }
> before , Assert(specinsert->data.tp.oldtuple == NULL); to fix segfault ?
>
That seems unlikely to fix the root cause - more likely it's going to
mask it in some strange way. We need to understand why it does happen
(i.e. specinsert is not expected to be NULL here).
It would really help if you could provide a self-contained test case
demonstrating the issue, as mentioned by Andres.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From: | jfinzel <finzelj(at)gmail(dot)com> |
---|---|
To: | pgsql-bugs(at)postgresql(dot)org |
Subject: | Re: Segmentation Fault in logical decoding get/peek API |
Date: | 2019-02-14 20:33:43 |
Message-ID: | 1550176423593-0.post@n3.nabble.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-bugs |
We ran into a segmentation fault using peek_changes that appears identical
with what is described in this unfinished thread. We are running pg 10.3,
and the segfault was fixed by us upgrading to 10.6. However, we could not
find any clearly related fixes in any of the release notes for 10.4-6
relating to this bug fix. I did find this commit that I believe fixes the
issue:
https://github.com/postgres/postgres/commit/bba8c612117416907f332fce8b9e80b748e0b798
If this indeed fixes a critical issue as we think it does, could someone
please add it in the right place to the release notes?
Would it be useful to anyone for us to provide any more diagnostics for how
we encountered this bug? I have a core dump and stack trace. We
encountered it during a peek_changes being run and fetched via a cursor.
The same peek_changes from the same LSN actually succeeds after recovery.
specinsert is null here:
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x000000000084a914 in ReorderBufferCommit (rb=0x1e236c0, xid=316524727,
commit_lsn=9825880784664, end_lsn=9825880787976,
commit_time=603381553004372, origin_id=0, origin_lsn=0) at
reorderbuffer.c:1395
1395 Assert(specinsert->data.tp.oldtuple == NULL);
Some interesting notes are that we are doing INSERT ON CONFLICT UPDATE (we
are not using DO NOTHING), and the table of interest has a btree gist index.
I am certain that the decoding of this SQL process is somehow related to
this segfault.
Another note of interest is that we have been decoding using pglogical from
this system for some time with no issues, so this is definitely unique to
the use of the peek_changes function.
Please advise, thanks!
--
Sent from: http://www.postgresql-archive.org/PostgreSQL-bugs-f2117394.html
From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | jfinzel <finzelj(at)gmail(dot)com> |
Cc: | pgsql-bugs(at)postgresql(dot)org |
Subject: | Re: Segmentation Fault in logical decoding get/peek API |
Date: | 2019-02-14 22:26:55 |
Message-ID: | 2527.1550183215@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-bugs |
jfinzel <finzelj(at)gmail(dot)com> writes:
> We ran into a segmentation fault using peek_changes that appears identical
> with what is described in this unfinished thread. We are running pg 10.3,
> and the segfault was fixed by us upgrading to 10.6. However, we could not
> find any clearly related fixes in any of the release notes for 10.4-6
> relating to this bug fix. I did find this commit that I believe fixes the
> issue:
> https://github.com/postgres/postgres/commit/bba8c612117416907f332fce8b9e80b748e0b798
> If this indeed fixes a critical issue as we think it does, could someone
> please add it in the right place to the release notes?
You'd have to provide some evidence that that commit actually had
something to do with a user-visible problem. It certainly wouldn't,
by itself, have completely fixed any problem: at best it'd have
transformed it from a crash into an elog(ERROR). More to the point,
that commit was intended to silence a probably-hypothetical-anyway
warning about the prior commit, so it wasn't fixing any issue that
ever saw the light of day in a PG release.
So my bet is that your problem was fixed by some other commit between
10.3 and 10.6. Maybe the predecessor one, b767b3f2e; but hard to say
without more investigation than seems warranted, if the bug's gone.
regards, tom lane
From: | Peter Geoghegan <pg(at)bowt(dot)ie> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | jfinzel <finzelj(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-bugs(at)postgresql(dot)org> |
Subject: | Re: Segmentation Fault in logical decoding get/peek API |
Date: | 2019-02-14 22:28:54 |
Message-ID: | CAH2-Wz=2huFPbkmmMqyJfoB-tXbADdBEqs8byoQDSHDWzH65wg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-bugs |
On Thu, Feb 14, 2019 at 2:27 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> So my bet is that your problem was fixed by some other commit between
> 10.3 and 10.6. Maybe the predecessor one, b767b3f2e; but hard to say
> without more investigation than seems warranted, if the bug's gone.
I wonder if jfinzel is in a position to do a git-bisect. That might be
interesting.
--
Peter Geoghegan
From: | Jeremy Finzel <finzelj(at)gmail(dot)com> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | pgsql-bugs(at)postgresql(dot)org |
Subject: | Re: Segmentation Fault in logical decoding get/peek API |
Date: | 2019-02-14 22:42:04 |
Message-ID: | CAMa1XUiPZVyxYe2FTr4RVd8DPrb-h1_2CLBG1ND82wKk4eSvPg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-bugs |
On Thu, Feb 14, 2019 at 4:26 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> jfinzel <finzelj(at)gmail(dot)com> writes:
>
> You'd have to provide some evidence that that commit actually had
> something to do with a user-visible problem. It certainly wouldn't,
> by itself, have completely fixed any problem: at best it'd have
> transformed it from a crash into an elog(ERROR). More to the point,
> that commit was intended to silence a probably-hypothetical-anyway
> warning about the prior commit, so it wasn't fixing any issue that
> ever saw the light of day in a PG release.
>
And indeed I have not seen such error messages so I suspect you are right
there.
> So my bet is that your problem was fixed by some other commit between
> 10.3 and 10.6. Maybe the predecessor one, b767b3f2e; but hard to say
> without more investigation than seems warranted, if the bug's gone.
>
> regards, tom lane
>
I am willing to put in more time debugging this because we want to know
which clusters are actually susceptible to the bug. Any suggestions to
proceed are welcome.
Thanks,
Jeremy
From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | Jeremy Finzel <finzelj(at)gmail(dot)com> |
Cc: | pgsql-bugs(at)postgresql(dot)org |
Subject: | Re: Segmentation Fault in logical decoding get/peek API |
Date: | 2019-02-14 23:05:00 |
Message-ID: | 4354.1550185500@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-bugs |
Jeremy Finzel <finzelj(at)gmail(dot)com> writes:
> On Thu, Feb 14, 2019 at 4:26 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> So my bet is that your problem was fixed by some other commit between
>> 10.3 and 10.6. Maybe the predecessor one, b767b3f2e; but hard to say
>> without more investigation than seems warranted, if the bug's gone.
> I am willing to put in more time debugging this because we want to know
> which clusters are actually susceptible to the bug. Any suggestions to
> proceed are welcome.
Well, as Peter said, "git bisect" and trying to reproduce the problem
at each step would be the way to prove it definitively. Seems mighty
tedious though. Possibly you could shave some time off the process
by assuming it must have been one of the commits that touched
reorderbuffer.c ... a quick check says there have been ten of those
in the v10 branch since 10.3.
regards, tom lane
From: | Jeremy Finzel <finzelj(at)gmail(dot)com> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | pgsql-bugs(at)postgresql(dot)org |
Subject: | Re: Segmentation Fault in logical decoding get/peek API |
Date: | 2019-02-18 20:06:50 |
Message-ID: | CAMa1XUjoL_DhD4AWz+MRZYYqjXt=vNcUhuy=JPK2LKS3boVqGQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-bugs |
>
> Well, as Peter said, "git bisect" and trying to reproduce the problem
> at each step would be the way to prove it definitively. Seems mighty
> tedious though. Possibly you could shave some time off the process
> by assuming it must have been one of the commits that touched
> reorderbuffer.c ... a quick check says there have been ten of those
> in the v10 branch since 10.3.
>
Update:
- I definitely got the same segfault on a commit after 10.4 - 0bb28ca
- I am now getting a different segfault on 10.5 - but I need another set
of eyes to verify I am not compiling it wrong
After decoding successfully for awhile, now I get an immediate segfault
upon peek_changes. First of all, here is the backtrace:
$ sudo -u postgres gdb -q -c /san/<cluster>/pgdata/core
/usr/lib/postgresql/10.5/bin/postgres
Reading symbols from /usr/lib/postgresql/10.5/bin/postgres...done.
[New LWP 22699]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: <cluster>: jfinzel foo_db
10.7.111.37(52316) FETCH'.
Program terminated with signal SIGABRT, Aborted.
#0 0x00007eff42d54428 in raise () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0 0x00007eff42d54428 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007eff42d5602a in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x0000000000a45f9c in ExceptionalCondition (conditionName=0xc2d688
"!(prev_first_lsn < cur_txn->first_lsn)", errorType=0xc2d404
"FailedAssertion", fileName=0xc2d478 "reorderbuffer.c", lineNumber=688) at
assert.c:54
#3 0x000000000084b0ac in AssertTXNLsnOrder (rb=0x28ed790) at
reorderbuffer.c:688
#4 0x000000000084ab97 in ReorderBufferTXNByXid (rb=0x28ed790,
xid=319299822, create=1 '\001', is_new=0x0, lsn=9888781386112,
create_as_top=1 '\001') at reorderbuffer.c:567
#5 0x000000000084d86c in ReorderBufferAddNewTupleCids (rb=0x28ed790,
xid=319299822, lsn=9888781386112, node=..., tid=..., cmin=2,
cmax=4294967295, combocid=4294967295) at reorderbuffer.c:2053
#6 0x00000000008522b6 in SnapBuildProcessNewCid (builder=0x28f57c0,
xid=319299827, lsn=9888781386112, xlrec=0x2821c08) at snapbuild.c:780
#7 0x000000000083f280 in DecodeHeap2Op (ctx=0x28dd720, buf=0x7ffc5b73e2d0)
at decode.c:371
#8 0x000000000083ebb1 in LogicalDecodingProcessRecord (ctx=0x28dd720,
record=0x28dd9e0) at decode.c:121
#9 0x0000000000844f86 in pg_logical_slot_get_changes_guts
(fcinfo=0x7ffc5b73e600, confirm=0 '\000', binary=0 '\000') at
logicalfuncs.c:308
#10 0x000000000084514d in pg_logical_slot_peek_changes
(fcinfo=0x7ffc5b73e600) at logicalfuncs.c:381
#11 0x00000000006f7973 in ExecMakeTableFunctionResult (setexpr=0x28265b8,
econtext=0x28262b0, argContext=0x28b4af0, expectedDesc=0x28d1d20,
randomAccess=4 '\004') at execSRF.c:231
#12 0x000000000070a870 in FunctionNext (node=0x2826198) at
nodeFunctionscan.c:94
#13 0x00000000006f6f6e in ExecScanFetch (node=0x2826198, accessMtd=0x70a7b9
<FunctionNext>, recheckMtd=0x70aba1 <FunctionRecheck>) at execScan.c:97
#14 0x00000000006f6fdd in ExecScan (node=0x2826198, accessMtd=0x70a7b9
<FunctionNext>, recheckMtd=0x70aba1 <FunctionRecheck>) at execScan.c:147
#15 0x000000000070abef in ExecFunctionScan (pstate=0x2826198) at
nodeFunctionscan.c:270
#16 0x00000000006f541a in ExecProcNodeFirst (node=0x2826198) at
execProcnode.c:430
#17 0x00000000006ed5af in ExecProcNode (node=0x2826198) at
../../../src/include/executor/executor.h:250
#18 0x00000000006effaf in ExecutePlan (estate=0x2825f80,
planstate=0x2826198, use_parallel_mode=0 '\000', operation=CMD_SELECT,
sendTuples=1 '\001', numberTuples=2000, direction=ForwardScanDirection,
dest=0x27ffc78, execute_once=0 '\000') at execMain.c:1722
#19 0x00000000006edbc4 in standard_ExecutorRun (queryDesc=0x2825130,
direction=ForwardScanDirection, count=2000, execute_once=0 '\000') at
execMain.c:363
#20 0x00000000006ed9de in ExecutorRun (queryDesc=0x2825130,
direction=ForwardScanDirection, count=2000, execute_once=0 '\000') at
execMain.c:306
#21 0x00000000008d0dd7 in PortalRunSelect (portal=0x27f70a8, forward=1
'\001', count=2000, dest=0x27ffc78) at pquery.c:932
#22 0x00000000008d200f in DoPortalRunFetch (portal=0x27f70a8,
fdirection=FETCH_FORWARD, count=2000, dest=0x27ffc78) at pquery.c:1675
#23 0x00000000008d19df in PortalRunFetch (portal=0x27f70a8,
fdirection=FETCH_FORWARD, count=2000, dest=0x27ffc78) at pquery.c:1434
#24 0x00000000006833bb in PerformPortalFetch (stmt=0x2888570,
dest=0x27ffc78, completionTag=0x7ffc5b73f0f0 "") at portalcmds.c:199
#25 0x00000000008d2ab6 in standard_ProcessUtility (pstmt=0x28888d0,
queryString=0x2887b30 "FETCH FORWARD 2000 FROM crash_dude;",
context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x27ffc78,
completionTag=0x7ffc5b73f0f0 "") at utility.c:527
#26 0x00007eff42829eb6 in pglogical_ProcessUtility (pstmt=0x28888d0,
queryString=0x2887b30 "FETCH FORWARD 2000 FROM crash_dude;",
context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x27ffc78,
completionTag=0x7ffc5b73f0f0 "") at pglogical_executor.c:279
#27 0x00000000008d2547 in ProcessUtility (pstmt=0x28888d0,
queryString=0x2887b30 "FETCH FORWARD 2000 FROM crash_dude;",
context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x27ffc78,
completionTag=0x7ffc5b73f0f0 "") at utility.c:353
#28 0x00000000008d141b in PortalRunUtility (portal=0x27f6f90,
pstmt=0x28888d0, isTopLevel=1 '\001', setHoldSnapshot=1 '\001',
dest=0x27ffc78, completionTag=0x7ffc5b73f0f0 "") at pquery.c:1178
#29 0x00000000008d1119 in FillPortalStore (portal=0x27f6f90, isTopLevel=1
'\001') at pquery.c:1038
#30 0x00000000008d09a1 in PortalRun (portal=0x27f6f90,
count=9223372036854775807, isTopLevel=1 '\001', run_once=1 '\001',
dest=0x28889c8, altdest=0x28889c8, completionTag=0x7ffc5b73f350 "") at
pquery.c:768
#31 0x00000000008c9f67 in exec_simple_query (query_string=0x2887b30 "FETCH
FORWARD 2000 FROM crash_dude;") at postgres.c:1099
#32 0x00000000008cea3c in PostgresMain (argc=1, argv=0x2804e50,
dbname=0x2804e28 "<foo_db>", username=0x2804e08 "jfinzel") at
postgres.c:4088
#33 0x000000000082369b in BackendRun (port=0x2801170) at postmaster.c:4405
#34 0x0000000000822d02 in BackendStartup (port=0x2801170) at
postmaster.c:4077
#35 0x000000000081ee31 in ServerLoop () at postmaster.c:1755
#36 0x000000000081e2d9 in PostmasterMain (argc=3, argv=0x27d79a0) at
postmaster.c:1363
#37 0x0000000000751669 in main (argc=3, argv=0x27d79a0) at main.c:228
Here is my compile script that I used to compile 10.5 (at commit
4191e37a9a1fb598267c445c717914012d9bc423) and run. The cluster with said
issue uses extensions compiled below as well:
$ cat make_postgres
#!/bin/bash
set -eu
dirname=$1
instdir=/usr/lib/postgresql/$dirname
# Install Postgres
export
PATH=$instdir/bin:/home/jfinzel/bin:/home/jfinzel/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games
sudo mkdir $instdir
# This is my directory with source code from git
commit 4191e37a9a1fb598267c445c717914012d9bc423
cd ~/postgres_source/postgres
./configure --prefix=$instdir --enable-cassert --enable-debug CFLAGS="-ggdb
-g3 -fno-omit-frame-pointer -fPIC"
make
sudo "PATH=$PATH" make install
# Contrib
cd contrib/btree_gist/
sudo "PATH=$PATH" make install
cd ../test_decoding/
sudo "PATH=$PATH" make install
# Install Pglogical
cd /usr/src/pglogical-2.2.1
sudo "PATH=$PATH" make clean
sudo "PATH=$PATH" make install
# Install Extensions
cd $HOME/pgl_ddl_deploy
make clean
sudo "PATH=$PATH" make install
cd $HOME/pglogical_ticker
make clean
sudo "PATH=$PATH" make install
cd $HOME/pg_fact_loader
make clean
sudo "PATH=$PATH" make install
$ ./make_postgres 10.5
Thanks!
Jeremy
From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | Jeremy Finzel <finzelj(at)gmail(dot)com> |
Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-bugs(at)postgresql(dot)org |
Subject: | Re: Segmentation Fault in logical decoding get/peek API |
Date: | 2019-02-18 20:15:47 |
Message-ID: | 20190218201547.pv53o6aqeq675ehf@alap3.anarazel.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-bugs |
Hi,
> - I definitely got the same segfault on a commit after 10.4 - 0bb28ca
> - I am now getting a different segfault on 10.5 - but I need another set
> of eyes to verify I am not compiling it wrong
>
> After decoding successfully for awhile, now I get an immediate segfault
> upon peek_changes. First of all, here is the backtrace:
But to confirm, with 10.6 that's gone?
> $ sudo -u postgres gdb -q -c /san/<cluster>/pgdata/core
> /usr/lib/postgresql/10.5/bin/postgres
> Reading symbols from /usr/lib/postgresql/10.5/bin/postgres...done.
> [New LWP 22699]
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
> Core was generated by `postgres: <cluster>: jfinzel foo_db
> 10.7.111.37(52316) FETCH'.
> Program terminated with signal SIGABRT, Aborted.
Note that this is not a segfault, it's an assertion that failed.
> #0 0x00007eff42d54428 in raise () from /lib/x86_64-linux-gnu/libc.so.6
> (gdb) bt
> #0 0x00007eff42d54428 in raise () from /lib/x86_64-linux-gnu/libc.so.6
> #1 0x00007eff42d5602a in abort () from /lib/x86_64-linux-gnu/libc.so.6
> #2 0x0000000000a45f9c in ExceptionalCondition (conditionName=0xc2d688
> "!(prev_first_lsn < cur_txn->first_lsn)", errorType=0xc2d404
> "FailedAssertion", fileName=0xc2d478 "reorderbuffer.c", lineNumber=688) at
> assert.c:54
If it's not gone with 10.6, could you print the two variables in that
expression?
> # Install Extensions
> cd $HOME/pgl_ddl_deploy
> make clean
> sudo "PATH=$PATH" make install
> cd $HOME/pglogical_ticker
> make clean
> sudo "PATH=$PATH" make install
> cd $HOME/pg_fact_loader
> make clean
> sudo "PATH=$PATH" make install
It's entirely possible that one of those extensions does something to
violate WAL logging rules :/
Greetings,
Andres Freund
From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | Jeremy Finzel <finzelj(at)gmail(dot)com> |
Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-bugs(at)postgresql(dot)org |
Subject: | Re: Segmentation Fault in logical decoding get/peek API |
Date: | 2019-02-18 20:19:40 |
Message-ID: | 20190218201940.ykvbkutewsirxtcf@alap3.anarazel.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-bugs |
On 2019-02-18 12:15:47 -0800, Andres Freund wrote:
> But to confirm, with 10.6 that's gone?
*10.7
From: | Jeremy Finzel <finzelj(at)gmail(dot)com> |
---|---|
To: | Andres Freund <andres(at)anarazel(dot)de> |
Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-bugs(at)postgresql(dot)org |
Subject: | Re: Segmentation Fault in logical decoding get/peek API |
Date: | 2019-02-18 20:35:52 |
Message-ID: | CAMa1XUj4jUAxOMS=VNunhWuzo68CvPnW1fUhR+OrHh1omgCeyg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-bugs |
>
> But to confirm, with 10.6 that's gone?
>
We now have production running on 10.6 (including all those extensions),
and have seen no more issues at all. So I am assuming it must be fixed in
10.6 at this point.
> If it's not gone with 10.6, could you print the two variables in that
> expression?
>
Well it is gone with 10.6 I think, but here is what you requested on my
10.5 version:
(gdb) frame 2
#2 0x0000000000a45f9c in ExceptionalCondition (conditionName=0xc2d688
"!(prev_first_lsn < cur_txn->first_lsn)", errorType=0xc2d404
"FailedAssertion", fileName=0xc2d478 "reorderbuffer.c", lineNumber=688) at
assert.c:54
54 abort();
(gdb) up
#3 0x000000000084b0ac in AssertTXNLsnOrder (rb=0x2775790) at
reorderbuffer.c:688
688 Assert(prev_first_lsn < cur_txn->first_lsn);
(gdb) print prev_first_lsn
$2 = 9888781386112
(gdb) print cur_txn->first_lsn
$3 = 9888781386112
> > # Install Extensions
> > cd $HOME/pgl_ddl_deploy
> > make clean
> > sudo "PATH=$PATH" make install
> > cd $HOME/pglogical_ticker
> > make clean
> > sudo "PATH=$PATH" make install
> > cd $HOME/pg_fact_loader
> > make clean
> > sudo "PATH=$PATH" make install
>
> It's entirely possible that one of those extensions does something to
> violate WAL logging rules :/
>
But I would expect then to still have a problem in our production instance
running 10.6, which I don't.
Thanks,
Jeremy
From: | Jeremy Finzel <finzelj(at)gmail(dot)com> |
---|---|
To: | Andres Freund <andres(at)anarazel(dot)de> |
Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-bugs(at)postgresql(dot)org |
Subject: | Re: Segmentation Fault in logical decoding get/peek API |
Date: | 2019-02-18 21:16:44 |
Message-ID: | CAMa1XUizNiEKxrvrhuJP2E3zn9b6WpsquoBLAXO6oQ752ySULA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-bugs |
>
> We now have production running on 10.6 (including all those extensions),
> and have seen no more issues at all. So I am assuming it must be fixed in
> 10.6 at this point.
>
Somehow I am getting this crash on my compiled 10.6, but I don't get it on
packaged 10.6. :(
I've tried this a number of times so I'm not sure how worth it is trying to
get to the bottom of the compile difference. But I would like to
understand how the compiled version is different. Really the only
difference should be the compile options used for postgres.
Thanks,
Jeremy
From: | Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> |
---|---|
To: | Jeremy Finzel <finzelj(at)gmail(dot)com> |
Cc: | Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-bugs(at)postgresql(dot)org |
Subject: | Re: Segmentation Fault in logical decoding get/peek API |
Date: | 2019-02-18 21:19:33 |
Message-ID: | 20190218211933.GA32698@alvherre.pgsql |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-bugs |
On 2019-Feb-18, Jeremy Finzel wrote:
> >
> > We now have production running on 10.6 (including all those extensions),
> > and have seen no more issues at all. So I am assuming it must be fixed in
> > 10.6 at this point.
>
> Somehow I am getting this crash on my compiled 10.6, but I don't get it on
> packaged 10.6. :(
Maybe because the packaged version doesn't have assertions enabled?
(configure --enable-cassert)
--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From: | Jeremy Finzel <finzelj(at)gmail(dot)com> |
---|---|
To: | Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> |
Cc: | Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-bugs(at)postgresql(dot)org |
Subject: | Re: Segmentation Fault in logical decoding get/peek API |
Date: | 2019-02-18 21:48:05 |
Message-ID: | CAMa1XUiT10038jS8hYcjMtxkXzcC783G2VoWetUGOWjsv=P+qg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | Postg롤 토토SQL : Postg롤 토토SQL 메일 링리스트 : 2019-02-18 이후 PGSQL-BUGS 21:48 |
>
> Maybe because the packaged version doesn't have assertions enabled?
> (configure --enable-cassert)
>
Of course! I am getting a failed assertion on 10.6 then. Is that a
concern? Should I try 10.7?
Thanks,
Jeremy
From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | Jeremy Finzel <finzelj(at)gmail(dot)com> |
Cc: | Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-bugs(at)postgresql(dot)org |
Subject: | Re: Segmentation Fault in logical decoding get/peek API |
Date: | 2019-02-18 21:51:39 |
Message-ID: | 20190218215139.j35vdbh2leio2txz@alap3.anarazel.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-bugs |
Hi,
On 2019-02-18 15:48:05 -0600, Jeremy Finzel wrote:
> >
> > Maybe because the packaged version doesn't have assertions enabled?
> > (configure --enable-cassert)
> >
>
> Of course! I am getting a failed assertion on 10.6 then. Is that a
> concern? Should I try 10.7?
Yes, please do. And if that fails, is is realistic to test without those
additional extensions? If not, where's their source?
Greetings,
Andres Freund
From: | Jeremy Finzel <finzelj(at)gmail(dot)com> |
---|---|
To: | Andres Freund <andres(at)anarazel(dot)de> |
Cc: | Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-bugs(at)postgresql(dot)org |
Subject: | Re: Segmentation Fault in logical decoding get/peek API |
Date: | 2019-02-19 15:32:15 |
Message-ID: | CAMa1XUiAGtcmbAWJNXutbfeJUNSF+X9jYhuNjxev3PDhUDbvUw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-bugs |
>
> Yes, please do. And if that fails, is is realistic to test without those
> additional extensions? If not, where's their source?
>
So yes, I just reproduced the same assert abort on 10.7, so issue is still
there.
I still need pg_fact_loader, but I was able to strip out any custom C code
from the test and I am still able to produce the crash. I dropped
extensions pgl_ddl_deploy and pglogical_ticker. pg_fact_loader can
optionally use a pg background worker, but I am bypassing that and thus
only running plpgsql functions, and still producing the crash. This makes
me doubt that it could have anything to do with these extensions.
If interested, here is the source for them:
github.com/enova/pgl_ddl_deploy
github.com/enova/pglogical_ticker
github.com/enova/pg_fact_loader
I want to see if I can reproduce the crash on a fresh cluster. Is there
anything else you want me to try at this point?
Thanks,
Jeremy
From: | Michael Paquier <michael(at)paquier(dot)xyz> |
---|---|
To: | Jeremy Finzel <finzelj(at)gmail(dot)com> |
Cc: | Andres Freund <andres(at)anarazel(dot)de>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-bugs(at)postgresql(dot)org |
Subject: | Re: Segmentation Fault in logical decoding get/peek API |
Date: | 2019-02-20 02:09:11 |
Message-ID: | 20190220020911.GA15532@paquier.xyz |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-bugs |
On Tue, Feb 19, 2019 at 09:32:15AM -0600, Jeremy Finzel wrote:
> I still need pg_fact_loader, but I was able to strip out any custom C code
> from the test and I am still able to produce the crash. I dropped
> extensions pgl_ddl_deploy and pglogical_ticker. pg_fact_loader can
> optionally use a pg background worker, but I am bypassing that and thus
> only running plpgsql functions, and still producing the crash. This makes
> me doubt that it could have anything to do with these extensions.
The best thing which could be done is to post a set of SQL queries
on this list which allows to reproduce the problem.
--
Michael
From: | Jeremy Finzel <finzelj(at)gmail(dot)com> |
---|---|
To: | Michael Paquier <michael(at)paquier(dot)xyz> |
Cc: | Andres Freund <andres(at)anarazel(dot)de>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-bugs(at)postgresql(dot)org |
Subject: | Re: Segmentation Fault in logical decoding get/peek API |
Date: | 2019-02-20 14:24:59 |
Message-ID: | CAMa1XUgC8WrsSSBj_C6W3J+EC5PTnPEB+jNf4wzmDLT0xywWSw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | Postg범퍼카 토토SQL : Postg범퍼카 토토SQL 메일 링리스트 : 2019-02-20 이후 PGSQL-BUGS 14:24 |
>
> The best thing which could be done is to post a set of SQL queries
> on this list which allows to reproduce the problem.
>
I am trying, but I am not sure how to do that just yet. However, I can
reliably reproduce the error by letting this system run, meaning I have the
ability to attach a debugger to see what is happening more deeply. I can
also provide just what data in the slot is leading to a failed assert.
Would any of this help to narrow down the problem?
Thanks,
Jeremy
From: | Jeremy Finzel <finzelj(at)gmail(dot)com> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | pgsql-bugs(at)postgresql(dot)org |
Subject: | Re: Segmentation Fault in logical decoding get/peek API |
Date: | 2019-02-27 14:34:22 |
Message-ID: | CAMa1XUgMLqVEf_5U-CWRdbGGKy4W_=CBDzgMsRmGpP4wbey_hg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-bugs |
On Thu, Feb 14, 2019 at 5:05 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Jeremy Finzel <finzelj(at)gmail(dot)com> writes:
> > On Thu, Feb 14, 2019 at 4:26 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> >> So my bet is that your problem was fixed by some other commit between
> >> 10.3 and 10.6. Maybe the predecessor one, b767b3f2e; but hard to say
> >> without more investigation than seems warranted, if the bug's gone.
>
> > I am willing to put in more time debugging this because we want to know
> > which clusters are actually susceptible to the bug. Any suggestions to
> > proceed are welcome.
>
> Well, as Peter said, "git bisect" and trying to reproduce the problem
> at each step would be the way to prove it definitively. Seems mighty
> tedious though. Possibly you could shave some time off the process
> by assuming it must have been one of the commits that touched
> reorderbuffer.c ... a quick check says there have been ten of those
> in the v10 branch since 10.3.
>
> regards, tom lane
>
I am slowly continuing to pursue this, and I hit a slightly different
actual segfault on commit 4346794, still on reorderbuffer.c, this time just
2 lines below the original.
The first segfault I was getting was on:
Assert(specinsert->data.tp.oldtuple == NULL);
Now it's on this:
change->action = REORDER_BUFFER_CHANGE_INSERT
I only have 3 more steps to test and I should know exactly what commit
fixed the segfault possibility. I still don't know precisely what can
reproduce the crash.
Thanks,
Jeremy
From: | Jeremy Finzel <finzelj(at)gmail(dot)com> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | pgsql-bugs(at)postgresql(dot)org |
Subject: | Re: Segmentation Fault in logical decoding get/peek API |
Date: | 2019-03-04 14:28:36 |
Message-ID: | CAMa1XUivSJ2idopJUhMi-CWundycnH4OdwWOiZ_D+12BvMLObg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-bugs |
On Wed, Feb 27, 2019 at 8:34 AM Jeremy Finzel <finzelj(at)gmail(dot)com> wrote:
> On Thu, Feb 14, 2019 at 5:05 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
>> Jeremy Finzel <finzelj(at)gmail(dot)com> writes:
>> > On Thu, Feb 14, 2019 at 4:26 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> >> So my bet is that your problem was fixed by some other commit between
>> >> 10.3 and 10.6. Maybe the predecessor one, b767b3f2e; but hard to say
>> >> without more investigation than seems warranted, if the bug's gone.
>>
>> > I am willing to put in more time debugging this because we want to know
>> > which clusters are actually susceptible to the bug. Any suggestions to
>> > proceed are welcome.
>>
>> Well, as Peter said, "git bisect" and trying to reproduce the problem
>> at each step would be the way to prove it definitively. Seems mighty
>> tedious though. Possibly you could shave some time off the process
>> by assuming it must have been one of the commits that touched
>> reorderbuffer.c ... a quick check says there have been ten of those
>> in the v10 branch since 10.3.
>>
>> regards, tom lane
>
>
I have confirmed which commit fixed the segfault:
cee1dd1eeda1e7b86b78f240d24bbfde21d75928 is the first fixed commit
commit cee1dd1eeda1e7b86b78f240d24bbfde21d75928
Author: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Date: Tue Mar 6 15:57:20 2018 -0300
Refrain from duplicating data in reorderbuffers
If a walsender exits leaving data in reorderbuffers, the next walsender
that tries to decode the same transaction would append its decoded data
in the same spill files without truncating it first, which effectively
duplicate the data. Avoid that by removing any leftover reorderbuffer
spill files when a walsender starts.
Backpatch to 9.4; this bug has been there from the very beginning of
logical decoding.
Author: Craig Ringer, revised by me
Reviewed by: Álvaro Herrera, Petr Jelínek, Masahiko Sawada
I'm very curious if anyone more familiar with the code base understands how
the old code was vulnerable to a segfault, but not the new code. I don't
think in my situation a walsender was exiting at all, so I do think the
specific reason for this code change accidentally fixed a potential
segfault unrelated to an exiting walsender leaving data in reorderbuffers.
I still can't explain precisely what reproduces it. As I have said before,
some unique things about my setup are lots of INSERT ... ON CONFLICT UPDATE
queries, as well as btree gist exclusion indexes, but I suspect this has to
do with the former somehow, although I haven't been able to create a
reliably reproducing crash.
I suggest at least adding something to the 10.4 (and corollary major
version) release notes around this.
Here is the release note now:
- In logical decoding, avoid possible double processing of WAL data when
a walsender restarts (Craig Ringer)
Perhaps some addition like this:
- In logical decoding, avoid possible double processing of WAL data when
a walsender restarts (Craig Ringer)
This change also had the effect of fixing a potential segfault
vulnerability in the logical decoding peek/get_changes API in
ReorderBufferCommit
Thanks,
Jeremy