Quick Links

Re: TOAST table repeatedly corrupted

Lists:	pgsql-bugs

From:	Niles Oien <noien(at)nso(dot)edu>
To:	pgsql-bugs(at)postgresql(dot)org
Subject:	TOAST table repeatedly corrupted
Date:	2018-05-09 18:54:17
Message-ID:	CANQ3m6Nhq8VqtZ8SnC3ZLVdQf_1Upb1R+hsPpbFsS7fJOw+Jqg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	Postg토토 커뮤니티SQL : Postg토토 커뮤니티SQL 메일 링리스트 : 2018-05-09 이후 PGSQL-BUGS 18:54

I have what I think is a bug in postgres. I'd like to help track it down,
if that
is indeed what's going on.

I am running a reasonably recent version of postgres :

data=# select version();
version

----------------------------------------------------------------------------------------------------------
PostgreSQL 9.5.5 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7
20120313 (Red Hat 4.4.7-17), 64-bit
(1 row)

About every six months, the logs show that vacuuming of
one of our tables (the same table every time) starts to fail :

grep ERROR postgresql-09.csv

2018-05-09 16:14:03.834 GMT,,,27018,,5af31e4b.698a,1,,2018-05-09 16:14:03
GMT,12/611211,0,ERROR,XX001,"invalid page in block 1374551 of relation
base/16384/36298640",,,,,"automatic vacuum of table
""data.pg_toast.pg_toast_36298637""",,,,""
2018-05-09 16:15:03.865 GMT,,,27696,,5af31e87.6c30,1,,2018-05-09 16:15:03
GMT,12/611219,0,ERROR,XX001,"invalid page in block 1374551 of relation
base/16384/36298640",,,,,"automatic vacuum of table
""data.pg_toast.pg_toast_36298637""",,,,""
2018-05-09 16:16:03.897 GMT,,,31666,,5af31ec3.7bb2,1,,2018-05-09 16:16:03
GMT,10/1330957,0,ERROR,XX001,"invalid page in block 1374551 of relation
base/16384/36298640",,,,,"automatic vacuum of table
""data.pg_toast.pg_toast_36298637""",,,,""

So it looks like a TOAST table has been corrupted. I can see which
table the TOAST table is supporting :

data=# select relname from pg_class where reltoastrelid =
'pg_toast.pg_toast_36298637'::regclass;
relname
---------------
rdvtrack_fd05
(1 row)

And sure enough, I now cannot dump that table :

linux> pg_dump -p5432 -Upostgres -t hmi.rdvtrack_fd05 data >
hmi_rdvtrack_fd05.sql
pg_dump: Dumping the contents of table "rdvtrack_fd05" failed:
PQgetResult() failed.
pg_dump: Error message from server: ERROR: compressed data is corrupted
pg_dump: The command was: COPY hmi.rdvtrack_fd05 (recnum, sunum, slotnum,
sessionid, sessionns, cparms_sg000, v_bzero, v_bscale, carrrot, cmlon,
lonhg, lathg, loncm, midtime, crval1, crval2, crval3, t_start, t_stop,
lonspan, coverage, zonalvel, mai, module, source, input, created, bld_vers,
backgrnd, rejectlist, cmlon_index, loncm_index, lathg_index, lonhg_index,
sg_000_file, sg_001_file, history, datavals, missvals, datamin, datamax,
datamean, datarms, dataskew, datakurt, meanmu, meanlat) TO stdout;

I think I can see the disk files where the tables are stored :

data=# select pg_relation_filepath('hmi.rdvtrack_fd05');
pg_relation_filepath
----------------------
base/16384/36298637
(1 row)

data=# select pg_relation_filepath('pg_toast.pg_toast_36298637');
pg_relation_filepath
----------------------
base/16384/36298640
(1 row)

I can run pg_filedump on those two files, in both cases all
the flags seem unremarkable (although I may not be knowing
what to look for) and it runs to the end of the file, giving
me a message like :

*** End of File Encountered. Last Block Read: 131071 ***

I think this is probably a bug? Every time it happens
it affects the same table, hmi.rdvtrack_fd05. The hardware
we're running on is fairly new, disks and memory are
not throwing errors, and we've never had problems with any
other tables. We're running on CentOS 6.8 - again, not brand
new but hardly decrepit, either. I've rebuilt the table
several times. This time I'd like to give the PG community
a chance to chase it down before I do that.

I've put the two data files, the postgres config and
essentially what I've said here online at :

http://netdrms01.nispdc.nso.edu/pgIssue/36298640
http://netdrms01.nispdc.nso.edu/pgIssue/36298637
http://netdrms01.nispdc.nso.edu/pgIssue/postgresql.conf
http://netdrms01.nispdc.nso.edu/pgIssue/PG_ISSUE.txt

If it helps, the TOAST table looks like this :

data=# \d pg_toast.pg_toast_36298637
TOAST table "pg_toast.pg_toast_36298637"
Column | Type
------------+---------
chunk_id | oid
chunk_seq | integer
chunk_data | bytea

And the table the TOAST table is supporting looks like this :

Niles Oien, noien(at)nso(dot)edu

--
Niles Oien, National Solar Observatory, Boulder Colorado USA

From:	"David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>
To:	Niles Oien <noien(at)nso(dot)edu>
Cc:	"pgsql-bugs(at)postgresql(dot)org" <pgsql-bugs(at)postgresql(dot)org>
Subject:	TOAST table repeatedly corrupted
Date:	2018-05-09 19:31:54
Message-ID:	CAKFQuwZh5wxzhmq_fD4Q8kV8w=rm8UUTJAuvOCSrqjwSqZyGAQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	Postg젠 토토SQL : Postg젠 토토SQL 메일 링리스트 : 2018-05-09 이후 PGSQL-BUGS 19:31

On Wednesday, May 9, 2018, Niles Oien <noien(at)nso(dot)edu> wrote:

>
> I have what I think is a bug in postgres. I'd like to help track it down,
> if that
> is indeed what's going on.
>
> I am running a reasonably recent version of postgres :
>
> data=# select version();
> version
>
> ------------------------------------------------------------
> ----------------------------------------------
> PostgreSQL 9.5.5 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7
> 20120313 (Red Hat 4.4.7-17), 64-bit
> (1 row)
>
>
9.5.12 is reasonably recent. 9.5.5 is not. Even if you found a live bug
you'd have to upgrade to fix it so you should just do that now and see if
that works. You can scan the release notes for 9.5.6 and newer to see
whether any vacuum related fixes seem to apply.

David J.

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Niles Oien <noien(at)nso(dot)edu>
Cc:	pgsql-bugs(at)postgresql(dot)org
Subject:	Re: TOAST table repeatedly corrupted
Date:	2018-05-09 20:47:11
Message-ID:	15116.1525898831@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs

Niles Oien <noien(at)nso(dot)edu> writes:
> I am running a reasonably recent version of postgres :
> PostgreSQL 9.5.5 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7
> 20120313 (Red Hat 4.4.7-17), 64-bit

As David said, that's not terribly recent. If you are going to upgrade,
I'd suggest waiting till tomorrow and grabbing 9.5.13, because we fixed
a pretty serious TOAST data corruption bug in this week's batch of
releases. The expected symptoms of it don't match what you're seeing,
unfortunately, but nonetheless you ought to be using the latest, just
in case this is an already-fixed issue.

> 2018-05-09 16:14:03.834 GMT,,,27018,,5af31e4b.698a,1,,2018-05-09 16:14:03
> GMT,12/611211,0,ERROR,XX001,"invalid page in block 1374551 of relation
> base/16384/36298640",,,,,"automatic vacuum of table
> ""data.pg_toast.pg_toast_36298637""",,,,""

Block 1374551 would be well past the first segment of the file, since
in a standard build (1GB segments, 8K blocks) there are only 131072
pages per segment. This explains why you didn't see any complaints
from pg_filedump, if you only ran it over the first segment.

If you've not clobbered the DB yet, file 36298640.10 would be what
to look at, I believe.

> And sure enough, I now cannot dump that table :
> pg_dump: Error message from server: ERROR: compressed data is corrupted

That's interesting, because it seems to indicate an independent problem.
The "invalid page" error indicates a bad page header, or possibly a
page checksum failure; either way the page would not have been allowed
into the buffer pool. But "compressed data is corrupted" implies that
we did read a page but the data in it seems messed up. So this evidence
says you have at least two different corrupted places in that table.

Do you have checksums enabled in this installation? If you're going
to have to rebuild it, you should probably turn those on (use
initdb --data-checksums), in hopes of narrowing down what's happening.

> I think this is probably a bug? Every time it happens
> it affects the same table, hmi.rdvtrack_fd05.

That's mighty suggestive all right, but unfortunately it doesn't
do much to narrow down the problem :-(

regards, tom lane

From:	Niles Oien <noien(at)nso(dot)edu>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-bugs(at)postgresql(dot)org
Subject:	Re: TOAST table repeatedly corrupted
Date:	2018-05-09 21:34:45
Message-ID:	CANQ3m6OKUsPZ9c-=5hRBi7_CCVNDy+ED+mJOdwFCPq8u=2nNGA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs

Thanks. I don't have checksums on. I'll look into it on the next build.

The file 36298640.10 didn't show anything unusual under pg_filedump.

I'm betting that we are suffering from a now-fixed TOAST issue, if not the
recently fixed one you mentioned. That's probably all the chasing that's
worth doing here given the dated nature of our production box. On our
development box, where we have some room to move, we're running something a
bit newer :

PostgreSQL 10.3 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5
20150623 (Red Hat 4.8.5-16), 64-bit

Does that need an upgrade to get this week's TOAST fixes, too? I'm not sure
if CentOS's 'yum upgrade' will pick it up - I have the repo pgdg10/7/x86_64
enabled, will the update show up that way?

Thanks,

Niles.

On Wed, May 9, 2018 at 2:47 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Niles Oien <noien(at)nso(dot)edu> writes:
> > I am running a reasonably recent version of postgres :
> > PostgreSQL 9.5.5 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7
> > 20120313 (Red Hat 4.4.7-17), 64-bit
>
> As David said, that's not terribly recent. If you are going to upgrade,
> I'd suggest waiting till tomorrow and grabbing 9.5.13, because we fixed
> a pretty serious TOAST data corruption bug in this week's batch of
> releases. The expected symptoms of it don't match what you're seeing,
> unfortunately, but nonetheless you ought to be using the latest, just
> in case this is an already-fixed issue.
>
> > 2018-05-09 16:14:03.834 GMT,,,27018,,5af31e4b.698a,1,,2018-05-09
> 16:14:03
> > GMT,12/611211,0,ERROR,XX001,"invalid page in block 1374551 of relation
> > base/16384/36298640",,,,,"automatic vacuum of table
> > ""data.pg_toast.pg_toast_36298637""",,,,""
>
> Block 1374551 would be well past the first segment of the file, since
> in a standard build (1GB segments, 8K blocks) there are only 131072
> pages per segment. This explains why you didn't see any complaints
> from pg_filedump, if you only ran it over the first segment.
>
> If you've not clobbered the DB yet, file 36298640.10 would be what
> to look at, I believe.
>
> > And sure enough, I now cannot dump that table :
> > pg_dump: Error message from server: ERROR: compressed data is corrupted
>
> That's interesting, because it seems to indicate an independent problem.
> The "invalid page" error indicates a bad page header, or possibly a
> page checksum failure; either way the page would not have been allowed
> into the buffer pool. But "compressed data is corrupted" implies that
> we did read a page but the data in it seems messed up. So this evidence
> says you have at least two different corrupted places in that table.
>
> Do you have checksums enabled in this installation? If you're going
> to have to rebuild it, you should probably turn those on (use
> initdb --data-checksums), in hopes of narrowing down what's happening.
>
> > I think this is probably a bug? Every time it happens
> > it affects the same table, hmi.rdvtrack_fd05.
>
> That's mighty suggestive all right, but unfortunately it doesn't
> do much to narrow down the problem :-(
>
> regards, tom lane
>

--
Niles Oien, National Solar Observatory, Boulder Colorado USA

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Niles Oien <noien(at)nso(dot)edu>
Cc:	PostgreSQL mailing lists <pgsql-bugs(at)postgresql(dot)org>
Subject:	Re: TOAST table repeatedly corrupted
Date:	2018-05-09 22:15:07
Message-ID:	CAH2-Wz=NhNb4oEPa8BjVME6HkFcFk-3b8nc_QPwyBsX8aE=j1Q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs

On Wed, May 9, 2018 at 11:54 AM, Niles Oien <noien(at)nso(dot)edu> wrote:
> I think this is probably a bug? Every time it happens
> it affects the same table, hmi.rdvtrack_fd05. The hardware
> we're running on is fairly new, disks and memory are
> not throwing errors, and we've never had problems with any
> other tables. We're running on CentOS 6.8 - again, not brand
> new but hardly decrepit, either. I've rebuilt the table
> several times. This time I'd like to give the PG community
> a chance to chase it down before I do that.

amcheck could potentially reveal more information about the problem.
The version on Github works with 9.5, and there are Redhat/CentOS pgdg
packages you can use. See:

https://github.com/petergeoghegan/amcheck

The "heapallindexed" option may be of particular interest to you. It
verifies that the table has matching rows for a target index, in
addition to testing the structure of the target B-Tree index itself.
It doesn't do anything special with TOAST tables just yet, but it's
still the best general corruption test we have.

--
Peter Geoghegan

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Niles Oien <noien(at)nso(dot)edu>
Cc:	pgsql-bugs(at)postgresql(dot)org
Subject:	Re: TOAST table repeatedly corrupted
Date:	2018-05-09 22:46:04
Message-ID:	20811.1525905964@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	Postg젠 토토SQL : Postg젠 토토SQL 메일 링리스트 : 2018-05-09 이후 PGSQL-BUGS 22:46

Niles Oien <noien(at)nso(dot)edu> writes:
> PostgreSQL 10.3 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5
> 20150623 (Red Hat 4.8.5-16), 64-bit
>
> Does that need an upgrade to get this week's TOAST fixes, too?

Yes.

> I'm not sure
> if CentOS's 'yum upgrade' will pick it up - I have the repo pgdg10/7/x86_64
> enabled, will the update show up that way?

I believe so.

regards, tom lane