Quick Links

Re: PosgreSQL backend process crashed with signal 9

Lists:	pgsql-bugs

From:	Pavel Suderevsky <psuderevsky(at)gmail(dot)com>
To:	pgsql-bugs(at)postgresql(dot)org
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Aleksey Romanov <drednout(dot)by(at)gmail(dot)com>
Subject:	Re: PosgreSQL backend process crashed with signal 9
Date:	2016-04-06 14:36:31
Message-ID:	CAEBTBzuuSR7X70sK_iVmVLg-yhOAABG3Yrbdj+jhN=-yvx9eYQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs

I've faced similar behaviour in much more simple situation. It crashed on
simple nested loop. And all other child postmaster processes were
terminated likewise.

Repoduced on:
1. 9.5.1 + Ubuntu 14.04.4 LTS
2. 9.5.1 + CentOS 7.2.1511
2. 9.5.2 + CentOS 7.2.1511

DDL:

> platform_db=# \d node_statuses
> Table "platform_db.node_statuses"
> Column | Type | Modifiers
>
>
> -----------------------+-------------------------+---------------------------------
> node_id | integer | not null
> status_id | integer | not null
> msg | character varying(1024) | default NULL::character
> varying
> node_status_timestamp | bigint | not null
> Indexes:
> "NODE_STATUSES_NODE_ID_FK" btree (node_id)
> "NODE_STATUSES_NODE_ID_TIMESTAMP_IDX" btree (node_id,
> node_status_timestamp)
> "NODE_STATUSES_TIMESTAMP_IDX" btree (node_status_timestamp)
>
> platform_db=# \d node_status_values
> Table "platform_db.node_status_values"
> Column | Type | Modifiers
> ----------------------+---------+-----------
> node_id | integer | not null
> status_id | integer | not null
> node_value_timestamp | bigint | not null
> Indexes:
> "NODE_STATUS_VALUES_NODE_ID_FK" btree (node_id)
> "NODE_STATUS_VALUES_NODE_ID_TIMESTAMP_IDX" btree (node_id,
> node_value_timestamp)

SQL:

> SELECT * FROM node_statuses ns INNER JOIN node_status_values nsv ON
> ns.node_id = nsv.node_id where ns.node_id = 1147;

Postgres log:

> 2016-04-05 18:37:31 MSK LOG: server process (PID 2848) was terminated by
> signal 9: Killed
> 2016-04-05 18:37:31 MSK DETAIL: Failed process was running: COMMIT
> 2016-04-05 18:37:32 MSK LOG: terminating any other active server
> processes
> 2016-04-05 18:37:32 MSK WARNING: terminating connection because of crash
> of another server process
> 2016-04-05 18:37:32 MSK DETAIL: The postmaster has commanded this server
> process to roll back the current transaction and exit, because another
> server process exited abnormally and possib
> ly corrupted shared memory.
> 2016-04-05 18:37:32 MSK HINT: In a moment you should be able to
> reconnect to the database and repeat your command.
> 2016-04-05 18:37:32 MSK 127.0.0.1 platform_db(at)platform_db [unknown] %
> WARNING: terminating connection because of crash of another server process
> 2016-04-05 18:37:32 MSK 127.0.0.1 platform_db(at)platform_db [unknown] %
> DETAIL: The postmaster has commanded this server process to roll back the
> current transaction and exit, because another
> server process exited abnormally and possibly corrupted shared memory.
> 2016-04-05 18:37:32 MSK 127.0.0.1 platform_db(at)platform_db [unknown] %
> HINT: In a moment you should be able to reconnect to the database and
> repeat your command.
> 2016-04-05 18:37:32 MSK ::1 zabbix(at)zabbix [unknown] % WARNING:
> terminating connection because of crash of another server process
> 2016-04-05 18:37:32 MSK ::1 zabbix(at)zabbix [unknown] % DETAIL: The
> postmaster has commanded this server process to roll back the current
> transaction and exit, because another server process
> exited abnormally and possibly corrupted shared memory.
> 2016-04-05 18:37:32 MSK ::1 zabbix(at)zabbix [unknown] % HINT: In a moment
> you should be able to reconnect to the database and repeat your command.
> 2016-04-05 18:37:32 MSK ::1 zabbix(at)zabbix [unknown] % WARNING:
> terminating connection because of crash of another server process
> 2016-04-05 18:37:32 MSK ::1 zabbix(at)zabbix [unknown] % DETAIL: The
> postmaster has commanded this server process to roll back the current
> transaction and exit, because another server process
> exited abnormally and possibly corrupted shared memory.
> 2016-04-05 18:37:37 MSK LOG: statistics collector process (PID 1012) was
> terminated by signal 9: Killed
> 2016-04-05 18:37:37 MSK LOG: all server processes terminated;
> reinitializing
> 2016-04-05 18:37:39 MSK LOG: database system was interrupted; last known
> up at 2016-04-05 18:31:44 MSK
> 2016-04-05 18:37:39 MSK ::1 zabbix(at)zabbix [unknown] % FATAL: the
> database system is in recovery mode
> 2016-04-05 18:37:41 MSK LOG: database system was not properly shut down;
> automatic recovery in progress
> 2016-04-05 18:37:41 MSK LOG: redo starts at 298/AB146828
> 2016-04-05 18:37:41 MSK ::1 zabbix(at)zabbix [unknown] % FATAL: the
> database system is in recovery mode
> 2016-04-05 18:37:44 MSK [local] postgres(at)platform_db [unknown] % FATAL:
> the database system is in recovery mode
> 2016-04-05 18:37:46 MSK LOG: invalid record length at 298/C07ACA78
> 2016-04-05 18:37:46 MSK LOG: redo done at 298/C07ACA50
> 2016-04-05 18:37:46 MSK LOG: last completed transaction was at log time
> 2016-04-05 18:37:28.455638+03
> 2016-04-05 18:37:48 MSK ::1 zabbix(at)zabbix [unknown] % FATAL: the
> database system is in recovery mode
> 2016-04-05 18:38:08 MSK LOG: MultiXact member wraparound protections are
> now enabled
> 2016-04-05 18:38:08 MSK LOG: autovacuum launcher started
> 2016-04-05 18:38:08 MSK LOG: database system is ready to accept
> connections

In CentOS just:

> 2016-04-06 13:58:58 GMT [local] postgres(at)platform_db psql % LOG: could
> not send data to client: Broken pipe
> 2016-04-06 13:58:58 GMT [local] postgres(at)platform_db psql % STATEMENT:
> SELECT * FROM node_statuses ns INNER JOIN node_status_values nsv ON
> ns.node_id = nsv.node_id where ns.node_id = 1147;
> 2016-04-06 13:58:58 GMT [local] postgres(at)platform_db psql % FATAL:
> connection to client lost

System log [ubuntu]:

> Apr 5 18:37:30 db_server kernel: [3905852.822122] postgres invoked
> oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0
> Apr 5 18:37:30 db_server kernel: [3905852.822131] postgres cpuset=/
> mems_allowed=0-1
> Apr 5 18:37:30 db_server kernel: [3905852.822136] CPU: 3 PID: 9191 Comm:
> postgres Not tainted 3.13.0-77-generic #121-Ubuntu
> Apr 5 18:37:30 db_server kernel: [3905852.822139] Hardware name:
> Supermicro H8DMR-82/H8DMR-82, BIOS 080014 12/15/2006
> ...
> Apr 5 18:37:31 db_server kernel: [3905854.860083] Out of memory: Kill
> process 2847 (postgres) score 391 or sacrifice child
> Apr 5 18:37:31 db_server kernel: [3905854.860201] Killed process 2847
> (postgres) total-vm:6613344kB, anon-rss:34224kB, file-rss:6387036kB

Counts:

> platform_db=# SELECT count(*) FROM node_statuses ns where node_id = 1147;
> count
> -------
> 6819
> (1 row)
> platform_db=# SELECT count(*) FROM node_status_values nsv where node_id =
> 1147;
> count
> -------
> 6814
> (1 row)
> platform_db=# SELECT count(*) FROM node_statuses ns INNER JOIN
> node_status_values nsv ON ns.node_id = nsv.node_id where ns.node_id = 1147;
>
> count
> ----------
> 46423776
> (1 row)

Query plan:

> Nested Loop (cost=477.14..570040.76 rows=45261752 width=53) (actual
> time=3.223..252614.947 rows=46382904 loops=1)
> -> Bitmap Heap Scan on node_statuses ns (cost=241.60..2336.38
> rows=6862 width=37) (actual time=1.724..21.248 rows=6813 loops=1)
> Recheck Cond: (node_id = 1147)
> Heap Blocks: exact=1397
> -> Bitmap Index Scan on "NODE_STATUSES_NODE_ID_FK"
> (cost=0.00..239.89 rows=6862 width=0) (actual time=1.472..1.472 rows=6813
> loops=1)
> Index Cond: (node_id = 1147)
> -> Materialize (cost=235.54..1948.97 rows=6596 width=16) (actual
> time=0.002..12.092 rows=6808 loops=6813)
> -> Bitmap Heap Scan on node_status_values nsv
> (cost=235.54..1915.99 rows=6596 width=16) (actual time=1.471..17.282
> rows=6808 loops=1)
> Recheck Cond: (node_id = 1147)
> Heap Blocks: exact=1357
> -> Bitmap Index Scan on "NODE_STATUS_VALUES_NODE_ID_FK"
> (cost=0.00..233.89 rows=6596 width=0) (actual time=1.238..1.238 rows=6808
> loops=1)
> Index Cond: (node_id = 1147)
> Planning time: 0.210 ms
> Execution time: 334897.948 ms

Memory on the host:

> total used free shared buffers cached
> Mem: 15G 15G 453M 2.7G 195M 12G
> -/+ buffers/cache: 2.2G 13G
> Swap: 0B 0B 0B

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Pavel Suderevsky <psuderevsky(at)gmail(dot)com>
Cc:	pgsql-bugs(at)postgresql(dot)org, Aleksey Romanov <drednout(dot)by(at)gmail(dot)com>
Subject:	Re: PosgreSQL backend process crashed with signal 9
Date:	2016-04-06 14:46:53
Message-ID:	30504.1459954013@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs

Pavel Suderevsky <psuderevsky(at)gmail(dot)com> writes:
> I've faced similar behaviour in much more simple situation. It crashed on
> simple nested loop. And all other child postmaster processes were
> terminated likewise.

>> 2016-04-05 18:37:31 MSK LOG: server process (PID 2848) was terminated by
>> signal 9: Killed

Well, signal 9 is not an internal-to-Postgres error; that's something
outside the database killing the process. If you're on a Linux machine
it's most likely the dreaded OOM killer doing it. The usual
recommendation is to configure your system to reduce or eliminate
memory overcommit:

http://www.postgresql.org/docs/9.5/static/kernel-resources.html#LINUX-MEMORY-OVERCOMMIT

The OP's log did not indicate a signal 9, though --- that was SIGSEGV
(sig 11, typically).

regards, tom lane

From:	Pavel Suderevsky <psuderevsky(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-bugs(at)postgresql(dot)org, Aleksey Romanov <drednout(dot)by(at)gmail(dot)com>
Subject:	Re: PosgreSQL backend process crashed with signal 9
Date:	2016-04-06 14:53:25
Message-ID:	CAEBTBzt7u-+U_aOXiWSiKzWntGN2iaJg=rwztbEqS66gB=hNAg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs

Yes, OOM killer did the job, but is it normal that so lightweight query is
consuming so much memory that OOM-killer to be invoked?

> [dbtest3] mars_gedik=# select
> pg_size_pretty(pg_relation_size('node_statuses'));
> pg_size_pretty
> ----------------
> 11 MB
> (1 row)
> [dbtest3] mars_gedik=# select
> pg_size_pretty(pg_relation_size('node_status_values'));
> pg_size_pretty
> ----------------
> 11 MB
> (1 row)

2016-04-06 17:46 GMT+03:00 Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>:

> Pavel Suderevsky <psuderevsky(at)gmail(dot)com> writes:
> > I've faced similar behaviour in much more simple situation. It crashed on
> > simple nested loop. And all other child postmaster processes were
> > terminated likewise.
>
> >> 2016-04-05 18:37:31 MSK LOG: server process (PID 2848) was terminated
> by
> >> signal 9: Killed
>
> Well, signal 9 is not an internal-to-Postgres error; that's something
> outside the database killing the process. If you're on a Linux machine
> it's most likely the dreaded OOM killer doing it. The usual
> recommendation is to configure your system to reduce or eliminate
> memory overcommit:
>
>
> http://www.postgresql.org/docs/9.5/static/kernel-resources.html#LINUX-MEMORY-OVERCOMMIT
>
> The OP's log did not indicate a signal 9, though --- that was SIGSEGV
> (sig 11, typically).
>
> regards, tom lane
>

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Pavel Suderevsky <psuderevsky(at)gmail(dot)com>
Cc:	pgsql-bugs(at)postgresql(dot)org, Aleksey Romanov <drednout(dot)by(at)gmail(dot)com>
Subject:	Re: PosgreSQL backend process crashed with signal 9
Date:	2016-04-06 15:00:16
Message-ID:	32700.1459954816@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs

Pavel Suderevsky <psuderevsky(at)gmail(dot)com> writes:
> Yes, OOM killer did the job, but is it normal that so lightweight query is
> consuming so much memory that OOM-killer to be invoked?

[ shrug... ] The OOM killer is widely considered broken. Its heuristics
don't interact terribly well with processes using large amounts of shared
memory.

regards, tom lane

From:	Pavel Suderevsky <psuderevsky(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-bugs(at)postgresql(dot)org, Aleksey Romanov <drednout(dot)by(at)gmail(dot)com>
Subject:	Re: PosgreSQL backend process crashed with signal 9
Date:	2016-04-06 18:22:23
Message-ID:	CAEBTBzsmAfQ-FvgY2SNAGrKTqyW6+X6uiQ-LbusNv0AW18r_wg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs

>
> [ shrug... ] The OOM killer is widely considered broken. Its heuristics
> don't interact terribly well with processes using large amounts of shared
> memory.

Actually the issue is not the OOM killer invokation but in amount of memory
that postgresql consumes for a not very heavy operation.

Observing Memory:

> [root(at)dbtest3 ~]# for i in `seq 25`; do free -m && sleep 1 ; done
> total used free shared buff/cache
> available
> Mem: 993 64 831 3 97
> 811
> Swap: 1231 181 1050

Starting SQL:

> [dbtest3] mars_gedik=# SELECT * FROM node_statuses ns INNER JOIN
> node_status_values nsv ON ns.node_id = nsv.node_id where ns.node_id = 1147;
>

Total memory consumed by a process including shared memory (according pmap
utility):

> [root(at)dbtest3 ~]# for i in `seq 25`; do pmap 16227 | tail -1 && sleep 1;
> done
> total 189172K
> total 220264K
> total 438276K
> total 657148K
> total 861352K
> total 1137396K
> total 1320612K
> total 1564020K
> total 1809472K
> total 2038492K

Obersving memory once again:

> total used free shared buff/cache
> available
> Mem: 993 881 62 2 49
> 18
> Swap: 1231 1113 118

And OOM killer works.

If I set vm.overcommit_memory=2 than I get:

> [dbtest3] mars_gedik=# SELECT * FROM node_statuses ns INNER JOIN
> node_status_values nsv ON ns.node_id = nsv.node_id where ns.node_id = 1147;
> out of memory for query result

But process is still being alive and OOM killer hasn't been invoked.

> [root(at)dbtest3 ~]# for i in `seq 25`; do pmap 16479 | tail -1 && sleep 1;
> done
> total 189172K
> total 205776K
> total 419796K
> total 644212K
> total 860560K
> total 1110732K
> total 1320480K
> total 484948K
> total 400404K
> total 313236K
> total 189172K
> total 189172K
> total 189172K
> total 189172K
> total 189172K

2016-04-06 18:00 GMT+03:00 Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>:

> Pavel Suderevsky <psuderevsky(at)gmail(dot)com> writes:
> > Yes, OOM killer did the job, but is it normal that so lightweight query
> is
> > consuming so much memory that OOM-killer to be invoked?
>
> [ shrug... ] The OOM killer is widely considered broken. Its heuristics
> don't interact terribly well with processes using large amounts of shared
> memory.
>
> regards, tom lane
>

From:	David Gould <daveg(at)sonic(dot)net>
To:	Pavel Suderevsky <psuderevsky(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-bugs(at)postgresql(dot)org, Aleksey Romanov <drednout(dot)by(at)gmail(dot)com>
Subject:	Re: PosgreSQL backend process crashed with signal 9
Date:	2016-04-06 20:35:43
Message-ID:	20160406133543.0809152d@engels
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs

On Wed, 6 Apr 2016 21:22:23 +0300
Pavel Suderevsky <psuderevsky(at)gmail(dot)com> wrote:

> >
> > [ shrug... ] The OOM killer is widely considered broken. Its heuristics
> > don't interact terribly well with processes using large amounts of shared
> > memory.
>
>
> Actually the issue is not the OOM killer invokation but in amount of memory
> that postgresql consumes for a not very heavy operation.

I suggest you open a new bug for this. The original report was for a SEGV,
which should never happen. This report is for excessive memory growth leading
to OOM. Both cause a crash, but they are different issues and tying them
together may delay solving either of them.

-dg

--
David Gould 510 282 0869 daveg(at)sonic(dot)net
If simplicity worked, the world would be overrun with insects.