Possible bug with SKIP LOCKED behaviour

Lists: pgsql-bugs
From: Glen Mailer <glen(at)geckoboard(dot)com>
To: pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Possible bug with SKIP LOCKED behaviour
Date: 2022-09-28 16:28:06
Message-ID: CAHvdy4VdisRrJ9mYyaVo++S+q2PuXn2K-yU_AHXQ_Z+UgdyFcA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Hello everyone

I believe I've run into a bug in the behaviour of SKIP LOCKED, where I have
a program that implements a queue with concurrent workers SELECTing work
from some shared tables.

The code in question does a LEFT JOIN across two tables with a FOR UPDATE
on the left table and a SKIP LOCKED clause, and then UPDATEs or INSERTs
rows into the table on right side of the JOIN in a way that leads to
subsequent executions of the same query to no longer match those rows.
However, when run concurrently I'm seeing the same row be selected by
multiple workers - which shouldn't be possible based on my understanding of
the relevant semantics of these operations. Perhaps I'm just holding it
wrong, but I would have expected the FOR UPDATE lock on the left table to
be sufficient to avoid overlapping results.

I have extracted a fairly minimal reproducing case from our production
code, which includes some Go code as a test harness to run the queries
concurrently enough to demonstrate the problem - this can be found at
https://github.com/glenjamin/postgres-skip-locked-surprise

I wasn't sure how much detail from that reproducing case to repeat in this
email, so I've only gone with an outline of the observed and expected
behaviour - but I can try and add more detail to this thread if desired

Cheers
Glen


From: Zhang Mingli <zmlpostgres(at)gmail(dot)com>
To: pgsql-bugs(at)lists(dot)postgresql(dot)org, Glen Mailer <glen(at)geckoboard(dot)com>
Subject: Re: Possible bug with SKIP LOCKED behaviour
Date: 2022-09-29 02:41:42
Message-ID: 33029e69-53fd-44b7-81be-4bd3891cf23b@Spark
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Hi,

On Sep 29, 2022, 00:56 +0800, Glen Mailer <glen(at)geckoboard(dot)com>, wrote:
> Hello everyone
>
> I believe I've run into a bug in the behaviour of SKIP LOCKED, where I have a program that implements a queue with concurrent workers SELECTing work from some shared tables.
>
> The code in question does a LEFT JOIN across two tables with a FOR UPDATE on the left table and a SKIP LOCKED clause, and then UPDATEs or INSERTs rows into the table on right side of the JOIN in a way that leads to subsequent executions of the same query to no longer match those rows. However, when run concurrently I'm seeing the same row be selected by multiple workers - which shouldn't be possible based on my understanding of the relevant semantics of these operations. Perhaps I'm just holding it wrong, but I would have expected the FOR UPDATE lock on the left table to be sufficient to avoid overlapping results.
>
> I have extracted a fairly minimal reproducing case from our production code, which includes some Go code as a test harness to run the queries concurrently enough to demonstrate the problem - this can be found at https://github.com/glenjamin/postgres-skip-locked-surprise
> I wasn't sure how much detail from that reproducing case to repeat in this email, so I've only gone with an outline of the observed and expected behaviour - but I can try and add more detail to this thread if desired
>
> Cheers
> Glen
According to doc:

With SKIP LOCKED, any selected rows that cannot be immediately locked are skipped. Skipping locked rows provides an inconsistent view of the data, so this is not suitable for general purpose work, but can be used to avoid lock contention with multiple consumers accessing a queue-like table.
> this can be found at https://github.com/glenjamin/postgres-skip-locked-surprise

And a golang script is not convenient for hackers to reproduce. Could you provide some steps to produce the bug stably if it really was ?

Regards,
Zhang Mingli


From: Glen Mailer <glen(at)geckoboard(dot)com>
To: Zhang Mingli <zmlpostgres(at)gmail(dot)com>
Cc: pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: Possible bug with SKIP LOCKED behaviour
Date: 2022-09-29 08:50:50
Message-ID: CAHvdy4VGD+Jk2hc=m=iYL4amAohcUBoPzYcChM1mfHtPQ9AbWw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Hello

With SKIP LOCKED, any selected rows that cannot be immediately locked are
> skipped. Skipping locked rows provides an inconsistent view of the data, so
> this is not suitable for general purpose work, but can be used to avoid
> lock contention with multiple consumers accessing a queue-like table.
>

Yes, I am specifically aiming to avoid lock contention with multiple
consumers accessing a queue-like table, and I'm seeing the same row being
retrieved my multiple workers

And a golang script is not convenient for hackers to reproduce. Could you
> provide some steps to produce the bug stably if it really was ?
>

Reproducing requires running a transaction with queries dependent on the
results of earlier queries, and then running a number of these transactions
concurrently, and then repeating the test until the unexpected result
happens. Currently I'm doing 20 concurrent transactions, and I find that if
I repeat the test 100 times I tend to get between zero and 3 failures.

What would be a more convenient way for me to provide this for reproduction?

Thanks
Glen

On Thu, 29 Sept 2022 at 03:41, Zhang Mingli <zmlpostgres(at)gmail(dot)com> wrote:

> Hi,
>
> On Sep 29, 2022, 00:56 +0800, Glen Mailer <glen(at)geckoboard(dot)com>, wrote:
>
> Hello everyone
>
> I believe I've run into a bug in the behaviour of SKIP LOCKED, where I
> have a program that implements a queue with concurrent workers SELECTing
> work from some shared tables.
>
> The code in question does a LEFT JOIN across two tables with a FOR UPDATE
> on the left table and a SKIP LOCKED clause, and then UPDATEs or INSERTs
> rows into the table on right side of the JOIN in a way that leads to
> subsequent executions of the same query to no longer match those rows.
> However, when run concurrently I'm seeing the same row be selected by
> multiple workers - which shouldn't be possible based on my understanding of
> the relevant semantics of these operations. Perhaps I'm just holding it
> wrong, but I would have expected the FOR UPDATE lock on the left table to
> be sufficient to avoid overlapping results.
>
> I have extracted a fairly minimal reproducing case from our production
> code, which includes some Go code as a test harness to run the queries
> concurrently enough to demonstrate the problem - this can be found at
> https://github.com/glenjamin/postgres-skip-locked-surprise
> I wasn't sure how much detail from that reproducing case to repeat in this
> email, so I've only gone with an outline of the observed and expected
> behaviour - but I can try and add more detail to this thread if desired
>
> Cheers
> Glen
>
> According to doc:
>
> With SKIP LOCKED, any selected rows that cannot be immediately locked are
> skipped. Skipping locked rows provides an inconsistent view of the data, so
> this is not suitable for general purpose work, but can be used to avoid
> lock contention with multiple consumers accessing a queue-like table.
>
> this can be found at
> https://github.com/glenjamin/postgres-skip-locked-surprise
>
> And a golang script is not convenient for hackers to reproduce. Could you
> provide some steps to produce the bug stably if it really was ?
>
> Regards,
> Zhang Mingli
>
>