Quick Links

Re: Causal reads take II

Lists:	pgsql-hackers

From:	Ants Aasma <ants(dot)aasma(at)eesti(dot)ee>
To:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc:	Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Causal reads take II
Date:	2017-01-19 07:11:02
Message-ID:	CA+CSw_tz0q+FQsqh7Zx7xxF99Jm98VaAWGdEP592e7a+zkD_Mw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Jan 3, 2017 at 3:43 AM, Thomas Munro
<thomas(dot)munro(at)enterprisedb(dot)com> wrote:
> Here is a new version of my "causal reads" patch (see the earlier
> thread from the 9.6 development cycle[1]), which provides a way to
> avoid stale reads when load balancing with streaming replication.

Thanks for working on this. It will let us do something a lot of
people have been asking for.

> Long term, I think it would be pretty cool if we could develop a set
> of features that give you distributed sequential consistency on top of
> streaming replication. Something like (this | causality-tokens) +
> SERIALIZABLE-DEFERRABLE-on-standbys[3] +
> distributed-dirty-read-prevention[4].

Is it necessary that causal writes wait for replication before making
the transaction visible on the master? I'm asking because the per tx
variable wait time between logging commit record and making
transaction visible makes it really hard to provide matching
visibility order on master and standby. In CSN based snapshot
discussions we came to the conclusion that to make standby visibility
order match master while still allowing for async transactions to
become visible before they are durable we need to make the commit
sequence a vector clock and transmit extra visibility ordering
information to standby's. Having one more level of delay between wal
logging of commit and making it visible would make the problem even
worse.

One other thing that might be an issue for some users is that this
patch only ensures that clients observe forwards progress of database
state after a writing transaction. With two consecutive read only
transactions that go to different servers a client could still observe
database state going backwards. It seems that fixing that would
require either keeping some per client state or a global agreement on
what snapshots are safe to provide, both of which you tried to avoid
for this feature.

Regards,
Ants Aasma

From:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To:	Ants Aasma <ants(dot)aasma(at)eesti(dot)ee>
Cc:	Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Causal reads take II
Date:	2017-01-19 12:22:39
Message-ID:	CAEepm=15WC7A9Zdj2Qbw3CUDXWHe69d=nBpf+jXui7OYXXq11w@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	Postg윈 토토SQL :

On Thu, Jan 19, 2017 at 8:11 PM, Ants Aasma <ants(dot)aasma(at)eesti(dot)ee> wrote:
> On Tue, Jan 3, 2017 at 3:43 AM, Thomas Munro
> <thomas(dot)munro(at)enterprisedb(dot)com> wrote:
>> Long term, I think it would be pretty cool if we could develop a set
>> of features that give you distributed sequential consistency on top of
>> streaming replication. Something like (this | causality-tokens) +
>> SERIALIZABLE-DEFERRABLE-on-standbys[3] +
>> distributed-dirty-read-prevention[4].
>
> Is it necessary that causal writes wait for replication before making
> the transaction visible on the master? I'm asking because the per tx
> variable wait time between logging commit record and making
> transaction visible makes it really hard to provide matching
> visibility order on master and standby.

Yeah, that does seem problematic. Even with async replication or no
replication, isn't there already a race in CommitTransaction() where
two backends could reach RecordTransactionCommit() in one order but
ProcArrayEndTransaction() in the other order? AFAICS using
synchronous replication in one of the transactions just makes it more
likely you'll experience such a visibility difference between the DO
and REDO histories (!), by making RecordTransactionCommit() wait.
Nothing prevents you getting a snapshot that can see t2 but not t1 in
the DO history, while someone doing PITR or querying an asynchronous
standby gets a snapshot that can see t1 but not t2 because those
replay the REDO history.

> In CSN based snapshot
> discussions we came to the conclusion that to make standby visibility
> order match master while still allowing for async transactions to
> become visible before they are durable we need to make the commit
> sequence a vector clock and transmit extra visibility ordering
> information to standby's. Having one more level of delay between wal
> logging of commit and making it visible would make the problem even
> worse.

I'd like to read that... could you please point me at the right bit of
that discussion?

> One other thing that might be an issue for some users is that this
> patch only ensures that clients observe forwards progress of database
> state after a writing transaction. With two consecutive read only
> transactions that go to different servers a client could still observe
> database state going backwards.

True. This patch is about "read your writes", not, erm, "read your
reads". That may indeed be problematic for some users. It's not a
very satisfying answer but I guess you could run a dummy write query
on the primary every time you switch between standbys, or before
telling any other client to run read-only queries after you have done
so, in order to convert your "r r" sequence into a "r w r" sequence...

> It seems that fixing that would
> require either keeping some per client state or a global agreement on
> what snapshots are safe to provide, both of which you tried to avoid
> for this feature.

Agreed. You briefly mentioned this problem in the context of pairs of
read-only transactions a while ago[1]. As you said then, it does seem
plausible to do that with a token system that gives clients the last
commit LSN from the snapshot used by a read only query, so that you
can ask another standby to make sure that LSN has been replayed before
running another read-only transaction. This could be handled
explicitly by a motivated client that is talking to multiple nodes. A
more general problem is client A telling client B to go and run
queries and expecting B to see all transactions that A has seen; it
now has to pass the LSN along with that communication, or rely on some
kind of magic proxy that sees all transactions, or a radically
different system with a GTM.

[1] /message-id/CA%2BCSw_u4Vy5FSbjVc7qms6PuZL7QV90%2BonBEtK9PFqOsNj0Uhw@mail.gmail.com

--
Thomas Munro
http://www.enterprisedb.com

From:	Ants Aasma <ants(dot)aasma(at)eesti(dot)ee>
To:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc:	Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Causal reads take II
Date:	2017-01-19 14:01:28
Message-ID:	CA+CSw_txMN-FxKM_sAHeXs+fXynfQ5uW4O=xgOYnPHOQ9c0_kQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Jan 19, 2017 at 2:22 PM, Thomas Munro
<thomas(dot)munro(at)enterprisedb(dot)com> wrote:
> On Thu, Jan 19, 2017 at 8:11 PM, Ants Aasma <ants(dot)aasma(at)eesti(dot)ee> wrote:
>> On Tue, Jan 3, 2017 at 3:43 AM, Thomas Munro
>> <thomas(dot)munro(at)enterprisedb(dot)com> wrote:
>>> Long term, I think it would be pretty cool if we could develop a set
>>> of features that give you distributed sequential consistency on top of
>>> streaming replication. Something like (this | causality-tokens) +
>>> SERIALIZABLE-DEFERRABLE-on-standbys[3] +
>>> distributed-dirty-read-prevention[4].
>>
>> Is it necessary that causal writes wait for replication before making
>> the transaction visible on the master? I'm asking because the per tx
>> variable wait time between logging commit record and making
>> transaction visible makes it really hard to provide matching
>> visibility order on master and standby.
>
> Yeah, that does seem problematic. Even with async replication or no
> replication, isn't there already a race in CommitTransaction() where
> two backends could reach RecordTransactionCommit() in one order but
> ProcArrayEndTransaction() in the other order? AFAICS using
> synchronous replication in one of the transactions just makes it more
> likely you'll experience such a visibility difference between the DO
> and REDO histories (!), by making RecordTransactionCommit() wait.
> Nothing prevents you getting a snapshot that can see t2 but not t1 in
> the DO history, while someone doing PITR or querying an asynchronous
> standby gets a snapshot that can see t1 but not t2 because those
> replay the REDO history.

Yes there is a race even with all transactions having the same
synchronization level. But nobody will mind if we some day fix that
race. :) With different synchronization levels it is much trickier to
fix as either async commits must wait behind sync commits before
becoming visible, return without becoming visible or visibility order
must differ from commit record LSN order. The first makes the async
commit feature useless, second seems a no-go for semantic reasons,
third requires extra information sent to standby's so they know the
actual commit order.

>> In CSN based snapshot
>> discussions we came to the conclusion that to make standby visibility
>> order match master while still allowing for async transactions to
>> become visible before they are durable we need to make the commit
>> sequence a vector clock and transmit extra visibility ordering
>> information to standby's. Having one more level of delay between wal
>> logging of commit and making it visible would make the problem even
>> worse.
>
> I'd like to read that... could you please point me at the right bit of
> that discussion?

Some of that discussion was face to face at pgconf.eu, some of it is
here: /message-id/CA+CSw_vbt=CwLuOgR7gXdpnho_Y4Cz7X97+o_bH-RFo7keNO8Q@mail.gmail.com

Let me know if you have any questions.

>> It seems that fixing that would
>> require either keeping some per client state or a global agreement on
>> what snapshots are safe to provide, both of which you tried to avoid
>> for this feature.
>
> Agreed. You briefly mentioned this problem in the context of pairs of
> read-only transactions a while ago[1]. As you said then, it does seem
> plausible to do that with a token system that gives clients the last
> commit LSN from the snapshot used by a read only query, so that you
> can ask another standby to make sure that LSN has been replayed before
> running another read-only transaction. This could be handled
> explicitly by a motivated client that is talking to multiple nodes. A
> more general problem is client A telling client B to go and run
> queries and expecting B to see all transactions that A has seen; it
> now has to pass the LSN along with that communication, or rely on some
> kind of magic proxy that sees all transactions, or a radically
> different system with a GTM.

If/when we do CSN based snapshots, adding a GTM could be relatively
straightforward. It's basically not all that far from what Spanner is
doing by using a timestamp as the snapshot. But this is all relatively
independent of this patch.

Regards,
Ants Aasma

From:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To:	Ants Aasma <ants(dot)aasma(at)eesti(dot)ee>
Cc:	Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Causal reads take II
Date:	2017-01-19 22:11:58
Message-ID:	CAEepm=3SsvP4HQz0twk+6_G74H4oLG9hfg6DQvKkqX5_8a8XNQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jan 20, 2017 at 3:01 AM, Ants Aasma <ants(dot)aasma(at)eesti(dot)ee> wrote:
> Yes there is a race even with all transactions having the same
> synchronization level. But nobody will mind if we some day fix that
> race. :)

We really should fix that!

> With different synchronization levels it is much trickier to
> fix as either async commits must wait behind sync commits before
> becoming visible, return without becoming visible or visibility order
> must differ from commit record LSN order. The first makes the async
> commit feature useless, second seems a no-go for semantic reasons,
> third requires extra information sent to standby's so they know the
> actual commit order.

Thought experiment:

1. Log commit and make visible atomically (so DO and REDO agree on
visibility order).
2. Introduce flag 'not yet visible' to commit record for sync rep commits.
3. Introduce a new log record 'make all invisible commits up to LSN X
visible', which is inserted when enough sync rep standbys reply. Log
this + make visible on primary atomically (again, so DO and REDO agree
on visibility order).
4. Teach GetSnapshotData to deal with this using <insert magic here>.

Now standby and primary agree on visibility order of async and sync
transactions, and no standby will allow you to see transactions that
the primary doesn't yet consider to be durable (ie flushed on a quorum
of standbys etc). But... sync rep has to flush xlog twice on primary,
and standby has to wait to make things visible, and remote_apply would
either need to be changed or supplemented with a new level
remote_apply_and_visible, and it's not obvious how to actually do
atomic visibility + logging (I heard ProcArrayLock is kinda hot...).
Hmm. Doesn't sound too popular...

>>> In CSN based snapshot
>>> discussions we came to the conclusion that to make standby visibility
>>> order match master while still allowing for async transactions to
>>> become visible before they are durable we need to make the commit
>>> sequence a vector clock and transmit extra visibility ordering
>>> information to standby's. Having one more level of delay between wal
>>> logging of commit and making it visible would make the problem even
>>> worse.
>>
>> I'd like to read that... could you please point me at the right bit of
>> that discussion?
>
> Some of that discussion was face to face at pgconf.eu, some of it is
> here: /message-id/CA+CSw_vbt=CwLuOgR7gXdpnho_Y4Cz7X97+o_bH-RFo7keNO8Q@mail.gmail.com
>
> Let me know if you have any questions.

Thanks! That may take me some time...

--
Thomas Munro
http://www.enterprisedb.com