Quick Links

Re: Global snapshots

Lists:	pgsql-hackers

From:	Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru>
To:	PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Global snapshots
Date:	2018-05-01 16:27:22
Message-ID:	21BC916B-80A1-43BF-8650-3363CCDAE09C@postgrespro.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Global snapshots
================

Here proposed a set of patches that allow achieving proper snapshot isolation
semantics in the case of cross-node transactions. Provided infrastructure to
synchronize snapshots between different Postgres instances and a way to
atomically commit such transaction with respect to other concurrent global and
local transactions. Such global transactions can be coordinated outside of
Postgres by using provided SQL functions or through postgres_fdw, which make use
of this functions on remote nodes transparently.

Background
----------

Several years ago was proposed extensible transaction manager (XTM) patch
[1,2,3] which allowed extensions to hook and override transaction-related
functions like a xid assignment, snapshot acquiring, transaction tree
commit/abort, etc. Also, two transaction management extensions were created
based on that API: pg_dtm, pg_tsdtm [4].

The first one, pg_dtm, was inspired by Postgres-XL and represents a direct
generalization of Postgres MVCC to the cross-node scenario: there is standalone
service (DTM) that maintains a xid counter and an array of running backends on
all nodes. Every node keeps an open connection to a DTM and asks for xids and
Postgres-style snapshots by a network call. While this approach is reasonable
for low transaction rates (which is common for OLAP-like load) it quickly gasps
for OLTP case.

The second one, pg_tsdtm, based on Clock-SI paper [5] implements distributed
snapshot synchronization protocol without the necessity of central service like
DTM. It makes use of CSN-based visibility and requires two steps of
communication during transaction commit. Since commit between nodes each of
which can abort transaction is usually done by 2PC protocol two steps of
communication are anyway already needed.

The current patch is a reimplementation of pg_tsdtm moved into core directly
without using any API. The decision to drop XTM API was based on two thoughts.
At first XTM API itself needed a huge pile of work to be done to became an
actual API instead of the set of hooks over current implementation of MVCC with
extensions responsible for handling random static variables like
RecentGlobalXmin and friend, which other guts make use of. For the second in all
of our test pg_tsdtm was better and faster, but that wasn't obvious from the
beginning of work on XTM/DTM. So we decided that it is better to focus on the
good implementation of Clock-SI approach.

Algorithm
---------

Clock-SI is described in [5] and here I provide a small overview, which
supposedly should be enough to catch the idea. Assume that each node runs Commit
Sequence Number (CSN) based visibility: database tracks one counter for each
transaction start (xid) and another counter for each transaction commit (csn).
In such setting, a snapshot is just a single number -- a copy of current CSN at
the moment when the snapshot was taken. Visibility rules are boiled down to
checking whether current tuple's CSN is less than our snapshot's csn. Also it
worth of mentioning that for the last 5 years there is an active proposal to
switch Postgres to CSN-based visibility [6].

Next, let's assume that CSN is current physical time on the node and call it
GlobalCSN. If the physical time on different nodes would be perfectly
synchronized then such snapshot exported on one node can be used on other nodes.
But unfortunately physical time never perfectly sync and can drift, so that fact
should be taken into mind. Also, there is no easy notion of lock or atomic
operation in the distributed environment, so commit atomicity on different nodes
with respect to concurrent snapshot acquisition should be handled somehow.
Clock-SI addresses that in following way:

1) To achieve commit atomicity of different nodes intermediate step is
introduced: at first running transaction is marked as InDoubt on all nodes,
and only after that each node commit it and stamps with a given GlobalCSN.
All readers who ran into tuples of an InDoubt transaction should wait until
it ends and recheck visibility. The same technique was used in [6] to achieve
atomicity of subtransactions tree commit.

2) When coordinator is marking transactions as InDoubt on other nodes it
collects ProposedGlobalCSN from each participant which is local time at that
nodes. Next, it selects the maximal value of all ProposedGlobalCSN's and
commits the transaction on all nodes with that maximal GlobaCSN. Even if that
value is greater than current time on this node due to clock drift. So the
GlobalCSN for the given transaction will be the same on all nodes.

3) When local transaction imports foreign global snapshot with some GlobalCSN
and current time on this node is smaller then incoming GlobalCSN then the
transaction should wait until this GlobalCSN time comes on the local clock.

Rules 2) and 3) provide protection against time drift. Paper [5] proves that
this is enough to guarantee a proper Snapshot Isolation.

Implementation
--------------

Main design decision of this patch is trying to affect the performance of local
transaction as low as possible while providing a way to make global
transactions. GUC variable track_global_snapshots enables/disables this feature.

Patch 0001-GlobalCSNLog introduces new SLRU instance that maps xids to
GlobalCSN. GlobalCSNLog code is pretty straightforward and more or less copied
from SUBTRANS log which is also not persistent. Also, I kept an eye on the
corresponding part of Heikki's original patch for CSN's in [6] and commit_ts.c.

Patch 0002-Global-snapshots provides infrastructure to snapshot sync functions
and global commit functions. It consists of several parts which would be enabled
when GUC track_global_snapshots is on:

* Each Postgres snapshot acquisition is accompanied by taking current GlobalCSN
under the same shared ProcArray lock.

* Each transaction commit also writes current GlobalCSN into GlobalCSNLog. To
avoid writes to SLRU under exclusive ProcArray lock (which would be the major
hit on commit performance) trick with intermediate InDoubt state is used:
before calling ProcArrayEndTransaction() backend writes InDoubt state in SLRU,
then inside of ProcArrayEndTransaction() under a ProcArray lock GlobalCSN is
assigned, and after the lock is released assigned GlobalCSN value is written
to GlobalCSNLog SLRU. This approach ensures XIP-based snapshots and
GlobalCSN-based are going to see the same subset of tuple versions without
putting too much extra contention on ProcArray lock.

* XidInMVCCSnapshot can use both XIP-based and GlobalCSN-based snapshot. If the
current transaction is local one then at first XIP-based check is performed,
then if the tuple is visible don't do any further checks; if this xid is
in-progress we need to fetch GlobalCSN from SLRU and recheck it with
GlobalCSN-based visibility rules, as it may be part of global InDoubt
transaction. So, for local transactions XidInMVCCSnapshot() will fetch SLRU
entry only for in-progress transactions. This can be optimized further: it is
possible to store a flag in a snapshot which indicates whether there were any
active global transactions when this snapshot was taken. Then if there were no
global transactions during snapshot creation SLRU access in
XidInMVCCSnapshot() can be skipped at all. Otherwise, if current backend have
the snapshot that was imported from foreign node then we use only
GlobalCSN-based visibility rules (as we don't have any information about how
XIP array looked like when that GlobalCSN was taken).

* Import/export routines provided for global snapshots. Export just returns
current snapshot's global_csn. Import, on the other hand, is more complicated.
Since imported GlobalCSN usually points in past we should hold OldestXmin and
ensure that tuple versions for given GlobalCSN don't pass through OldestXmin
boundary. To achieve that, mechanism like one in SnapshotTooOld is created: on
each snapshot creation current oldestXmin is written to a sparse ring buffer
which holds oldestXmin entries for a last global_snapshot_defer_time seconds.
GetOldestXmin() is delaying its results to the oldest entry in this ring
buffer. If we asked to import snapshot that is later then current_time -
global_snapshot_defer_time then we just error out with "global snapshot too
old" message. Otherwise, we have enough info to set proper xmin in our proc
entry to defuse GetOldestXmin().

* Following routines for commit provided:
* pg_global_snaphot_prepare(gid) sets InDoubt state for given global tx and
return proposed GlobalCSN.
* pg_global_snaphot_assign(gid, global_csn) assign given global_csn to given
global tx. Consequent COMMIT PREPARED will use that.

Import/export and commit routines are made as SQL functions. IMO it's better to
transform them to IMPORT GLOBAL SNAPSHOT / EXPORT GLOBAL SNAPSHOT / PREPARE
GLOBAL / COMMIT GLOBAL PREPARED. But at this stage, I don't want to clutter this
patch with new grammar.

Patch 0003-postgres_fdw-support-for-global-snapshots uses the previous
infrastructure to achieve isolation in the generated transaction. Right now it
is a minimalistic implementation of 2PC like one in [7], but which doesn't care
about writing something about remote prepares in WAL on coordinator and doesn't
care about any recovery. The main usage is to test things in global snapshot
patch, as it easier to do with TAP tests over several postgres_fdw-connected
instances. Tests are included.

Usage
-----

The distributed transaction can be coordinated by an external coordinator. In
this case normal scenario would be following:

1) Start transaction on the first node, do some queries if needed, call
pg_global_snaphot_export().

2) On the other node open transaction and call
pg_global_snaphot_import(global_csn). global_csn is from previous export.

... do some useful work ...

3) Issue PREPARE for all participant.
4) Call pg_global_snaphot_prepare(gid) on all participant and store returned
global_csn's;

select maximal over of returned global_csn's.

5) Call pg_global_snaphot_assign(gid, max_global_csn) on all participant.
6) Issue COMMIT PREPARED for all participant.

As it was said earlier steps 4) and 5) can be melded into 3) and 5)
respectively, but let's go for grammar changes after and if there will be
agreement on the overall concept.

postgres_fdw in 003-GlobalSnapshotsForPostgresFdw does the same thing, but
transparently to the user.

Tests
-----

Each XidInMVCCSnapshot() in this patch is coated with assert that checks that
same things are visible under XIP-based check and GlobalCSN one.

Patch 003-GlobalSnapshotsForPostgresFdw includes numerous variants of banking
test. It spin-offs several Postgres instances and several concurrent pgbenches
which simulate cross-node bank account transfers, while test constantly checks
that total balance is correct. Also, there are tests that imported snapshot is
going to see the same checksum of data as it was during import.

I think this feature also deserves separate test module that will wobble the
clock time, but that isn't done yet.

Attributions
------------
Original XTM, pg_dtm, pg_tsdtm were written by Konstantin Knizhnik, Constantin
Pan and me.
This version is mostly on me, with some inputs by Konstantin
Knizhnik and Arseny Sher.

[1] /message-id/flat/20160223164335.GA11285%40momjian.us
(originally thread was about fdw-based sharding, but later got "hijacked" by xtm)
[2] /message-id/flat/F2766B97-555D-424F-B29F-E0CA0F6D1D74%40postgrespro.ru
[3] /message-id/flat/56E3CAAE.6060407%40postgrespro.ru
[4] https://wiki.postgresql.org/wiki/DTM
[5] https://dl.acm.org/citation.cfm?id=2553434
[6] /message-id/flat/CA%2BCSw_tEpJ%3Dmd1zgxPkjH6CWDnTDft4gBi%3D%2BP9SnoC%2BWy3pKdA%40mail.gmail.com
[7] /message-id/flat/CAFjFpRfPoDAzf3x-fs86nDwJ4FAwn2cZ%2BxdmbdDPSChU-kt7%2BQ%40mail.gmail.com

Attachment	Content-Type	Size
0001-GlobalCSNLog-SLRU.patch	application/octet-stream	24.0 KB
0002-Global-snapshots.patch	application/octet-stream	63.6 KB
0003-postgres_fdw-support-for-global-snapshots.patch	application/octet-stream	35.9 KB
unknown_filename	text/plain	3 bytes

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Global snapshots
Date:	2018-05-01 19:43:43
Message-ID:	CA+TgmoazjBL1D-hSV-pMYK=g9L-Zc3v0Gh7TiAhMYRzo6nXdSg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, May 1, 2018 at 12:27 PM, Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru> wrote:
> Here proposed a set of patches that allow achieving proper snapshot isolation
> semantics in the case of cross-node transactions. Provided infrastructure to
> synchronize snapshots between different Postgres instances and a way to
> atomically commit such transaction with respect to other concurrent global and
> local transactions. Such global transactions can be coordinated outside of
> Postgres by using provided SQL functions or through postgres_fdw, which make use
> of this functions on remote nodes transparently.

I'm concerned about the provisioning aspect of this problem. Suppose
I have two existing database systems with, perhaps, wildly different
XID counters. On a certain date, I want to start using this system.
Or conversely, I have two systems that are bonded together using this
system from the beginning, and then, as of a certain date, I want to
break them apart into two standalone systems. In your proposed
design, are things like this possible? Can you describe the setup
involved?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Global snapshots
Date:	2018-05-01 21:02:47
Message-ID:	5A30884C-F446-40FD-9D86-F4046F29F9F6@postgrespro.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	Postg무지개 토토SQL

> On 1 May 2018, at 22:43, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> I'm concerned about the provisioning aspect of this problem. Suppose
> I have two existing database systems with, perhaps, wildly different
> XID counters. On a certain date, I want to start using this system.

Yes, that totally possible. On both systems you need:

* set track_global_snapshots='on' -- this will start writing each
transaction commit sequence number to SRLU.
* set global_snapshot_defer_time to 30 seconds, for example -- this
will delay oldestXmin advancement for specified amount of time,
preserving old tuples.
* restart database
* optionally enable NTPd if it wasn't enabled.

Also it is possible to avoid reboot, but that will require some careful
work: after enabling track_global_snapshots it will be safe to start
global transactions only when all concurrently running transactions
will finish. More or less equivalent thing happens during logical slot
creation.

> Or conversely, I have two systems that are bonded together using this
> system from the beginning, and then, as of a certain date, I want to
> break them apart into two standalone systems. In your proposed
> design, are things like this possible? Can you describe the setup
> involved?

Well, they are not actually "bonded" in any persistent way. If there will
be no distributed transactions, there will be no any logical or physical
connection between that nodes.

And returning to your original concern about "wildly different XID
counters" I want to emphasise that only thing that is floating between
nodes is a GlobalCSN's during start and commit of distributed transaction.
And that GlobalCSN is actually a timestamp of commit, the real one, from
clock_gettime(). And clock time is supposedly more or less the same
on different nodes in normal condition. But correctness here will not
depend on degree of clock synchronisation, only performance of
global transactions will.

--
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

From:	Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
To:	Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Global snapshots
Date:	2018-05-02 02:58:37
Message-ID:	0cec343f-a50c-a40c-a299-a4043960e26c@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 5/1/18 12:27, Stas Kelvich wrote:
> Clock-SI is described in [5] and here I provide a small overview, which
> supposedly should be enough to catch the idea. Assume that each node runs Commit
> Sequence Number (CSN) based visibility: database tracks one counter for each
> transaction start (xid) and another counter for each transaction commit (csn).
> In such setting, a snapshot is just a single number -- a copy of current CSN at
> the moment when the snapshot was taken. Visibility rules are boiled down to
> checking whether current tuple's CSN is less than our snapshot's csn. Also it
> worth of mentioning that for the last 5 years there is an active proposal to
> switch Postgres to CSN-based visibility [6].

But that proposal has so far not succeeded. How are you overcoming the
reasons for that?

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru>
To:	Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Global snapshots
Date:	2018-05-02 07:47:45
Message-ID:	C1A2A06C-BAF6-4DCE-BBDC-5CF68718B864@postgrespro.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On 2 May 2018, at 05:58, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com> wrote:
>
> On 5/1/18 12:27, Stas Kelvich wrote:
>> Clock-SI is described in [5] and here I provide a small overview, which
>> supposedly should be enough to catch the idea. Assume that each node runs Commit
>> Sequence Number (CSN) based visibility: database tracks one counter for each
>> transaction start (xid) and another counter for each transaction commit (csn).
>> In such setting, a snapshot is just a single number -- a copy of current CSN at
>> the moment when the snapshot was taken. Visibility rules are boiled down to
>> checking whether current tuple's CSN is less than our snapshot's csn. Also it
>> worth of mentioning that for the last 5 years there is an active proposal to
>> switch Postgres to CSN-based visibility [6].
>
> But that proposal has so far not succeeded. How are you overcoming the
> reasons for that?

Well, CSN proposal is aiming to switch all postgres visibility stuff with CSN.
This proposal is far more ambitious and original postgres visibility with
snapshots being arrays of XIDs is preserved. In this patch CSNs are written
to SLRU during commit (in a way like commit_ts does) but will be read in two
cases:

1) When the local transaction faced XID that in progress according to XIP-based
snapshot, it CSN need to be checked, as it may already be InDoubt. XIDs that
viewed as committed doesn't need that check (in [6] they also need to be
checked through SLRU).

2) If we are in backend that imported global snapshot, then only CSN-based
visibility can be used. But that happens only for global transactions.

So I hope that local transactions performance will be affected only by
in-progress check and there are ways to circumvent this check.

Also all this behaviour is optional and can be switched off by not enabling
track_global_snapshots.

--
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

From:	Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To:	Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Global snapshots
Date:	2018-05-03 15:28:52
Message-ID:	CAD21AoBph=12dbyH6KJJ--UkBKtzugT3iJb8CPL5fY7tdpLW8g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, May 2, 2018 at 1:27 AM, Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru> wrote:
> 1) To achieve commit atomicity of different nodes intermediate step is
> introduced: at first running transaction is marked as InDoubt on all nodes,
> and only after that each node commit it and stamps with a given GlobalCSN.
> All readers who ran into tuples of an InDoubt transaction should wait until
> it ends and recheck visibility.

I'm concerned that long-running transaction could keep other
transactions waiting and then the system gets stuck. Can this happen?
and is there any workaround?

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru>
To:	Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Global snapshots
Date:	2018-05-03 17:11:26
Message-ID:	EDC74981-C4DD-4FA9-9E89-DFD659117149@postgrespro.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On 3 May 2018, at 18:28, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
> On Wed, May 2, 2018 at 1:27 AM, Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru> wrote:
>> 1) To achieve commit atomicity of different nodes intermediate step is
>> introduced: at first running transaction is marked as InDoubt on all nodes,
>> and only after that each node commit it and stamps with a given GlobalCSN.
>> All readers who ran into tuples of an InDoubt transaction should wait until
>> it ends and recheck visibility.
>
> I'm concerned that long-running transaction could keep other
> transactions waiting and then the system gets stuck. Can this happen?
> and is there any workaround?

InDoubt state is set just before commit, so it is short-lived state.
During transaction execution global tx looks like an ordinary running
transaction. Failure during 2PC with coordinator not being able to
commit/abort this prepared transaction can result in situation where
InDoubt tuples will be locked for reading, but in such situation
coordinator should be blamed. Same problems will arise if prepared
transaction holds locks, for example.

--
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Global snapshots
Date:	2018-05-04 19:09:19
Message-ID:	CA+Tgmob=fJo-pwGsnQT+PBoHyhNM4Giw3y5LXBm0YLhWvKEm1g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	Postg메이저 토토 사이트SQL

On Tue, May 1, 2018 at 5:02 PM, Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru> wrote:
> Yes, that totally possible. On both systems you need:

Cool.

> * set track_global_snapshots='on' -- this will start writing each
> transaction commit sequence number to SRLU.
> * set global_snapshot_defer_time to 30 seconds, for example -- this
> will delay oldestXmin advancement for specified amount of time,
> preserving old tuples.

So, is the idea that we'll definitely find out about any remote
transactions within 30 seconds, and then after we know about remote
transactions, we'll hold back OldestXmin some other way?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Global snapshots
Date:	2018-05-06 10:22:57
Message-ID:	D963AFAB-1FA5-4FE7-A5A2-AC37D728564C@postgrespro.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On 4 May 2018, at 22:09, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> So, is the idea that we'll definitely find out about any remote
> transactions within 30 seconds, and then after we know about remote
> transactions, we'll hold back OldestXmin some other way?

Yes, kind of. There is a procArray->global_snapshot_xmin variable which
acts as a barrier to xmin calculations in GetOldestXmin and
GetSnapshotData, when set.

Also each second GetSnapshotData writes globalxmin (as it was before
procArray->global_snapshot_xmin was taken into account) into a circle
buffer with a size equal to global_snapshot_defer_time value. That more
or less the same thing as with Snapshot Too Old feature, but with a
bucket size of 1 second instead of 1 minute.
procArray->global_snapshot_xmin is always set to oldest
value in circle buffer.

This way xmin calculation is always gives a value that were
global_snapshot_xmin seconds ago and we have mapping from time (or
GlobalCSN) to globalxmin for each second in this range. So when
some backends imports global snapshot with some GlobalCSN, that
GlobalCSN is mapped to a xmin and this xmin is set as a Proc->xmin.

--
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Global snapshots
Date:	2018-05-07 17:04:18
Message-ID:	CA+TgmoZfLj=nMWBOvELzCz-97YX4KrR_P3MZKHvgJstNDG9uNA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	Postg사설 토토SQL

On Sun, May 6, 2018 at 6:22 AM, Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru> wrote:
> Also each second GetSnapshotData writes globalxmin (as it was before
> procArray->global_snapshot_xmin was taken into account) into a circle
> buffer with a size equal to global_snapshot_defer_time value. That more
> or less the same thing as with Snapshot Too Old feature, but with a
> bucket size of 1 second instead of 1 minute.
> procArray->global_snapshot_xmin is always set to oldest
> value in circle buffer.
>
> This way xmin calculation is always gives a value that were
> global_snapshot_xmin seconds ago and we have mapping from time (or
> GlobalCSN) to globalxmin for each second in this range. So when
> some backends imports global snapshot with some GlobalCSN, that
> GlobalCSN is mapped to a xmin and this xmin is set as a Proc->xmin.

But what happens if a transaction starts on node A at time T0 but
first touches node B at a much later time T1, such that T1 - T0 >
global_snapshot_defer_time?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Global snapshots
Date:	2018-05-08 20:51:01
Message-ID:	B3B6D908-206E-4D54-AA25-B272B1DD64B6@postgrespro.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On 7 May 2018, at 20:04, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> But what happens if a transaction starts on node A at time T0 but
> first touches node B at a much later time T1, such that T1 - T0 >
> global_snapshot_defer_time?
>

Such transaction will get "global snapshot too old" error.

In principle such behaviour can be avoided by calculating oldest
global csn among all cluster nodes and oldest xmin on particular
node will be held only when there is some open old transaction on
other node. It's easy to do from global snapshot point of view,
but it's not obvious how to integrate that into postgres_fdw. Probably
that will require bi-derectional connection between postgres_fdw nodes
(also distributed deadlock detection will be easy with such connection).

--
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Global snapshots
Date:	2018-05-09 14:51:25
Message-ID:	CA+TgmoZHQtCAd+QNg0DXKS0RCqALz9Cc7mXkp7+r45j-U2XM7Q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, May 8, 2018 at 4:51 PM, Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru> wrote:
>> On 7 May 2018, at 20:04, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> But what happens if a transaction starts on node A at time T0 but
>> first touches node B at a much later time T1, such that T1 - T0 >
>> global_snapshot_defer_time?
>
> Such transaction will get "global snapshot too old" error.

Ouch. That's not so bad at READ COMMITTED, but at higher isolation
levels failure becomes extremely likely. Any multi-statement
transaction that lasts longer than global_snapshot_defer_time is
pretty much doomed.

> In principle such behaviour can be avoided by calculating oldest
> global csn among all cluster nodes and oldest xmin on particular
> node will be held only when there is some open old transaction on
> other node. It's easy to do from global snapshot point of view,
> but it's not obvious how to integrate that into postgres_fdw. Probably
> that will require bi-derectional connection between postgres_fdw nodes
> (also distributed deadlock detection will be easy with such connection).

I don't think holding back xmin is a very good strategy. Maybe it
won't be so bad if and when we get zheap, since only the undo log will
bloat rather than the table. But as it stands, holding back xmin
means everything bloats and you have to CLUSTER or VACUUM FULL the
table in order to fix it.

If the behavior were really analogous to our existing "snapshot too
old" feature, it wouldn't be so bad. Old snapshots continue to work
without error so long as they only read unmodified data, and only
error out if they hit modified pages. SERIALIZABLE works according to
a similar principle: it worries about data that is written by one
transaction and read by another, but if there's a portion of the data
that is only read and not written, or at least not written by any
transactions that were active around the same time, then it's fine.
While the details aren't really clear to me, I'm inclined to think
that any solution we adopt for global snapshots ought to leverage this
same principle in some way.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To:	Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Global snapshots
Date:	2018-05-11 01:05:44
Message-ID:	CAD21AoAp4DWQUCBH5Gp1XHe22ivjxCkjtEc-=0BFCfUzJz=7ag@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, May 4, 2018 at 2:11 AM, Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru> wrote:
>
>
>> On 3 May 2018, at 18:28, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>>
>> On Wed, May 2, 2018 at 1:27 AM, Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru> wrote:
>>> 1) To achieve commit atomicity of different nodes intermediate step is
>>> introduced: at first running transaction is marked as InDoubt on all nodes,
>>> and only after that each node commit it and stamps with a given GlobalCSN.
>>> All readers who ran into tuples of an InDoubt transaction should wait until
>>> it ends and recheck visibility.
>>
>> I'm concerned that long-running transaction could keep other
>> transactions waiting and then the system gets stuck. Can this happen?
>> and is there any workaround?
>
> InDoubt state is set just before commit, so it is short-lived state.
> During transaction execution global tx looks like an ordinary running
> transaction. Failure during 2PC with coordinator not being able to
> commit/abort this prepared transaction can result in situation where
> InDoubt tuples will be locked for reading, but in such situation
> coordinator should be blamed. Same problems will arise if prepared
> transaction holds locks, for example.

Thank you for explanation! I understood that algorithm. I have two questions.

If I understand correctly, simple writes with ordinary 2PC doesn't
block read who reads that writes. For example, an insertion on a node
doesn't block readers who reads the inserted tuple. But in this global
transaction, the read will be blocked during the global transaction is
InDoubt state. Is that right? InDoubt state will be short-live state
if it's local transaction but I'm not sure in global transaction.
Because during InDoubt state the coordinator has to prepare on all
participant nodes and to assign the global csn to them (and end global
transaction) the global transaction could be in InDoubt state for a
relatively long time. Also, it could be more longer if the
commit/rollback prepared never be performed due to a failure of any
nodes of them.

With this patch, can we start a remote transaction at READ COMMITTED
with imported a global snapshot if the local transaction started at
READ COMMITTED?

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Global snapshots
Date:	2018-05-14 11:20:59
Message-ID:	25966165-21FE-4C9C-A5DD-4E31D82487FB@postgrespro.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On 9 May 2018, at 17:51, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> Ouch. That's not so bad at READ COMMITTED, but at higher isolation
> levels failure becomes extremely likely. Any multi-statement
> transaction that lasts longer than global_snapshot_defer_time is
> pretty much doomed.

Ouch indeed. Current xmin holding scheme has two major drawbacks: it introduces
timeout between export/import of snapshot and holds xmin in pessimistic way, so
old versions will be preserved even when there were no global transactions. On
a positive side is simplicity: that is the only way which I can think of that
doesn't require distributed calculation of global xmin, which in turn, will
probably require permanent connection to remote postgres_fdw node. It is not
hard to add some background worker to postgres_fdw that will hold permanent
connection, but I afraid that it is very discussion-prone topic and that's why
I tried to avoid that.

> I don't think holding back xmin is a very good strategy. Maybe it
> won't be so bad if and when we get zheap, since only the undo log will
> bloat rather than the table. But as it stands, holding back xmin
> means everything bloats and you have to CLUSTER or VACUUM FULL the
> table in order to fix it.

Well, opened local transaction in postgres holds globalXmin for whole postgres
instance (with exception of STO). Also active global transaction should hold
globalXmin of participating nodes to be able to read right versions (again,
with exception of STO).
However, xmin holding scheme itself can be different. For example we can
periodically check (lets say every 1-2 seconds) oldest GlobalCSN on each node
and delay globalXmin advancement only if there is really exist some long
transaction. So the period of bloat will be limited by this 1-2 seconds, and
will not impose timeout between export/import.
Also, I want to note, that global_snapshot_defer_time values about
of tens of seconds doesn't change much in terms of bloat comparing to logical
replication. Active logical slot holds globalXmin by setting
replication_slot_xmin, which is advanced on every RunningXacts, which in turn
logged every 15 seconds (hardcoded in LOG_SNAPSHOT_INTERVAL_MS).

> If the behavior were really analogous to our existing "snapshot too
> old" feature, it wouldn't be so bad. Old snapshots continue to work
> without error so long as they only read unmodified data, and only
> error out if they hit modified pages.

That is actually a good idea that I missed, thanks. Really since all logic
for checking modified pages is already present, it is possible to reuse that
and don't raise "Global STO" error right when old snapshot is imported, but
only in case when global transaction read modified page. I will implement
that and update patch set.

Summarising, I think, that introducing some permanent connections to
postgres_fdw node will put too much burden on this patch set and that it will
be possible to address that later (in a long run such connection will be anyway
needed at least for a deadlock detection). However, if you think that current
behavior + STO analog isn't good enough, then I'm ready to pursue that track.

--
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

From:	Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru>
To:	Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Global snapshots
Date:	2018-05-14 12:39:10
Message-ID:	BB9A190A-7A8C-45B0-88DE-E385CFBD0479@postgrespro.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On 11 May 2018, at 04:05, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
> If I understand correctly, simple writes with ordinary 2PC doesn't
> block read who reads that writes. For example, an insertion on a node
> doesn't block readers who reads the inserted tuple. But in this global
> transaction, the read will be blocked during the global transaction is
> InDoubt state. Is that right? InDoubt state will be short-live state
> if it's local transaction but I'm not sure in global transaction.
> Because during InDoubt state the coordinator has to prepare on all
> participant nodes and to assign the global csn to them (and end global
> transaction) the global transaction could be in InDoubt state for a
> relatively long time.

What I meant by "short-lived" is that InDoubt is set after transaction
is prepared so it doesn't depend on size of transaction, only on
network/commit latency. So you can have transaction that did bulk load for
a several hours and still InDoubt state will last for network round-trip
and possibly fsync that can happened during logging of commit record.

> Also, it could be more longer if the
> commit/rollback prepared never be performed due to a failure of any
> nodes of them.

In this case it definitely can. Individual node can not know whether
that InDoubt transaction was somewhere committed or aborted, so it should
wait until somebody will finish this tx. Particular time to recover
depends on how failures are handled.

Speaking more generally in presence of failures we can unlock tuples
only when consensus on transaction commit is reached. And FLP theorem
states that it can take indefinitely long time in fully asynchronous
network. However from more practical PoW probability of such behaviour
in real network becomes negligible after several iterations of voting process
(some evaluations can be found in https://ieeexplore.ieee.org/document/1352999/)
So several roundtrips can be a decent approximation of how long it should
take to recover from InDoubt state in case failure.

> With this patch, can we start a remote transaction at READ COMMITTED
> with imported a global snapshot if the local transaction started at
> READ COMMITTED?

In theory it is possible, one just need to send new snapshot before each
statement. With some amount of careful work it is possible to achieve
READ COMMITED with postgres_fwd using global snapshots.

--
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Global snapshots
Date:	2018-05-15 12:53:52
Message-ID:	CA+Tgmobcd5u4nj-BXFsMx7PrhscHho1z9926y1Xz_og8V1hi2A@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, May 14, 2018 at 7:20 AM, Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru> wrote:
> Summarising, I think, that introducing some permanent connections to
> postgres_fdw node will put too much burden on this patch set and that it will
> be possible to address that later (in a long run such connection will be anyway
> needed at least for a deadlock detection). However, if you think that current
> behavior + STO analog isn't good enough, then I'm ready to pursue that track.

I don't think I'd be willing to commit to a particular approach at
this point. I think the STO analog is an interesting idea and worth
more investigation, and I think the idea of a permanent connection
with chatter that can be used to resolve deadlocks, coordinate shared
state, etc. is also an interesting idea. But there are probably lots
of ideas that somebody could come up with in this area that would
sound interesting but ultimately not work out. Also, an awful lot
depends on quality of implementation. If you come up with an
implementation of a permanent connection for coordination "chatter",
and the patch gets rejected, it's almost certainly not a sign that we
don't want that thing in general. It means we don't want yours. :-)

Actually, I think if we're going to pursue that approach, we ought to
back off a bit from thinking about global snapshots and think about
what kind of general mechanism we want. For example, maybe you can
imagine it like a message bus, where there are a bunch of named
channels on which the server publishes messages and you can listen to
the ones you care about. There could, for example, be a channel that
publishes the new system-wide globalxmin every time it changes, and
another channel that publishes the wait graph every time the deadlock
detector runs, and so on. In fact, perhaps we should consider
implementing it using the existing LISTEN/NOTIFY framework: have a
bunch of channels that are predefined by PostgreSQL itself, and set
things up so that the server automatically begins publishing to those
channels as soon as anybody starts listening to them. I have to
imagine that if we had a good mechanism for this, we'd get all sorts
of proposals for things to publish. As long as they don't impose
overhead when nobody's listening, we should be able to be fairly
accommodating of such requests.

Or maybe that model is too limiting, either because we don't want to
broadcast to everyone but rather send specific messages to specific
connections, or else because we need a request-and-response mechanism
rather than what is in some sense a one-way communication channel.
Regardless, we should start by coming up with the right model for the
protocol first, bearing in mind how it's going to be used and other
things for which somebody might want to use it (deadlock detection,
failover, leader election), and then implement whatever we need for
global snapshots on top of it. I don't think that writing the code
here is going to be hugely difficult, but coming up with a good design
is going to require some thought and discussion.

And, for that matter, I think the same thing is true for global
snapshots. The coding is a lot harder for that than it is for some
new subprotocol, I'd imagine, but it's still easier than coming up
with a good design. As far as I can see, and everybody can decide for
themselves how far they think that is, the proposal you're making now
sounds like a significant improvement over the XTM proposal. In
particular, the provisioning and deprovisioning issues sound like they
have been thought through a lot more. I'm happy to call that
progress. At the same time, progress on a journey is not synonymous
with arrival at the destination, and I guess it seems to me that you
have some further research to do along the lines you've described:

1. Can we hold back xmin only when necessary and to the extent
necessary instead of all the time?
2. Can we use something like an STO analog, maybe as an optional
feature, rather than actually holding back xmin?

And I'd add:

3. Is there another approach altogether that doesn't rely on holding
back xmin at all?

For example, if you constructed the happens-after graph between
transactions in shared memory, including actions on all nodes, and
looked for cycles, you could abort transactions that would complete a
cycle. (We say A happens-after B if A reads or writes data previously
written by B.) If no cycle exists then all is well. I'm pretty sure
it's been well-established that a naive implementation of this
algorithm is terribly unperformant, but for example SSI works on this
principle. It reduces the bookkeeping involved by being willing to
abort transactions that aren't really creating a cycle if they look
like they *might* create a cycle. Now that's an implementation *on
top of* snapshots for the purpose of getting true serializability
rather than a way of getting global snapshots per se, so it's not
suitable for what you're trying do here, but I think it shows that
algorithms based on cycle detection can be made practical in some
cases, and so maybe this is another such case. On the other hand,
this whole line of thinking could also be a dead end...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Global snapshots
Date:	2018-05-16 12:02:02
Message-ID:	26E16795-5EE2-4BF0-A23A-C3E827959541@postgrespro.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On 15 May 2018, at 15:53, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> Actually, I think if we're going to pursue that approach, we ought to
> back off a bit from thinking about global snapshots and think about
> what kind of general mechanism we want. For example, maybe you can
> imagine it like a message bus, where there are a bunch of named
> channels on which the server publishes messages and you can listen to
> the ones you care about. There could, for example, be a channel that
> publishes the new system-wide globalxmin every time it changes, and
> another channel that publishes the wait graph every time the deadlock
> detector runs, and so on. In fact, perhaps we should consider
> implementing it using the existing LISTEN/NOTIFY framework: have a
> bunch of channels that are predefined by PostgreSQL itself, and set
> things up so that the server automatically begins publishing to those
> channels as soon as anybody starts listening to them. I have to
> imagine that if we had a good mechanism for this, we'd get all sorts
> of proposals for things to publish. As long as they don't impose
> overhead when nobody's listening, we should be able to be fairly
> accommodating of such requests.
>
> Or maybe that model is too limiting, either because we don't want to
> broadcast to everyone but rather send specific messages to specific
> connections, or else because we need a request-and-response mechanism
> rather than what is in some sense a one-way communication channel.
> Regardless, we should start by coming up with the right model for the
> protocol first, bearing in mind how it's going to be used and other
> things for which somebody might want to use it (deadlock detection,
> failover, leader election), and then implement whatever we need for
> global snapshots on top of it. I don't think that writing the code
> here is going to be hugely difficult, but coming up with a good design
> is going to require some thought and discussion.

Well, it would be cool to have some general mechanism to unreliably send
messages between postgres instances. I was thinking about the same thing
mostly in context of our multimaster, where we have an arbiter bgworker
which collects 2PC responses and heartbeats from other nodes on different
TCP port. It used to have some logic inside but evolved to just sending
messages from shared memory out queue and wake backends upon message arrival.
But necessity to manage second port is painful and error-prone at least
from configuration point of view. So it would be nice to have more general
mechanism to exchange messages via postgres port. Ideally with interface
like in shm_mq: send some messages in one queue, subscribe to responses
in different. Among other thing that were mentioned (xmin, deadlock,
elections/heartbeats) I especially interested in some multiplexing for
postgres_fdw, to save on context switches of individual backends while
sending statements.

Talking about model, I think it would be cool to have some primitives like
ones provided by ZeroMQ (message push/subscribe/pop) and then implement
on top of them some more complex ones like scatter/gather.

However, that's probably topic for a very important, but different thread.
For the needs of global snapshots something less ambitious will be suitable.

> And, for that matter, I think the same thing is true for global
> snapshots. The coding is a lot harder for that than it is for some
> new subprotocol, I'd imagine, but it's still easier than coming up
> with a good design.

Sure. This whole global snapshot thing experienced several internal redesigns,
before becoming satisfactory from our standpoint. However, nothing refraining
us from next iterations. In this regard, it is interesting to also hear comments
from Postgres-XL team -- from my experience with XL code this patches in
core can help XL to drop a lot of visibility-related ifdefs and seriously
offload GTM. But may be i'm missing something.

> I guess it seems to me that you
> have some further research to do along the lines you've described:
>
> 1. Can we hold back xmin only when necessary and to the extent
> necessary instead of all the time?
> 2. Can we use something like an STO analog, maybe as an optional
> feature, rather than actually holding back xmin?

Yes, to both questions. I'll implement that and share results.

> And I'd add:
>
> 3. Is there another approach altogether that doesn't rely on holding
> back xmin at all?

And for that question I believe the answer is no. If we want to keep
MVCC-like behaviour where read transactions aren't randomly aborted, we
will need to keep old versions. Disregarding whether it is local or global
transaction. And to keep old versions we need to hold xmin to defuse HOT,
microvacuum, macrovacuum, visibility maps, etc. At some point we can switch
to STO-like behaviour, but that probably should be used as protection from
unusually long transactions rather then a standard behavior.

> For example, if you constructed the happens-after graph between
> transactions in shared memory, including actions on all nodes, and
> looked for cycles, you could abort transactions that would complete a
> cycle. (We say A happens-after B if A reads or writes data previously
> written by B.) If no cycle exists then all is well.

Well, again, it seem to me that any kind of transaction scheduler that
guarantees that RO will not abort (even if it is special kind of RO like
read only deferred) needs to keep old versions.

Speaking about alternative approaches, good evaluation of algorithms
can be found in [HARD17]. Postgres model is close to MVCC described in article
and if we enable STO with small timeout then it will be close to TIMESTAMP
algorithm in article. Results shows that both MVCC and TIMESTAMP are less
performant then CALVIN approach =) But that one is quite different from what
is done in Postgres (and probably in all other databases except Calvin/Fauna
itself) in last 20-30 years.

Also looking through bunch of articles I found that one of the first articles
about MVCC [REED78] (I though first was [BERN83], but actually he references
bunch of previous articles and [REED78] is one them) was actually about distributed
transactions and uses more or less the same approach with pseudo-time in their
terminology to order transaction and assign snapshots.

[HARD17] https://dl.acm.org/citation.cfm?id=3055548
[REED78] https://dl.acm.org/citation.cfm?id=889815
[BERN83] https://dl.acm.org/citation.cfm?id=319998

--
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

From:	Arseny Sher <a(dot)sher(at)postgrespro(dot)ru>
To:	Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Global snapshots
Date:	2018-07-25 11:35:02
Message-ID:	874lgna6qh.fsf@ars-thinkpad
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	503 사설 토토 페치 실패

Hello,

I have looked through the patches and found them pretty accurate. I'd
fixed a lot of small issues here and there; updated patchset is
attached. But first, some high-level notes:

* I agree that it would be cool to implement functionality like current
"snapshot too old": that is, abort transaction with old global
snapshot only if it really attempted to touch modified data.

* I also agree with Stas that any attempts to trade oldestxmin in
gossip between the nodes would drastically complicate this patch and
make it discussion-prone; it would be nice first to get some feedback
on general approach, especially from people trying to distribute
Postgres.

* One drawback of these patches is that only REPEATABLE READ is
supported. For READ COMMITTED, we must export every new snapshot
generated on coordinator to all nodes, which is fairly easy to
do. SERIALIZABLE will definitely require chattering between nodes,
but that's much less demanded isolevel (e.g. we still don't support
it on replicas).

* Another somewhat serious issue is that there is a risk of recency
guarantee violation. If client starts transaction at node with
lagging clocks, its snapshot might not include some recently
committed transactions; if client works with different nodes, she
might not even see her own changes. CockroachDB describes at [1] how
they and Google Spanner overcome this problem. In short, both set
hard limit on maximum allowed clock skew. Spanner uses atomic
clocks, so this skew is small and they just wait it at the end of
each transaction before acknowledging the client. In CockroachDB, if
tuple is not visible but we are unsure whether it is truly invisible
or it's just the skew (the difference between snapshot and tuple's
csn is less than the skew), transaction is restarted with advanced
snapshot. This process is not infinite because the upper border
(initial snapshot + max skew) stays the same; this is correct as we
just want to ensure that our xact sees all the committed ones before
it started. We can implement the same thing.

Now, the description of my mostly cosmetical changes:

* Don't ERROR in BroadcastStmt to allow us to handle failure manually;
* Check global_snapshot_defer_time in ImportGlobalSnapshot instead of
falling on assert;
* (Arguably) improved comments around locking at circular buffer
maintenance; also, don't lock procarray during global_snapshot_xmin
bump.
* s/snaphot/snapshot, other typos.
* Don't track_global_snapshots by default -- while handy for testing, it
doesn't look generally good.
* Set track_global_snapshots = true in tests everywhere.
* GUC renamed from postgres_fdw.use_tsdtm to
postgres_fdw.use_global_snapshots for consistency.
* 003_bank_shared.pl test is removed. In current shape (loading one
node) it is useless, and if we bombard both nodes, deadlock surely
appears. In general, global snaphots are not needed for such
multimaster-like setup -- either there are no conflicts and we are
fine, or there is a conflict, in which case we get a deadlock.
* Fix initdb failure with non-zero global_snapshot_defer_time.
* Enforce REPEATABLE READ since currently we export snap only once in
xact.
* Remove assertion that circular buffer entries are monotonic, as
GetOldestXmin *can* go backwards.

[1] https://www.cockroachlabs.com/blog/living-without-atomic-clocks/

--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment	Content-Type	Size
0001-GlobalCSNLog-SLRU-v2.patch	text/x-diff	24.0 KB
0002-Global-snapshots-v2.patch	text/x-diff	64.6 KB
0003-postgres_fdw-support-for-global-snapshots-v2.patch	text/x-diff	32.0 KB

From:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
To:	Arseny Sher <a(dot)sher(at)postgrespro(dot)ru>
Cc:	Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Global snapshots
Date:	2018-09-24 08:58:48
Message-ID:	19F4097D-0525-4C4E-A40F-2C8C2B0787CF@yandex-team.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi!

I want to review this patch set. Though I understand that it probably will be quite long process.

I like the idea that with this patch set universally all postgres instances are bound into single distributed DB, even if they never heard about each other before :) This is just amazing. Or do I get something wrong?

I've got few questions:
1. If we coordinate HA-clusters with replicas, can replicas participate if their part of transaction is read-only?
2. How does InDoubt transaction behave when we add or subtract leap seconds?

Also, I could not understand some notes from Arseny:

> 25 июля 2018 г., в 16:35, Arseny Sher <a(dot)sher(at)postgrespro(dot)ru> написал(а):
>
> * One drawback of these patches is that only REPEATABLE READ is
> supported. For READ COMMITTED, we must export every new snapshot
> generated on coordinator to all nodes, which is fairly easy to
> do. SERIALIZABLE will definitely require chattering between nodes,
> but that's much less demanded isolevel (e.g. we still don't support
> it on replicas).

If all shards are executing transaction in SERIALIZABLE, what anomalies does it permit?

If you have transactions on server A and server B, there are transactions 1 and 2, transaction A1 is serialized before A2, but B1 is after B2, right?

Maybe we can somehow abort 1 or 2?

>
> * Another somewhat serious issue is that there is a risk of recency
> guarantee violation. If client starts transaction at node with
> lagging clocks, its snapshot might not include some recently
> committed transactions; if client works with different nodes, she
> might not even see her own changes. CockroachDB describes at [1] how
> they and Google Spanner overcome this problem. In short, both set
> hard limit on maximum allowed clock skew. Spanner uses atomic
> clocks, so this skew is small and they just wait it at the end of
> each transaction before acknowledging the client. In CockroachDB, if
> tuple is not visible but we are unsure whether it is truly invisible
> or it's just the skew (the difference between snapshot and tuple's
> csn is less than the skew), transaction is restarted with advanced
> snapshot. This process is not infinite because the upper border
> (initial snapshot + max skew) stays the same; this is correct as we
> just want to ensure that our xact sees all the committed ones before
> it started. We can implement the same thing.
I think that this situation is also covered in Clock-SI since transactions will not exit InDoubt state before we can see them. But I'm not sure, chances are that I get something wrong, I'll think more about it. I'd be happy to hear comments from Stas about this.
>
>
> * 003_bank_shared.pl test is removed. In current shape (loading one
> node) it is useless, and if we bombard both nodes, deadlock surely
> appears. In general, global snaphots are not needed for such
> multimaster-like setup -- either there are no conflicts and we are
> fine, or there is a conflict, in which case we get a deadlock.
Can we do something with this deadlock? Will placing an upper limit on time of InDoubt state fix the issue? I understand that aborting automatically is kind of dangerous...

Also, currently hanging 2pc transaction can cause a lot of headache for DBA. Can we have some kind of protection for the case when one node is gone permanently during transaction?

Thanks!

Best regards, Andrey Borodin.

From:	Arseny Sher <a(dot)sher(at)postgrespro(dot)ru>
To:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
Cc:	Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Global snapshots
Date:	2018-09-26 15:02:47
Message-ID:	87pnx0fgiw.fsf@ars-thinkpad
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello,

Andrey Borodin <x4mmm(at)yandex-team(dot)ru> writes:

> I like the idea that with this patch set universally all postgres
> instances are bound into single distributed DB, even if they never
> heard about each other before :) This is just amazing. Or do I get
> something wrong?

Yeah, in a sense of xact visibility we can view it like this.

> I've got few questions:
> 1. If we coordinate HA-clusters with replicas, can replicas
> participate if their part of transaction is read-only?

Ok, there are several things to consider. Clock-SI as described in the
paper technically boils down to three things. First two assume CSN-based
implementation of MVCC where local time acts as CSN/snapshot source, and
they impose the following additional rules:

1) When xact expands on some node and imports its snapshot, it must be
blocked until clocks on this node will show time >= snapshot being
imported: node never processes xacts with snap 'from the future'
2) When we choose CSN to commit xact with, we must read clocks on
all the nodes who participated in it and set CSN to max among read
values.

These rules ensure *integrity* of the snapshot in the face of clock
skews regardless of which node we access: that is, snapshots are stable
(no non-repeatable reads) and no xact is considered half committed: they
prevent situation when the snapshot sees some xact on one node as
committed and on another node as still running.
(Actually, this is only true under the assumption that any distributed
xact is committed at all nodes instantly at the same time; this is
obviously not true, see 3rd point below.)

If we are speaking about e.g. traditional usage of hot standy, when
client in one xact accesses either primary, or some standby, but not
several nodes at once, we just don't need this stuff because usual MVCC
in Postgres already provides you consistent snapshot. Same is true for
multimaster-like setups, when each node accepts writes, but client still
accesses single node; if there is a write conflict (e.g. same row
updated on different nodes), one of xacts must be aborted; the snapshot
is still good.

However, if you really intend to read in one xact data from multiple
nodes (e.g. read primary and then replica), then yes, these problems
arise and Clock-SI helps with them. However, honestly it is hard for me
to make up a reason why would you want to do that: reading local data is
always more efficient than visiting several nodes. It would make sense
if we could read primary and replica in parallel, but that currently is
impossible in core Postgres. More straightforward application of the
patchset is sharding, when data is splitted and you might need to go to
several nodes in one xact to collect needed data.

Also, this patchset adds core algorithm and makes use of it only in
postgres_fdw; you would need to adapt replication (import/export global
snapshot API) to make it work there.

3) The third rule of Clock-SI deals with the following problem.
Distributed (writing to several nodes) xact doesn't commit (i.e.
becomes visible) instantly at all nodes. That means that there is a
time hole in which we can see xact as committed on some node and still
running on another. To mitigate this, Clock-SI adds kind of two-phase
commit on visibility: additional state InDoubt which blocks all
attempts to read this xact changes until xact's fate (commit/abort) is
determined.

Unlike the previous problem, this issue exists in all replicated
setups. Even if we just have primary streaming data to one hot standby,
xacts are not committed on them instantly and we might observe xact as
committed on primary, then quickly switch to standby and find the data
we have just seen disappeared. remote_apply mode partially alleviates
this problem (apparently to the degree comfortable for most application
developers) by switching positions: with it xact always commits on
replicas earlier than on master. At least this guarantees that the guy
who wrote the xact will definitely see it on replica unless it drops the
connection to master before commit ack. Still the problem is not fully
solved: only addition of InDoubt state can fix this.

While Clock-SI (and this patchset) certainly addresses the issue as it
becomes even more serious in sharded setups (it makes possible to see
/parts/ of transactions), there is nothing CSN or clock specific
here. In theory, we could implement the same two-phase commit on
visiblity without switching to timestamp-based CSN MVCC.

Aside from the paper, you can have a look at Clock-SI explanation in
these slides [1] from PGCon.

> 2. How does InDoubt transaction behave when we add or subtract leap seconds?

Good question! In Clock-SI, time can be arbitrary desynchronized and
might go forward with arbitrary speed (e.g. clocks can be stopped), but
it must never go backwards. So if leap second correction is implemented
by doubling the duration of certain second (as it usually seems to be),
we are fine.

> Also, I could not understand some notes from Arseny:
>
>> 25 июля 2018 г., в 16:35, Arseny Sher <a(dot)sher(at)postgrespro(dot)ru> написал(а):
>>
>> * One drawback of these patches is that only REPEATABLE READ is
>> supported. For READ COMMITTED, we must export every new snapshot
>> generated on coordinator to all nodes, which is fairly easy to
>> do. SERIALIZABLE will definitely require chattering between nodes,
>> but that's much less demanded isolevel (e.g. we still don't support
>> it on replicas).
>
> If all shards are executing transaction in SERIALIZABLE, what anomalies does it permit?
>
> If you have transactions on server A and server B, there are
> transactions 1 and 2, transaction A1 is serialized before A2, but B1
> is after B2, right?
>
> Maybe we can somehow abort 1 or 2?

Yes, your explanation is concise and correct.
To put it in another way, ensuring SERIALIZABLE in MVCC requires
tracking reads, and there is no infrastructure for doing it
globally. Classical write skew is possible: you have node A holding x
and node B holding y, initially x = y = 30 and there is a constraint x +
y > 0.
Two concurrent xacts start:
T1: x = x - 42;
T2: y = y - 42;
They don't see each other, so both commit successfully and the
constraint is violated. We need to transfer info about reads between
nodes to know when we need to abort someone.

>>
>> * Another somewhat serious issue is that there is a risk of recency
>> guarantee violation. If client starts transaction at node with
>> lagging clocks, its snapshot might not include some recently
>> committed transactions; if client works with different nodes, she
>> might not even see her own changes. CockroachDB describes at [1] how
>> they and Google Spanner overcome this problem. In short, both set
>> hard limit on maximum allowed clock skew. Spanner uses atomic
>> clocks, so this skew is small and they just wait it at the end of
>> each transaction before acknowledging the client. In CockroachDB, if
>> tuple is not visible but we are unsure whether it is truly invisible
>> or it's just the skew (the difference between snapshot and tuple's
>> csn is less than the skew), transaction is restarted with advanced
>> snapshot. This process is not infinite because the upper border
>> (initial snapshot + max skew) stays the same; this is correct as we
>> just want to ensure that our xact sees all the committed ones before
>> it started. We can implement the same thing.
> I think that this situation is also covered in Clock-SI since
> transactions will not exit InDoubt state before we can see them. But
> I'm not sure, chances are that I get something wrong, I'll think more
> about it. I'd be happy to hear comments from Stas about this.

InDoubt state protects from seeing xact who is not yet committed
everywhere, but it doesn't protect from starting xact on node with
lagging clocks, obtaining plainly old snapshot. We won't see any
half-committed data (InDoubt covers us here) with it, but some recently
committed xacts might not get into our old snapshot entirely.

>> * 003_bank_shared.pl test is removed. In current shape (loading one
>> node) it is useless, and if we bombard both nodes, deadlock surely
>> appears. In general, global snaphots are not needed for such
>> multimaster-like setup -- either there are no conflicts and we are
>> fine, or there is a conflict, in which case we get a deadlock.
> Can we do something with this deadlock? Will placing an upper limit on
> time of InDoubt state fix the issue? I understand that aborting
> automatically is kind of dangerous...

Sure, this is just a generalization of basic deadlock problem to
distributed system. To deal with it, someone must periodically collect
locks across the nodes, build graph, check for loops and punish (abort)
one of its chain creators, if loop exists. InDoubt (and global snapshots
in general) is unrelated to this, we hanged on usual row-level lock in
this test. BTW, our pg_shardman extension has primitive deadlock
detector [2], I suppose Citus [3] also has one.

> Also, currently hanging 2pc transaction can cause a lot of headache
> for DBA. Can we have some kind of protection for the case when one
> node is gone permanently during transaction?

Oh, subject of automatic 2PC xacts resolution is also matter of another
(probably many-miles) thread, largely unrelated to global
snapshots/visibility. In general, the problem of distributed transaction
commit which doesn't block while majority of nodes is alive requires
implementing distributed consensus algorithm like Paxos or Raft. You
might also find thread [4] interesting.

Thank you for your interest in the topic!

[1] https://yadi.sk/i/qgmFeICvuRwYNA
[2] https://github.com/postgrespro/pg_shardman/blob/broadcast/pg_shardman--0.0.3.sql#L2497
[3] https://www.citusdata.com/
[4] /message-id/flat/CAFjFpRc5Eo%3DGqgQBa1F%2B_VQ-q_76B-d5-Pt0DWANT2QS24WE7w%40mail.gmail.com

--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

From:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
To:	a(dot)sher(at)postgrespro(dot)ru
Cc:	x4mmm(at)yandex-team(dot)ru, Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Global snapshots
Date:	2018-11-29 15:21:21
Message-ID:	CA+q6zcWk-JP+ZEmqPe5Uaaqg39a8w8VueBFw+5z+Jq_iWV5TYA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	Postg스포츠 토토 베트맨SQL

> On Wed, Jul 25, 2018 at 1:35 PM Arseny Sher <a(dot)sher(at)postgrespro(dot)ru> wrote:
>
> Hello,
>
> I have looked through the patches and found them pretty accurate. I'd
> fixed a lot of small issues here and there; updated patchset is
> attached.

Hi,

Thank you for working on this patch. Unfortunately, the patch has some
conflicts, could you please rebase it? Also I wonder if you or Stas can shed
some light about this:

> On Wed, May 16, 2018 at 2:02 PM Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru> wrote:
> > On 15 May 2018, at 15:53, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> >
> > I guess it seems to me that you
> > have some further research to do along the lines you've described:
> >
> > 1. Can we hold back xmin only when necessary and to the extent
> > necessary instead of all the time?
> > 2. Can we use something like an STO analog, maybe as an optional
> > feature, rather than actually holding back xmin?
>
> Yes, to both questions. I'll implement that and share results.

Is there any resulting patch where the ideas how to implement this are outlined?

From:	Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru>
To:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
Cc:	Arseny Sher <a(dot)sher(at)postgrespro(dot)ru>, x4mmm(at)yandex-team(dot)ru, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Global snapshots
Date:	2018-11-30 13:00:17
Message-ID:	153A5F03-F6B4-42AE-B8ED-632F7A2570C6@postgrespro.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On 29 Nov 2018, at 18:21, Dmitry Dolgov <9erthalion6(at)gmail(dot)com> wrote:
>
>> On Wed, Jul 25, 2018 at 1:35 PM Arseny Sher <a(dot)sher(at)postgrespro(dot)ru> wrote:
>>
>> Hello,
>>
>> I have looked through the patches and found them pretty accurate. I'd
>> fixed a lot of small issues here and there; updated patchset is
>> attached.
>
> Hi,
>
> Thank you for working on this patch. Unfortunately, the patch has some
> conflicts, could you please rebase it?

Rebased onto current master (dcfdf56e89a). Also I corrected few formatting issues
and worked around new pgbench return codes policy in tests.

> Also I wonder if you or Stas can shed
> some light about this:
>
>> On Wed, May 16, 2018 at 2:02 PM Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru> wrote:
>>> On 15 May 2018, at 15:53, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>>
>>> I guess it seems to me that you
>>> have some further research to do along the lines you've described:
>>>
>>> 1. Can we hold back xmin only when necessary and to the extent
>>> necessary instead of all the time?
>>> 2. Can we use something like an STO analog, maybe as an optional
>>> feature, rather than actually holding back xmin?
>>
>> Yes, to both questions. I'll implement that and share results.
>
> Is there any resulting patch where the ideas how to implement this are outlined?

Not yet. I’m going to continue work on this in January. And probably try to
force some of nearby committers to make a line by line review.

Attachment	Content-Type	Size
0001-GlobalCSNLog-SLRU-v3.patch	application/octet-stream	24.0 KB
0002-Global-snapshots-v3.patch	application/octet-stream	64.2 KB
0003-postgres_fdw-support-for-global-snapshots-v3.patch	application/octet-stream	32.1 KB
unknown_filename	text/plain	94 bytes

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru>
Cc:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Arseny Sher <a(dot)sher(at)postgrespro(dot)ru>, x4mmm(at)yandex-team(dot)ru, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Global snapshots
Date:	2019-01-31 15:42:52
Message-ID:	20190131154252.cryp7edozpupl76r@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	Postg사설 토토SQL

Hi,

On 2018-11-30 16:00:17 +0300, Stas Kelvich wrote:
> > On 29 Nov 2018, at 18:21, Dmitry Dolgov <9erthalion6(at)gmail(dot)com> wrote:
> > Is there any resulting patch where the ideas how to implement this are outlined?
>
> Not yet. I’m going to continue work on this in January. And probably try to
> force some of nearby committers to make a line by line review.

This hasn't happened yet, so I think this ought to be marked ad returned
with feedback?

- Andres

From:	Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Arseny Sher <a(dot)sher(at)postgrespro(dot)ru>, x4mmm(at)yandex-team(dot)ru, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Global snapshots
Date:	2019-01-31 16:46:45
Message-ID:	71DE41C2-799B-4308-91D5-CEA03122AC7C@postgrespro.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	503 토토 사이트 페치

> On 31 Jan 2019, at 18:42, Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> Hi,
>
> On 2018-11-30 16:00:17 +0300, Stas Kelvich wrote:
>>> On 29 Nov 2018, at 18:21, Dmitry Dolgov <9erthalion6(at)gmail(dot)com> wrote:
>>> Is there any resulting patch where the ideas how to implement this are outlined?
>>
>> Not yet. I’m going to continue work on this in January. And probably try to
>> force some of nearby committers to make a line by line review.
>
> This hasn't happened yet, so I think this ought to be marked ad returned
> with feedback?
>

No objections. I don't think this will realistically go in during last CF, so
will open it during next release cycle.

--
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

From:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
To:	Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru>
Cc:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Arseny Sher <a(dot)sher(at)postgrespro(dot)ru>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Global snapshots
Date:	2019-04-21 17:13:21
Message-ID:	CEB5C3A6-3CD1-404B-A827-12B6A87D8804@yandex-team.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi!

> 30 нояб. 2018 г., в 18:00, Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru> написал(а):
>
>
> <0001-GlobalCSNLog-SLRU-v3.patch><0002-Global-snapshots-v3.patch><0003-postgres_fdw-support-for-global-snapshots-v3.patch>

In spite of recent backup discussions I realized that we need to backup clusters even if they provide global snapshot capabilities.

I think we can have pretty elegant Point-in-CSN-Recovery here, right? If we want a group of clusters to recover to a globally consistent state.

Best regards, Andrey Borodin.