Quick Links

Re: Global snapshots

Lists:	pgsql-hackers

From:	Andrey Lepikhov <a(dot)lepikhov(at)postgrespro(dot)ru>
To:	PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Global snapshots
Date:	2020-05-12 10:24:18
Message-ID:	07b2c899-4ed0-4c87-1327-23c750311248@postgrespro.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	Postg젠 토토SQL :

Rebased onto current master (fb544735f1).

--
Andrey Lepikhov
Postgres Professional
https://postgrespro.com
The Russian Postgres Company

Attachment	Content-Type	Size
0001-GlobalCSNLog-SLRU-v3.patch	text/x-patch	24.1 KB
0002-Global-snapshots-v3.patch	text/x-patch	63.5 KB
0003-postgres_fdw-support-for-global-snapshots-v3.patch	text/x-patch	31.9 KB

From:	Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
To:	Andrey Lepikhov <a(dot)lepikhov(at)postgrespro(dot)ru>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Global snapshots
Date:	2020-06-09 06:41:56
Message-ID:	4d13207c-43ba-1db3-1459-e7d4bc0e47ad@oss.nttdata.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	Postg무지개 토토SQL

On 2020/05/12 19:24, Andrey Lepikhov wrote:
> Rebased onto current master (fb544735f1).

Thanks for the patches!

These patches are no longer applied cleanly and caused the compilation failure.
So could you rebase and update them?

The patches seem not to be registered in CommitFest yet.
Are you planning to do that?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

From:	"Andrey V(dot) Lepikhov" <a(dot)lepikhov(at)postgrespro(dot)ru>
To:	Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Global snapshots
Date:	2020-06-10 03:05:47
Message-ID:	9964cf46-9294-34b9-4858-971e9029f5c7@postgrespro.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 09.06.2020 11:41, Fujii Masao wrote:
>
>
> On 2020/05/12 19:24, Andrey Lepikhov wrote:
>> Rebased onto current master (fb544735f1).
>
> Thanks for the patches!
>
> These patches are no longer applied cleanly and caused the compilation
> failure.
> So could you rebase and update them?
Rebased onto 57cb806308 (see attachment).
>
> The patches seem not to be registered in CommitFest yet.
> Are you planning to do that?
Not now. It is a sharding-related feature. I'm not sure that this
approach is fully consistent with the sharding way now.

--
Andrey Lepikhov
Postgres Professional
https://postgrespro.com

Attachment	Content-Type	Size
0001-GlobalCSNLog-SLRU.patch	text/x-patch	24.1 KB
0002-Global-snapshots.patch	text/x-patch	63.5 KB
0003-postgres_fdw-support-for-global-snapshots.patch	text/x-patch	31.9 KB

From:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To:	"Andrey V(dot) Lepikhov" <a(dot)lepikhov(at)postgrespro(dot)ru>, movead(dot)li(at)highgo(dot)ca, Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>
Cc:	Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Global snapshots
Date:	2020-06-19 06:48:30
Message-ID:	CAA4eK1Jo=1261+qEKmm69Bm9wv6fvR7kVbqDo7EFUw3vkQtC7w@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jun 10, 2020 at 8:36 AM Andrey V. Lepikhov
<a(dot)lepikhov(at)postgrespro(dot)ru> wrote:
>
>
> On 09.06.2020 11:41, Fujii Masao wrote:
> >
> >
> > The patches seem not to be registered in CommitFest yet.
> > Are you planning to do that?
> Not now. It is a sharding-related feature. I'm not sure that this
> approach is fully consistent with the sharding way now.
>

Can you please explain in detail, why you think so? There is no
commit message explaining what each patch does so it is difficult to
understand why you said so? Also, can you let us know if this
supports 2PC in some way and if so how is it different from what the
other thread on the same topic [1] is trying to achieve? Also, I
would like to know if the patch related to CSN based snapshot [2] is a
precursor for this, if not, then is it any way related to this patch
because I see the latest reply on that thread [2] which says it is an
infrastructure of sharding feature but I don't understand completely
whether these patches are related?

Basically, there seem to be three threads, first, this one and then
[1] and [2] which seems to be doing the work for sharding feature but
there is no clear explanation anywhere if these are anyway related or
whether combining all these three we are aiming for a solution for
atomic commit and atomic visibility.

I am not sure if you know answers to all these questions so I added
the people who seem to be working on the other two patches. I am also
afraid that if there is any duplicate or conflicting work going on in
these threads so we should try to find that as well.

[1] - /message-id/CA%2Bfd4k4v%2BKdofMyN%2BjnOia8-7rto8tsh9Zs3dd7kncvHp12WYw%40mail.gmail.com
[2] - /message-id/2020061911294657960322%40highgo.ca

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

From:	"Andrey V(dot) Lepikhov" <a(dot)lepikhov(at)postgrespro(dot)ru>
To:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, movead(dot)li(at)highgo(dot)ca, Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>
Cc:	Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Global snapshots
Date:	2020-06-19 08:11:59
Message-ID:	f23083b9-38d0-6126-eb6e-091816a78585@postgrespro.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 6/19/20 11:48 AM, Amit Kapila wrote:
> On Wed, Jun 10, 2020 at 8:36 AM Andrey V. Lepikhov
> <a(dot)lepikhov(at)postgrespro(dot)ru> wrote:
>> On 09.06.2020 11:41, Fujii Masao wrote:
>>> The patches seem not to be registered in CommitFest yet.
>>> Are you planning to do that?
>> Not now. It is a sharding-related feature. I'm not sure that this
>> approach is fully consistent with the sharding way now.
> Can you please explain in detail, why you think so? There is no
> commit message explaining what each patch does so it is difficult to
> understand why you said so?
For now I used this patch set for providing correct visibility in the
case of access to the table with foreign partitions from many nodes in
parallel. So I saw at this patch set as a sharding-related feature, but
[1] shows another useful application.
CSN-based approach has weak points such as:
1. Dependency on clocks synchronization
2. Needs guarantees of monotonically increasing of the CSN in the case
of an instance restart/crash etc.
3. We need to delay increasing of OldestXmin because it can be needed
for a transaction snapshot at another node.
So I do not have full conviction that it will be better than a single
distributed transaction manager.
Also, can you let us know if this
> supports 2PC in some way and if so how is it different from what the
> other thread on the same topic [1] is trying to achieve?
Yes, the patch '0003-postgres_fdw-support-for-global-snapshots' contains
2PC machinery. Now I'd not judge which approach is better.
Also, I
> would like to know if the patch related to CSN based snapshot [2] is a
> precursor for this, if not, then is it any way related to this patch
> because I see the latest reply on that thread [2] which says it is an
> infrastructure of sharding feature but I don't understand completely
> whether these patches are related?
I need some time to study this patch. At first sight it is different.
>
> Basically, there seem to be three threads, first, this one and then
> [1] and [2] which seems to be doing the work for sharding feature but
> there is no clear explanation anywhere if these are anyway related or
> whether combining all these three we are aiming for a solution for
> atomic commit and atomic visibility.
It can be useful to study all approaches.
>
> I am not sure if you know answers to all these questions so I added
> the people who seem to be working on the other two patches. I am also
> afraid that if there is any duplicate or conflicting work going on in
> these threads so we should try to find that as well.
Ok
>
>
> [1] - /message-id/CA%2Bfd4k4v%2BKdofMyN%2BjnOia8-7rto8tsh9Zs3dd7kncvHp12WYw%40mail.gmail.com
> [2] - /message-id/2020061911294657960322%40highgo.ca
>

[1]
/message-id/flat/20200301083601.ews6hz5dduc3w2se%40alap3.anarazel.de

--
Andrey Lepikhov
Postgres Professional
https://postgrespro.com

From:	"movead(dot)li(at)highgo(dot)ca" <movead(dot)li(at)highgo(dot)ca>
To:	"Andrey V(dot) Lepikhov" <a(dot)lepikhov(at)postgrespro(dot)ru>, "Amit Kapila" <amit(dot)kapila16(at)gmail(dot)com>, "Masahiko Sawada" <masahiko(dot)sawada(at)2ndquadrant(dot)com>
Cc:	"Fujii Masao" <masao(dot)fujii(at)oss(dot)nttdata(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Global snapshots
Date:	2020-06-19 09:03:20
Message-ID:	2020061917031834329360@highgo.ca
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

>> would like to know if the patch related to CSN based snapshot [2] is a
>> precursor for this, if not, then is it any way related to this patch
>> because I see the latest reply on that thread [2] which says it is an
>> infrastructure of sharding feature but I don't understand completely
>> whether these patches are related?
>I need some time to study this patch. At first sight it is different.

This patch[2] is almost base on [3], because I think [1] is talking about 2PC
and FDW, so this patch focus on CSN only and I detach the global snapshot
part and FDW part from the [1] patch.

I notice CSN will not survival after a restart in [1] patch, I think it may not the
right way, may be it is what in last mail "Needs guarantees of monotonically
increasing of the CSN in the case of an instance restart/crash etc" so I try to
add wal support for CSN on this patch.

That's why this thread exist.

> [1] - /message-id/CA%2Bfd4k4v%2BKdofMyN%2BjnOia8-7rto8tsh9Zs3dd7kncvHp12WYw%40mail.gmail.com
> [2] - /message-id/2020061911294657960322%40highgo.ca
[3]/message-id/21BC916B-80A1-43BF-8650-3363CCDAE09C%40postgrespro.ru

Regards,
Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca
EMAIL: mailto:movead(dot)li(at)highgo(dot)ca

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	"movead(dot)li(at)highgo(dot)ca" <movead(dot)li(at)highgo(dot)ca>
Cc:	"Andrey V(dot) Lepikhov" <a(dot)lepikhov(at)postgrespro(dot)ru>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Global snapshots
Date:	2020-06-19 13:02:57
Message-ID:	20200619130257.GA17183@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jun 19, 2020 at 05:03:20PM +0800, movead(dot)li(at)highgo(dot)ca wrote:
>
> >> would like to know if the patch related to CSN based snapshot [2] is a
> >> precursor for this, if not, then is it any way related to this patch
> >> because I see the latest reply on that thread [2] which says it is an
> >> infrastructure of sharding feature but I don't understand completely
> >> whether these patches are related?
> >I need some time to study this patch.. At first sight it is different.
>
> This patch[2] is almost base on [3], because I think [1] is talking about 2PC
> and FDW, so this patch focus on CSN only and I detach the global snapshot
> part and FDW part from the [1] patch.
>
> I notice CSN will not survival after a restart in [1] patch, I think it may not
> the
> right way, may be it is what in last mail "Needs guarantees of monotonically
> increasing of the CSN in the case of an instance restart/crash etc" so I try to
> add wal support for CSN on this patch.
>
> That's why this thread exist.

I was certainly missing how these items fit together. Sharding needs
parallel FDWs, atomic commits, and atomic snapshots. To get atomic
snapshots, we need CSN. This new sharding wiki pages has more details:

https://wiki.postgresql.org/wiki/WIP_PostgreSQL_Sharding

After all that is done, we will need optimizer improvements and shard
management tooling.

--
Bruce Momjian <bruce(at)momjian(dot)us> https://momjian.us
EnterpriseDB https://enterprisedb.com

The usefulness of a cup is in its emptiness, Bruce Lee

From:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To:	"Andrey V(dot) Lepikhov" <a(dot)lepikhov(at)postgrespro(dot)ru>
Cc:	movead(dot)li(at)highgo(dot)ca, Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Global snapshots
Date:	2020-06-20 12:21:21
Message-ID:	CAA4eK1K8ibXN8vknTYvjz1sd+JaDDohvV9jBxSs+=1=KJz=u=w@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	Postg메이저 토토 사이트SQL

On Fri, Jun 19, 2020 at 1:42 PM Andrey V. Lepikhov
<a(dot)lepikhov(at)postgrespro(dot)ru> wrote:
>
> On 6/19/20 11:48 AM, Amit Kapila wrote:
> > On Wed, Jun 10, 2020 at 8:36 AM Andrey V. Lepikhov
> > <a(dot)lepikhov(at)postgrespro(dot)ru> wrote:
> >> On 09.06.2020 11:41, Fujii Masao wrote:
> >>> The patches seem not to be registered in CommitFest yet.
> >>> Are you planning to do that?
> >> Not now. It is a sharding-related feature. I'm not sure that this
> >> approach is fully consistent with the sharding way now.
> > Can you please explain in detail, why you think so? There is no
> > commit message explaining what each patch does so it is difficult to
> > understand why you said so?
> For now I used this patch set for providing correct visibility in the
> case of access to the table with foreign partitions from many nodes in
> parallel. So I saw at this patch set as a sharding-related feature, but
> [1] shows another useful application.
> CSN-based approach has weak points such as:
> 1. Dependency on clocks synchronization
> 2. Needs guarantees of monotonically increasing of the CSN in the case
> of an instance restart/crash etc.
> 3. We need to delay increasing of OldestXmin because it can be needed
> for a transaction snapshot at another node.
>

So, is anyone working on improving these parts of the patch. AFAICS
from what Bruce has shared [1], some people from HighGo are working on
it but I don't see any discussion of that yet.

> So I do not have full conviction that it will be better than a single
> distributed transaction manager.
>

When you say "single distributed transaction manager" do you mean
something like pg_dtm which is inspired by Postgres-XL?

> Also, can you let us know if this
> > supports 2PC in some way and if so how is it different from what the
> > other thread on the same topic [1] is trying to achieve?
> Yes, the patch '0003-postgres_fdw-support-for-global-snapshots' contains
> 2PC machinery. Now I'd not judge which approach is better.
>

Yeah, I have studied both the approaches a little and I feel the main
difference seems to be that in this patch atomicity is tightly coupled
with how we achieve global visibility, basically in this patch "all
running transactions are marked as InDoubt on all nodes in prepare
phase, and after that, each node commit it and stamps each xid with a
given GlobalCSN.". There are no separate APIs for
prepare/commit/rollback exposed by postgres_fdw as we do it in the
approach followed by Sawada-San's patch. It seems to me in the patch
in this email one of postgres_fdw node can be a sort of coordinator
which prepares and commit the transaction on all other nodes whereas
that is not true in Sawada-San's patch (where the coordinator is a
local Postgres node, am I right Sawada-San?). OTOH, Sawada-San's
patch has advanced concepts like a resolver process that can
commit/abort the transactions later. I couldn't still get a complete
grip of both patches so difficult to say which is better approach but
I think at the least we should have some discussion.

I feel if Sawada-San or someone involved in another patch also once
studies this approach and try to come up with some form of comparison
then we might be able to make better decision. It is possible that
there are few good things in each approach which we can use.

> Also, I
> > would like to know if the patch related to CSN based snapshot [2] is a
> > precursor for this, if not, then is it any way related to this patch
> > because I see the latest reply on that thread [2] which says it is an
> > infrastructure of sharding feature but I don't understand completely
> > whether these patches are related?
> I need some time to study this patch. At first sight it is different.
>

I feel the opposite. I think it has extracted some stuff from this
patch series and extended the same.

Thanks for the inputs. I feel inputs from you and others who were
involved in this project will be really helpful to move this project
forward.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

From:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To:	"Andrey V(dot) Lepikhov" <a(dot)lepikhov(at)postgrespro(dot)ru>
Cc:	movead(dot)li(at)highgo(dot)ca, Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Global snapshots
Date:	2020-06-20 12:22:22
Message-ID:	CAA4eK1JvP+n6tQ6LktT1wKi-YLf683gRE-QpfsCZUFTDAf_LPw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Jun 20, 2020 at 5:51 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
>
> So, is anyone working on improving these parts of the patch. AFAICS
> from what Bruce has shared [1],
>

oops, forgot to share the link [1] -
https://wiki.postgresql.org/wiki/WIP_PostgreSQL_Sharding

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

From:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To:	Bruce Momjian <bruce(at)momjian(dot)us>
Cc:	"movead(dot)li(at)highgo(dot)ca" <movead(dot)li(at)highgo(dot)ca>, "Andrey V(dot) Lepikhov" <a(dot)lepikhov(at)postgrespro(dot)ru>, Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Global snapshots
Date:	2020-06-20 12:24:18
Message-ID:	CAA4eK1+dav1r_dOmXNC8Gzw2+xF4_4-mpRaeSoXhmc_36wNDkA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jun 19, 2020 at 6:33 PM Bruce Momjian <bruce(at)momjian(dot)us> wrote:
>
> On Fri, Jun 19, 2020 at 05:03:20PM +0800, movead(dot)li(at)highgo(dot)ca wrote:
> >
> > >> would like to know if the patch related to CSN based snapshot [2] is a
> > >> precursor for this, if not, then is it any way related to this patch
> > >> because I see the latest reply on that thread [2] which says it is an
> > >> infrastructure of sharding feature but I don't understand completely
> > >> whether these patches are related?
> > >I need some time to study this patch.. At first sight it is different.
> >
> > This patch[2] is almost base on [3], because I think [1] is talking about 2PC
> > and FDW, so this patch focus on CSN only and I detach the global snapshot
> > part and FDW part from the [1] patch.
> >
> > I notice CSN will not survival after a restart in [1] patch, I think it may not
> > the
> > right way, may be it is what in last mail "Needs guarantees of monotonically
> > increasing of the CSN in the case of an instance restart/crash etc" so I try to
> > add wal support for CSN on this patch.
> >
> > That's why this thread exist.
>
> I was certainly missing how these items fit together. Sharding needs
> parallel FDWs, atomic commits, and atomic snapshots. To get atomic
> snapshots, we need CSN. This new sharding wiki pages has more details:
>
> https://wiki.postgresql.org/wiki/WIP_PostgreSQL_Sharding
>

Thanks for maintaining this page. It is quite helpful!

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc:	"movead(dot)li(at)highgo(dot)ca" <movead(dot)li(at)highgo(dot)ca>, "Andrey V(dot) Lepikhov" <a(dot)lepikhov(at)postgrespro(dot)ru>, Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Ahsan Hadi <ahsan(dot)hadi(at)highgo(dot)ca>
Subject:	Re: Global snapshots
Date:	2020-06-22 15:00:38
Message-ID:	20200622150038.GA28999@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Jun 20, 2020 at 05:54:18PM +0530, Amit Kapila wrote:
> On Fri, Jun 19, 2020 at 6:33 PM Bruce Momjian <bruce(at)momjian(dot)us> wrote:
> >
> > On Fri, Jun 19, 2020 at 05:03:20PM +0800, movead(dot)li(at)highgo(dot)ca wrote:
> > >
> > > >> would like to know if the patch related to CSN based snapshot [2] is a
> > > >> precursor for this, if not, then is it any way related to this patch
> > > >> because I see the latest reply on that thread [2] which says it is an
> > > >> infrastructure of sharding feature but I don't understand completely
> > > >> whether these patches are related?
> > > >I need some time to study this patch.. At first sight it is different.
> > >
> > > This patch[2] is almost base on [3], because I think [1] is talking about 2PC
> > > and FDW, so this patch focus on CSN only and I detach the global snapshot
> > > part and FDW part from the [1] patch.
> > >
> > > I notice CSN will not survival after a restart in [1] patch, I think it may not
> > > the
> > > right way, may be it is what in last mail "Needs guarantees of monotonically
> > > increasing of the CSN in the case of an instance restart/crash etc" so I try to
> > > add wal support for CSN on this patch.
> > >
> > > That's why this thread exist.
> >
> > I was certainly missing how these items fit together. Sharding needs
> > parallel FDWs, atomic commits, and atomic snapshots. To get atomic
> > snapshots, we need CSN. This new sharding wiki pages has more details:
> >
> > https://wiki.postgresql.org/wiki/WIP_PostgreSQL_Sharding
> >
>
> Thanks for maintaining this page. It is quite helpful!

Ahsan Hadi <ahsan(dot)hadi(at)highgo(dot)ca> created that page, and I just made a
few wording edits. Ahsan is copying information from this older
sharding wiki page:

https://wiki.postgresql.org/wiki/Built-in_Sharding

to the new one you listed above.

--
Bruce Momjian <bruce(at)momjian(dot)us> https://momjian.us
EnterpriseDB https://enterprisedb.com

The usefulness of a cup is in its emptiness, Bruce Lee

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc:	"Andrey V(dot) Lepikhov" <a(dot)lepikhov(at)postgrespro(dot)ru>, movead(dot)li(at)highgo(dot)ca, Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Global snapshots
Date:	2020-06-22 15:06:36
Message-ID:	20200622150636.GB28999@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Jun 20, 2020 at 05:51:21PM +0530, Amit Kapila wrote:
> I feel if Sawada-San or someone involved in another patch also once
> studies this approach and try to come up with some form of comparison
> then we might be able to make better decision. It is possible that
> there are few good things in each approach which we can use.

Agreed. Postgres-XL code is under the Postgres license:

Postgres-XL is released under the PostgreSQL License, a liberal Open
Source license, similar to the BSD or MIT licenses.

and even says they want it moved into Postgres core:

https://www.postgres-xl.org/2017/08/postgres-xl-9-5-r1-6-announced/

Postgres-XL is a massively parallel database built on top of,
and very closely compatible with PostgreSQL 9.5 and its set of advanced
features. Postgres-XL is fully open source and many parts of it will
feed back directly or indirectly into later releases of PostgreSQL, as
we begin to move towards a fully parallel sharded version of core PostgreSQL.

so we should understand what can be used from it.

--
Bruce Momjian <bruce(at)momjian(dot)us> https://momjian.us
EnterpriseDB https://enterprisedb.com

The usefulness of a cup is in its emptiness, Bruce Lee

From:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To:	Bruce Momjian <bruce(at)momjian(dot)us>
Cc:	"Andrey V(dot) Lepikhov" <a(dot)lepikhov(at)postgrespro(dot)ru>, movead(dot)li(at)highgo(dot)ca, Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Global snapshots
Date:	2020-06-23 04:12:09
Message-ID:	CAA4eK1LqkOM0SH-Xk2veY4NNNiOGz+kbAiRPLkcjuzpinzwXSg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	Postg스포츠 토토 베트맨SQL

On Mon, Jun 22, 2020 at 8:36 PM Bruce Momjian <bruce(at)momjian(dot)us> wrote:
>
> On Sat, Jun 20, 2020 at 05:51:21PM +0530, Amit Kapila wrote:
> > I feel if Sawada-San or someone involved in another patch also once
> > studies this approach and try to come up with some form of comparison
> > then we might be able to make better decision. It is possible that
> > there are few good things in each approach which we can use.
>
> Agreed. Postgres-XL code is under the Postgres license:
>
> Postgres-XL is released under the PostgreSQL License, a liberal Open
> Source license, similar to the BSD or MIT licenses.
>
> and even says they want it moved into Postgres core:
>
> https://www.postgres-xl.org/2017/08/postgres-xl-9-5-r1-6-announced/
>
> Postgres-XL is a massively parallel database built on top of,
> and very closely compatible with PostgreSQL 9.5 and its set of advanced
> features. Postgres-XL is fully open source and many parts of it will
> feed back directly or indirectly into later releases of PostgreSQL, as
> we begin to move towards a fully parallel sharded version of core PostgreSQL.
>
> so we should understand what can be used from it.
>

+1. I think that will be quite useful.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

From:	Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>
To:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc:	"Andrey V(dot) Lepikhov" <a(dot)lepikhov(at)postgrespro(dot)ru>, movead(dot)li(at)highgo(dot)ca, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Global snapshots
Date:	2020-07-03 06:48:16
Message-ID:	CA+fd4k6oZtO-MFYmunHVecGaTWre8YKDNTSfX9hZhQh6Kui1kA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, 20 Jun 2020 at 21:21, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Fri, Jun 19, 2020 at 1:42 PM Andrey V. Lepikhov
> <a(dot)lepikhov(at)postgrespro(dot)ru> wrote:
> >
> > On 6/19/20 11:48 AM, Amit Kapila wrote:
> > > On Wed, Jun 10, 2020 at 8:36 AM Andrey V. Lepikhov
> > > <a(dot)lepikhov(at)postgrespro(dot)ru> wrote:
> > >> On 09.06.2020 11:41, Fujii Masao wrote:
> > >>> The patches seem not to be registered in CommitFest yet.
> > >>> Are you planning to do that?
> > >> Not now. It is a sharding-related feature. I'm not sure that this
> > >> approach is fully consistent with the sharding way now.
> > > Can you please explain in detail, why you think so? There is no
> > > commit message explaining what each patch does so it is difficult to
> > > understand why you said so?
> > For now I used this patch set for providing correct visibility in the
> > case of access to the table with foreign partitions from many nodes in
> > parallel. So I saw at this patch set as a sharding-related feature, but
> > [1] shows another useful application.
> > CSN-based approach has weak points such as:
> > 1. Dependency on clocks synchronization
> > 2. Needs guarantees of monotonically increasing of the CSN in the case
> > of an instance restart/crash etc.
> > 3. We need to delay increasing of OldestXmin because it can be needed
> > for a transaction snapshot at another node.
> >
>
> So, is anyone working on improving these parts of the patch. AFAICS
> from what Bruce has shared [1], some people from HighGo are working on
> it but I don't see any discussion of that yet.
>
> > So I do not have full conviction that it will be better than a single
> > distributed transaction manager.
> >
>
> When you say "single distributed transaction manager" do you mean
> something like pg_dtm which is inspired by Postgres-XL?
>
> > Also, can you let us know if this
> > > supports 2PC in some way and if so how is it different from what the
> > > other thread on the same topic [1] is trying to achieve?
> > Yes, the patch '0003-postgres_fdw-support-for-global-snapshots' contains
> > 2PC machinery. Now I'd not judge which approach is better.
> >
>

Sorry for being late.

> Yeah, I have studied both the approaches a little and I feel the main
> difference seems to be that in this patch atomicity is tightly coupled
> with how we achieve global visibility, basically in this patch "all
> running transactions are marked as InDoubt on all nodes in prepare
> phase, and after that, each node commit it and stamps each xid with a
> given GlobalCSN.". There are no separate APIs for
> prepare/commit/rollback exposed by postgres_fdw as we do it in the
> approach followed by Sawada-San's patch. It seems to me in the patch
> in this email one of postgres_fdw node can be a sort of coordinator
> which prepares and commit the transaction on all other nodes whereas
> that is not true in Sawada-San's patch (where the coordinator is a
> local Postgres node, am I right Sawada-San?).

Yeah, where to manage foreign transactions is different: postgres_fdw
manages foreign transactions in this patch whereas the PostgreSQL core
does that in that 2PC patch.

>
> I feel if Sawada-San or someone involved in another patch also once
> studies this approach and try to come up with some form of comparison
> then we might be able to make better decision. It is possible that
> there are few good things in each approach which we can use.
>

I studied this patch and did a simple comparison between this patch
(0002 patch) and my 2PC patch.

In terms of atomic commit, the features that are not implemented in
this patch but in the 2PC patch are:

* Crash safe.
* PREPARE TRANSACTION command support.
* Query cancel during waiting for the commit.
* Automatically in-doubt transaction resolution.

On the other hand, the feature that is implemented in this patch but
not in the 2PC patch is:

* Executing PREPARE TRANSACTION (and other commands) in parallel

When the 2PC patch was proposed, IIRC it was like this patch (0002
patch). I mean, it changed only postgres_fdw to support 2PC. But after
discussion, we changed the approach to have the core manage foreign
transaction for crash-safe. From my perspective, this patch has a
minimum implementation of 2PC to work the global snapshot feature and
has some missing features important for supporting crash-safe atomic
commit. So I personally think we should consider how to integrate this
global snapshot feature with the 2PC patch, rather than improving this
patch if we want crash-safe atomic commit.

Looking at the commit procedure with this patch:

When starting a new transaction on a foreign server, postgres_fdw
executes pg_global_snapshot_import() to import the global snapshot.
After some work, in pre-commit phase we do:

1. generate global transaction id, say 'gid'
2. execute PREPARE TRANSACTION 'gid' on all participants.
3. prepare global snapshot locally, if the local node also involves
the transaction
4. execute pg_global_snapshot_prepare('gid') for all participants

During step 2 to 4, we calculate the maximum CSN from the CSNs
returned from each pg_global_snapshot_prepare() executions.

5. assign global snapshot locally, if the local node also involves the
transaction
6. execute pg_global_snapshot_assign('gid', max-csn) on all participants.

Then, we commit locally (i.g. mark the current transaction as
committed in clog).

After that, in post-commit phase, execute COMMIT PREPARED 'gid' on all
participants.

Considering how to integrate this global snapshot feature with the 2PC
patch, what the 2PC patch needs to at least change is to allow FDW to
store an FDW-private data that is passed to subsequent FDW transaction
API calls. Currently, in the current 2PC patch, we call Prepare API
for each participant servers one by one, and the core pass only
metadata such as ForeignServer, UserMapping, and global transaction
identifier. So it's not easy to calculate the maximum CSN across
multiple transaction API calls. I think we can change the 2PC patch to
add a void pointer into FdwXactRslvState, struct passed from the core,
in order to store FDW-private data. It's going to be the maximum CSN
in this case. That way, at the first Prepare API calls postgres_fdw
allocates the space and stores CSN to that space. And at subsequent
Prepare API calls it can calculate the maximum of csn, and then is
able to the step 3 to 6 when preparing the transaction on the last
participant. Another idea would be to change 2PC patch so that the
core passes a bunch of participants grouped by FDW.

I’ve not read this patch deeply yet and have considered it without any
coding but my first feeling is not hard to integrate this feature with
the 2PC patch.

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To:	Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>
Cc:	"Andrey V(dot) Lepikhov" <a(dot)lepikhov(at)postgrespro(dot)ru>, movead(dot)li(at)highgo(dot)ca, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Global snapshots
Date:	2020-07-07 06:40:20
Message-ID:	CAA4eK1JVsMWUD4q-b+vawehFzJb6Qg0AOGq7qOGL_gm6EEhdJg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	503 토토 사이트 추천

On Fri, Jul 3, 2020 at 12:18 PM Masahiko Sawada
<masahiko(dot)sawada(at)2ndquadrant(dot)com> wrote:
>
> On Sat, 20 Jun 2020 at 21:21, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > On Fri, Jun 19, 2020 at 1:42 PM Andrey V. Lepikhov
> > <a(dot)lepikhov(at)postgrespro(dot)ru> wrote:
> >
> > > Also, can you let us know if this
> > > > supports 2PC in some way and if so how is it different from what the
> > > > other thread on the same topic [1] is trying to achieve?
> > > Yes, the patch '0003-postgres_fdw-support-for-global-snapshots' contains
> > > 2PC machinery. Now I'd not judge which approach is better.
> > >
> >
>
> Sorry for being late.
>

No problem, your summarization, and comparisons of both approaches are
quite helpful.

>
> I studied this patch and did a simple comparison between this patch
> (0002 patch) and my 2PC patch.
>
> In terms of atomic commit, the features that are not implemented in
> this patch but in the 2PC patch are:
>
> * Crash safe.
> * PREPARE TRANSACTION command support.
> * Query cancel during waiting for the commit.
> * Automatically in-doubt transaction resolution.
>
> On the other hand, the feature that is implemented in this patch but
> not in the 2PC patch is:
>
> * Executing PREPARE TRANSACTION (and other commands) in parallel
>
> When the 2PC patch was proposed, IIRC it was like this patch (0002
> patch). I mean, it changed only postgres_fdw to support 2PC. But after
> discussion, we changed the approach to have the core manage foreign
> transaction for crash-safe. From my perspective, this patch has a
> minimum implementation of 2PC to work the global snapshot feature and
> has some missing features important for supporting crash-safe atomic
> commit. So I personally think we should consider how to integrate this
> global snapshot feature with the 2PC patch, rather than improving this
> patch if we want crash-safe atomic commit.
>

Okay, but isn't there some advantage with this approach (manage 2PC at
postgres_fdw level) as well which is that any node will be capable of
handling global transactions rather than doing them via central
coordinator? I mean any node can do writes or reads rather than
probably routing them (at least writes) via coordinator node. Now, I
agree that even if this advantage is there in the current approach, we
can't lose the crash-safety aspect of other approach. Will you be
able to summarize what was the problem w.r.t crash-safety and how your
patch has dealt it?

> Looking at the commit procedure with this patch:
>
> When starting a new transaction on a foreign server, postgres_fdw
> executes pg_global_snapshot_import() to import the global snapshot.
> After some work, in pre-commit phase we do:
>
> 1. generate global transaction id, say 'gid'
> 2. execute PREPARE TRANSACTION 'gid' on all participants.
> 3. prepare global snapshot locally, if the local node also involves
> the transaction
> 4. execute pg_global_snapshot_prepare('gid') for all participants
>
> During step 2 to 4, we calculate the maximum CSN from the CSNs
> returned from each pg_global_snapshot_prepare() executions.
>
> 5. assign global snapshot locally, if the local node also involves the
> transaction
> 6. execute pg_global_snapshot_assign('gid', max-csn) on all participants.
>
> Then, we commit locally (i.g. mark the current transaction as
> committed in clog).
>
> After that, in post-commit phase, execute COMMIT PREPARED 'gid' on all
> participants.
>

As per my current understanding, the overall idea is as follows. For
global transactions, pg_global_snapshot_prepare('gid') will set the
transaction status as InDoubt and generate CSN (let's call it NodeCSN)
at the node where that function is executed, it also returns the
NodeCSN to the coordinator. Then the coordinator (the current
postgres_fdw node on which write transaction is being executed)
computes MaxCSN based on the return value (NodeCSN) of prepare
(pg_global_snapshot_prepare) from all nodes. It then assigns MaxCSN
to each node. Finally, when Commit Prepared is issued for each node
that MaxCSN will be written to each node including the current node.
So, with this idea, each node will have the same view of CSN value
corresponding to any particular transaction.

For Snapshot management, the node which receives the query generates a
CSN (CurrentCSN) and follows the simple rule that the tuple having a
xid with CSN lesser than CurrentCSN will be visible. Now, it is
possible that when we are examining a tuple, the CSN corresponding to
xid that has written the tuple has a value as INDOUBT which will
indicate that the transaction is yet not committed on all nodes. And
we wait till we get the valid CSN value corresponding to xid and then
use it to check if the tuple is visible.

Now, one thing to note here is that for global transactions we
primarily rely on CSN value corresponding to a transaction for its
visibility even though we still maintain CLOG for local transaction
status.

Leaving aside the incomplete parts and or flaws of the current patch,
does the above match the top-level idea of this patch? I am not sure
if my understanding of this patch at this stage is completely correct
or whether we want to follow the approach of this patch but I think at
least lets first be sure if such a top-level idea can achieve what we
want to do here.

> Considering how to integrate this global snapshot feature with the 2PC
> patch, what the 2PC patch needs to at least change is to allow FDW to
> store an FDW-private data that is passed to subsequent FDW transaction
> API calls. Currently, in the current 2PC patch, we call Prepare API
> for each participant servers one by one, and the core pass only
> metadata such as ForeignServer, UserMapping, and global transaction
> identifier. So it's not easy to calculate the maximum CSN across
> multiple transaction API calls. I think we can change the 2PC patch to
> add a void pointer into FdwXactRslvState, struct passed from the core,
> in order to store FDW-private data. It's going to be the maximum CSN
> in this case. That way, at the first Prepare API calls postgres_fdw
> allocates the space and stores CSN to that space. And at subsequent
> Prepare API calls it can calculate the maximum of csn, and then is
> able to the step 3 to 6 when preparing the transaction on the last
> participant. Another idea would be to change 2PC patch so that the
> core passes a bunch of participants grouped by FDW.
>

IIUC with this the coordinator needs the communication with the nodes
twice at the prepare stage, once to prepare the transaction in each
node and get CSN from each node and then to communicate MaxCSN to each
node? Also, we probably need InDoubt CSN status at prepare phase to
make snapshots and global visibility work.

> I’ve not read this patch deeply yet and have considered it without any
> coding but my first feeling is not hard to integrate this feature with
> the 2PC patch.
>

Okay.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

From:	Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>
To:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc:	"Andrey V(dot) Lepikhov" <a(dot)lepikhov(at)postgrespro(dot)ru>, movead(dot)li(at)highgo(dot)ca, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Global snapshots
Date:	2020-07-08 05:46:02
Message-ID:	CA+fd4k6HE8xLGEvqWzABEg8kkju5MxU+if7bf-md0_2pjzXp9Q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, 7 Jul 2020 at 15:40, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Fri, Jul 3, 2020 at 12:18 PM Masahiko Sawada
> <masahiko(dot)sawada(at)2ndquadrant(dot)com> wrote:
> >
> > On Sat, 20 Jun 2020 at 21:21, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > >
> > > On Fri, Jun 19, 2020 at 1:42 PM Andrey V. Lepikhov
> > > <a(dot)lepikhov(at)postgrespro(dot)ru> wrote:
> > >
> > > > Also, can you let us know if this
> > > > > supports 2PC in some way and if so how is it different from what the
> > > > > other thread on the same topic [1] is trying to achieve?
> > > > Yes, the patch '0003-postgres_fdw-support-for-global-snapshots' contains
> > > > 2PC machinery. Now I'd not judge which approach is better.
> > > >
> > >
> >
> > Sorry for being late.
> >
>
> No problem, your summarization, and comparisons of both approaches are
> quite helpful.
>
> >
> > I studied this patch and did a simple comparison between this patch
> > (0002 patch) and my 2PC patch.
> >
> > In terms of atomic commit, the features that are not implemented in
> > this patch but in the 2PC patch are:
> >
> > * Crash safe.
> > * PREPARE TRANSACTION command support.
> > * Query cancel during waiting for the commit.
> > * Automatically in-doubt transaction resolution.
> >
> > On the other hand, the feature that is implemented in this patch but
> > not in the 2PC patch is:
> >
> > * Executing PREPARE TRANSACTION (and other commands) in parallel
> >
> > When the 2PC patch was proposed, IIRC it was like this patch (0002
> > patch). I mean, it changed only postgres_fdw to support 2PC. But after
> > discussion, we changed the approach to have the core manage foreign
> > transaction for crash-safe. From my perspective, this patch has a
> > minimum implementation of 2PC to work the global snapshot feature and
> > has some missing features important for supporting crash-safe atomic
> > commit. So I personally think we should consider how to integrate this
> > global snapshot feature with the 2PC patch, rather than improving this
> > patch if we want crash-safe atomic commit.
> >
>
> Okay, but isn't there some advantage with this approach (manage 2PC at
> postgres_fdw level) as well which is that any node will be capable of
> handling global transactions rather than doing them via central
> coordinator? I mean any node can do writes or reads rather than
> probably routing them (at least writes) via coordinator node.

The postgres server where the client started the transaction works as
the coordinator node. I think this is true for both this patch and
that 2PC patch. From the perspective of atomic commit, any node will
be capable of handling global transactions in both approaches.

> Now, I
> agree that even if this advantage is there in the current approach, we
> can't lose the crash-safety aspect of other approach. Will you be
> able to summarize what was the problem w.r.t crash-safety and how your
> patch has dealt it?

Since this patch proceeds 2PC without any logging, foreign
transactions prepared on foreign servers are left over without any
clues if the coordinator crashes during commit. Therefore, after
restart, the user will need to find and resolve in-doubt foreign
transactions manually.

In that 2PC patch, the information of foreign transactions is WAL
logged before PREPARE TRANSACTION. So even if the coordinator crashes
after preparing some foreign transactions, the prepared foreign
transactions are recovered during crash recovery, and then the
transaction resolver resolves them automatically or the user also can
resolve them. The user doesn't need to check other participants node
to resolve in-doubt foreign transactions. Also, since the foreign
transaction information is replicated to physical standbys the new
master can take over resolving in-doubt transactions.

>
> > Looking at the commit procedure with this patch:
> >
> > When starting a new transaction on a foreign server, postgres_fdw
> > executes pg_global_snapshot_import() to import the global snapshot.
> > After some work, in pre-commit phase we do:
> >
> > 1. generate global transaction id, say 'gid'
> > 2. execute PREPARE TRANSACTION 'gid' on all participants.
> > 3. prepare global snapshot locally, if the local node also involves
> > the transaction
> > 4. execute pg_global_snapshot_prepare('gid') for all participants
> >
> > During step 2 to 4, we calculate the maximum CSN from the CSNs
> > returned from each pg_global_snapshot_prepare() executions.
> >
> > 5. assign global snapshot locally, if the local node also involves the
> > transaction
> > 6. execute pg_global_snapshot_assign('gid', max-csn) on all participants.
> >
> > Then, we commit locally (i.g. mark the current transaction as
> > committed in clog).
> >
> > After that, in post-commit phase, execute COMMIT PREPARED 'gid' on all
> > participants.
> >
>
> As per my current understanding, the overall idea is as follows. For
> global transactions, pg_global_snapshot_prepare('gid') will set the
> transaction status as InDoubt and generate CSN (let's call it NodeCSN)
> at the node where that function is executed, it also returns the
> NodeCSN to the coordinator. Then the coordinator (the current
> postgres_fdw node on which write transaction is being executed)
> computes MaxCSN based on the return value (NodeCSN) of prepare
> (pg_global_snapshot_prepare) from all nodes. It then assigns MaxCSN
> to each node. Finally, when Commit Prepared is issued for each node
> that MaxCSN will be written to each node including the current node.
> So, with this idea, each node will have the same view of CSN value
> corresponding to any particular transaction.
>
> For Snapshot management, the node which receives the query generates a
> CSN (CurrentCSN) and follows the simple rule that the tuple having a
> xid with CSN lesser than CurrentCSN will be visible. Now, it is
> possible that when we are examining a tuple, the CSN corresponding to
> xid that has written the tuple has a value as INDOUBT which will
> indicate that the transaction is yet not committed on all nodes. And
> we wait till we get the valid CSN value corresponding to xid and then
> use it to check if the tuple is visible.
>
> Now, one thing to note here is that for global transactions we
> primarily rely on CSN value corresponding to a transaction for its
> visibility even though we still maintain CLOG for local transaction
> status.
>
> Leaving aside the incomplete parts and or flaws of the current patch,
> does the above match the top-level idea of this patch?

I'm still studying this patch but your understanding seems right to me.

> I am not sure
> if my understanding of this patch at this stage is completely correct
> or whether we want to follow the approach of this patch but I think at
> least lets first be sure if such a top-level idea can achieve what we
> want to do here.
>
> > Considering how to integrate this global snapshot feature with the 2PC
> > patch, what the 2PC patch needs to at least change is to allow FDW to
> > store an FDW-private data that is passed to subsequent FDW transaction
> > API calls. Currently, in the current 2PC patch, we call Prepare API
> > for each participant servers one by one, and the core pass only
> > metadata such as ForeignServer, UserMapping, and global transaction
> > identifier. So it's not easy to calculate the maximum CSN across
> > multiple transaction API calls. I think we can change the 2PC patch to
> > add a void pointer into FdwXactRslvState, struct passed from the core,
> > in order to store FDW-private data. It's going to be the maximum CSN
> > in this case. That way, at the first Prepare API calls postgres_fdw
> > allocates the space and stores CSN to that space. And at subsequent
> > Prepare API calls it can calculate the maximum of csn, and then is
> > able to the step 3 to 6 when preparing the transaction on the last
> > participant. Another idea would be to change 2PC patch so that the
> > core passes a bunch of participants grouped by FDW.
> >
>
> IIUC with this the coordinator needs the communication with the nodes
> twice at the prepare stage, once to prepare the transaction in each
> node and get CSN from each node and then to communicate MaxCSN to each
> node?

Yes, I think so too.

> Also, we probably need InDoubt CSN status at prepare phase to
> make snapshots and global visibility work.

I think it depends on how global CSN feature works.

For instance, in that 2PC patch, if the coordinator crashes during
preparing a foreign transaction, the global transaction manager
recovers and regards it as "prepared" regardless of the foreign
transaction actually having been prepared. And it sends ROLLBACK
PREPARED after recovery completed. With global CSN patch, as you
mentioned, at prepare phase the coordinator needs to communicate
participants twice other than sending PREPARE TRANSACTION:
pg_global_snapshot_prepare() and pg_global_snapshot_assign().

If global CSN patch needs different cleanup work depending on the CSN
status, we will need InDoubt CSN status so that the global transaction
manager can distinguish between a foreign transaction that has
executed pg_global_snapshot_prepare() and the one that has executed
pg_global_snapshot_assign().

On the other hand, if it's enough to just send ROLLBACK or ROLLBACK
PREPARED in that case, I think we don't need InDoubt CSN status. There
is no difference between those foreign transactions from the global
transaction manager perspective.

As far as I read the patch, on failure postgres_fdw simply send
ROLLBACK PREPARED to participants, and there seems no additional work
other than that. I might be missing something.

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To:	Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>
Cc:	"Andrey V(dot) Lepikhov" <a(dot)lepikhov(at)postgrespro(dot)ru>, movead(dot)li(at)highgo(dot)ca, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Global snapshots
Date:	2020-07-08 12:34:54
Message-ID:	CAA4eK1+tMDXU9WtQ0kRgyk1jPr+K-7FowMdFB8=t-ptUEQa-mQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jul 8, 2020 at 11:16 AM Masahiko Sawada
<masahiko(dot)sawada(at)2ndquadrant(dot)com> wrote:
>
> On Tue, 7 Jul 2020 at 15:40, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> >
> > Okay, but isn't there some advantage with this approach (manage 2PC at
> > postgres_fdw level) as well which is that any node will be capable of
> > handling global transactions rather than doing them via central
> > coordinator? I mean any node can do writes or reads rather than
> > probably routing them (at least writes) via coordinator node.
>
> The postgres server where the client started the transaction works as
> the coordinator node. I think this is true for both this patch and
> that 2PC patch. From the perspective of atomic commit, any node will
> be capable of handling global transactions in both approaches.
>

Okay, but then probably we need to ensure that GID has to be unique
even if that gets generated on different nodes? I don't know if that
is ensured.

> > Now, I
> > agree that even if this advantage is there in the current approach, we
> > can't lose the crash-safety aspect of other approach. Will you be
> > able to summarize what was the problem w.r.t crash-safety and how your
> > patch has dealt it?
>
> Since this patch proceeds 2PC without any logging, foreign
> transactions prepared on foreign servers are left over without any
> clues if the coordinator crashes during commit. Therefore, after
> restart, the user will need to find and resolve in-doubt foreign
> transactions manually.
>

Okay, but is it because we can't directly WAL log in postgres_fdw or
there is some other reason for not doing so?

>
> >
> > > Looking at the commit procedure with this patch:
> > >
> > > When starting a new transaction on a foreign server, postgres_fdw
> > > executes pg_global_snapshot_import() to import the global snapshot.
> > > After some work, in pre-commit phase we do:
> > >
> > > 1. generate global transaction id, say 'gid'
> > > 2. execute PREPARE TRANSACTION 'gid' on all participants.
> > > 3. prepare global snapshot locally, if the local node also involves
> > > the transaction
> > > 4. execute pg_global_snapshot_prepare('gid') for all participants
> > >
> > > During step 2 to 4, we calculate the maximum CSN from the CSNs
> > > returned from each pg_global_snapshot_prepare() executions.
> > >
> > > 5. assign global snapshot locally, if the local node also involves the
> > > transaction
> > > 6. execute pg_global_snapshot_assign('gid', max-csn) on all participants.
> > >
> > > Then, we commit locally (i.g. mark the current transaction as
> > > committed in clog).
> > >
> > > After that, in post-commit phase, execute COMMIT PREPARED 'gid' on all
> > > participants.
> > >
> >
> > As per my current understanding, the overall idea is as follows. For
> > global transactions, pg_global_snapshot_prepare('gid') will set the
> > transaction status as InDoubt and generate CSN (let's call it NodeCSN)
> > at the node where that function is executed, it also returns the
> > NodeCSN to the coordinator. Then the coordinator (the current
> > postgres_fdw node on which write transaction is being executed)
> > computes MaxCSN based on the return value (NodeCSN) of prepare
> > (pg_global_snapshot_prepare) from all nodes. It then assigns MaxCSN
> > to each node. Finally, when Commit Prepared is issued for each node
> > that MaxCSN will be written to each node including the current node.
> > So, with this idea, each node will have the same view of CSN value
> > corresponding to any particular transaction.
> >
> > For Snapshot management, the node which receives the query generates a
> > CSN (CurrentCSN) and follows the simple rule that the tuple having a
> > xid with CSN lesser than CurrentCSN will be visible. Now, it is
> > possible that when we are examining a tuple, the CSN corresponding to
> > xid that has written the tuple has a value as INDOUBT which will
> > indicate that the transaction is yet not committed on all nodes. And
> > we wait till we get the valid CSN value corresponding to xid and then
> > use it to check if the tuple is visible.
> >
> > Now, one thing to note here is that for global transactions we
> > primarily rely on CSN value corresponding to a transaction for its
> > visibility even though we still maintain CLOG for local transaction
> > status.
> >
> > Leaving aside the incomplete parts and or flaws of the current patch,
> > does the above match the top-level idea of this patch?
>
> I'm still studying this patch but your understanding seems right to me.
>

Cool. While studying, if you can try to think whether this approach is
different from the global coordinator based approach then it would be
great. Here is my initial thought apart from other reasons the global
coordinator based design can help us to do the global transaction
management and snapshots. It can allocate xids for each transaction
and then collect the list of running xacts (or CSN) from each node and
then prepare a global snapshot that can be used to perform any
transaction.

OTOH, in the design proposed in this patch, we don't need any
coordinator to manage transactions and snapshots because each node's
current CSN will be sufficient for snapshot and visibility as
explained above. Now, sure this assumes that there is no clock skew
on different nodes or somehow we take care of the same (Note that in
the proposed patch the CSN is a timestamp.).

> > I am not sure
> > if my understanding of this patch at this stage is completely correct
> > or whether we want to follow the approach of this patch but I think at
> > least lets first be sure if such a top-level idea can achieve what we
> > want to do here.
> >
> > > Considering how to integrate this global snapshot feature with the 2PC
> > > patch, what the 2PC patch needs to at least change is to allow FDW to
> > > store an FDW-private data that is passed to subsequent FDW transaction
> > > API calls. Currently, in the current 2PC patch, we call Prepare API
> > > for each participant servers one by one, and the core pass only
> > > metadata such as ForeignServer, UserMapping, and global transaction
> > > identifier. So it's not easy to calculate the maximum CSN across
> > > multiple transaction API calls. I think we can change the 2PC patch to
> > > add a void pointer into FdwXactRslvState, struct passed from the core,
> > > in order to store FDW-private data. It's going to be the maximum CSN
> > > in this case. That way, at the first Prepare API calls postgres_fdw
> > > allocates the space and stores CSN to that space. And at subsequent
> > > Prepare API calls it can calculate the maximum of csn, and then is
> > > able to the step 3 to 6 when preparing the transaction on the last
> > > participant. Another idea would be to change 2PC patch so that the
> > > core passes a bunch of participants grouped by FDW.
> > >
> >
> > IIUC with this the coordinator needs the communication with the nodes
> > twice at the prepare stage, once to prepare the transaction in each
> > node and get CSN from each node and then to communicate MaxCSN to each
> > node?
>
> Yes, I think so too.
>
> > Also, we probably need InDoubt CSN status at prepare phase to
> > make snapshots and global visibility work.
>
> I think it depends on how global CSN feature works.
>
> For instance, in that 2PC patch, if the coordinator crashes during
> preparing a foreign transaction, the global transaction manager
> recovers and regards it as "prepared" regardless of the foreign
> transaction actually having been prepared. And it sends ROLLBACK
> PREPARED after recovery completed. With global CSN patch, as you
> mentioned, at prepare phase the coordinator needs to communicate
> participants twice other than sending PREPARE TRANSACTION:
> pg_global_snapshot_prepare() and pg_global_snapshot_assign().
>
> If global CSN patch needs different cleanup work depending on the CSN
> status, we will need InDoubt CSN status so that the global transaction
> manager can distinguish between a foreign transaction that has
> executed pg_global_snapshot_prepare() and the one that has executed
> pg_global_snapshot_assign().
>
> On the other hand, if it's enough to just send ROLLBACK or ROLLBACK
> PREPARED in that case, I think we don't need InDoubt CSN status. There
> is no difference between those foreign transactions from the global
> transaction manager perspective.
>

I think InDoubt status helps in checking visibility in the proposed
patch wherein if we find the status of the transaction as InDoubt, we
wait till we get some valid CSN for it as explained in my previous
email. So whether we use it for Rollback/Rollback Prepared, it is
required for this design.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

From:	Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>
To:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc:	"Andrey V(dot) Lepikhov" <a(dot)lepikhov(at)postgrespro(dot)ru>, movead(dot)li(at)highgo(dot)ca, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Global snapshots
Date:	2020-07-10 03:16:11
Message-ID:	CA+fd4k6YTa25VxpsqvpvqZmsCErT5T8=MYNNDXt7Wn0DsJvSrg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 8 Jul 2020 at 21:35, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Wed, Jul 8, 2020 at 11:16 AM Masahiko Sawada
> <masahiko(dot)sawada(at)2ndquadrant(dot)com> wrote:
> >
> > On Tue, 7 Jul 2020 at 15:40, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > >
> > >
> > > Okay, but isn't there some advantage with this approach (manage 2PC at
> > > postgres_fdw level) as well which is that any node will be capable of
> > > handling global transactions rather than doing them via central
> > > coordinator? I mean any node can do writes or reads rather than
> > > probably routing them (at least writes) via coordinator node.
> >
> > The postgres server where the client started the transaction works as
> > the coordinator node. I think this is true for both this patch and
> > that 2PC patch. From the perspective of atomic commit, any node will
> > be capable of handling global transactions in both approaches.
> >
>
> Okay, but then probably we need to ensure that GID has to be unique
> even if that gets generated on different nodes? I don't know if that
> is ensured.

Yes, if you mean GID is global transaction id specified to PREPARE
TRANSACTION, it has to be unique. In that 2PC patch, GID is generated
in form of 'fx_<random string>_<server oid>_<user oid>'. I believe it
can ensure uniqueness in most cases. In addition, there is FDW API to
generate an arbitrary identifier.

>
> > > Now, I
> > > agree that even if this advantage is there in the current approach, we
> > > can't lose the crash-safety aspect of other approach. Will you be
> > > able to summarize what was the problem w.r.t crash-safety and how your
> > > patch has dealt it?
> >
> > Since this patch proceeds 2PC without any logging, foreign
> > transactions prepared on foreign servers are left over without any
> > clues if the coordinator crashes during commit. Therefore, after
> > restart, the user will need to find and resolve in-doubt foreign
> > transactions manually.
> >
>
> Okay, but is it because we can't directly WAL log in postgres_fdw or
> there is some other reason for not doing so?

Yes, I think it is because we cannot WAL log in postgres_fdw. Maybe I
missed the point in your question. Please correct me if I missed
something.

>
> >
> > >
> > > > Looking at the commit procedure with this patch:
> > > >
> > > > When starting a new transaction on a foreign server, postgres_fdw
> > > > executes pg_global_snapshot_import() to import the global snapshot.
> > > > After some work, in pre-commit phase we do:
> > > >
> > > > 1. generate global transaction id, say 'gid'
> > > > 2. execute PREPARE TRANSACTION 'gid' on all participants.
> > > > 3. prepare global snapshot locally, if the local node also involves
> > > > the transaction
> > > > 4. execute pg_global_snapshot_prepare('gid') for all participants
> > > >
> > > > During step 2 to 4, we calculate the maximum CSN from the CSNs
> > > > returned from each pg_global_snapshot_prepare() executions.
> > > >
> > > > 5. assign global snapshot locally, if the local node also involves the
> > > > transaction
> > > > 6. execute pg_global_snapshot_assign('gid', max-csn) on all participants.
> > > >
> > > > Then, we commit locally (i.g. mark the current transaction as
> > > > committed in clog).
> > > >
> > > > After that, in post-commit phase, execute COMMIT PREPARED 'gid' on all
> > > > participants.
> > > >
> > >
> > > As per my current understanding, the overall idea is as follows. For
> > > global transactions, pg_global_snapshot_prepare('gid') will set the
> > > transaction status as InDoubt and generate CSN (let's call it NodeCSN)
> > > at the node where that function is executed, it also returns the
> > > NodeCSN to the coordinator. Then the coordinator (the current
> > > postgres_fdw node on which write transaction is being executed)
> > > computes MaxCSN based on the return value (NodeCSN) of prepare
> > > (pg_global_snapshot_prepare) from all nodes. It then assigns MaxCSN
> > > to each node. Finally, when Commit Prepared is issued for each node
> > > that MaxCSN will be written to each node including the current node.
> > > So, with this idea, each node will have the same view of CSN value
> > > corresponding to any particular transaction.
> > >
> > > For Snapshot management, the node which receives the query generates a
> > > CSN (CurrentCSN) and follows the simple rule that the tuple having a
> > > xid with CSN lesser than CurrentCSN will be visible. Now, it is
> > > possible that when we are examining a tuple, the CSN corresponding to
> > > xid that has written the tuple has a value as INDOUBT which will
> > > indicate that the transaction is yet not committed on all nodes. And
> > > we wait till we get the valid CSN value corresponding to xid and then
> > > use it to check if the tuple is visible.
> > >
> > > Now, one thing to note here is that for global transactions we
> > > primarily rely on CSN value corresponding to a transaction for its
> > > visibility even though we still maintain CLOG for local transaction
> > > status.
> > >
> > > Leaving aside the incomplete parts and or flaws of the current patch,
> > > does the above match the top-level idea of this patch?
> >
> > I'm still studying this patch but your understanding seems right to me.
> >
>
> Cool. While studying, if you can try to think whether this approach is
> different from the global coordinator based approach then it would be
> great. Here is my initial thought apart from other reasons the global
> coordinator based design can help us to do the global transaction
> management and snapshots. It can allocate xids for each transaction
> and then collect the list of running xacts (or CSN) from each node and
> then prepare a global snapshot that can be used to perform any
> transaction. OTOH, in the design proposed in this patch, we don't need any
> coordinator to manage transactions and snapshots because each node's
> current CSN will be sufficient for snapshot and visibility as
> explained above.

Yeah, my thought is the same as you. Since both approaches have strong
points and weak points I cannot mention which is a better approach,
but that 2PC patch would go well together with the design proposed in
this patch.

> Now, sure this assumes that there is no clock skew
> on different nodes or somehow we take care of the same (Note that in
> the proposed patch the CSN is a timestamp.).

As far as I read Clock-SI paper, we take care of the clock skew by
putting some waits on the transaction start and reading tuples on the
remote node.

>
> > > I am not sure
> > > if my understanding of this patch at this stage is completely correct
> > > or whether we want to follow the approach of this patch but I think at
> > > least lets first be sure if such a top-level idea can achieve what we
> > > want to do here.
> > >
> > > > Considering how to integrate this global snapshot feature with the 2PC
> > > > patch, what the 2PC patch needs to at least change is to allow FDW to
> > > > store an FDW-private data that is passed to subsequent FDW transaction
> > > > API calls. Currently, in the current 2PC patch, we call Prepare API
> > > > for each participant servers one by one, and the core pass only
> > > > metadata such as ForeignServer, UserMapping, and global transaction
> > > > identifier. So it's not easy to calculate the maximum CSN across
> > > > multiple transaction API calls. I think we can change the 2PC patch to
> > > > add a void pointer into FdwXactRslvState, struct passed from the core,
> > > > in order to store FDW-private data. It's going to be the maximum CSN
> > > > in this case. That way, at the first Prepare API calls postgres_fdw
> > > > allocates the space and stores CSN to that space. And at subsequent
> > > > Prepare API calls it can calculate the maximum of csn, and then is
> > > > able to the step 3 to 6 when preparing the transaction on the last
> > > > participant. Another idea would be to change 2PC patch so that the
> > > > core passes a bunch of participants grouped by FDW.
> > > >
> > >
> > > IIUC with this the coordinator needs the communication with the nodes
> > > twice at the prepare stage, once to prepare the transaction in each
> > > node and get CSN from each node and then to communicate MaxCSN to each
> > > node?
> >
> > Yes, I think so too.
> >
> > > Also, we probably need InDoubt CSN status at prepare phase to
> > > make snapshots and global visibility work.
> >
> > I think it depends on how global CSN feature works.
> >
> > For instance, in that 2PC patch, if the coordinator crashes during
> > preparing a foreign transaction, the global transaction manager
> > recovers and regards it as "prepared" regardless of the foreign
> > transaction actually having been prepared. And it sends ROLLBACK
> > PREPARED after recovery completed. With global CSN patch, as you
> > mentioned, at prepare phase the coordinator needs to communicate
> > participants twice other than sending PREPARE TRANSACTION:
> > pg_global_snapshot_prepare() and pg_global_snapshot_assign().
> >
> > If global CSN patch needs different cleanup work depending on the CSN
> > status, we will need InDoubt CSN status so that the global transaction
> > manager can distinguish between a foreign transaction that has
> > executed pg_global_snapshot_prepare() and the one that has executed
> > pg_global_snapshot_assign().
> >
> > On the other hand, if it's enough to just send ROLLBACK or ROLLBACK
> > PREPARED in that case, I think we don't need InDoubt CSN status. There
> > is no difference between those foreign transactions from the global
> > transaction manager perspective.
> >
>
> I think InDoubt status helps in checking visibility in the proposed
> patch wherein if we find the status of the transaction as InDoubt, we
> wait till we get some valid CSN for it as explained in my previous
> email. So whether we use it for Rollback/Rollback Prepared, it is
> required for this design.

Yes, InDoubt status is required for checking visibility. My comment
was it's not necessary from the perspective of atomic commit.

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To:	Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>
Cc:	"Andrey V(dot) Lepikhov" <a(dot)lepikhov(at)postgrespro(dot)ru>, movead(dot)li(at)highgo(dot)ca, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Global snapshots
Date:	2020-07-13 11:18:07
Message-ID:	CAA4eK1KAB=YbvDd3a8g2qNj2QNrQANnrpNmoaNCtEMLfaiFjrg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jul 10, 2020 at 8:46 AM Masahiko Sawada
<masahiko(dot)sawada(at)2ndquadrant(dot)com> wrote:
>
> On Wed, 8 Jul 2020 at 21:35, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> >
> > Cool. While studying, if you can try to think whether this approach is
> > different from the global coordinator based approach then it would be
> > great. Here is my initial thought apart from other reasons the global
> > coordinator based design can help us to do the global transaction
> > management and snapshots. It can allocate xids for each transaction
> > and then collect the list of running xacts (or CSN) from each node and
> > then prepare a global snapshot that can be used to perform any
> > transaction. OTOH, in the design proposed in this patch, we don't need any
> > coordinator to manage transactions and snapshots because each node's
> > current CSN will be sufficient for snapshot and visibility as
> > explained above.
>
> Yeah, my thought is the same as you. Since both approaches have strong
> points and weak points I cannot mention which is a better approach,
> but that 2PC patch would go well together with the design proposed in
> this patch.
>

I also think with some modifications we might be able to integrate
your 2PC patch with the patches proposed here. However, if we decide
not to pursue this approach then it is uncertain whether your proposed
patch can be further enhanced for global visibility. Does it make
sense to dig the design of this approach a bit further so that we can
be somewhat more sure that pursuing your 2PC patch would be a good
idea and we can, in fact, enhance it later for global visibility?
AFAICS, Andrey has mentioned couple of problems with this approach
[1], the details of which I am also not sure at this stage but if we
can dig those it would be really great.

> > Now, sure this assumes that there is no clock skew
> > on different nodes or somehow we take care of the same (Note that in
> > the proposed patch the CSN is a timestamp.).
>
> As far as I read Clock-SI paper, we take care of the clock skew by
> putting some waits on the transaction start and reading tuples on the
> remote node.
>

Oh, but I am not sure if this patch is able to solve that, and if so, how?

> >
> > I think InDoubt status helps in checking visibility in the proposed
> > patch wherein if we find the status of the transaction as InDoubt, we
> > wait till we get some valid CSN for it as explained in my previous
> > email. So whether we use it for Rollback/Rollback Prepared, it is
> > required for this design.
>
> Yes, InDoubt status is required for checking visibility. My comment
> was it's not necessary from the perspective of atomic commit.
>

True and probably we can enhance your patch for InDoubt status if required.

Thanks for moving this work forward. I know the progress is a bit
slow due to various reasons but I think it is important to keep making
some progress.

[1] - /message-id/f23083b9-38d0-6126-eb6e-091816a78585%40postgrespro.ru

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

From:	"tsunakawa(dot)takay(at)fujitsu(dot)com" <tsunakawa(dot)takay(at)fujitsu(dot)com>
To:	PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Cc:	"Andrey V(dot) Lepikhov" <a(dot)lepikhov(at)postgrespro(dot)ru>, "movead(dot)li(at)highgo(dot)ca" <movead(dot)li(at)highgo(dot)ca>, 'Amit Kapila' <amit(dot)kapila16(at)gmail(dot)com>, Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
Subject:	RE: Global snapshots
Date:	2020-07-23 04:46:30
Message-ID:	TYAPR01MB2990E83F9E886A9C8AB8E269FE760@TYAPR01MB2990.jpnprd01.prod.outlook.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello,

While I'm thinking of the following issues of the current approach Andrey raised, I'm getting puzzled and can't help asking certain things. Please forgive me if I'm missing some discussions in the past.

> 1. Dependency on clocks synchronization
> 2. Needs guarantees of monotonically increasing of the CSN in the case
> of an instance restart/crash etc.
> 3. We need to delay increasing of OldestXmin because it can be needed
> for a transaction snapshot at another node.

While Clock-SI seems to be considered the best promising for global serializability here,

* Why does Clock-SI gets so much attention? How did Clock-SI become the only choice?

* Clock-SI was devised in Microsoft Research. Does Microsoft or some other organization use Clock-SI?

Have anyone examined the following Multiversion Commitment Ordering (MVCO)? Although I haven't understood this yet, it insists that no concurrency control information including timestamps needs to be exchanged among the cluster nodes. I'd appreciate it if someone could give an opinion.

Commitment Ordering Based Distributed Concurrency Control for Bridging Single and Multi Version Resources.
Proceedings of the Third IEEE International Workshop on Research Issues on Data Engineering: Interoperability in Multidatabase Systems (RIDE-IMS), Vienna, Austria, pp. 189-198, April 1993. (also DEC-TR 853, July 1992)
https://ieeexplore.ieee.org/document/281924?arnumber=281924

The author of the above paper, Yoav Raz, seems to have had strong passion at least until 2011 about making people believe the mightiness of Commitment Ordering (CO) for global serializability. However, he complains (sadly) that almost all researchers ignore his theory, as written in his following site and wikipedia page for Commitment Ordering. Does anyone know why CO is ignored?

Commitment ordering (CO) - yoavraz2
https://sites.google.com/site/yoavraz2/the_principle_of_co

FWIW, some researchers including Michael Stonebraker evaluated the performance of various distributed concurrency control methods in 2017. Have anyone looked at this? (I don't mean there was some promising method that we might want to adopt.)

An Evaluation of Distributed Concurrency Control
Rachael Harding, Dana Van Aken, Andrew Pavlo, and Michael Stonebraker. 2017.
Proc. VLDB Endow. 10, 5 (January 2017), 553-564.
https://doi.org/10.14778/3055540.3055548

Regards
Takayuki Tsunakawa

From:	Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>
To:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc:	"Andrey V(dot) Lepikhov" <a(dot)lepikhov(at)postgrespro(dot)ru>, movead(dot)li(at)highgo(dot)ca, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Global snapshots
Date:	2020-07-23 06:13:12
Message-ID:	CA+fd4k4rAoBcd92eVr94gNynLRxpqAOdd8+SKdcw1Avb2-q8tQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	Postg스포츠 토토 사이트SQL

On Mon, 13 Jul 2020 at 20:18, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Fri, Jul 10, 2020 at 8:46 AM Masahiko Sawada
> <masahiko(dot)sawada(at)2ndquadrant(dot)com> wrote:
> >
> > On Wed, 8 Jul 2020 at 21:35, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > >
> > >
> > > Cool. While studying, if you can try to think whether this approach is
> > > different from the global coordinator based approach then it would be
> > > great. Here is my initial thought apart from other reasons the global
> > > coordinator based design can help us to do the global transaction
> > > management and snapshots. It can allocate xids for each transaction
> > > and then collect the list of running xacts (or CSN) from each node and
> > > then prepare a global snapshot that can be used to perform any
> > > transaction. OTOH, in the design proposed in this patch, we don't need any
> > > coordinator to manage transactions and snapshots because each node's
> > > current CSN will be sufficient for snapshot and visibility as
> > > explained above.
> >
> > Yeah, my thought is the same as you. Since both approaches have strong
> > points and weak points I cannot mention which is a better approach,
> > but that 2PC patch would go well together with the design proposed in
> > this patch.
> >
>
> I also think with some modifications we might be able to integrate
> your 2PC patch with the patches proposed here. However, if we decide
> not to pursue this approach then it is uncertain whether your proposed
> patch can be further enhanced for global visibility.

Yes. I think even if we decide not to pursue this approach it's not
the reason for not pursuing the 2PC patch. if so we would need to
consider the design of 2PC patch again so it generically resolves the
atomic commit problem.

> Does it make
> sense to dig the design of this approach a bit further so that we can
> be somewhat more sure that pursuing your 2PC patch would be a good
> idea and we can, in fact, enhance it later for global visibility?

Agreed.

> AFAICS, Andrey has mentioned couple of problems with this approach
> [1], the details of which I am also not sure at this stage but if we
> can dig those it would be really great.
>
> > > Now, sure this assumes that there is no clock skew
> > > on different nodes or somehow we take care of the same (Note that in
> > > the proposed patch the CSN is a timestamp.).
> >
> > As far as I read Clock-SI paper, we take care of the clock skew by
> > putting some waits on the transaction start and reading tuples on the
> > remote node.
> >
>
> Oh, but I am not sure if this patch is able to solve that, and if so, how?

I'm not sure the details but, as far as I read the patch I guess the
transaction will sleep at GlobalSnapshotSync() when the received
global csn is greater than the local global csn.

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	"tsunakawa(dot)takay(at)fujitsu(dot)com" <tsunakawa(dot)takay(at)fujitsu(dot)com>
To:	"Andrey V(dot) Lepikhov" <a(dot)lepikhov(at)postgrespro(dot)ru>, "movead(dot)li(at)highgo(dot)ca" <movead(dot)li(at)highgo(dot)ca>
Cc:	'Amit Kapila' <amit(dot)kapila16(at)gmail(dot)com>, Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject:	RE: Global snapshots
Date:	2020-07-27 06:22:45
Message-ID:	TYAPR01MB29902D8B671A5A6946F5AEA8FE720@TYAPR01MB2990.jpnprd01.prod.outlook.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi Andrey san, Movead san,

From: tsunakawa(dot)takay(at)fujitsu(dot)com <tsunakawa(dot)takay(at)fujitsu(dot)com>
> While Clock-SI seems to be considered the best promising for global
> serializability here,
>
> * Why does Clock-SI gets so much attention? How did Clock-SI become the
> only choice?
>
> * Clock-SI was devised in Microsoft Research. Does Microsoft or some other
> organization use Clock-SI?

Could you take a look at this patent? I'm afraid this is the Clock-SI for MVCC. Microsoft holds this until 2031. I couldn't find this with the keyword "Clock-SI.""

US8356007B2 - Distributed transaction management for database systems with multiversioning - Google Patents
https://patents.google.com/patent/US8356007

If it is, can we circumvent this patent?

Regards
Takayuki Tsunakawa

From:	"Andrey V(dot) Lepikhov" <a(dot)lepikhov(at)postgrespro(dot)ru>
To:	"tsunakawa(dot)takay(at)fujitsu(dot)com" <tsunakawa(dot)takay(at)fujitsu(dot)com>, "movead(dot)li(at)highgo(dot)ca" <movead(dot)li(at)highgo(dot)ca>
Cc:	'Amit Kapila' <amit(dot)kapila16(at)gmail(dot)com>, Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Global snapshots
Date:	2020-07-27 06:44:54
Message-ID:	806c3708-26ea-b94d-87b1-d48e41e93726@postgrespro.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 7/27/20 11:22 AM, tsunakawa(dot)takay(at)fujitsu(dot)com wrote:
> Hi Andrey san, Movead san,
>
>
> From: tsunakawa(dot)takay(at)fujitsu(dot)com <tsunakawa(dot)takay(at)fujitsu(dot)com>
>> While Clock-SI seems to be considered the best promising for global
>> serializability here,
>>
>> * Why does Clock-SI gets so much attention? How did Clock-SI become the
>> only choice?
>>
>> * Clock-SI was devised in Microsoft Research. Does Microsoft or some other
>> organization use Clock-SI?
>
> Could you take a look at this patent? I'm afraid this is the Clock-SI for MVCC. Microsoft holds this until 2031. I couldn't find this with the keyword "Clock-SI.""
>
>
> US8356007B2 - Distributed transaction management for database systems with multiversioning - Google Patents
> https://patents.google.com/patent/US8356007
>
>
> If it is, can we circumvent this patent?
>
>
> Regards
> Takayuki Tsunakawa
>
>

Thank you for the research (and previous links too).
I haven't seen this patent before. This should be carefully studied.

--
regards,
Andrey Lepikhov
Postgres Professional

From:	Alexey Kondratov <a(dot)kondratov(at)postgrespro(dot)ru>
To:	"Andrey V(dot) Lepikhov" <a(dot)lepikhov(at)postgrespro(dot)ru>
Cc:	tsunakawa(dot)takay(at)fujitsu(dot)com, movead(dot)li(at)highgo(dot)ca, 'Amit Kapila' <amit(dot)kapila16(at)gmail(dot)com>, Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Global snapshots
Date:	2020-09-04 18:31:14
Message-ID:	3ef7877bfed0582019eab3d462a43275@postgrespro.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 2020-07-27 09:44, Andrey V. Lepikhov wrote:
> On 7/27/20 11:22 AM, tsunakawa(dot)takay(at)fujitsu(dot)com wrote:
>>
>> US8356007B2 - Distributed transaction management for database systems
>> with multiversioning - Google Patents
>> https://patents.google.com/patent/US8356007
>>
>>
>> If it is, can we circumvent this patent?
>>
>
> Thank you for the research (and previous links too).
> I haven't seen this patent before. This should be carefully studied.

I had a look on the patch set, although it is quite outdated, especially
on 0003.

Two thoughts about 0003:

First, IIUC atomicity of the distributed transaction in the postgres_fdw
is achieved by the usage of 2PC. I think that this postgres_fdw 2PC
support should be separated from global snapshots. It could be useful to
have such atomic distributed transactions even without a proper
visibility, which is guaranteed by the global snapshot. Especially
taking into account the doubts about Clock-SI and general questions
about algorithm choosing criteria above in the thread.

Thus, I propose to split 0003 into two parts and add a separate GUC
'postgres_fdw.use_twophase', which could be turned on independently from
'postgres_fdw.use_global_snapshots'. Of course if the latter is enabled,
then 2PC should be forcedly turned on as well.

Second, there are some problems with errors handling in the 0003 (thanks
to Arseny Sher for review).

+error:
+ if (!res)
+ {
+ sql = psprintf("ABORT PREPARED '%s'", fdwTransState->gid);
+ BroadcastCmd(sql);
+ elog(ERROR, "Failed to PREPARE transaction on remote node");
+ }

It seems that we should never reach this point, just because
BroadcastStmt will throw an ERROR if it fails to prepare transaction on
the foreign server:

+ if (PQresultStatus(result) != expectedStatus ||
+ (handler && !handler(result, arg)))
+ {
+ elog(WARNING, "Failed command %s: status=%d, expected status=%d",
sql, PQresultStatus(result), expectedStatus);
+ pgfdw_report_error(ERROR, result, entry->conn, true, sql);
+ allOk = false;
+ }

Moreover, It doesn't make much sense to try to abort prepared xacts,
since if we failed to prepare it somewhere, then some foreign servers
may become unavailable already and this doesn't provide us a 100%
guarantee of clean up.

+ /* COMMIT open transaction of we were doing 2PC */
+ if (fdwTransState->two_phase_commit &&
+ (event == XACT_EVENT_PARALLEL_COMMIT || event == XACT_EVENT_COMMIT))
+ {
+ BroadcastCmd(psprintf("COMMIT PREPARED '%s'", fdwTransState->gid));
+ }

At this point, the host (local) transaction is already committed and
there is no way to abort it gracefully. However, BroadcastCmd may rise
an ERROR that will cause a PANIC, since it is non-recoverable state:

PANIC: cannot abort transaction 487, it was already committed

Attached is a patch, which implements a plain 2PC in the postgres_fdw
and adds a GUC 'postgres_fdw.use_twophase'. Also it solves these errors
handling issues above and tries to add proper comments everywhere. I
think, that 0003 should be rebased on the top of it, or it could be a
first patch in the set, since it may be used independently. What do you
think?

Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company

Attachment	Content-Type	Size
0001-Add-postgres_fdw.use_twophase-GUC-to-use-2PC.patch	text/x-diff	12.8 KB

From:	Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
To:	Alexey Kondratov <a(dot)kondratov(at)postgrespro(dot)ru>, "Andrey V(dot) Lepikhov" <a(dot)lepikhov(at)postgrespro(dot)ru>
Cc:	tsunakawa(dot)takay(at)fujitsu(dot)com, movead(dot)li(at)highgo(dot)ca, 'Amit Kapila' <amit(dot)kapila16(at)gmail(dot)com>, Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Global snapshots
Date:	2020-09-08 02:49:36
Message-ID:	057004b6-68bf-98ee-7c57-91c418f6e221@oss.nttdata.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2020/09/05 3:31, Alexey Kondratov wrote:
> Hi,
>
> On 2020-07-27 09:44, Andrey V. Lepikhov wrote:
>> On 7/27/20 11:22 AM, tsunakawa(dot)takay(at)fujitsu(dot)com wrote:
>>>
>>> US8356007B2 - Distributed transaction management for database systems with multiversioning - Google Patents
>>> https://patents.google.com/patent/US8356007
>>>
>>>
>>> If it is, can we circumvent this patent?
>>>
>>
>> Thank you for the research (and previous links too).
>> I haven't seen this patent before. This should be carefully studied.
>
> I had a look on the patch set, although it is quite outdated, especially on 0003.
>
> Two thoughts about 0003:
>
> First, IIUC atomicity of the distributed transaction in the postgres_fdw is achieved by the usage of 2PC. I think that this postgres_fdw 2PC support should be separated from global snapshots.

Agreed.

> It could be useful to have such atomic distributed transactions even without a proper visibility, which is guaranteed by the global snapshot. Especially taking into account the doubts about Clock-SI and general questions about algorithm choosing criteria above in the thread.
>
> Thus, I propose to split 0003 into two parts and add a separate GUC 'postgres_fdw.use_twophase', which could be turned on independently from 'postgres_fdw.use_global_snapshots'. Of course if the latter is enabled, then 2PC should be forcedly turned on as well.
>
> Second, there are some problems with errors handling in the 0003 (thanks to Arseny Sher for review).
>
> +error:
> +            if (!res)
> +            {
> +                sql = psprintf("ABORT PREPARED '%s'", fdwTransState->gid);
> +                BroadcastCmd(sql);
> +                elog(ERROR, "Failed to PREPARE transaction on remote node");
> +            }
>
> It seems that we should never reach this point, just because BroadcastStmt will throw an ERROR if it fails to prepare transaction on the foreign server:
>
> +            if (PQresultStatus(result) != expectedStatus ||
> +                (handler && !handler(result, arg)))
> +            {
> +                elog(WARNING, "Failed command %s: status=%d, expected status=%d", sql, PQresultStatus(result), expectedStatus);
> +                pgfdw_report_error(ERROR, result, entry->conn, true, sql);
> +                allOk = false;
> +            }
>
> Moreover, It doesn't make much sense to try to abort prepared xacts, since if we failed to prepare it somewhere, then some foreign servers may become unavailable already and this doesn't provide us a 100% guarantee of clean up.
>
> +    /* COMMIT open transaction of we were doing 2PC */
> +    if (fdwTransState->two_phase_commit &&
> +        (event == XACT_EVENT_PARALLEL_COMMIT || event == XACT_EVENT_COMMIT))
> +    {
> +        BroadcastCmd(psprintf("COMMIT PREPARED '%s'", fdwTransState->gid));
> +    }
>
> At this point, the host (local) transaction is already committed and there is no way to abort it gracefully. However, BroadcastCmd may rise an ERROR that will cause a PANIC, since it is non-recoverable state:
>
> PANIC: cannot abort transaction 487, it was already committed
>
> Attached is a patch, which implements a plain 2PC in the postgres_fdw and adds a GUC 'postgres_fdw.use_twophase'. Also it solves these errors handling issues above and tries to add proper comments everywhere. I think, that 0003 should be rebased on the top of it, or it could be a first patch in the set, since it may be used independently. What do you think?

Thanks for the patch!

Sawada-san was proposing another 2PC patch at [1]. Do you have any thoughts
about pros and cons between your patch and Sawada-san's?

Regards,

[1]
/message-id/CA+fd4k4z6_B1ETEvQamwQhu4RX7XsrN5ORL7OhJ4B5B6sW-RgQ@mail.gmail.com

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

From:	Alexey Kondratov <a(dot)kondratov(at)postgrespro(dot)ru>
To:	Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
Cc:	"Andrey V(dot) Lepikhov" <a(dot)lepikhov(at)postgrespro(dot)ru>, tsunakawa(dot)takay(at)fujitsu(dot)com, movead(dot)li(at)highgo(dot)ca, 'Amit Kapila' <amit(dot)kapila16(at)gmail(dot)com>, Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Global snapshots
Date:	2020-09-08 10:36:16
Message-ID:	0e51d0298eed7588664e3c67a4fb15c9@postgrespro.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	503 토토 사이트 순위

On 2020-09-08 05:49, Fujii Masao wrote:
> On 2020/09/05 3:31, Alexey Kondratov wrote:
>>
>> Attached is a patch, which implements a plain 2PC in the postgres_fdw
>> and adds a GUC 'postgres_fdw.use_twophase'. Also it solves these
>> errors handling issues above and tries to add proper comments
>> everywhere. I think, that 0003 should be rebased on the top of it, or
>> it could be a first patch in the set, since it may be used
>> independently. What do you think?
>
> Thanks for the patch!
>
> Sawada-san was proposing another 2PC patch at [1]. Do you have any
> thoughts
> about pros and cons between your patch and Sawada-san's?
>
> [1]
> /message-id/CA+fd4k4z6_B1ETEvQamwQhu4RX7XsrN5ORL7OhJ4B5B6sW-RgQ@mail.gmail.com

Thank you for the link!

After a quick look on the Sawada-san's patch set I think that there are
two major differences:

1. There is a built-in foreign xacts resolver in the [1], which should
be much more convenient from the end-user perspective. It involves huge
in-core changes and additional complexity that is of course worth of.

However, it's still not clear for me that it is possible to resolve all
foreign prepared xacts on the Postgres' own side with a 100% guarantee.
Imagine a situation when the coordinator node is actually a HA cluster
group (primary + sync + async replica) and it failed just after PREPARE
stage of after local COMMIT. In that case all foreign xacts will be left
in the prepared state. After failover process complete synchronous
replica will become a new primary. Would it have all required info to
properly resolve orphan prepared xacts?

Probably, this situation is handled properly in the [1], but I've not
yet finished a thorough reading of the patch set, though it has a great
doc!

On the other hand, previous 0003 and my proposed patch rely on either
manual resolution of hung prepared xacts or usage of external
monitor/resolver. This approach is much simpler from the in-core
perspective, but doesn't look as complete as [1] though.

2. In the patch from this thread all 2PC logic sit in the postgres_fdw,
while [1] tries to put it into the generic fdw core, which also feels
like a more general and architecturally correct way. However, how many
from the currently available dozens of various FDWs are capable to
perform 2PC? And how many of them are maintained well enough to adopt
this new API? This is not an argument against [1] actually, since
postgres_fdw is known to be the most advanced FDW and an early adopter
of new feature, just a little doubt about a usefulness of this
preliminary generalisation.

Anyway, I think that [1] is a great work and really hope to find more
time to investigate it deeper later this year.

Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company

From:	Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
To:	Alexey Kondratov <a(dot)kondratov(at)postgrespro(dot)ru>
Cc:	"Andrey V(dot) Lepikhov" <a(dot)lepikhov(at)postgrespro(dot)ru>, tsunakawa(dot)takay(at)fujitsu(dot)com, movead(dot)li(at)highgo(dot)ca, 'Amit Kapila' <amit(dot)kapila16(at)gmail(dot)com>, Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Global snapshots
Date:	2020-09-08 11:48:16
Message-ID:	a86f7722-48b3-cd82-d491-86272a7bf8f0@oss.nttdata.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2020/09/08 19:36, Alexey Kondratov wrote:
> On 2020-09-08 05:49, Fujii Masao wrote:
>> On 2020/09/05 3:31, Alexey Kondratov wrote:
>>>
>>> Attached is a patch, which implements a plain 2PC in the postgres_fdw and adds a GUC 'postgres_fdw.use_twophase'. Also it solves these errors handling issues above and tries to add proper comments everywhere. I think, that 0003 should be rebased on the top of it, or it could be a first patch in the set, since it may be used independently. What do you think?
>>
>> Thanks for the patch!
>>
>> Sawada-san was proposing another 2PC patch at [1]. Do you have any thoughts
>> about pros and cons between your patch and Sawada-san's?
>>
>> [1]
>> /message-id/CA+fd4k4z6_B1ETEvQamwQhu4RX7XsrN5ORL7OhJ4B5B6sW-RgQ@mail.gmail.com
>
> Thank you for the link!
>
> After a quick look on the Sawada-san's patch set I think that there are two major differences:

Thanks for sharing your thought! As far as I read your patch quickly,
I basically agree with your this view.

>
> 1. There is a built-in foreign xacts resolver in the [1], which should be much more convenient from the end-user perspective. It involves huge in-core changes and additional complexity that is of course worth of.
>
> However, it's still not clear for me that it is possible to resolve all foreign prepared xacts on the Postgres' own side with a 100% guarantee. Imagine a situation when the coordinator node is actually a HA cluster group (primary + sync + async replica) and it failed just after PREPARE stage of after local COMMIT. In that case all foreign xacts will be left in the prepared state. After failover process complete synchronous replica will become a new primary. Would it have all required info to properly resolve orphan prepared xacts?

IIUC, yes, the information required for automatic resolution is
WAL-logged and the standby tries to resolve those orphan transactions
from WAL after the failover. But Sawada-san's patch provides
the special function for manual resolution, so there may be some cases
where manual resolution is necessary.

>
> Probably, this situation is handled properly in the [1], but I've not yet finished a thorough reading of the patch set, though it has a great doc!
>
> On the other hand, previous 0003 and my proposed patch rely on either manual resolution of hung prepared xacts or usage of external monitor/resolver. This approach is much simpler from the in-core perspective, but doesn't look as complete as [1] though.
>
> 2. In the patch from this thread all 2PC logic sit in the postgres_fdw, while [1] tries to put it into the generic fdw core, which also feels like a more general and architecturally correct way. However, how many from the currently available dozens of various FDWs are capable to perform 2PC? And how many of them are maintained well enough to adopt this new API? This is not an argument against [1] actually, since postgres_fdw is known to be the most advanced FDW and an early adopter of new feature, just a little doubt about a usefulness of this preliminary generalisation.

If we implement 2PC feature only for PostgreSQL sharding using
postgres_fdw, IMO it's ok to support only postgres_fdw.
But if we implement 2PC as the improvement on FDW independently
from PostgreSQL sharding and global visibility, I think that it's
necessary to support other FDW. I'm not sure how many FDW
actually will support this new 2PC interface. But if the interface is
not so complicated, I *guess* some FDW will support it in the near future.

Implementing 2PC feature only inside postgres_fdw seems to cause
another issue; COMMIT PREPARED is issued to the remote servers
after marking the local transaction as committed
(i.e., ProcArrayEndTransaction()). Is this safe? This issue happens
because COMMIT PREPARED is issued via
CallXactCallbacks(XACT_EVENT_COMMIT) and that CallXactCallbacks()
is called after ProcArrayEndTransaction().

>
> Anyway, I think that [1] is a great work and really hope to find more time to investigate it deeper later this year.

I'm sure your work is also great! I hope we can discuss the design
of 2PC feature together!

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

From:	Alexey Kondratov <a(dot)kondratov(at)postgrespro(dot)ru>
To:	Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
Cc:	"Andrey V(dot) Lepikhov" <a(dot)lepikhov(at)postgrespro(dot)ru>, tsunakawa(dot)takay(at)fujitsu(dot)com, movead(dot)li(at)highgo(dot)ca, 'Amit Kapila' <amit(dot)kapila16(at)gmail(dot)com>, Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Global snapshots
Date:	2020-09-08 17:00:44
Message-ID:	53b1586317ae98ecd8c3383f2c9e7c16@postgrespro.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2020-09-08 14:48, Fujii Masao wrote:
> On 2020/09/08 19:36, Alexey Kondratov wrote:
>> On 2020-09-08 05:49, Fujii Masao wrote:
>>> On 2020/09/05 3:31, Alexey Kondratov wrote:
>>>>
>>>> Attached is a patch, which implements a plain 2PC in the
>>>> postgres_fdw and adds a GUC 'postgres_fdw.use_twophase'. Also it
>>>> solves these errors handling issues above and tries to add proper
>>>> comments everywhere. I think, that 0003 should be rebased on the top
>>>> of it, or it could be a first patch in the set, since it may be used
>>>> independently. What do you think?
>>>
>>> Thanks for the patch!
>>>
>>> Sawada-san was proposing another 2PC patch at [1]. Do you have any
>>> thoughts
>>> about pros and cons between your patch and Sawada-san's?
>>>
>>> [1]
>>> /message-id/CA+fd4k4z6_B1ETEvQamwQhu4RX7XsrN5ORL7OhJ4B5B6sW-RgQ@mail.gmail.com
>>
>> Thank you for the link!
>>
>> After a quick look on the Sawada-san's patch set I think that there
>> are two major differences:
>
> Thanks for sharing your thought! As far as I read your patch quickly,
> I basically agree with your this view.
>
>
>>
>> 1. There is a built-in foreign xacts resolver in the [1], which should
>> be much more convenient from the end-user perspective. It involves
>> huge in-core changes and additional complexity that is of course worth
>> of.
>>
>> However, it's still not clear for me that it is possible to resolve
>> all foreign prepared xacts on the Postgres' own side with a 100%
>> guarantee. Imagine a situation when the coordinator node is actually a
>> HA cluster group (primary + sync + async replica) and it failed just
>> after PREPARE stage of after local COMMIT. In that case all foreign
>> xacts will be left in the prepared state. After failover process
>> complete synchronous replica will become a new primary. Would it have
>> all required info to properly resolve orphan prepared xacts?
>
> IIUC, yes, the information required for automatic resolution is
> WAL-logged and the standby tries to resolve those orphan transactions
> from WAL after the failover. But Sawada-san's patch provides
> the special function for manual resolution, so there may be some cases
> where manual resolution is necessary.
>

I've found a note about manual resolution in the v25 0002:

+After that we prepare all foreign transactions by calling
+PrepareForeignTransaction() API. If we failed on any of them we change
to
+rollback, therefore at this time some participants might be prepared
whereas
+some are not prepared. The former foreign transactions need to be
resolved
+using pg_resolve_foreign_xact() manually and the latter ends
transaction
+in one-phase by calling RollbackForeignTransaction() API.

but it's not yet clear for me.

>
> Implementing 2PC feature only inside postgres_fdw seems to cause
> another issue; COMMIT PREPARED is issued to the remote servers
> after marking the local transaction as committed
> (i.e., ProcArrayEndTransaction()).
>

According to the Sawada-san's v25 0002 the logic is pretty much the same
there:

+2. Pre-Commit phase (1st phase of two-phase commit)

+3. Commit locally
+Once we've prepared all of them, commit the transaction locally.

+4. Post-Commit Phase (2nd phase of two-phase commit)

Brief look at the code confirms this scheme. IIUC, AtEOXact_FdwXact /
FdwXactParticipantEndTransaction happens after ProcArrayEndTransaction()
in the CommitTransaction(). Thus, I don't see many difference between
these approach and CallXactCallbacks() usage regarding this point.

> Is this safe? This issue happens
> because COMMIT PREPARED is issued via
> CallXactCallbacks(XACT_EVENT_COMMIT) and that CallXactCallbacks()
> is called after ProcArrayEndTransaction().
>

Once the transaction is committed locally any ERROR (or higher level
message) will be escalated to PANIC. And I do see possible ERROR level
messages in the postgresCommitForeignTransaction() for example:

+ if (PQresultStatus(res) != PGRES_COMMAND_OK)
+ ereport(ERROR, (errmsg("could not commit transaction on server %s",
+ frstate->server->servername)));

I don't think that it's very convenient to get a PANIC every time we
failed to commit one of the prepared foreign xacts, since it could be
not so rare in the distributed system. That's why I tried to get rid of
possible ERRORs as far as possible in my proposed patch.

Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company

From:	Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>
To:	Alexey Kondratov <a(dot)kondratov(at)postgrespro(dot)ru>
Cc:	Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>, "Andrey V(dot) Lepikhov" <a(dot)lepikhov(at)postgrespro(dot)ru>, tsunakawa(dot)takay(at)fujitsu(dot)com, movead(dot)li(at)highgo(dot)ca, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Global snapshots
Date:	2020-09-09 05:35:04
Message-ID:	CA+fd4k4vhUon3wz+GGoRU_f+ja-U7PYPUE07fg33BxsSuqP0Bw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 9 Sep 2020 at 02:00, Alexey Kondratov
<a(dot)kondratov(at)postgrespro(dot)ru> wrote:
>
> On 2020-09-08 14:48, Fujii Masao wrote:
> > On 2020/09/08 19:36, Alexey Kondratov wrote:
> >> On 2020-09-08 05:49, Fujii Masao wrote:
> >>> On 2020/09/05 3:31, Alexey Kondratov wrote:
> >>>>
> >>>> Attached is a patch, which implements a plain 2PC in the
> >>>> postgres_fdw and adds a GUC 'postgres_fdw.use_twophase'. Also it
> >>>> solves these errors handling issues above and tries to add proper
> >>>> comments everywhere. I think, that 0003 should be rebased on the top
> >>>> of it, or it could be a first patch in the set, since it may be used
> >>>> independently. What do you think?
> >>>
> >>> Thanks for the patch!
> >>>
> >>> Sawada-san was proposing another 2PC patch at [1]. Do you have any
> >>> thoughts
> >>> about pros and cons between your patch and Sawada-san's?
> >>>
> >>> [1]
> >>> /message-id/CA+fd4k4z6_B1ETEvQamwQhu4RX7XsrN5ORL7OhJ4B5B6sW-RgQ@mail.gmail.com
> >>
> >> Thank you for the link!
> >>
> >> After a quick look on the Sawada-san's patch set I think that there
> >> are two major differences:
> >
> > Thanks for sharing your thought! As far as I read your patch quickly,
> > I basically agree with your this view.
> >
> >
> >>
> >> 1. There is a built-in foreign xacts resolver in the [1], which should
> >> be much more convenient from the end-user perspective. It involves
> >> huge in-core changes and additional complexity that is of course worth
> >> of.
> >>
> >> However, it's still not clear for me that it is possible to resolve
> >> all foreign prepared xacts on the Postgres' own side with a 100%
> >> guarantee. Imagine a situation when the coordinator node is actually a
> >> HA cluster group (primary + sync + async replica) and it failed just
> >> after PREPARE stage of after local COMMIT. In that case all foreign
> >> xacts will be left in the prepared state. After failover process
> >> complete synchronous replica will become a new primary. Would it have
> >> all required info to properly resolve orphan prepared xacts?
> >
> > IIUC, yes, the information required for automatic resolution is
> > WAL-logged and the standby tries to resolve those orphan transactions
> > from WAL after the failover. But Sawada-san's patch provides
> > the special function for manual resolution, so there may be some cases
> > where manual resolution is necessary.
> >
>
> I've found a note about manual resolution in the v25 0002:
>
> +After that we prepare all foreign transactions by calling
> +PrepareForeignTransaction() API. If we failed on any of them we change
> to
> +rollback, therefore at this time some participants might be prepared
> whereas
> +some are not prepared. The former foreign transactions need to be
> resolved
> +using pg_resolve_foreign_xact() manually and the latter ends
> transaction
> +in one-phase by calling RollbackForeignTransaction() API.
>
> but it's not yet clear for me.

Sorry, the above description in README is out of date. In the v25
patch, it's true that if a backend fails to prepare a transaction on a
foreign server, it’s possible that some foreign transactions are
prepared whereas others are not. But at the end of the transaction
after changing to rollback, the process does rollback (or rollback
prepared) all of them. So the use case of pg_resolve_foreign_xact() is
to resolve orphaned foreign prepared transactions or to resolve a
foreign transaction that is not resolved for some reasons, bugs etc.

>
> >
> > Implementing 2PC feature only inside postgres_fdw seems to cause
> > another issue; COMMIT PREPARED is issued to the remote servers
> > after marking the local transaction as committed
> > (i.e., ProcArrayEndTransaction()).
> >
>
> According to the Sawada-san's v25 0002 the logic is pretty much the same
> there:
>
> +2. Pre-Commit phase (1st phase of two-phase commit)
>
> +3. Commit locally
> +Once we've prepared all of them, commit the transaction locally.
>
> +4. Post-Commit Phase (2nd phase of two-phase commit)
>
> Brief look at the code confirms this scheme. IIUC, AtEOXact_FdwXact /
> FdwXactParticipantEndTransaction happens after ProcArrayEndTransaction()
> in the CommitTransaction(). Thus, I don't see many difference between
> these approach and CallXactCallbacks() usage regarding this point.
>
> > Is this safe? This issue happens
> > because COMMIT PREPARED is issued via
> > CallXactCallbacks(XACT_EVENT_COMMIT) and that CallXactCallbacks()
> > is called after ProcArrayEndTransaction().
> >
>
> Once the transaction is committed locally any ERROR (or higher level
> message) will be escalated to PANIC.

I think this is true only inside the critical section and it's not
necessarily true for all errors happening after the local commit,
right?

> And I do see possible ERROR level
> messages in the postgresCommitForeignTransaction() for example:
>
> + if (PQresultStatus(res) != PGRES_COMMAND_OK)
> + ereport(ERROR, (errmsg("could not commit transaction on server %s",
> + frstate->server->servername)));
>
> I don't think that it's very convenient to get a PANIC every time we
> failed to commit one of the prepared foreign xacts, since it could be
> not so rare in the distributed system. That's why I tried to get rid of
> possible ERRORs as far as possible in my proposed patch.
>

In my patch, the second phase of 2PC is executed only by the resolver
process. Therefore, even if an error would happen during committing a
foreign prepared transaction, we just need to relaunch the resolver
process and trying again. During that, the backend process will be
just waiting. If a backend process raises an error after the local
commit, the client will see transaction failure despite the local
transaction having been committed. An error could happen even by
palloc. So the patch uses a background worker to commit prepared
foreign transactions, not by backend itself.

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Alexey Kondratov <a(dot)kondratov(at)postgrespro(dot)ru>
To:	Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>
Cc:	Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>, "Andrey V(dot) Lepikhov" <a(dot)lepikhov(at)postgrespro(dot)ru>, tsunakawa(dot)takay(at)fujitsu(dot)com, movead(dot)li(at)highgo(dot)ca, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Global snapshots
Date:	2020-09-09 10:45:02
Message-ID:	36887e82ad3f6c5254de71382a672c61@postgrespro.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	Postg토토 꽁 머니SQL

On 2020-09-09 08:35, Masahiko Sawada wrote:
> On Wed, 9 Sep 2020 at 02:00, Alexey Kondratov
> <a(dot)kondratov(at)postgrespro(dot)ru> wrote:
>>
>> On 2020-09-08 14:48, Fujii Masao wrote:
>> >
>> > IIUC, yes, the information required for automatic resolution is
>> > WAL-logged and the standby tries to resolve those orphan transactions
>> > from WAL after the failover. But Sawada-san's patch provides
>> > the special function for manual resolution, so there may be some cases
>> > where manual resolution is necessary.
>> >
>>
>> I've found a note about manual resolution in the v25 0002:
>>
>> +After that we prepare all foreign transactions by calling
>> +PrepareForeignTransaction() API. If we failed on any of them we
>> change
>> to
>> +rollback, therefore at this time some participants might be prepared
>> whereas
>> +some are not prepared. The former foreign transactions need to be
>> resolved
>> +using pg_resolve_foreign_xact() manually and the latter ends
>> transaction
>> +in one-phase by calling RollbackForeignTransaction() API.
>>
>> but it's not yet clear for me.
>
> Sorry, the above description in README is out of date. In the v25
> patch, it's true that if a backend fails to prepare a transaction on a
> foreign server, it’s possible that some foreign transactions are
> prepared whereas others are not. But at the end of the transaction
> after changing to rollback, the process does rollback (or rollback
> prepared) all of them. So the use case of pg_resolve_foreign_xact() is
> to resolve orphaned foreign prepared transactions or to resolve a
> foreign transaction that is not resolved for some reasons, bugs etc.
>

OK, thank you for the explanation!

>>
>> Once the transaction is committed locally any ERROR (or higher level
>> message) will be escalated to PANIC.
>
> I think this is true only inside the critical section and it's not
> necessarily true for all errors happening after the local commit,
> right?
>

It's not actually related to critical section errors escalation. Any
error in the backend after the local commit and
ProcArrayEndTransaction() will try to abort the current transaction and
do RecordTransactionAbort(), but it's too late to do so and PANIC will
be risen:

/*
* Check that we haven't aborted halfway through
RecordTransactionCommit.
*/
if (TransactionIdDidCommit(xid))
elog(PANIC, "cannot abort transaction %u, it was already committed",
xid);

At least that's how I understand it.

>> And I do see possible ERROR level
>> messages in the postgresCommitForeignTransaction() for example:
>>
>> + if (PQresultStatus(res) != PGRES_COMMAND_OK)
>> + ereport(ERROR, (errmsg("could not commit transaction
>> on server %s",
>> +
>> frstate->server->servername)));
>>
>> I don't think that it's very convenient to get a PANIC every time we
>> failed to commit one of the prepared foreign xacts, since it could be
>> not so rare in the distributed system. That's why I tried to get rid
>> of
>> possible ERRORs as far as possible in my proposed patch.
>>
>
> In my patch, the second phase of 2PC is executed only by the resolver
> process. Therefore, even if an error would happen during committing a
> foreign prepared transaction, we just need to relaunch the resolver
> process and trying again. During that, the backend process will be
> just waiting. If a backend process raises an error after the local
> commit, the client will see transaction failure despite the local
> transaction having been committed. An error could happen even by
> palloc. So the patch uses a background worker to commit prepared
> foreign transactions, not by backend itself.
>

Yes, if it's a background process, then it seems to be safe.

BTW, it seems that I've chosen a wrong thread for posting my patch and
staring a discussion :) Activity from this thread moved to [1] and you
solution with built-in resolver is discussed [2]. I'll try to take a
look on v25 closely and write to [2] instead.

[1]
/message-id/2020081009525213277261%40highgo.ca

[2]
/message-id/CAExHW5uBy9QwjdSO4j82WC4aeW-Q4n2ouoZ1z70o%3D8Vb0skqYQ%40mail.gmail.com

Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company

From:	Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
To:	Alexey Kondratov <a(dot)kondratov(at)postgrespro(dot)ru>
Cc:	"Andrey V(dot) Lepikhov" <a(dot)lepikhov(at)postgrespro(dot)ru>, tsunakawa(dot)takay(at)fujitsu(dot)com, movead(dot)li(at)highgo(dot)ca, 'Amit Kapila' <amit(dot)kapila16(at)gmail(dot)com>, Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Global snapshots
Date:	2020-09-09 17:29:23
Message-ID:	b5ea3797-0bcc-7288-ba76-119a423dd693@oss.nttdata.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2020/09/09 2:00, Alexey Kondratov wrote:
> On 2020-09-08 14:48, Fujii Masao wrote:
>> On 2020/09/08 19:36, Alexey Kondratov wrote:
>>> On 2020-09-08 05:49, Fujii Masao wrote:
>>>> On 2020/09/05 3:31, Alexey Kondratov wrote:
>>>>>
>>>>> Attached is a patch, which implements a plain 2PC in the postgres_fdw and adds a GUC 'postgres_fdw.use_twophase'. Also it solves these errors handling issues above and tries to add proper comments everywhere. I think, that 0003 should be rebased on the top of it, or it could be a first patch in the set, since it may be used independently. What do you think?
>>>>
>>>> Thanks for the patch!
>>>>
>>>> Sawada-san was proposing another 2PC patch at [1]. Do you have any thoughts
>>>> about pros and cons between your patch and Sawada-san's?
>>>>
>>>> [1]
>>>> /message-id/CA+fd4k4z6_B1ETEvQamwQhu4RX7XsrN5ORL7OhJ4B5B6sW-RgQ@mail.gmail.com
>>>
>>> Thank you for the link!
>>>
>>> After a quick look on the Sawada-san's patch set I think that there are two major differences:
>>
>> Thanks for sharing your thought! As far as I read your patch quickly,
>> I basically agree with your this view.
>>
>>
>>>
>>> 1. There is a built-in foreign xacts resolver in the [1], which should be much more convenient from the end-user perspective. It involves huge in-core changes and additional complexity that is of course worth of.
>>>
>>> However, it's still not clear for me that it is possible to resolve all foreign prepared xacts on the Postgres' own side with a 100% guarantee. Imagine a situation when the coordinator node is actually a HA cluster group (primary + sync + async replica) and it failed just after PREPARE stage of after local COMMIT. In that case all foreign xacts will be left in the prepared state. After failover process complete synchronous replica will become a new primary. Would it have all required info to properly resolve orphan prepared xacts?
>>
>> IIUC, yes, the information required for automatic resolution is
>> WAL-logged and the standby tries to resolve those orphan transactions
>> from WAL after the failover. But Sawada-san's patch provides
>> the special function for manual resolution, so there may be some cases
>> where manual resolution is necessary.
>>
>
> I've found a note about manual resolution in the v25 0002:
>
> +After that we prepare all foreign transactions by calling
> +PrepareForeignTransaction() API. If we failed on any of them we change to
> +rollback, therefore at this time some participants might be prepared whereas
> +some are not prepared. The former foreign transactions need to be resolved
> +using pg_resolve_foreign_xact() manually and the latter ends transaction
> +in one-phase by calling RollbackForeignTransaction() API.
>
> but it's not yet clear for me.
>
>>
>> Implementing 2PC feature only inside postgres_fdw seems to cause
>> another issue; COMMIT PREPARED is issued to the remote servers
>> after marking the local transaction as committed
>> (i.e., ProcArrayEndTransaction()).
>>
>
> According to the Sawada-san's v25 0002 the logic is pretty much the same there:
>
> +2. Pre-Commit phase (1st phase of two-phase commit)
>
> +3. Commit locally
> +Once we've prepared all of them, commit the transaction locally.
>
> +4. Post-Commit Phase (2nd phase of two-phase commit)
>
> Brief look at the code confirms this scheme. IIUC, AtEOXact_FdwXact / FdwXactParticipantEndTransaction happens after ProcArrayEndTransaction() in the CommitTransaction(). Thus, I don't see many difference between these approach and CallXactCallbacks() usage regarding this point.

IIUC the commit logic in Sawada-san's patch looks like

1. PreCommit_FdwXact()
PREPARE TRANSACTION command is issued

2. RecordTransactionCommit()
2-1. WAL-log the commit record
2-2. Update CLOG
2-3. Wait for sync rep
2-4. FdwXactWaitForResolution()
Wait until COMMIT PREPARED commands are issued to the remote servers and completed.

3. ProcArrayEndTransaction()
4. AtEOXact_FdwXact(true)

So ISTM that the timing of when COMMIT PREPARED is issued
to the remote server is different between the patches.
Am I missing something?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

From:	"tsunakawa(dot)takay(at)fujitsu(dot)com" <tsunakawa(dot)takay(at)fujitsu(dot)com>
To:	"'Andrey V(dot) Lepikhov'" <a(dot)lepikhov(at)postgrespro(dot)ru>, "movead(dot)li(at)highgo(dot)ca" <movead(dot)li(at)highgo(dot)ca>
Cc:	'Amit Kapila' <amit(dot)kapila16(at)gmail(dot)com>, Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject:	RE: Global snapshots
Date:	2020-09-10 01:38:15
Message-ID:	TYAPR01MB2990F6F2A75657899A7E3A30FE270@TYAPR01MB2990.jpnprd01.prod.outlook.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi Andrey san,

From: Andrey V. Lepikhov <a(dot)lepikhov(at)postgrespro(dot)ru>> > From: tsunakawa(dot)takay(at)fujitsu(dot)com <tsunakawa(dot)takay(at)fujitsu(dot)com>
> >> While Clock-SI seems to be considered the best promising for global
> >>> > Could you take a look at this patent? I'm afraid this is the Clock-SI for MVCC.
> Microsoft holds this until 2031. I couldn't find this with the keyword
> "Clock-SI.""
> >
> >
> > US8356007B2 - Distributed transaction management for database systems
> with multiversioning - Google Patents
> > https://patents.google.com/patent/US8356007
> >
> >
> > If it is, can we circumvent this patent?
> >>
> Thank you for the research (and previous links too).
> I haven't seen this patent before. This should be carefully studied.

I wanted to ask about this after I've published the revised scale-out design wiki, but I'm taking too long, so could you share your study results? I think we need to make it clear about the patent before discussing the code. After we hear your opinion, we also have to check to see if Clock-SI is patented or avoid it by modifying part of the algorithm. Just in case we cannot use it, we have to proceed with thinking about alternatives.

Regards
Takayuki Tsunakawa

From:	Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
To:	"tsunakawa(dot)takay(at)fujitsu(dot)com" <tsunakawa(dot)takay(at)fujitsu(dot)com>, "'Andrey V(dot) Lepikhov'" <a(dot)lepikhov(at)postgrespro(dot)ru>, "movead(dot)li(at)highgo(dot)ca" <movead(dot)li(at)highgo(dot)ca>
Cc:	'Amit Kapila' <amit(dot)kapila16(at)gmail(dot)com>, Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Global snapshots
Date:	2020-09-10 08:16:53
Message-ID:	d17da3c0-54bb-6221-7ec7-151f6be39e5f@oss.nttdata.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2020/09/10 10:38, tsunakawa(dot)takay(at)fujitsu(dot)com wrote:
> Hi Andrey san,
>
> From: Andrey V. Lepikhov <a(dot)lepikhov(at)postgrespro(dot)ru>> > From: tsunakawa(dot)takay(at)fujitsu(dot)com <tsunakawa(dot)takay(at)fujitsu(dot)com>
>>>> While Clock-SI seems to be considered the best promising for global
>>>>>> Could you take a look at this patent? I'm afraid this is the Clock-SI for MVCC.
>> Microsoft holds this until 2031. I couldn't find this with the keyword
>> "Clock-SI.""
>>>
>>>
>>> US8356007B2 - Distributed transaction management for database systems
>> with multiversioning - Google Patents
>>> https://patents.google.com/patent/US8356007
>>>
>>>
>>> If it is, can we circumvent this patent?
>>>>
>> Thank you for the research (and previous links too).
>> I haven't seen this patent before. This should be carefully studied.
>
> I wanted to ask about this after I've published the revised scale-out design wiki, but I'm taking too long, so could you share your study results? I think we need to make it clear about the patent before discussing the code.

Yes.

But I'm concerned about that it's really hard to say there is no patent risk
around that. I'm not sure who can judge there is no patent risk,
in the community. Maybe no one? Anyway, I was thinking that Google Spanner,
YugabyteDB, etc use the global transaction approach based on the clock
similar to Clock-SI. Since I've never heard they have the patent issues,
I was just thinking Clock-SI doesn't have. No? This type of *guess* is not
safe, though...

> After we hear your opinion, we also have to check to see if Clock-SI is patented or avoid it by modifying part of the algorithm. Just in case we cannot use it, we have to proceed with thinking about alternatives.

One alternative is to add only hooks into PostgreSQL core so that we can
implement the global transaction management outside. This idea was
discussed before as the title "eXtensible Transaction Manager API".

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

From:	"tsunakawa(dot)takay(at)fujitsu(dot)com" <tsunakawa(dot)takay(at)fujitsu(dot)com>
To:	'Fujii Masao' <masao(dot)fujii(at)oss(dot)nttdata(dot)com>, "'Andrey V(dot) Lepikhov'" <a(dot)lepikhov(at)postgrespro(dot)ru>, "movead(dot)li(at)highgo(dot)ca" <movead(dot)li(at)highgo(dot)ca>
Cc:	'Amit Kapila' <amit(dot)kapila16(at)gmail(dot)com>, Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject:	RE: Global snapshots
Date:	2020-09-10 09:01:44
Message-ID:	TYAPR01MB299093D3F5E8BC2D80D5DFD6FE270@TYAPR01MB2990.jpnprd01.prod.outlook.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

From: Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
> But I'm concerned about that it's really hard to say there is no patent risk
> around that. I'm not sure who can judge there is no patent risk,
> in the community. Maybe no one? Anyway, I was thinking that Google Spanner,
> YugabyteDB, etc use the global transaction approach based on the clock
> similar to Clock-SI. Since I've never heard they have the patent issues,
> I was just thinking Clock-SI doesn't have. No? This type of *guess* is not
> safe, though...

Hm, it may be difficult to be sure that the algorithm does not violate a patent. But it may not be difficult to know if the algorithm apparently violates a patent or is highly likely (for those who know Clock-SI well.) At least, Andrey-san seems to have felt that it needs careful study, so I guess he had some hunch.

I understand this community is sensitive to patents. After the discussions at and after PGCon 2018, the community concluded that it won't accept patented technology. In the distant past, the community released Postgres 8.0 that contains an IBM's pending patent ARC, and removed it in 8.0.2. I wonder how could this could be detected, and how hard to cope with the patent issue. Bruce warned that we should be careful not to violate Greenplum's patents.

E.25. Release 8.0.2
/docs/8.0/release-8-0-2.html
--------------------------------------------------
New cache management algorithm 2Q replaces ARC (Tom)
This was done to avoid a pending US patent on ARC. The 2Q code might be a few percentage points slower than ARC for some work loads. A better cache management algorithm will appear in 8.1.
--------------------------------------------------

I think I'll try to contact the people listed in Clock-SI paper and the Microsoft patent to ask about this. I'm going to have a late summer vacation next week, so this is my summer homework?

> One alternative is to add only hooks into PostgreSQL core so that we can
> implement the global transaction management outside. This idea was
> discussed before as the title "eXtensible Transaction Manager API".

Yeah, I read that discussion. And I remember Robert Haas and Postgres Pro people said it's not good...

Regards
Takayuki Tsunakawa

From:	Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
To:	"tsunakawa(dot)takay(at)fujitsu(dot)com" <tsunakawa(dot)takay(at)fujitsu(dot)com>, "'Andrey V(dot) Lepikhov'" <a(dot)lepikhov(at)postgrespro(dot)ru>, "movead(dot)li(at)highgo(dot)ca" <movead(dot)li(at)highgo(dot)ca>
Cc:	'Amit Kapila' <amit(dot)kapila16(at)gmail(dot)com>, Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Global snapshots
Date:	2020-09-10 10:50:08
Message-ID:	88a2642f-1ce6-a9d2-5d26-1221a2c4e072@oss.nttdata.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	503 토토 핫 페치 실패

On 2020/09/10 18:01, tsunakawa(dot)takay(at)fujitsu(dot)com wrote:
> From: Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
>> But I'm concerned about that it's really hard to say there is no patent risk
>> around that. I'm not sure who can judge there is no patent risk,
>> in the community. Maybe no one? Anyway, I was thinking that Google Spanner,
>> YugabyteDB, etc use the global transaction approach based on the clock
>> similar to Clock-SI. Since I've never heard they have the patent issues,
>> I was just thinking Clock-SI doesn't have. No? This type of *guess* is not
>> safe, though...
>
> Hm, it may be difficult to be sure that the algorithm does not violate a patent. But it may not be difficult to know if the algorithm apparently violates a patent or is highly likely (for those who know Clock-SI well.) At least, Andrey-san seems to have felt that it needs careful study, so I guess he had some hunch.
>
> I understand this community is sensitive to patents. After the discussions at and after PGCon 2018, the community concluded that it won't accept patented technology. In the distant past, the community released Postgres 8.0 that contains an IBM's pending patent ARC, and removed it in 8.0.2. I wonder how could this could be detected, and how hard to cope with the patent issue. Bruce warned that we should be careful not to violate Greenplum's patents.
>
> E.25. Release 8.0.2
> /docs/8.0/release-8-0-2.html
> --------------------------------------------------
> New cache management algorithm 2Q replaces ARC (Tom)
> This was done to avoid a pending US patent on ARC. The 2Q code might be a few percentage points slower than ARC for some work loads. A better cache management algorithm will appear in 8.1.
> --------------------------------------------------
>
>
> I think I'll try to contact the people listed in Clock-SI paper and the Microsoft patent to ask about this.

Thanks!

> I'm going to have a late summer vacation next week, so this is my summer homework?
>
>
>> One alternative is to add only hooks into PostgreSQL core so that we can
>> implement the global transaction management outside. This idea was
>> discussed before as the title "eXtensible Transaction Manager API".
>
> Yeah, I read that discussion. And I remember Robert Haas and Postgres Pro people said it's not good...

But it may be worth revisiting this idea if we cannot avoid the patent issue.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

From:	Alexey Kondratov <a(dot)kondratov(at)postgrespro(dot)ru>
To:	Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
Cc:	"Andrey V(dot) Lepikhov" <a(dot)lepikhov(at)postgrespro(dot)ru>, tsunakawa(dot)takay(at)fujitsu(dot)com, movead(dot)li(at)highgo(dot)ca, 'Amit Kapila' <amit(dot)kapila16(at)gmail(dot)com>, Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Global snapshots
Date:	2020-09-10 14:22:27
Message-ID:	c9a5c774eb9b5fb98d9a07c1ef1490b7@postgrespro.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2020-09-09 20:29, Fujii Masao wrote:
> On 2020/09/09 2:00, Alexey Kondratov wrote:
>>
>> According to the Sawada-san's v25 0002 the logic is pretty much the
>> same there:
>>
>> +2. Pre-Commit phase (1st phase of two-phase commit)
>>
>> +3. Commit locally
>> +Once we've prepared all of them, commit the transaction locally.
>>
>> +4. Post-Commit Phase (2nd phase of two-phase commit)
>>
>> Brief look at the code confirms this scheme. IIUC, AtEOXact_FdwXact /
>> FdwXactParticipantEndTransaction happens after
>> ProcArrayEndTransaction() in the CommitTransaction(). Thus, I don't
>> see many difference between these approach and CallXactCallbacks()
>> usage regarding this point.
>
> IIUC the commit logic in Sawada-san's patch looks like
>
> 1. PreCommit_FdwXact()
> PREPARE TRANSACTION command is issued
>
> 2. RecordTransactionCommit()
> 2-1. WAL-log the commit record
> 2-2. Update CLOG
> 2-3. Wait for sync rep
> 2-4. FdwXactWaitForResolution()
> Wait until COMMIT PREPARED commands are issued to the
> remote servers and completed.
>
> 3. ProcArrayEndTransaction()
> 4. AtEOXact_FdwXact(true)
>
> So ISTM that the timing of when COMMIT PREPARED is issued
> to the remote server is different between the patches.
> Am I missing something?
>

No, you are right, sorry. At a first glance I thought that
AtEOXact_FdwXact is responsible for COMMIT PREPARED as well, but it is
only calling FdwXactParticipantEndTransaction in the abort case.

Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company

From:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
Cc:	"tsunakawa(dot)takay(at)fujitsu(dot)com" <tsunakawa(dot)takay(at)fujitsu(dot)com>, "Andrey V(dot) Lepikhov" <a(dot)lepikhov(at)postgrespro(dot)ru>, "movead(dot)li(at)highgo(dot)ca" <movead(dot)li(at)highgo(dot)ca>, Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Global snapshots
Date:	2020-09-17 06:56:47
Message-ID:	CAA4eK1LMQYtY1kJX5RpxkUCnrFnGqj--zcAGVH7Sp3mcPRRuvg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Sep 10, 2020 at 4:20 PM Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com> wrote:
>
> >> One alternative is to add only hooks into PostgreSQL core so that we can
> >> implement the global transaction management outside. This idea was
> >> discussed before as the title "eXtensible Transaction Manager API".
> >
> > Yeah, I read that discussion. And I remember Robert Haas and Postgres Pro people said it's not good...
>
> But it may be worth revisiting this idea if we cannot avoid the patent issue.
>

It is not very clear what exactly we can do about the point raised by
Tsunakawa-San related to patent in this technology as I haven't seen
that discussed during other development but maybe we can try to study
a bit. One more thing I would like to bring here is that it seems to
be there have been some concerns about this idea when originally
discussed [1]. It is not very clear to me if all the concerns are
addressed or not. If one can summarize the concerns discussed and how
the latest patch is able to address those then it will be great.

Also, I am not sure but maybe global deadlock detection also needs to
be considered as that also seems to be related because it depends on
how we manage global transactions. We need to prevent deadlock among
transaction operations spanned across multiple nodes. Say a
transaction T-1 has updated row r-1 of tbl-1 on node-1 and tries to
update row r-1 of tbl-2 on node n-2. Similarly, a transaction T-2
tries to perform those two operations in reverse order. Now, this will
lead to the deadlock that spans across multiple nodes and our current
deadlock detector doesn't have that capability. Having some form of
global/distributed transaction id might help to resolve it but not
sure how it can be solved with this clock-si based algorithm.

As all these problems are related, that is why I am insisting on this
thread and other thread "Transactions involving multiple postgres
foreign servers" [2] to have a high-level idea on how the distributed
transaction management will work before we decide on a particular
approach and commit one part of that patch.

[1] - /message-id/21BC916B-80A1-43BF-8650-3363CCDAE09C%40postgrespro.ru
[2] - /message-id/CAA4eK1J86S%3DmeivVsH%2Boy%3DTwUC%2Byr9jj2VtmmqMfYRmgs2JzUA%40mail.gmail.com

--
With Regards,
Amit Kapila.

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Alexey Kondratov <a(dot)kondratov(at)postgrespro(dot)ru>
Cc:	Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>, "Andrey V(dot) Lepikhov" <a(dot)lepikhov(at)postgrespro(dot)ru>, tsunakawa(dot)takay(at)fujitsu(dot)com, movead(dot)li(at)highgo(dot)ca, 'Amit Kapila' <amit(dot)kapila16(at)gmail(dot)com>, Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Global snapshots
Date:	2020-09-17 21:54:32
Message-ID:	20200917215432.GA24265@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Sep 8, 2020 at 01:36:16PM +0300, Alexey Kondratov wrote:
> Thank you for the link!
>
> After a quick look on the Sawada-san's patch set I think that there are two
> major differences:
>
> 1. There is a built-in foreign xacts resolver in the [1], which should be
> much more convenient from the end-user perspective. It involves huge in-core
> changes and additional complexity that is of course worth of.
>
> However, it's still not clear for me that it is possible to resolve all
> foreign prepared xacts on the Postgres' own side with a 100% guarantee.
> Imagine a situation when the coordinator node is actually a HA cluster group
> (primary + sync + async replica) and it failed just after PREPARE stage of
> after local COMMIT. In that case all foreign xacts will be left in the
> prepared state. After failover process complete synchronous replica will
> become a new primary. Would it have all required info to properly resolve
> orphan prepared xacts?
>
> Probably, this situation is handled properly in the [1], but I've not yet
> finished a thorough reading of the patch set, though it has a great doc!
>
> On the other hand, previous 0003 and my proposed patch rely on either manual
> resolution of hung prepared xacts or usage of external monitor/resolver.
> This approach is much simpler from the in-core perspective, but doesn't look
> as complete as [1] though.

Have we considered how someone would clean up foreign transactions if the
coordinating server dies? Could it be done manually? Would an external
resolver, rather than an internal one, make this easier?

--
Bruce Momjian <bruce(at)momjian(dot)us> https://momjian.us
EnterpriseDB https://enterprisedb.com

The usefulness of a cup is in its emptiness, Bruce Lee

From:	Alexey Kondratov <a(dot)kondratov(at)postgrespro(dot)ru>
To:	Bruce Momjian <bruce(at)momjian(dot)us>
Cc:	Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>, "Andrey V(dot) Lepikhov" <a(dot)lepikhov(at)postgrespro(dot)ru>, tsunakawa(dot)takay(at)fujitsu(dot)com, movead(dot)li(at)highgo(dot)ca, 'Amit Kapila' <amit(dot)kapila16(at)gmail(dot)com>, Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Global snapshots
Date:	2020-09-21 14:24:22
Message-ID:	3cdde64facc19a2c1baad1a3993a3b96@postgrespro.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2020-09-18 00:54, Bruce Momjian wrote:
> On Tue, Sep 8, 2020 at 01:36:16PM +0300, Alexey Kondratov wrote:
>> Thank you for the link!
>>
>> After a quick look on the Sawada-san's patch set I think that there
>> are two
>> major differences:
>>
>> 1. There is a built-in foreign xacts resolver in the [1], which should
>> be
>> much more convenient from the end-user perspective. It involves huge
>> in-core
>> changes and additional complexity that is of course worth of.
>>
>> However, it's still not clear for me that it is possible to resolve
>> all
>> foreign prepared xacts on the Postgres' own side with a 100%
>> guarantee.
>> Imagine a situation when the coordinator node is actually a HA cluster
>> group
>> (primary + sync + async replica) and it failed just after PREPARE
>> stage of
>> after local COMMIT. In that case all foreign xacts will be left in the
>> prepared state. After failover process complete synchronous replica
>> will
>> become a new primary. Would it have all required info to properly
>> resolve
>> orphan prepared xacts?
>>
>> Probably, this situation is handled properly in the [1], but I've not
>> yet
>> finished a thorough reading of the patch set, though it has a great
>> doc!
>>
>> On the other hand, previous 0003 and my proposed patch rely on either
>> manual
>> resolution of hung prepared xacts or usage of external
>> monitor/resolver.
>> This approach is much simpler from the in-core perspective, but
>> doesn't look
>> as complete as [1] though.
>
> Have we considered how someone would clean up foreign transactions if
> the
> coordinating server dies? Could it be done manually? Would an
> external
> resolver, rather than an internal one, make this easier?

Both Sawada-san's patch [1] and in this thread (e.g. mine [2]) use 2PC
with a special gid format including a xid + server identification info.
Thus, one can select from pg_prepared_xacts, get xid and coordinator
info, then use txid_status() on the coordinator (or ex-coordinator) to
get transaction status and finally either commit or abort these stale
prepared xacts. Of course this could be wrapped into some user-level
support routines as it is done in the [1].

As for the benefits of using an external resolver, I think that there
are some of them from the whole system perspective:

1) If one follows the logic above, then this resolver could be
stateless, it takes all the required info from the Postgres nodes
themselves.

2) Then you can easily put it into container, which make it easier do
deploy to all these 'cloud' stuff like kubernetes.

3) Also you can scale resolvers independently from Postgres nodes.

I do not think that either of these points is a game changer, but we use
a very simple external resolver altogether with [2] in our sharding
prototype and it works just fine so far.

[1]
/message-id/CA%2Bfd4k4HOVqqC5QR4H984qvD0Ca9g%3D1oLYdrJT_18zP9t%2BUsJg%40mail.gmail.com

[2]
/message-id/3ef7877bfed0582019eab3d462a43275%40postgrespro.ru

--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company

From:	"tsunakawa(dot)takay(at)fujitsu(dot)com" <tsunakawa(dot)takay(at)fujitsu(dot)com>
To:	"'Andrey V(dot) Lepikhov'" <a(dot)lepikhov(at)postgrespro(dot)ru>, "movead(dot)li(at)highgo(dot)ca" <movead(dot)li(at)highgo(dot)ca>
Cc:	'Amit Kapila' <amit(dot)kapila16(at)gmail(dot)com>, Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject:	RE: Global snapshots
Date:	2020-09-22 00:47:52
Message-ID:	TYAPR01MB29903E52A9410C9061DB44E8FE3B0@TYAPR01MB2990.jpnprd01.prod.outlook.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi Andrey-san, all,

From: Andrey V. Lepikhov <a(dot)lepikhov(at)postgrespro(dot)ru>
> On 7/27/20 11:22 AM, tsunakawa(dot)takay(at)fujitsu(dot)com wrote:
> > Could you take a look at this patent? I'm afraid this is the Clock-SI for MVCC.
> Microsoft holds this until 2031. I couldn't find this with the keyword
> "Clock-SI.""
> >
> >
> > US8356007B2 - Distributed transaction management for database systems
> with multiversioning - Google Patents
> > https://patents.google.com/patent/US8356007
> >
> >
> > If it is, can we circumvent this patent?

> I haven't seen this patent before. This should be carefully studied.

I contacted 6 people individually, 3 holders of the patent and different 3 authors of the Clock-SI paper. I got replies from two people. (It's a regret I couldn't get a reply from the main author of Clock-SI paper.)

[Reply from the patent holder Per-Ake Larson]
--------------------------------------------------
Thanks for your interest in my patent.

The answer to your question is: No, Clock-SI is not based on the patent - it was an entirely independent development. The two approaches are similar in the sense that there is no global clock, the commit time of a distributed transaction is the same in every partition where it modified data, and a transaction gets it snapshot timestamp from a local clock. The difference is whether a distributed transaction gets its commit timestamp before or after the prepare phase in 2PC.

Hope this helpful.

Best regards,
Per-Ake
--------------------------------------------------

[Reply from the Clock-SI author Willy Zwaenepoel]
--------------------------------------------------
Thank you for your kind words about our work.

I was unaware of this patent at the time I wrote the paper. The two came out more or less at the same time.

I am not a lawyer, so I cannot tell you if something based on Clock-SI would infringe on the Microsoft patent. The main distinction to me seems to be that Clock-SI is based on physical clocks, while the Microsoft patent talks about logical clocks, but again I am not a lawyer.

Best regards,

Willy.
--------------------------------------------------

Does this make sense from your viewpoint, and can we think that we can use Clock-SI without infrindging on the patent? According to the patent holder, the difference between Clock-SI and the patent seems to be fewer than the similarities.

Regards
Takayuki Tsunakawa

From:	Andrey Lepikhov <a(dot)lepikhov(at)postgrespro(dot)ru>
To:	"tsunakawa(dot)takay(at)fujitsu(dot)com" <tsunakawa(dot)takay(at)fujitsu(dot)com>, "movead(dot)li(at)highgo(dot)ca" <movead(dot)li(at)highgo(dot)ca>
Cc:	'Amit Kapila' <amit(dot)kapila16(at)gmail(dot)com>, Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Global snapshots
Date:	2020-09-22 07:41:11
Message-ID:	9b36d84a-daac-be1a-df38-67a3317fc209@postgrespro.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	Postg토토 커뮤니티SQL

22.09.2020 03:47, tsunakawa(dot)takay(at)fujitsu(dot)com пишет:
> Does this make sense from your viewpoint, and can we think that we can use Clock-SI without infrindging on the patent? According to the patent holder, the difference between Clock-SI and the patent seems to be fewer than the similarities.
Thank you for this work!
As I can see, main development difficulties placed in other areas: CSN,
resolver, global deadlocks, 2PC commit... I'm not lawyer too. But if we
get remarks from the patent holders, we can rewrite our Clock-SI
implementation.

--
regards,
Andrey Lepikhov
Postgres Professional

From:	"tsunakawa(dot)takay(at)fujitsu(dot)com" <tsunakawa(dot)takay(at)fujitsu(dot)com>
To:	'Andrey Lepikhov' <a(dot)lepikhov(at)postgrespro(dot)ru>, "movead(dot)li(at)highgo(dot)ca" <movead(dot)li(at)highgo(dot)ca>
Cc:	'Amit Kapila' <amit(dot)kapila16(at)gmail(dot)com>, Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject:	RE: Global snapshots
Date:	2020-09-23 00:44:22
Message-ID:	TYAPR01MB29909DAA3C7B9CC1ADD2F844FE380@TYAPR01MB2990.jpnprd01.prod.outlook.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

From: Andrey Lepikhov <a(dot)lepikhov(at)postgrespro(dot)ru>
> Thank you for this work!
> As I can see, main development difficulties placed in other areas: CSN, resolver,
> global deadlocks, 2PC commit... I'm not lawyer too. But if we get remarks from
> the patent holders, we can rewrite our Clock-SI implementation.

Yeah, I understand your feeling. I personally don't want like patents, and don't want to be disturbed by them. But the world is not friendly... We are not a lawyer, but we have to do our best to make sure PostgreSQL will be patent-free by checking the technologies as engineers.

Among the above items, CSN is the only concerning one. Other items are written in textbooks, well-known, and used in other DBMSs, so they should be free from patents. However, CSN is not (at least to me.) Have you checked if CSN is not related to some patent? Or is CSN or similar technology already widely used in famous software and we can regard it as patent-free?

And please wait. As below, the patent holder just says that Clock-SI is not based on the patent and an independent development. He doesn't say Clock-SI does not overlap with the patent or implementing Clock-SI does not infringe on the patent. Rather, he suggests that Clock-SI has many similarities and thus those may match the claims of the patent (unintentionally?) I felt this is a sign of risking infringement.

"The answer to your question is: No, Clock-SI is not based on the patent - it was an entirely independent development. The two approaches are similar in the sense that there is no global clock, the commit time of a distributed transaction is the same in every partition where it modified data, and a transaction gets it snapshot timestamp from a local clock. The difference is whether a distributed transaction gets its commit timestamp before or after the prepare phase in 2PC."

The timeline of events also worries me. It seems unnatural to consider that Clock-SI and the patent are independent.

2010/6 - 2010/9 One Clock-SI author worked for Microsoft Research as an research intern
2010/10 Microsoft filed the patent
2011/9 - 2011/12 The same Clock-SI author worked for Microsoft Research as an research intern
2013 The same author moved to EPFL and published the Clock-SI paper with another author who has worked for Microsoft Research since then.

So, could you give your opinion whether we can use Clock-SI without overlapping with the patent claims? I also will try to check and see, so that I can understand your technical analysis.

And I've just noticed that I got in touch with another author of Clock-SI via SNS, and sent an inquiry to him. I'll report again when I have a reply.

Regards
Takayuki Tsunakawa

From:	"tsunakawa(dot)takay(at)fujitsu(dot)com" <tsunakawa(dot)takay(at)fujitsu(dot)com>
To:	'Andrey Lepikhov' <a(dot)lepikhov(at)postgrespro(dot)ru>, "movead(dot)li(at)highgo(dot)ca" <movead(dot)li(at)highgo(dot)ca>
Cc:	'Amit Kapila' <amit(dot)kapila16(at)gmail(dot)com>, Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject:	RE: Global snapshots
Date:	2020-09-28 01:36:19
Message-ID:	TYAPR01MB2990E0C9ABC1BCE5FF92DC69FE350@TYAPR01MB2990.jpnprd01.prod.outlook.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi Andrey san, all,

From: tsunakawa(dot)takay(at)fujitsu(dot)com <tsunakawa(dot)takay(at)fujitsu(dot)com>
> And please wait. As below, the patent holder just says that Clock-SI is not
> based on the patent and an independent development. He doesn't say
> Clock-SI does not overlap with the patent or implementing Clock-SI does not
> infringe on the patent. Rather, he suggests that Clock-SI has many
> similarities and thus those may match the claims of the patent
> (unintentionally?) I felt this is a sign of risking infringement.
>
> "The answer to your question is: No, Clock-SI is not based on the patent - it
> was an entirely independent development. The two approaches are similar in
> the sense that there is no global clock, the commit time of a distributed
> transaction is the same in every partition where it modified data, and a
> transaction gets it snapshot timestamp from a local clock. The difference is
> whether a distributed transaction gets its commit timestamp before or after the
> prepare phase in 2PC."
>
> The timeline of events also worries me. It seems unnatural to consider that
> Clock-SI and the patent are independent.
>
> 2010/6 - 2010/9 One Clock-SI author worked for Microsoft Research as
> an research intern
> 2010/10 Microsoft filed the patent
> 2011/9 - 2011/12 The same Clock-SI author worked for Microsoft
> Research as an research intern
> 2013 The same author moved to EPFL and published the Clock-SI paper
> with another author who has worked for Microsoft Research since then.
>
> So, could you give your opinion whether we can use Clock-SI without
> overlapping with the patent claims? I also will try to check and see, so that I
> can understand your technical analysis.
>
> And I've just noticed that I got in touch with another author of Clock-SI via SNS,
> and sent an inquiry to him. I'll report again when I have a reply.

I got a reply from the main author of the Clock-SI paper:

[Reply from the Clock-SI author Jiaqing Du]
--------------------------------------------------
Thanks for reaching out.

I actually did not know that Microsoft wrote a patent which is similar to the ideas in my paper. I worked there as an intern. My Clock-SI paper was done at my school (EPFL) after my internships at Microsoft. The paper was very loosely related to my internship project at Microsoft. In a sense, the internship project at Microsoft inspired me to work on Clock-SI after I finished the internship. As you see in the paper, my coauthor, who is my internship host, is also from Microsoft, but interestingly he is not on the patent :)

Cheers,
Jiaqing
--------------------------------------------------

Unfortunately, he also did not assert that Clock-SI does not infringe on the patent. Rather, worrying words are mixed: "similar to my ideas", "loosely related", "inspired".

Also, his internship host is the co-author of the Clock-SI paper. That person should be Sameh Elnikety, who has been working for Microsoft Research. I also asked him about the same question, but he has been silent for about 10 days.

When I had a quick look, the patent appeared to be broader than Clock-SI, and Clock-SI is a concrete application of the patent. This is just my guess, but Sameh Elnikety had known the patent and set an internship theme at Microsoft or the research subject at EPFL based on it, whether he was aware or not.

As of now, it seems that the Clock-SI needs to be evaluated against the patent claims by two or more persons -- one from someone who knows Clock-SI well and implemented it for Postgres (Andrey-san?), and someone else who shares little benefit with the former person and can see it objectively.

Regards
Takayuki Tsunakawa

From:	Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
To:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc:	"tsunakawa(dot)takay(at)fujitsu(dot)com" <tsunakawa(dot)takay(at)fujitsu(dot)com>, "Andrey V(dot) Lepikhov" <a(dot)lepikhov(at)postgrespro(dot)ru>, "movead(dot)li(at)highgo(dot)ca" <movead(dot)li(at)highgo(dot)ca>, Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Global snapshots
Date:	2020-10-14 16:41:34
Message-ID:	dc6d5fe2-7bdf-1b3f-1027-e152b979c12f@oss.nttdata.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2020/09/17 15:56, Amit Kapila wrote:
> On Thu, Sep 10, 2020 at 4:20 PM Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com> wrote:
>>
>>>> One alternative is to add only hooks into PostgreSQL core so that we can
>>>> implement the global transaction management outside. This idea was
>>>> discussed before as the title "eXtensible Transaction Manager API".
>>>
>>> Yeah, I read that discussion. And I remember Robert Haas and Postgres Pro people said it's not good...
>>
>> But it may be worth revisiting this idea if we cannot avoid the patent issue.
>>
>
> It is not very clear what exactly we can do about the point raised by
> Tsunakawa-San related to patent in this technology as I haven't seen
> that discussed during other development but maybe we can try to study
> a bit. One more thing I would like to bring here is that it seems to
> be there have been some concerns about this idea when originally
> discussed [1]. It is not very clear to me if all the concerns are
> addressed or not. If one can summarize the concerns discussed and how
> the latest patch is able to address those then it will be great.

I have one concern about Clock-SI (sorry if this concern was already
discussed in the past). As far as I read the paper about Clock-SI, ISTM that
Tx2 that starts after Tx1's commit can fail to see the results by Tx1,
due to the clock skew. Please see the following example;

1. Tx1 starts at the server A.

2. Tx1 writes some records at the server A.

3. Tx1 gets the local clock 20, uses 20 as CommitTime, then completes
the commit at the server A.
This means that Tx1 is the local transaction, not distributed one.

4. Tx2 starts at the server B, i.e., the server B works as
the coordinator node for Tx2.

5. Tx2 gets the local clock 10 (i.e., it's delayed behind the server A
due to clock skew) and uses 10 as SnapshotTime at the server B.

6. Tx2 starts the remote transaction at the server A with SnapshotTime 10.

7. Tx2 doesn't need to wait due to clock skew because the imported
SnapshotTime 10 is smaller than the local clock at the server A.

8. Tx2 fails to see the records written by Tx1 at the server A because
Tx1's CommitTime 20 is larger than SnapshotTime 10.

So Tx1 was successfully committed before Tx2 starts. But, at the above example,
the subsequent transaction Tx2 fails to see the committed results.

The single PostgreSQL instance seems to guarantee that linearizability of
the transactions, but Clock-SI doesn't in the distributed env. Is this my
understanding right? Or am I missing something?

If my understanding is right, shouldn't we address that issue when using
Clock-SI? Or the patch has already addressed the issue?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

From:	Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
Cc:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, "tsunakawa(dot)takay(at)fujitsu(dot)com" <tsunakawa(dot)takay(at)fujitsu(dot)com>, "Andrey V(dot) Lepikhov" <a(dot)lepikhov(at)postgrespro(dot)ru>, "movead(dot)li(at)highgo(dot)ca" <movead(dot)li(at)highgo(dot)ca>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Global snapshots
Date:	2020-10-23 02:58:16
Message-ID:	CA+fd4k5d9Jjt2zwCFwm9FmFF1m6J1_ixaU_X7noM9qkDWCbAzA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 15 Oct 2020 at 01:41, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com> wrote:
>
>
>
> On 2020/09/17 15:56, Amit Kapila wrote:
> > On Thu, Sep 10, 2020 at 4:20 PM Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com> wrote:
> >>
> >>>> One alternative is to add only hooks into PostgreSQL core so that we can
> >>>> implement the global transaction management outside. This idea was
> >>>> discussed before as the title "eXtensible Transaction Manager API".
> >>>
> >>> Yeah, I read that discussion. And I remember Robert Haas and Postgres Pro people said it's not good...
> >>
> >> But it may be worth revisiting this idea if we cannot avoid the patent issue.
> >>
> >
> > It is not very clear what exactly we can do about the point raised by
> > Tsunakawa-San related to patent in this technology as I haven't seen
> > that discussed during other development but maybe we can try to study
> > a bit. One more thing I would like to bring here is that it seems to
> > be there have been some concerns about this idea when originally
> > discussed [1]. It is not very clear to me if all the concerns are
> > addressed or not. If one can summarize the concerns discussed and how
> > the latest patch is able to address those then it will be great.
>
> I have one concern about Clock-SI (sorry if this concern was already
> discussed in the past). As far as I read the paper about Clock-SI, ISTM that
> Tx2 that starts after Tx1's commit can fail to see the results by Tx1,
> due to the clock skew. Please see the following example;
>
> 1. Tx1 starts at the server A.
>
> 2. Tx1 writes some records at the server A.
>
> 3. Tx1 gets the local clock 20, uses 20 as CommitTime, then completes
> the commit at the server A.
> This means that Tx1 is the local transaction, not distributed one.
>
> 4. Tx2 starts at the server B, i.e., the server B works as
> the coordinator node for Tx2.
>
> 5. Tx2 gets the local clock 10 (i.e., it's delayed behind the server A
> due to clock skew) and uses 10 as SnapshotTime at the server B.
>
> 6. Tx2 starts the remote transaction at the server A with SnapshotTime 10.
>
> 7. Tx2 doesn't need to wait due to clock skew because the imported
> SnapshotTime 10 is smaller than the local clock at the server A.
>
> 8. Tx2 fails to see the records written by Tx1 at the server A because
> Tx1's CommitTime 20 is larger than SnapshotTime 10.
>
> So Tx1 was successfully committed before Tx2 starts. But, at the above example,
> the subsequent transaction Tx2 fails to see the committed results.
>
> The single PostgreSQL instance seems to guarantee that linearizability of
> the transactions, but Clock-SI doesn't in the distributed env. Is this my
> understanding right? Or am I missing something?
>
> If my understanding is right, shouldn't we address that issue when using
> Clock-SI? Or the patch has already addressed the issue?

As far as I read the paper, the above scenario can happen. I could
reproduce the above scenario with the patch. Moreover, a stale read
could happen even if Tx1 was initiated at server B (i.g., both
transactions started at the same server in sequence). In this case,
Tx1's commit timestamp would be 20 taken from server A's local clock
whereas Tx2's snapshot timestamp would be 10 same as the above case.
Therefore, in spite of both transactions were initiated at the same
server the linearizability is not provided.

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
To:	Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>
Cc:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, "tsunakawa(dot)takay(at)fujitsu(dot)com" <tsunakawa(dot)takay(at)fujitsu(dot)com>, "Andrey V(dot) Lepikhov" <a(dot)lepikhov(at)postgrespro(dot)ru>, "movead(dot)li(at)highgo(dot)ca" <movead(dot)li(at)highgo(dot)ca>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Global snapshots
Date:	2020-10-28 07:00:27
Message-ID:	f59e3cf8-84d7-4a2b-218d-48ffac488566@oss.nttdata.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2020/10/23 11:58, Masahiko Sawada wrote:
> On Thu, 15 Oct 2020 at 01:41, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com> wrote:
>>
>>
>>
>> On 2020/09/17 15:56, Amit Kapila wrote:
>>> On Thu, Sep 10, 2020 at 4:20 PM Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com> wrote:
>>>>
>>>>>> One alternative is to add only hooks into PostgreSQL core so that we can
>>>>>> implement the global transaction management outside. This idea was
>>>>>> discussed before as the title "eXtensible Transaction Manager API".
>>>>>
>>>>> Yeah, I read that discussion. And I remember Robert Haas and Postgres Pro people said it's not good...
>>>>
>>>> But it may be worth revisiting this idea if we cannot avoid the patent issue.
>>>>
>>>
>>> It is not very clear what exactly we can do about the point raised by
>>> Tsunakawa-San related to patent in this technology as I haven't seen
>>> that discussed during other development but maybe we can try to study
>>> a bit. One more thing I would like to bring here is that it seems to
>>> be there have been some concerns about this idea when originally
>>> discussed [1]. It is not very clear to me if all the concerns are
>>> addressed or not. If one can summarize the concerns discussed and how
>>> the latest patch is able to address those then it will be great.
>>
>> I have one concern about Clock-SI (sorry if this concern was already
>> discussed in the past). As far as I read the paper about Clock-SI, ISTM that
>> Tx2 that starts after Tx1's commit can fail to see the results by Tx1,
>> due to the clock skew. Please see the following example;
>>
>> 1. Tx1 starts at the server A.
>>
>> 2. Tx1 writes some records at the server A.
>>
>> 3. Tx1 gets the local clock 20, uses 20 as CommitTime, then completes
>> the commit at the server A.
>> This means that Tx1 is the local transaction, not distributed one.
>>
>> 4. Tx2 starts at the server B, i.e., the server B works as
>> the coordinator node for Tx2.
>>
>> 5. Tx2 gets the local clock 10 (i.e., it's delayed behind the server A
>> due to clock skew) and uses 10 as SnapshotTime at the server B.
>>
>> 6. Tx2 starts the remote transaction at the server A with SnapshotTime 10.
>>
>> 7. Tx2 doesn't need to wait due to clock skew because the imported
>> SnapshotTime 10 is smaller than the local clock at the server A.
>>
>> 8. Tx2 fails to see the records written by Tx1 at the server A because
>> Tx1's CommitTime 20 is larger than SnapshotTime 10.
>>
>> So Tx1 was successfully committed before Tx2 starts. But, at the above example,
>> the subsequent transaction Tx2 fails to see the committed results.
>>
>> The single PostgreSQL instance seems to guarantee that linearizability of
>> the transactions, but Clock-SI doesn't in the distributed env. Is this my
>> understanding right? Or am I missing something?
>>
>> If my understanding is right, shouldn't we address that issue when using
>> Clock-SI? Or the patch has already addressed the issue?
>
> As far as I read the paper, the above scenario can happen. I could
> reproduce the above scenario with the patch. Moreover, a stale read
> could happen even if Tx1 was initiated at server B (i.g., both
> transactions started at the same server in sequence). In this case,
> Tx1's commit timestamp would be 20 taken from server A's local clock
> whereas Tx2's snapshot timestamp would be 10 same as the above case.
> Therefore, in spite of both transactions were initiated at the same
> server the linearizability is not provided.

Yeah, so if we need to guarantee the transaction linearizability even
in distributed env (probably this is yes. Right?), using only Clock-SI
is not enough. We would need to implement something more
in addition to Clock-SI or adopt the different approach other than Clock-SI.
Thought?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

From:	"tsunakawa(dot)takay(at)fujitsu(dot)com" <tsunakawa(dot)takay(at)fujitsu(dot)com>
To:	'Fujii Masao' <masao(dot)fujii(at)oss(dot)nttdata(dot)com>, Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>
Cc:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, "Andrey V(dot) Lepikhov" <a(dot)lepikhov(at)postgrespro(dot)ru>, "movead(dot)li(at)highgo(dot)ca" <movead(dot)li(at)highgo(dot)ca>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject:	RE: Global snapshots
Date:	2020-10-28 07:20:16
Message-ID:	TYAPR01MB299083299DA94BA849358BEEFE170@TYAPR01MB2990.jpnprd01.prod.outlook.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Fujii-san, Sawada-san, all,

From: Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
> Yeah, so if we need to guarantee the transaction linearizability even
> in distributed env (probably this is yes. Right?), using only Clock-SI
> is not enough. We would need to implement something more
> in addition to Clock-SI or adopt the different approach other than Clock-SI.
> Thought?

Could you please try interpreting MVCO and see if we have any hope in this? This doesn't fit in my small brain. I'll catch up with understanding this when I have time.

MVCO - Technical report - IEEE RIDE-IMS 93 (PDF; revised version of DEC-TR 853)
https://sites.google.com/site/yoavraz2/MVCO-WDE.pdf

MVCO is a multiversion member of Commitment Ordering algorithms described below:

Commitment ordering (CO) - yoavraz2
https://sites.google.com/site/yoavraz2/the_principle_of_co

Commitment ordering - Wikipedia
https://en.wikipedia.org/wiki/Commitment_ordering

Related patents are as follows. The last one is MVCO.

US5504900A - Commitment ordering for guaranteeing serializability across distributed transactions
https://patents.google.com/patent/US5504900A/en?oq=US5504900

US5504899A - Guaranteeing global serializability by applying commitment ordering selectively to global transactions
https://patents.google.com/patent/US5504899A/en?oq=US5504899

US5701480A - Distributed multi-version commitment ordering protocols for guaranteeing serializability during transaction processing
https://patents.google.com/patent/US5701480A/en?oq=US5701480

Regards
Takayuki Tsunakawa

From:	"tsunakawa(dot)takay(at)fujitsu(dot)com" <tsunakawa(dot)takay(at)fujitsu(dot)com>
To:	PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Cc:	'Fujii Masao' <masao(dot)fujii(at)oss(dot)nttdata(dot)com>, Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, "Andrey V(dot) Lepikhov" <a(dot)lepikhov(at)postgrespro(dot)ru>, "movead(dot)li(at)highgo(dot)ca" <movead(dot)li(at)highgo(dot)ca>
Subject:	RE: Global snapshots
Date:	2021-01-01 03:14:07
Message-ID:	TYAPR01MB2990CAC9CE3D0AA63B45053CFED50@TYAPR01MB2990.jpnprd01.prod.outlook.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello,

Fujii-san and I discussed how to move the scale-out development forward. We are both worried that Clock-SI is (highly?) likely to infringe the said Microsoft's patent. So we agreed we are going to investigate the Clock-SI and the patent, and if we have to conclude that we cannot embrace Clock-SI, we will explore other possibilities.

IMO, it seems that Clock-SI overlaps with the patent and we can't use it. First, looking back how to interpret the patent document, patent "claims" are what we should pay our greatest attention. According to the following citation from the IP guide by Software Freedom Law Center (SFLC) [1], software infringes a patent if it implements everything of any claim, not all claims.

--------------------------------------------------
4.2 Patent Infringement
To prove that you5 infringe a patent, the patent holder must show that you make, use, offer to sell, or sell the invention as it is defined in at least one claim of the patent.

For software to infringe a patent, the software essentially must implement everything recited in one of the patent�fs claims. It is crucial to recognize that infringement is based directly on the claims of the patent, and not on what is stated or described in other parts of the patent document.
--------------------------------------------------

And, Clock-SI implements at least claims 11 and 20 cited below. It doesn't matter whether Clock-SI uses a physical clock or logical one.

--------------------------------------------------
11. A method comprising:
receiving information relating to a distributed database transaction operating on data in data stores associated with respective participating nodes associated with the distributed database transaction;
requesting commit time votes from the respective participating nodes, the commit time votes reflecting local clock values of the respective participating nodes;
receiving the commit time votes from the respective participating nodes in response to the requesting;
computing a global commit timestamp for the distributed database transaction based at least in part on the commit time votes, the global commit timestamp reflecting a maximum value of the commit time votes received from the respective participating nodes; and
synchronizing commitment of the distributed database transaction at the respective participating nodes to the global commit timestamp,
wherein at least the computing is performed by a computing device.

20. A method for managing a distributed database transaction, the method comprising:
receiving information relating to the distributed database transaction from a transaction coordinator associated with the distributed database transaction;
determining a commit time vote for the distributed database transaction based at least in part on a local clock;
communicating the commit time vote for the distributed database transaction to the transaction coordinator;
receiving a global commit timestamp from the transaction coordinator;
synchronizing commitment of the distributed database transaction to the global commit timestamp;
receiving a remote request from a requesting database node corresponding to the distributed database transaction;
creating a local transaction corresponding to the distributed database transaction;
compiling a list of database nodes involved in generating a result of the local transaction and access types utilized by respective database nodes in the list of database nodes; and
returning the list of database nodes and the access types to the requesting database node in response to the remote request,
wherein at least the compiling is performed by a computing device.
--------------------------------------------------

My question is that the above claims appear to cover somewhat broad range. I wonder if other patents or unpatented technologies overlap with this kind of description.

Thoughts?

[1]
A Legal Issues Primer for Open Source and Free Software Projects
https://www.softwarefreedom.org/resources/2008/foss-primer.pdf

[2]
US8356007B2 - Distributed transaction management for database systems with multiversioning - Google Patents
https://patents.google.com/patent/US8356007

Regards
Takayuki Tsunakawa

From:	Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
To:	"tsunakawa(dot)takay(at)fujitsu(dot)com" <tsunakawa(dot)takay(at)fujitsu(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Cc:	Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, "Andrey V(dot) Lepikhov" <a(dot)lepikhov(at)postgrespro(dot)ru>, "movead(dot)li(at)highgo(dot)ca" <movead(dot)li(at)highgo(dot)ca>
Subject:	Re: Global snapshots
Date:	2021-01-08 12:21:53
Message-ID:	36d657e4-74aa-4f9c-3d6e-30c8a4a5a55f@oss.nttdata.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2021/01/01 12:14, tsunakawa(dot)takay(at)fujitsu(dot)com wrote:
> Hello,
>
>
> Fujii-san and I discussed how to move the scale-out development forward. We are both worried that Clock-SI is (highly?) likely to infringe the said Microsoft's patent. So we agreed we are going to investigate the Clock-SI and the patent, and if we have to conclude that we cannot embrace Clock-SI, we will explore other possibilities.

Yes.

>
> IMO, it seems that Clock-SI overlaps with the patent and we can't use it. First, looking back how to interpret the patent document, patent "claims" are what we should pay our greatest attention. According to the following citation from the IP guide by Software Freedom Law Center (SFLC) [1], software infringes a patent if it implements everything of any claim, not all claims.
>
>
> --------------------------------------------------
> 4.2 Patent Infringement
> To prove that you5 infringe a patent, the patent holder must show that you make, use, offer to sell, or sell the invention as it is defined in at least one claim of the patent.
>
> For software to infringe a patent, the software essentially must implement everything recited in one of the patent�fs claims. It is crucial to recognize that infringement is based directly on the claims of the patent, and not on what is stated or described in other parts of the patent document.
> --------------------------------------------------
>
>
> And, Clock-SI implements at least claims 11 and 20 cited below. It doesn't matter whether Clock-SI uses a physical clock or logical one.

Thanks for sharing the result of your investigation!

Regarding at least claim 11, I reached the same conclusion. As far as
I understand correctly, Clock-SI actually does the method described
at the claim 11 when determing the commit time and doing the commit
on each node.

I don't intend to offend Clock-SI and any activities based on that. OTOH,
I'm now wondering if it's worth considering another approach for global
transaction support, while I'm still interested in Clock-SI technically.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

From:	"tsunakawa(dot)takay(at)fujitsu(dot)com" <tsunakawa(dot)takay(at)fujitsu(dot)com>
To:	"Andrey V(dot) Lepikhov" <a(dot)lepikhov(at)postgrespro(dot)ru>
Cc:	'Fujii Masao' <masao(dot)fujii(at)oss(dot)nttdata(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>, Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, "movead(dot)li(at)highgo(dot)ca" <movead(dot)li(at)highgo(dot)ca>
Subject:	RE: Global snapshots
Date:	2021-01-19 06:32:56
Message-ID:	TYAPR01MB2990F14722494266C833245FFEA30@TYAPR01MB2990.jpnprd01.prod.outlook.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	Postg무지개 토토SQL

Hello, Andrey-san, all,

Based on the request at HighGo's sharding meeting, I'm re-sending the information on Commitment Ordering that could be used for global visibility. Their patents have already expired.

--------------------------------------------------
Have anyone examined the following Multiversion Commitment Ordering (MVCO)? Although I haven't understood this yet, it insists that no concurrency control information including timestamps needs to be exchanged among the cluster nodes. I'd appreciate it if someone could give an opinion.

--------------------------------------------------
* Or, maybe we can use the following Commitment ordering that doesn't require the timestamp or any other information to be transferred among the cluster nodes. However, this seems to have to track the order of read and write operations among concurrent transactions to ensure the correct commit order, so I'm not sure about the performance. The MVCO paper seems to present the information we need, but I haven't understood it well yet (it's difficult.) Could you anybody kindly interpret this?

Commitment ordering (CO) - yoavraz2
https://sites.google.com/site/yoavraz2/the_principle_of_co

--------------------------------------------------
Could you please try interpreting MVCO and see if we have any hope in this? This doesn't fit in my small brain. I'll catch up with understanding this when I have time.

MVCO - Technical report - IEEE RIDE-IMS 93 (PDF; revised version of DEC-TR 853)
https://sites.google.com/site/yoavraz2/MVCO-WDE.pdf

MVCO is a multiversion member of Commitment Ordering algorithms described below:

Commitment ordering (CO) - yoavraz2
https://sites.google.com/site/yoavraz2/the_principle_of_co

Commitment ordering - Wikipedia
https://en.wikipedia.org/wiki/Commitment_ordering

Related patents are as follows. The last one is MVCO.

US5504900A - Commitment ordering for guaranteeing serializability across distributed transactions
https://patents.google.com/patent/US5504900A/en?oq=US5504900

US5504899A - Guaranteeing global serializability by applying commitment ordering selectively to global transactions
https://patents.google.com/patent/US5504899A/en?oq=US5504899

US5701480A - Distributed multi-version commitment ordering protocols for guaranteeing serializability during transaction processing
https://patents.google.com/patent/US5701480A/en?oq=US5701480

Regards
Takayuki Tsunakawa

From:	"Andrey V(dot) Lepikhov" <a(dot)lepikhov(at)postgrespro(dot)ru>
To:	"tsunakawa(dot)takay(at)fujitsu(dot)com" <tsunakawa(dot)takay(at)fujitsu(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Cc:	'Fujii Masao' <masao(dot)fujii(at)oss(dot)nttdata(dot)com>, Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, "movead(dot)li(at)highgo(dot)ca" <movead(dot)li(at)highgo(dot)ca>
Subject:	Re: Global snapshots
Date:	2021-02-26 06:20:49
Message-ID:	a0503174-9ac9-fe2c-1788-1a1337d72453@postgrespro.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 1/1/21 8:14 AM, tsunakawa(dot)takay(at)fujitsu(dot)com wrote:
> --------------------------------------------------
> 11. A method comprising:
> receiving information relating to a distributed database transaction operating on data in data stores associated with respective participating nodes associated with the distributed database transaction;
> requesting commit time votes from the respective participating nodes, the commit time votes reflecting local clock values of the respective participating nodes;
> receiving the commit time votes from the respective participating nodes in response to the requesting;
> computing a global commit timestamp for the distributed database transaction based at least in part on the commit time votes, the global commit timestamp reflecting a maximum value of the commit time votes received from the respective participating nodes; and
> synchronizing commitment of the distributed database transaction at the respective participating nodes to the global commit timestamp,
> wherein at least the computing is performed by a computing device.

Thank you for this analysis of the patent.
After researching in depth, I think this is the real problem.
My idea was that we are not using real clocks, we only use clock ticks
to measure time intervals. It can also be interpreted as a kind of clock.

That we can do:
1. Use global clocks at the start of transaction.
2. Use CSN-based snapshot as a machinery and create an extension to
allow user defined commit protocols.

--
regards,
Andrey Lepikhov
Postgres Professional

From:	"tsunakawa(dot)takay(at)fujitsu(dot)com" <tsunakawa(dot)takay(at)fujitsu(dot)com>
To:	"'Andrey V(dot) Lepikhov'" <a(dot)lepikhov(at)postgrespro(dot)ru>
Cc:	'Fujii Masao' <masao(dot)fujii(at)oss(dot)nttdata(dot)com>, Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, "movead(dot)li(at)highgo(dot)ca" <movead(dot)li(at)highgo(dot)ca>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject:	RE: Global snapshots
Date:	2021-02-26 06:34:09
Message-ID:	TYAPR01MB299079A57E722E71869DBE0CFE9D9@TYAPR01MB2990.jpnprd01.prod.outlook.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	Postg범퍼카 토토SQL

From: Andrey V. Lepikhov <a(dot)lepikhov(at)postgrespro(dot)ru>
> After researching in depth, I think this is the real problem.
> My idea was that we are not using real clocks, we only use clock ticks to
> measure time intervals. It can also be interpreted as a kind of clock.

Yes, patent claims tend to be written to cover broad interpretation. That's too sad.

> That we can do:
> 1. Use global clocks at the start of transaction.
> 2. Use CSN-based snapshot as a machinery and create an extension to allow
> user defined commit protocols.

Is this your suggestion to circumvent the patent? Sorry, I'm afraid I can't understand it yet (I have to study more.) I hope others will comment on this.

Regards
Takayuki Tsunakawa

From:	"Andrey V(dot) Lepikhov" <a(dot)lepikhov(at)postgrespro(dot)ru>
To:	"tsunakawa(dot)takay(at)fujitsu(dot)com" <tsunakawa(dot)takay(at)fujitsu(dot)com>
Cc:	'Fujii Masao' <masao(dot)fujii(at)oss(dot)nttdata(dot)com>, Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, "movead(dot)li(at)highgo(dot)ca" <movead(dot)li(at)highgo(dot)ca>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Global snapshots
Date:	2021-03-23 07:54:57
Message-ID:	0e65f96f-bfd6-fb09-89b7-e07854c215e6@postgrespro.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Current state of the patch set rebased on master, 5aed6a1fc2.

It is development version. Here some problems with visibility still
detected in two tests:
1. CSN Snapshot module - TAP test on time skew.
2. Clock SI implementation - TAP test on emulation of bank transaction.

--
regards,
Andrey Lepikhov
Postgres Professional

Attachment	Content-Type	Size
0001-CSN-Log.patch	text/x-patch	44.1 KB
0002-CSN-Snapshot.patch	text/x-patch	63.2 KB
0003-CSN-Snapshot-Tests.patch	text/x-patch	21.9 KB
0004-Clock-SI-implementation.patch	text/x-patch	33.5 KB

From:	"tsunakawa(dot)takay(at)fujitsu(dot)com" <tsunakawa(dot)takay(at)fujitsu(dot)com>
To:	"'Andrey V(dot) Lepikhov'" <a(dot)lepikhov(at)postgrespro(dot)ru>
Cc:	'Fujii Masao' <masao(dot)fujii(at)oss(dot)nttdata(dot)com>, Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject:	RE: Global snapshots
Date:	2021-03-25 05:06:11
Message-ID:	TYAPR01MB299037242B6A5C0B6F9C84AEFE629@TYAPR01MB2990.jpnprd01.prod.outlook.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

From: Andrey V. Lepikhov <a(dot)lepikhov(at)postgrespro(dot)ru>
> Current state of the patch set rebased on master, 5aed6a1fc2.
>
> It is development version. Here some problems with visibility still detected in
> two tests:
> 1. CSN Snapshot module - TAP test on time skew.
> 2. Clock SI implementation - TAP test on emulation of bank transaction.

I'm sorry to be late to respond. Thank you for the update.

As discussed at the HighGo meeting, what do you think we should do about this patch set, now that we agreed that Clock-SI is covered by Microsoft's patent? I'd appreciate it if you could share some idea to change part of the algorithm and circumvent the patent.

Otherwise, why don't we discuss alternatives, such as the Commitment Ordering?

I have a hunch that YugabyteDB's method seems promising, which I wrote in the following wiki. Of course, we should make efforts to see if it's patented before diving deeper into the design or implementation.

Scaleout Design - PostgreSQL wiki
https://wiki.postgresql.org/wiki/Scaleout_Design

Regards
Takayuki Tsunakawa

From:	"Andrey V(dot) Lepikhov" <a(dot)lepikhov(at)postgrespro(dot)ru>
To:	"tsunakawa(dot)takay(at)fujitsu(dot)com" <tsunakawa(dot)takay(at)fujitsu(dot)com>
Cc:	'Fujii Masao' <masao(dot)fujii(at)oss(dot)nttdata(dot)com>, Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Global snapshots
Date:	2021-11-18 04:04:29
Message-ID:	c9786865-a9ed-a79d-256a-fa16671d117b@postgrespro.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Next version of CSN implementation in snapshots to achieve a proper
snapshot isolation in the case of a cross-instance distributed transaction.

--
regards,
Andrey Lepikhov
Postgres Professional

Attachment	Content-Type	Size
0001-Add-Commit-Sequence-Number-CSN-machinery-into-MVCC.patch	text/x-patch	131.9 KB

From:	"Andrey V(dot) Lepikhov" <a(dot)lepikhov(at)postgrespro(dot)ru>
To:	PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Cc:	'Fujii Masao' <masao(dot)fujii(at)oss(dot)nttdata(dot)com>, Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Subject:	Re: Global snapshots
Date:	2021-11-19 11:10:59
Message-ID:	2983d675-7076-083a-aac7-b35d84c7d6ff@postgrespro.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Patch in the previous letter is full of faulties. Please, use new version.
Also, here we fixed the problem with loosing CSN value in a parallel
worker (TAP test 003_parallel_safe.pl). Thanks for a.pyhalov for the
problem detection and a bugfix.

--
regards,
Andrey Lepikhov
Postgres Professional

Attachment	Content-Type	Size
v2-0001-Add-Commit-Sequence-Number-CSN-machinery-into-MVCC.patch	text/x-patch	118.5 KB