Quick Links

Re: CSN snapshots in hot standby

Lists:	pgsql-hackers

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	CSN snapshots in hot standby
Date:	2024-04-04 17:21:18
Message-ID:	08da26cc-95ef-4c0e-9573-8b930f80ce27@iki.fi
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

You cannot run queries on a Hot Standby server until the standby has
seen a running-xacts record. Furthermore if the subxids cache had
overflowed, you also need to wait for those transactions to finish. That
is usually not a problem, because we write a running-xacts record after
each checkpoint, and most systems don't use so many subtransactions that
the cache would overflow. Still, you can run into it if you're unlucky,
and it's annoying when you do.

It occurred to me that we could replace the known-assigned-xids
machinery with CSN snapshots. We've talked about CSN snapshots many
times in the past, and I think it would make sense on the primary too,
but for starters, we could use it just during Hot Standby.

With CSN-based snapshots, you don't have the limitation with the
fixed-size known-assigned-xids array, and overflowed sub-XIDs are not a
problem either. You can always enter Hot Standby and start accepting
queries as soon as the standby is in a physically consistent state.

I dusted up and rebased the last CSN patch that I found on the mailing
list [1], and modified it so that it's only used during recovery. That
makes some things simpler and less scary. There are no changes to how
transaction commit happens in the primary, the CSN log is only kept
up-to-date in the standby, when commit/abort records are replayed. The
CSN of each transaction is the LSN of its commit record.

The CSN approach is much simpler than the existing known-assigned-XIDs
machinery, as you can see from "git diff --stat" with this patch:

32 files changed, 773 insertions(+), 1711 deletions(-)

With CSN snapshots, we don't need the known-assigned-XIDs machinery, and
we can get rid of the xact-assignment records altogether. We no longer
need the running-xacts records for Hot Standby either, but I wasn't able
to remove that because it's still used by logical replication, in
snapbuild.c. I have a feeling that that could somehow be simplified too,
but didn't look into it.

This is obviously v18 material, so I'll park this at the July commitfest
for now. There are a bunch of little FIXMEs in the code, and needs
performance testing, but overall I was surprised how easy this was.

(We ran into this issue particularly hard with Neon, because with Neon
you don't need to perform WAL replay at standby startup. However, when
you don't perform WAL replay, you don't get to see the running-xact
record after the checkpoint either. If the primary is idle, it doesn't
generate new running-xact records, and the standby cannot start Hot
Standby until the next time something happens in the primary. It's
always a potential problem with overflowed sub-XIDs cache, but the lack
of WAL replay made it happen even when there are no subtransactions
involved.)

[1] /message-id/2020081009525213277261%40highgo.ca

--
Heikki Linnakangas
Neon (https://neon.tech)

Attachment	Content-Type	Size
v1-0001-Use-CSN-snapshots-during-Hot-Standby.patch	text/x-patch	120.8 KB

From:	Kirill Reshke <reshkekirill(at)gmail(dot)com>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: CSN snapshots in hot standby
Date:	2024-04-04 21:08:37
Message-ID:	CALdSSPgbnsq_ZOLd59tORg0-onQ0RVrMP_gQ_HudfJS14ecytQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On Thu, 4 Apr 2024 at 22:21, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:

> You cannot run queries on a Hot Standby server until the standby has
> seen a running-xacts record. Furthermore if the subxids cache had
> overflowed, you also need to wait for those transactions to finish. That
> is usually not a problem, because we write a running-xacts record after
> each checkpoint, and most systems don't use so many subtransactions that
> the cache would overflow. Still, you can run into it if you're unlucky,
> and it's annoying when you do.
>
> It occurred to me that we could replace the known-assigned-xids
> machinery with CSN snapshots. We've talked about CSN snapshots many
> times in the past, and I think it would make sense on the primary too,
> but for starters, we could use it just during Hot Standby.
>
> With CSN-based snapshots, you don't have the limitation with the
> fixed-size known-assigned-xids array, and overflowed sub-XIDs are not a
> problem either. You can always enter Hot Standby and start accepting
> queries as soon as the standby is in a physically consistent state.
>
> I dusted up and rebased the last CSN patch that I found on the mailing
> list [1], and modified it so that it's only used during recovery. That
> makes some things simpler and less scary. There are no changes to how
> transaction commit happens in the primary, the CSN log is only kept
> up-to-date in the standby, when commit/abort records are replayed. The
> CSN of each transaction is the LSN of its commit record.
>
> The CSN approach is much simpler than the existing known-assigned-XIDs
> machinery, as you can see from "git diff --stat" with this patch:
>
> 32 files changed, 773 insertions(+), 1711 deletions(-)
>
> With CSN snapshots, we don't need the known-assigned-XIDs machinery, and
> we can get rid of the xact-assignment records altogether. We no longer
> need the running-xacts records for Hot Standby either, but I wasn't able
> to remove that because it's still used by logical replication, in
> snapbuild.c. I have a feeling that that could somehow be simplified too,
> but didn't look into it.
>
> This is obviously v18 material, so I'll park this at the July commitfest
> for now. There are a bunch of little FIXMEs in the code, and needs
> performance testing, but overall I was surprised how easy this was.
>
> (We ran into this issue particularly hard with Neon, because with Neon
> you don't need to perform WAL replay at standby startup. However, when
> you don't perform WAL replay, you don't get to see the running-xact
> record after the checkpoint either. If the primary is idle, it doesn't
> generate new running-xact records, and the standby cannot start Hot
> Standby until the next time something happens in the primary. It's
> always a potential problem with overflowed sub-XIDs cache, but the lack
> of WAL replay made it happen even when there are no subtransactions
> involved.)
>
> [1]
> /message-id/2020081009525213277261%40highgo.ca
>
> --
> Heikki Linnakangas
> Neon (https://neon.tech)

Great. I really like the idea of vanishing KnownAssignedXids instead of
optimizing it (if optimizations are even possible).

> + /*
> + * TODO: We must mark CSNLOG first
> + */
> + CSNLogSetCSN(xid, parsed->nsubxacts, parsed->subxacts, lsn);
> +

As far as I understand we simply use the current Wal Record LSN as its XID
CSN number. Ok.
This seems to work for standbys snapshots, but this patch may be really
useful for distributed postgresql solutions, that use CSN for working
with distributed database snapshot (across multiple shards). These
solutions need to set CSN to some other value (time from True time/ClockSI
or whatever).
So, maybe we need some hooks here? Or maybe, we can take CSN here from
extension somehow. For example, we can define
some interface and extend it. Does this sound reasonable for you?

Also, I attached a patch which adds some more todos.

Attachment	Content-Type	Size
v1-0001-Point-comments-needed-to-be-edited.patch	application/octet-stream	2.3 KB

From:	"Andrey M(dot) Borodin" <x4mmm(at)yandex-team(dot)ru>
To:	Kirill Reshke <reshkekirill(at)gmail(dot)com>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: CSN snapshots in hot standby
Date:	2024-04-05 10:49:24
Message-ID:	53C9CD5A-4403-4FF0-9F00-8B8A00F922A7@yandex-team.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On 5 Apr 2024, at 02:08, Kirill Reshke <reshkekirill(at)gmail(dot)com> wrote:
>
> maybe we need some hooks here? Or maybe, we can take CSN here from extension somehow.

I really like the idea of CSN-provider-as-extension.
But it's very important to move on with CSN, at least on standby, to make CSN actually happen some day.
So, from my perspective, having LSN-as-CSN is already huge step forward.

Best regards, Andrey Borodin.

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	"Andrey M(dot) Borodin" <x4mmm(at)yandex-team(dot)ru>, Kirill Reshke <reshkekirill(at)gmail(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: CSN snapshots in hot standby
Date:	2024-08-13 20:13:39
Message-ID:	b439edfc-c5e5-43a9-802d-4cb51ec20646@iki.fi
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 05/04/2024 13:49, Andrey M. Borodin wrote:
>> On 5 Apr 2024, at 02:08, Kirill Reshke <reshkekirill(at)gmail(dot)com> wrote:

Thanks for taking a look, Kirill!

>> maybe we need some hooks here? Or maybe, we can take CSN here from extension somehow.
>
> I really like the idea of CSN-provider-as-extension.
> But it's very important to move on with CSN, at least on standby, to make CSN actually happen some day.
> So, from my perspective, having LSN-as-CSN is already huge step forward.

Yeah, I really don't want to expand the scope of this.

Here's a new version. Rebased, and lots of comments updated.

I added a tiny cache of the CSN lookups into SnapshotData, which can
hold the values of 4 XIDs that are known to be visible to the snapshot,
and 4 invisible XIDs. This is pretty arbitrary, but the idea is to have
something very small to speed up the common cases that 1-2 XIDs are
repeatedly looked up, without adding too much overhead.

I did some performance testing of the visibility checks using these CSN
snapshots. The tests run SELECTs with a SeqScan in a standby, over a
table where all the rows have xmin/xmax values that are still
in-progress in the primary.

Three test scenarios:

1. large-xact: one large transaction inserted all the rows. All rows
have the same XMIN, which is still in progress

2. many-subxacts: one large transaction inserted each row in a separate
subtransaction. All rows have a different XMIN, but they're all
subtransactions of the same top-level transaction. (This causes the
subxids cache in the proc array to overflow)

3. few-subxacts: All rows are inserted, committed, and vacuum frozen.
Then, using 10 in separate subtransactions, DELETE the rows, in an
interleaved fashion. The XMAX values cycle like this "1, 2, 3, 4, 5, 6,
7, 8, 9, 10, 1, 2, 3, 4, 5, ...". The point of this is that these
sub-XIDs fit in the subxids cache in the procarray, but the pattern
defeats the simple 4-element cache that I added.

The test script I used is attached. I repeated it a few times with
master and the patches here, and picked the fastest runs for each. Just
eyeballing the results, there's about ~10% variance in these numbers.
Smaller is better.

Master:

large-xact: 4.57732510566711
many-subxacts: 18.6958119869232
few-subxacts: 16.467698097229

Patched:

large-xact: 10.2999930381775
many-subxacts: 11.6501438617706
few-subxacts: 19.8457028865814

With cache:

large-xact: 3.68792295455933
many-subxacts: 13.3662350177765
few-subxacts: 21.4426419734955

The 'large-xacts' results show that the CSN lookups are slower than the
binary search on the 'xids' array. Not a surprise. The 4-element cache
fixes the regression, which is also not a surprise.

The 'many-subxacts' results show that the CSN lookups are faster than
the current method in master, when the subxids cache has overflowed.
That makes sense: on master, we always perform a lookup in pg_subtrans,
if the suxids cache has overflowed, which is more or less the same
overhead as the CSN lookup. But we avoid the binary search on the xids
array after that.

The 'few-subxacts' shows a regression, when the 4-element cache is not
effective. I think that's acceptable, the CSN approach has many
benefits, and I don't think this is a very common scenario. But if
necessary, it could perhaps be alleviated with more caching, or by
trying to compensate by optimizing elsewhere.

--
Heikki Linnakangas
Neon (https://neon.tech)

Attachment	Content-Type	Size
v2-0001-Update-outdated-comment-on-WAL-logged-locks-with-.patch	text/x-patch	1.7 KB
v2-0002-XXX-add-perf-test.patch	text/x-patch	5.6 KB
v2-0003-Use-CSN-snapshots-during-Hot-Standby.patch	text/x-patch	128.8 KB
v2-0004-Make-SnapBuildWaitSnapshot-work-without-xl_runnin.patch	text/x-patch	6.2 KB
v2-0005-Remove-the-now-unused-xids-array-from-xl_running_.patch	text/x-patch	7.0 KB
v2-0006-Add-a-small-cache-to-Snapshot-to-avoid-CSN-lookup.patch	text/x-patch	2.5 KB

From:	Kirill Reshke <reshkekirill(at)gmail(dot)com>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	"Andrey M(dot) Borodin" <x4mmm(at)yandex-team(dot)ru>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: CSN snapshots in hot standby
Date:	2024-08-13 21:15:01
Message-ID:	CALdSSPjC+YvJe-YnC2K-7SdT3vE8T7MASRfG9XshDfoOXVANwQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 14 Aug 2024 at 01:13, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
>
> On 05/04/2024 13:49, Andrey M. Borodin wrote:
> >> On 5 Apr 2024, at 02:08, Kirill Reshke <reshkekirill(at)gmail(dot)com> wrote:
>
> Thanks for taking a look, Kirill!
>
> >> maybe we need some hooks here? Or maybe, we can take CSN here from extension somehow.
> >
> > I really like the idea of CSN-provider-as-extension.
> > But it's very important to move on with CSN, at least on standby, to make CSN actually happen some day.
> > So, from my perspective, having LSN-as-CSN is already huge step forward.
>
> Yeah, I really don't want to expand the scope of this.
>
> Here's a new version. Rebased, and lots of comments updated.
>
> I added a tiny cache of the CSN lookups into SnapshotData, which can
> hold the values of 4 XIDs that are known to be visible to the snapshot,
> and 4 invisible XIDs. This is pretty arbitrary, but the idea is to have
> something very small to speed up the common cases that 1-2 XIDs are
> repeatedly looked up, without adding too much overhead.
>
>
> I did some performance testing of the visibility checks using these CSN
> snapshots. The tests run SELECTs with a SeqScan in a standby, over a
> table where all the rows have xmin/xmax values that are still
> in-progress in the primary.
>
> Three test scenarios:
>
> 1. large-xact: one large transaction inserted all the rows. All rows
> have the same XMIN, which is still in progress
>
> 2. many-subxacts: one large transaction inserted each row in a separate
> subtransaction. All rows have a different XMIN, but they're all
> subtransactions of the same top-level transaction. (This causes the
> subxids cache in the proc array to overflow)
>
> 3. few-subxacts: All rows are inserted, committed, and vacuum frozen.
> Then, using 10 in separate subtransactions, DELETE the rows, in an
> interleaved fashion. The XMAX values cycle like this "1, 2, 3, 4, 5, 6,
> 7, 8, 9, 10, 1, 2, 3, 4, 5, ...". The point of this is that these
> sub-XIDs fit in the subxids cache in the procarray, but the pattern
> defeats the simple 4-element cache that I added.
>
> The test script I used is attached. I repeated it a few times with
> master and the patches here, and picked the fastest runs for each. Just
> eyeballing the results, there's about ~10% variance in these numbers.
> Smaller is better.
>
> Master:
>
> large-xact: 4.57732510566711
> many-subxacts: 18.6958119869232
> few-subxacts: 16.467698097229
>
> Patched:
>
> large-xact: 10.2999930381775
> many-subxacts: 11.6501438617706
> few-subxacts: 19.8457028865814
>
> With cache:
>
> large-xact: 3.68792295455933
> many-subxacts: 13.3662350177765
> few-subxacts: 21.4426419734955
>
> The 'large-xacts' results show that the CSN lookups are slower than the
> binary search on the 'xids' array. Not a surprise. The 4-element cache
> fixes the regression, which is also not a surprise.
>
> The 'many-subxacts' results show that the CSN lookups are faster than
> the current method in master, when the subxids cache has overflowed.
> That makes sense: on master, we always perform a lookup in pg_subtrans,
> if the suxids cache has overflowed, which is more or less the same
> overhead as the CSN lookup. But we avoid the binary search on the xids
> array after that.
>
> The 'few-subxacts' shows a regression, when the 4-element cache is not
> effective. I think that's acceptable, the CSN approach has many
> benefits, and I don't think this is a very common scenario. But if
> necessary, it could perhaps be alleviated with more caching, or by
> trying to compensate by optimizing elsewhere.
>
> --
> Heikki Linnakangas
> Neon (https://neon.tech)

Thanks for the update. I will try to find time for perf-testing this.
Firstly, random suggestions. Sorry for being too nit-picky

1) in 0002
> +/*
> + * Number of shared CSNLog buffers.
> + */
> +static Size
> +CSNLogShmemBuffers(void)
> +{
> + return Min(32, Max(16, NBuffers / 512));
> +}

Should we GUC this?

2) In 0002 CSNLogShmemInit:

> + //SlruPagePrecedesUnitTests(CsnlogCtl, SUBTRANS_XACTS_PER_PAGE);

remove this?

3) In 0002 InitCSNLogPage:

> + SimpleLruZeroPage(CsnlogCtl, pageno);
we can use ZeroCSNLogPage here. This will justify existance of this
function a little bit more.

4) In 0002:
> +++ b/src/backend/replication/logical/snapbuild.c
> @@ -27,7 +27,7 @@
> * removed. This is achieved by using the replication slot mechanism.
> *
> * As the percentage of transactions modifying the catalog normally is fairly
> - * small in comparisons to ones only manipulating user data, we keep track of
> + * small in comparison to ones only manipulating user data, we keep track of
> * the committed catalog modifying ones inside [xmin, xmax) instead of keeping
> * track of all running transactions like it's done in a normal snapshot. Note
> * that we're generally only looking at transactions that have acquired an

This change is unrelated to 0002 patch, let's just push it as a separate change.

Overall, 0002 looks straightforward, though big. I however wonder how
we can test that this change does not lead to any unpleasant problem,
like observing uncommitted changes on replicas, corruption, and other
stuff? Maybe some basic injection-point-based TAP test here is
desirable?

--
Best regards,
Kirill Reshke

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	"Andrey M(dot) Borodin" <x4mmm(at)yandex-team(dot)ru>, Kirill Reshke <reshkekirill(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: CSN snapshots in hot standby
Date:	2024-09-24 18:08:27
Message-ID:	swzptfywx4i2jj65oi3v3atsiywho3qzr5rj4rjr7ktzpxinmh@nwoaps6qvuqy
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 2024-08-13 23:13:39 +0300, Heikki Linnakangas wrote:
> I added a tiny cache of the CSN lookups into SnapshotData, which can hold
> the values of 4 XIDs that are known to be visible to the snapshot, and 4
> invisible XIDs. This is pretty arbitrary, but the idea is to have something
> very small to speed up the common cases that 1-2 XIDs are repeatedly looked
> up, without adding too much overhead.
>
>
> I did some performance testing of the visibility checks using these CSN
> snapshots. The tests run SELECTs with a SeqScan in a standby, over a table
> where all the rows have xmin/xmax values that are still in-progress in the
> primary.
>
> Three test scenarios:
>
> 1. large-xact: one large transaction inserted all the rows. All rows have
> the same XMIN, which is still in progress
>
> 2. many-subxacts: one large transaction inserted each row in a separate
> subtransaction. All rows have a different XMIN, but they're all
> subtransactions of the same top-level transaction. (This causes the subxids
> cache in the proc array to overflow)
>
> 3. few-subxacts: All rows are inserted, committed, and vacuum frozen. Then,
> using 10 in separate subtransactions, DELETE the rows, in an interleaved
> fashion. The XMAX values cycle like this "1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1,
> 2, 3, 4, 5, ...". The point of this is that these sub-XIDs fit in the
> subxids cache in the procarray, but the pattern defeats the simple 4-element
> cache that I added.

I'd like to see some numbers for a workload with many overlapping top-level
transactions. I contrast to 2) HEAD wouldn't need to do subtrans lookups,
whereas this patch would need to do csn lookups. And a four entry cache
probably wouldn't help very much.

> +/*
> + * Record commit LSN of a transaction and its subtransaction tree.
> + *
> + * xid is a single xid to set status for. This will typically be the top level
> + * transaction ID for a top level commit.
> + *
> + * subxids is an array of xids of length nsubxids, representing subtransactions
> + * in the tree of xid. In various cases nsubxids may be zero.
> + *
> + * commitLsn is the LSN of the commit record. This is currently never called
> + * for aborted transactions.
> + */
> +void
> +CSNLogSetCSN(TransactionId xid, int nsubxids, TransactionId *subxids,
> + XLogRecPtr commitLsn)
> +{
> + int pageno;
> + int i = 0;
> + int offset = 0;
> +
> + Assert(TransactionIdIsValid(xid));
> +
> + pageno = TransactionIdToPage(xid); /* get page of parent */
> + for (;;)
> + {
> + int num_on_page = 0;
> +
> + while (i < nsubxids && TransactionIdToPage(subxids[i]) == pageno)
> + {
> + num_on_page++;
> + i++;
> + }

Hm - is there any guarantee / documented requirement that subxids is sorted?

> + CSNLogSetPageStatus(xid,
> + num_on_page, subxids + offset,
> + commitLsn, pageno);
> + if (i >= nsubxids)
> + break;
> +
> + offset = i;
> + pageno = TransactionIdToPage(subxids[offset]);
> + xid = InvalidTransactionId;
> + }
> +}

Hm. Maybe I'm missing something, but what prevents a concurrent transaction to
check the visibility of a subtransaction between marking the subtransaction
committed and marking the main transaction committed? If subtransaction and
main transaction are on the same page that won't be possible, but if they are
on different ones it does seem possible?

Today XidInMVCCSnapshot() will use pg_subtrans to find the top transaction in
case of a suboverflowed snapshot, but with this patch that's not the case
anymore. Which afaict will mean that repeated snapshot computations could
give different results for the same query?

Greetings,

Andres Freund

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	"Andrey M(dot) Borodin" <x4mmm(at)yandex-team(dot)ru>, Kirill Reshke <reshkekirill(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: CSN snapshots in hot standby
Date:	2024-10-21 17:32:33
Message-ID:	7cd97248-fd04-4fb2-9f7f-096c1d7660e8@iki.fi
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 24/09/2024 21:08, Andres Freund wrote:
> I'd like to see some numbers for a workload with many overlapping top-level
> transactions. I contrast to 2) HEAD wouldn't need to do subtrans lookups,
> whereas this patch would need to do csn lookups. And a four entry cache
> probably wouldn't help very much.

I spent some more on the tests. Here is a better set of adversarial
tests, which hit the worst case scenarios for this patch.

All the test scenarios have this high-level shape:

1. Create a table with 100000 rows, vacuum freeze it.

2. In primary, open transactions or subtransactions, and DELETE all rows
using the different (sub)transactions, to set the xmax of every row on
the test table. Leave the transactions open.

3. In standby, SELECT COUNT(*) all rows in the table, and measure how
long it takes.

The difference between the test scenarios is in the pattern of xmax
values, i.e. how many transactions or subtransactions were used. All the
rows are visible, the performance differences come just from how
expensive the visibility checks are in different cases.

First, the results on 'master' without patches (smaller is better):

few-xacts: 0.0041 s / iteration
many-xacts: 0.0042 s / iteration
many-xacts-wide-apart: 0.0042 s / iteration
few-subxacts: 0.0042 s / iteration
many-subxacts: 0.0073 s / iteration
many-subxacts-wide-apart: 0.10 s / iteration

So even on master, there are significant differences depending on
whether the sub-XIDs fit in the in-memory caches, or if you need to do
lookups in pg_subtrans. That's not surprising. Note how bad the
"many-subxacts-wide-apart" scenario is, though. It's over 20x slower
than the best case scenario! I was a little taken aback by that. More on
that later.

Descriptions of the test scenarios:

few-xacts: The xmax values on the rows cycle through four different
XIDs, like this: 1001, 1002, 1003, 1004, 1001, 1002, 1003, 1004, ...

many-xacts: like 'few-xacts', but cycle through 100 different XIDs.

many-xacts-wide-apart: like 'many-xacts', but the XIDs used are spread
out, so that there are 1000 unrelated committed XIDs in between each XID
used in the test table. I.e. "1000, 2000, 3000, 4000, 5000, ...". It
doesn't make a difference in the 'many-xacts-wide-apart' test, but in
the many-subxacts-wide-apart variant it does. It makes the XIDs fall on
different SLRU pages so that there are not enough SLRU buffers to hold
them all.

few-subxacts, many-subxacts, many-subxacts-wide-apart: Same tests, but
instead of using different top-level XIDs, all the XIDs are
subtransactions belonging to a single top-level XID.

Now, with the patch (the unpatched numbers are repeated here for
comparison):

master patched
few-xacts: 0.0041 0.0040 s / iteration
many-xacts: 0.0042 0.0053 s / iteration
many-xacts-wide-apart: 0.0042 0.17 s / iteration
few-subxacts: 0.0042 0.0040 s / iteration
many-subxacts: 0.0073 0.0052 s / iteration
many-subxacts-wide-apart: 0.10 0.22 s / iteration

So when the 4-element cache is effective, in the 'few-xacts' case, the
patch performs well. In the 'many-xacts' case, it needs to perform CSN
lookups, making it a little slower. The 'many-xacts-wide-apart'
regresses badly, showing the same SLRU trashing effect on CSN lookups as
the 'many-subxacts-wide-apart' case does on 'master' on pg_subtrans lookups.

Some thoughts on all this:

1. The many-subxacts-wide-apart performance is horrible even on master.
'perf' shows that about half of the CPU time is spent in open() and
close(). We open and close the SLRU file every time we need to read a
page! That's obviously silly, but also shouldn't be hard to fix.

2. Even if we fix the open/close issue and make the worst case 2x
faster, the worst case is still bad. We could call this a tuning issue;
more SLRU buffers helps. But that's not very satisfactory. I really wish
we could make SLRU buffers auto-tuning. Move them to the main buffer
pool. Or something. And I wish SLRU lookups were faster even in the case
that the SLRU page is already in memory. The LWLock acquire+release
shows up in profiles, maybe we could do some kind of optimistic locking
instead.

3. Aside from making SLRUs faster, we could also mask its slowness in
the CSN patch by caching. The 4-element cache in Snapshot that I
implemented is fast when it's sufficient, but we could make it larger to
cover more cases. At the extreme, we could never remove elements from
it, and just let it grow as large as needed.

4. Currently on 'master', the XID list in a snapshot is an array of XIDs
that is binary searched. A different data structure might be better.
When the difference between xmin and xmax is small, a bitmap would be
compact and fast to look up, for example. Or maybe a radix tree or
something. This is an independent optimization that might make
XidInMVCCSnapshot() faster even without the CSN stuff, but if we decide
to go with a large cache (see previous paragraph), it would be nice to
reduce the worst case memory usage of the cache with something like this.

5. I'm not sure how much any of this matters in practice. Performance is
obviously important, but we don't get too many complaints about these
things even though the 'many-subxacts-wide-apart' case is pretty bad
already. It was not that easy to construct these adversarial scenarios.
If we were implementing this from scratch, I think we could easily
accept the performance with the patch. Regressions can be very
unpleasant for existing users, however..

Thoughts? I think the fastest way to make progress with the CSN patch is
to make the cache larger, to hide the SLRU lookups. But those other
things would be interesting to explore too.

>> +/*
>> + * Record commit LSN of a transaction and its subtransaction tree.
>> + *
>> + * xid is a single xid to set status for. This will typically be the top level
>> + * transaction ID for a top level commit.
>> + *
>> + * subxids is an array of xids of length nsubxids, representing subtransactions
>> + * in the tree of xid. In various cases nsubxids may be zero.
>> + *
>> + * commitLsn is the LSN of the commit record. This is currently never called
>> + * for aborted transactions.
>> + */
>> +void
>> +CSNLogSetCSN(TransactionId xid, int nsubxids, TransactionId *subxids,
>> + XLogRecPtr commitLsn)
>> +{
>> + int pageno;
>> + int i = 0;
>> + int offset = 0;
>> +
>> + Assert(TransactionIdIsValid(xid));
>> +
>> + pageno = TransactionIdToPage(xid); /* get page of parent */
>> + for (;;)
>> + {
>> + int num_on_page = 0;
>> +
>> + while (i < nsubxids && TransactionIdToPage(subxids[i]) == pageno)
>> + {
>> + num_on_page++;
>> + i++;
>> + }
>
>
> Hm - is there any guarantee / documented requirement that subxids is sorted?

Yes, subtransaction XIDs are assigned in order. But point taken; I'll
add a comment of that here too.

>
>> + CSNLogSetPageStatus(xid,
>> + num_on_page, subxids + offset,
>> + commitLsn, pageno);
>> + if (i >= nsubxids)
>> + break;
>> +
>> + offset = i;
>> + pageno = TransactionIdToPage(subxids[offset]);
>> + xid = InvalidTransactionId;
>> + }
>> +}
>
>
> Hm. Maybe I'm missing something, but what prevents a concurrent transaction to
> check the visibility of a subtransaction between marking the subtransaction
> committed and marking the main transaction committed? If subtransaction and
> main transaction are on the same page that won't be possible, but if they are
> on different ones it does seem possible?
>
> Today XidInMVCCSnapshot() will use pg_subtrans to find the top transaction in
> case of a suboverflowed snapshot, but with this patch that's not the case
> anymore. Which afaict will mean that repeated snapshot computations could
> give different results for the same query?

The concurrent transaction's snapshot would consider the transaction and
all its subtransactions as still in-progress. Depending on the timing,
when XidInMVCCSnapshot() looks up the CSN of the subtransaction XID, it
will either see that it has no CSN which means it's still in progress
and thus invisible, or it has a CSN that is greater than the snapshot's
CSN, and invisible because of that. The global latestCommitLSN value is
advanced only after CSNLogSetCSN() has finished setting the CSN on all
the subtransactions, so before that, there cannot be any snapshots that
would see it as visible yet.

No code changes since v2, except the tests and minor comments, but I'm
including all the patches here again for convenience.

--
Heikki Linnakangas
Neon (https://neon.tech)

Attachment	Content-Type	Size
v3-0001-XXX-add-perf-test.patch	text/x-patch	10.4 KB
v3-0002-Use-CSN-snapshots-during-Hot-Standby.patch	text/x-patch	128.8 KB
v3-0003-Make-SnapBuildWaitSnapshot-work-without-xl_runnin.patch	text/x-patch	6.2 KB
v3-0004-Remove-the-now-unused-xids-array-from-xl_running_.patch	text/x-patch	7.0 KB
v3-0005-Add-a-small-cache-to-Snapshot-to-avoid-CSN-lookup.patch	text/x-patch	2.5 KB

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	"Andrey M(dot) Borodin" <x4mmm(at)yandex-team(dot)ru>, Kirill Reshke <reshkekirill(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: CSN snapshots in hot standby
Date:	2024-10-29 16:33:53
Message-ID:	b1955fa6-14b3-46cb-8796-6a2373b7220d@iki.fi
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 21/10/2024 20:32, Heikki Linnakangas wrote:
> On 24/09/2024 21:08, Andres Freund wrote:
>> I'd like to see some numbers for a workload with many overlapping
>> top-level
>> transactions. I contrast to 2) HEAD wouldn't need to do subtrans lookups,
>> whereas this patch would need to do csn lookups. And a four entry cache
>> probably wouldn't help very much.
>
> I spent some more on the tests. Here is a better set of adversarial
> tests, which hit the worst case scenarios for this patch.
>
> All the test scenarios have this high-level shape:
>
> 1. Create a table with 100000 rows, vacuum freeze it.
>
> 2. In primary, open transactions or subtransactions, and DELETE all rows
> using the different (sub)transactions, to set the xmax of every row on
> the test table. Leave the transactions open.
>
> 3. In standby, SELECT COUNT(*) all rows in the table, and measure how
> long it takes.
>
> The difference between the test scenarios is in the pattern of xmax
> values, i.e. how many transactions or subtransactions were used. All the
> rows are visible, the performance differences come just from how
> expensive the visibility checks are in different cases.
>
> First, the results on 'master' without patches (smaller is better):
>
> few-xacts:                 0.0041 s / iteration
> many-xacts:                0.0042 s / iteration
> many-xacts-wide-apart:     0.0042 s / iteration
> few-subxacts:              0.0042 s / iteration
> many-subxacts:             0.0073 s / iteration
> many-subxacts-wide-apart: 0.10   s / iteration
>
> So even on master, there are significant differences depending on
> whether the sub-XIDs fit in the in-memory caches, or if you need to do
> lookups in pg_subtrans. That's not surprising. Note how bad the
> "many-subxacts-wide-apart" scenario is, though. It's over 20x slower
> than the best case scenario! I was a little taken aback by that. More on
> that later.
>
> Descriptions of the test scenarios:
>
> few-xacts: The xmax values on the rows cycle through four different
> XIDs, like this: 1001, 1002, 1003, 1004, 1001, 1002, 1003, 1004, ...
>
> many-xacts: like 'few-xacts', but cycle through 100 different XIDs.
>
> many-xacts-wide-apart: like 'many-xacts', but the XIDs used are spread
> out, so that there are 1000 unrelated committed XIDs in between each XID
> used in the test table. I.e. "1000, 2000, 3000, 4000, 5000, ...". It
> doesn't make a difference in the 'many-xacts-wide-apart' test, but in
> the many-subxacts-wide-apart variant it does. It makes the XIDs fall on
> different SLRU pages so that there are not enough SLRU buffers to hold
> them all.
>
> few-subxacts, many-subxacts, many-subxacts-wide-apart: Same tests, but
> instead of using different top-level XIDs, all the XIDs are
> subtransactions belonging to a single top-level XID.
>
>
> Now, with the patch (the unpatched numbers are repeated here for
> comparison):
>
>                            master     patched
> few-xacts:                 0.0041      0.0040 s / iteration
> many-xacts:                0.0042      0.0053 s / iteration
> many-xacts-wide-apart:     0.0042      0.17   s / iteration
> few-subxacts:              0.0042      0.0040 s / iteration
> many-subxacts:             0.0073      0.0052 s / iteration
> many-subxacts-wide-apart: 0.10        0.22   s / iteration
>
>
> So when the 4-element cache is effective, in the 'few-xacts' case, the
> patch performs well. In the 'many-xacts' case, it needs to perform CSN
> lookups, making it a little slower. The 'many-xacts-wide-apart'
> regresses badly, showing the same SLRU trashing effect on CSN lookups as
> the 'many-subxacts-wide-apart' case does on 'master' on pg_subtrans
> lookups.

Here's another version, which replaces the small 4-element cache with a
cache with no size limit. It's implemented as a radix tree and entries
are never removed, so it can grow to hold the status of all XIDs between
the snapshot's xmin and xmax at most.

This new cache solves the performance issue with the earlier tests:

master patched
few-xacts: 0.0041 0.0041 s / iteration
many-xacts: 0.0042 0.0042 s / iteration
many-xacts-wide-apart: 0.0043 0.0045 s / iteration
few-subxacts: 0.0043 0.0042 s / iteration
many-subxacts: 0.0076 0.0042 s / iteration
many-subxacts-wide-apart: 0.11 0.0070 s / iteration

The new cache also elides the slow pg_subtrans lookups that makes
many-subxacts-wide-apart case slow on 'master', which is nice.

I added two tests to the test suite:
master patched
insert-all-different-xids: 0.00027 0.00019 s / iteration
insert-all-different-subxids: 0.00023 0.00020 s / iteration

insert-all-different-xids: Open 1000 connections, insert one row in
each, and leave the transactions open. In the replica, select all the rows

insert-all-different-subxids: The same, but with 1 transaction with 1000
subxids.

The point of these new tests is to test the scenario where the cache
doesn't help and just adds overhead, because each XID is looked up only
once. Seems to be fine. Surprisingly good actually; I'll do some more
profiling on that to understand why it's even faster than 'master'.

Now the downside of this new cache: Since it has no size limit, if you
keep looking up different XIDs, it will keep growing until it holds all
the XIDs between the snapshot's xmin and xmax. That can take a lot of
memory in the worst case. Radix tree is pretty memory efficient, but
holding, say 1 billion XIDs would probably take something like 500 MB of
RAM (the radix tree stores 64-bit words with 2 bits per XID, plus the
radix tree nodes). That's per snapshot, so if you have a lot of
connections, maybe even with multiple snapshots each, that can add up.

I'm inclined to accept that memory usage. If we wanted to limit the size
of the cache, would need to choose a policy on how to truncate it
(delete random nodes?), what the limit should be etc. But I think it'd
be rare to hit those cases in practice. If you have a one billion XID
old transaction running in the primary, you probably have bigger
problems already.

--
Heikki Linnakangas
Neon (https://neon.tech)

Attachment	Content-Type	Size
0001-XXX-add-perf-test.patch	text/x-patch	12.0 KB
0002-Use-CSN-snapshots-during-Hot-Standby.patch	text/x-patch	128.8 KB
0003-Make-SnapBuildWaitSnapshot-work-without-xl_running_x.patch	text/x-patch	6.2 KB
0004-Remove-the-now-unused-xids-array-from-xl_running_xac.patch	text/x-patch	7.0 KB
0005-Add-a-cache-to-Snapshot-to-avoid-repeated-CSN-lookup.patch	text/x-patch	5.7 KB

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	"Andrey M(dot) Borodin" <x4mmm(at)yandex-team(dot)ru>, Kirill Reshke <reshkekirill(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: CSN snapshots in hot standby
Date:	2024-11-15 19:16:13
Message-ID:	39f9fe48-4db9-4daa-b4c5-c6f46ac92597@iki.fi
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 29/10/2024 18:33, Heikki Linnakangas wrote:
> I added two tests to the test suite:
>                                 master     patched
> insert-all-different-xids:     0.00027    0.00019 s / iteration
> insert-all-different-subxids: 0.00023    0.00020 s / iteration
>
> insert-all-different-xids: Open 1000 connections, insert one row in
> each, and leave the transactions open. In the replica, select all the rows
>
> insert-all-different-subxids: The same, but with 1 transaction with 1000
> subxids.
>
> The point of these new tests is to test the scenario where the cache
> doesn't help and just adds overhead, because each XID is looked up only
> once. Seems to be fine. Surprisingly good actually; I'll do some more
> profiling on that to understand why it's even faster than 'master'.

Ok, I did some profiling and it makes sense:

In the insert-all-different-xids test on 'master', we spend about 60& of
CPU time in XidInMVCCSnapshot(), doing pg_lfind32() over the subxip
array. We should probably sort the array and use a binary search if it's
large or something...

With these patches, instead of the pg_lfind32() over subxip array, we
perform one CSN SLRU lookup instead, and the page is cached. There's
locking overhead etc. with that, but it's still cheaper than the
pg_lfind32().

In the insert-all-different-subxids test on 'master', the subxip array
is overflowed, so we call SubTransGetTopmostTransaction() on each XID.
That's performs two pg_subtrans lookups for each XID, first for the
subxid, then for the parent. With these patches, we perform just one
SLRU lookup, in pg_csnlog, which is faster.

> Now the downside of this new cache: Since it has no size limit, if you
> keep looking up different XIDs, it will keep growing until it holds all
> the XIDs between the snapshot's xmin and xmax. That can take a lot of
> memory in the worst case. Radix tree is pretty memory efficient, but
> holding, say 1 billion XIDs would probably take something like 500 MB of
> RAM (the radix tree stores 64-bit words with 2 bits per XID, plus the
> radix tree nodes). That's per snapshot, so if you have a lot of 60&
> connections, maybe even with multiple snapshots each, that can add up.
>
> I'm inclined to accept that memory usage. If we wanted to limit the size
> of the cache, would need to choose a policy on how to truncate it
> (delete random nodes?), what the limit should be etc. But I think it'd
> be rare to hit those cases in practice. If you have a one billion XID
> old transaction running in the primary, you probably have bigger
> problems already.

I'd love to hear some thoughts on this caching behavior. Is it
acceptable to let the cache grow, potentially to very large sizes in the
worst cases? Or do we need to make it more complicated and implement
some eviction policy?

--
Heikki Linnakangas
Neon (https://neon.tech)

From:	John Naylor <johncnaylorls(at)gmail(dot)com>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, "Andrey M(dot) Borodin" <x4mmm(at)yandex-team(dot)ru>, Kirill Reshke <reshkekirill(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: CSN snapshots in hot standby
Date:	2024-11-20 13:33:01
Message-ID:	CANWCAZaxwGfZxD1KQb88_eLUPj_yx7Sc3B1y6-kjVN2g7wy1iw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Oct 29, 2024 at 11:34 PM Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
> master patched
> few-xacts: 0.0041 0.0041 s / iteration
> many-xacts: 0.0042 0.0042 s / iteration
> many-xacts-wide-apart: 0.0043 0.0045 s / iteration

Hi Heikki,

I have some thoughts about behavior of the cache that might not be
apparent in this test:

The tree is only as tall as need be to store the highest non-zero
byte. On a newly initialized cluster, the current txid is small. The
first two test cases here will result in a tree with height of 2. The
last one will have a height of 3, and its runtime looks a bit higher,
although that could be just noise or touching more cache lines. It
might be worth it to try a test run while forcing the upper byte of
the keys to be non-zero (something like "key | (1<<30), so that the
tree always has a height of 4. That would match real-world conditions
more closely. If need be, there are a couple things we can do to
optimize node dispatch and touch fewer cache lines.

> I added two tests to the test suite:
> master patched
> insert-all-different-xids: 0.00027 0.00019 s / iteration
> insert-all-different-subxids: 0.00023 0.00020 s / iteration

> The point of these new tests is to test the scenario where the cache
> doesn't help and just adds overhead, because each XID is looked up only
> once. Seems to be fine. Surprisingly good actually; I'll do some more
> profiling on that to understand why it's even faster than 'master'.

These tests use a sequential scan. For things like primary key
lookups, I wonder if the overhead of creating and destroying the
tree's memory contexts for the (not used again) cache would be
noticeable. If so, it wouldn't be too difficult to teach radix tree to
create the larger contexts lazily.

> Now the downside of this new cache: Since it has no size limit, if you
> keep looking up different XIDs, it will keep growing until it holds all
> the XIDs between the snapshot's xmin and xmax. That can take a lot of
> memory in the worst case. Radix tree is pretty memory efficient, but
> holding, say 1 billion XIDs would probably take something like 500 MB of
> RAM (the radix tree stores 64-bit words with 2 bits per XID, plus the
> radix tree nodes). That's per snapshot, so if you have a lot of
> connections, maybe even with multiple snapshots each, that can add up.
>
> I'm inclined to accept that memory usage. If we wanted to limit the size
> of the cache, would need to choose a policy on how to truncate it
> (delete random nodes?), what the limit should be etc. But I think it'd
> be rare to hit those cases in practice. If you have a one billion XID
> old transaction running in the primary, you probably have bigger
> problems already.

I don't have a good sense of whether it needs a limit or not, but if
we decide to add one as a precaution, maybe it's enough to just blow
the cache away when reaching some limit? Being smarter than that would
need some work.

--
John Naylor
Amazon Web Services

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	John Naylor <johncnaylorls(at)gmail(dot)com>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, "Andrey M(dot) Borodin" <x4mmm(at)yandex-team(dot)ru>, Kirill Reshke <reshkekirill(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: CSN snapshots in hot standby
Date:	2024-12-03 14:25:13
Message-ID:	718d1788-b058-40e6-bc37-8f15612b5646@iki.fi
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 20/11/2024 15:33, John Naylor wrote:
> The tree is only as tall as need be to store the highest non-zero
> byte. On a newly initialized cluster, the current txid is small. The
> first two test cases here will result in a tree with height of 2. The
> last one will have a height of 3, and its runtime looks a bit higher,
> although that could be just noise or touching more cache lines. It
> might be worth it to try a test run while forcing the upper byte of
> the keys to be non-zero (something like "key | (1<<30), so that the
> tree always has a height of 4. That would match real-world conditions
> more closely. If need be, there are a couple things we can do to
> optimize node dispatch and touch fewer cache lines.

Good point. In some quick testing with the 'few-xacts' test, the
difference between the worst case and best case is about 10%. That can
be avoided very easily: instead of using the xid as the key, use (xmax -
xid). That way, the highest key used is the distance between the
snapshot's xmin and xmax rather than the absolute xid values.

>> I added two tests to the test suite:
>> master patched
>> insert-all-different-xids: 0.00027 0.00019 s / iteration
>> insert-all-different-subxids: 0.00023 0.00020 s / iteration
>
>> The point of these new tests is to test the scenario where the cache
>> doesn't help and just adds overhead, because each XID is looked up only
>> once. Seems to be fine. Surprisingly good actually; I'll do some more
>> profiling on that to understand why it's even faster than 'master'.
>
> These tests use a sequential scan. For things like primary key
> lookups, I wonder if the overhead of creating and destroying the
> tree's memory contexts for the (not used again) cache would be
> noticeable. If so, it wouldn't be too difficult to teach radix tree to
> create the larger contexts lazily.

I played with that a little, but it doesn't make much difference. It is
measurable when you look at the profile with 'perf': XidInMVCCSnapshot()
takes about 4% of CPU time in the worst case, where you perform a very
simple SeqScan over a table with just one row, and you perform only one
visibility check in each scan. I could squeeze that down to around 3% by
allocating the contexts lazily. But that's not significant in the grand
scheme of things. Might still be worth doing, but it's not a blocker for
this work, it can be discussed separately.

I did find one weird thing that makes a big difference: I originally
used AllocSetContextCreate(..., ALLOCSET_DEFAULT_SIZES) for the radix
tree's memory context. With that, XidInMVCCSnapshot() takes about 19% of
the CPU time in that test. When I changed that to ALLOCSET_SMALL_SIZES,
it falls down to the 4% figure. And weird enough, in both cases the time
seems to be spent in the malloc() call from SlabContextCreate(), not
AllocSetContextCreate(). I think doing this particular mix of large and
small allocations with malloc() somehow poisons its free list or
something. So this is probably heavily dependent on the malloc()
implementation. In any case, ALLOCSET_SMALL_SIZES is clearly a better
choice here, even without that effect.

One way to eliminate all that would be to start with a tiny cache with
e.g. 4 elements with linear probing, and switch to the radix tree only
when that fills up. But it doesn't seem worth the extra code right now.

Here's a new version with the those small changes:
- use xmax - xid as the cache key
- use ALLOCSET_SMALL_SIZES for the radix tree's memory context

Thanks for looking into this!

--
Heikki Linnakangas
Neon (https://neon.tech)

Attachment	Content-Type	Size
v5-0001-XXX-add-perf-test.patch	text/x-patch	12.0 KB
v5-0002-Use-CSN-snapshots-during-Hot-Standby.patch	text/x-patch	128.8 KB
v5-0003-Make-SnapBuildWaitSnapshot-work-without-xl_runnin.patch	text/x-patch	6.2 KB
v5-0004-Remove-the-now-unused-xids-array-from-xl_running_.patch	text/x-patch	7.0 KB
v5-0005-Add-a-cache-to-Snapshot-to-avoid-repeated-CSN-loo.patch	text/x-patch	6.6 KB

From:	John Naylor <johncnaylorls(at)gmail(dot)com>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, "Andrey M(dot) Borodin" <x4mmm(at)yandex-team(dot)ru>, Kirill Reshke <reshkekirill(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: CSN snapshots in hot standby
Date:	2024-12-05 10:36:43
Message-ID:	CANWCAZZOfxuCxbBMr7AiAaEezsb+_5-p5_aYaP=6BeSPeQWBbw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Dec 3, 2024 at 9:25 PM Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
>
> On 20/11/2024 15:33, John Naylor wrote:
> I did find one weird thing that makes a big difference: I originally
> used AllocSetContextCreate(..., ALLOCSET_DEFAULT_SIZES) for the radix
> tree's memory context. With that, XidInMVCCSnapshot() takes about 19% of
> the CPU time in that test. When I changed that to ALLOCSET_SMALL_SIZES,
> it falls down to the 4% figure. And weird enough, in both cases the time
> seems to be spent in the malloc() call from SlabContextCreate(), not
> AllocSetContextCreate(). I think doing this particular mix of large and
> small allocations with malloc() somehow poisons its free list or
> something. So this is probably heavily dependent on the malloc()
> implementation. In any case, ALLOCSET_SMALL_SIZES is clearly a better
> choice here, even without that effect.

Hmm, interesting. That passed context is needed for 4 things:
1. allocated values (not used here for 64-bit, and 32-bit could be
made to work the same way)
2. iteration state (not used here)
3. a convenient place to put slab child contexts so we can free them easily
4. a place to put the "control object" -- this is really only needed
for shared memory and I have a personal todo to embed it rather than
allocate it for the local memory case.

Removing the need for a passed context for callers that don't need it
is additional possible future work.

Anyway, 0005 looks good to me.

--
John Naylor
Amazon Web Services

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	John Naylor <johncnaylorls(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, "Andrey M(dot) Borodin" <x4mmm(at)yandex-team(dot)ru>, Kirill Reshke <reshkekirill(at)gmail(dot)com>
Subject:	Re: CSN snapshots in hot standby
Date:	2025-03-31 21:31:43
Message-ID:	80f254d3-8ee9-4cde-a7e3-ee99998154da@iki.fi
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Here's a new patchset version. Not much has changed in the actual CSN
patches. But I spent a lot of time refactoring the snapshot management
code, so that there is a simple place to add the "inprogress XID cache"
for the CSN snapshots, in a way that avoids duplicating the cache if a
snapshot is copied around.

Patches 0001-0002 are the patches I posted on a separate thread earlier.
See
/message-id/ec10d398-c9b3-4542-8095-5fc6408b17d1%40iki.fi.

Patches 0003-0006 contain more snapshot manager changes. The end state
is that an MVCC snapshot consists of two structs: a shared "inner"
struct that contains xmin, xmax and the XID lists, and an "outer" struct
that contains a pointer to the shared struct and the current command ID.
As a snapshot is copied around, all the copies share the same shared,
reference-counted struct.

The rest of the patches are the same CSN patches I posted before,
rebased over the snapshot manager changes.

There's one thing that hasn't been discussed yet: The
ProcArrayRecoveryEndTransaction() function, which replaces
ExpireTreeKnownAssignedTransactionIds() and is called on replay of every
commit/abort record, does this:

> /*
> * If this was the oldest XID that was still running, advance it. This is
> * important for advancing the global xmin, which avoids unnecessary
> * recovery conflicts
> *
> * No locking required because this runs in the startup process.
> *
> * XXX: the caller actually has a list of XIDs that just committed. We
> * could save some clog lookups by taking advantage of that list.
> */
> oldest_running_primary_xid = procArray->oldest_running_primary_xid;
> while (oldest_running_primary_xid < max_xid)
> {
> if (!TransactionIdDidCommit(oldest_running_primary_xid) &&
> !TransactionIdDidAbort(oldest_running_primary_xid))
> {
> break;
> }
> TransactionIdAdvance(oldest_running_primary_xid);
> }
> if (max_xid == oldest_running_primary_xid)
> TransactionIdAdvance(oldest_running_primary_xid);

The point is to maintain an "oldest xmin" value based on the WAL records
that are being replayed. Whenever the currently oldest running XID
finishes, we scan the CLOG to find the next oldest XID that hasn't
completed yet.

That adds approximately one or two CLOG lookup to every commit record
replay on average. I haven't tried measuring that, but it seems like it
could slow down recovery. There are ways that could be improved. For
example, do it in larger batches.

A bunch of other small XXX comments remain, but they're just markers for
comments that need to be adjusted, or for further cleanups that are now
possible.

There are also several ways the inprogress cache could be made more
efficient, which I haven't explored:

- For each XID in the cache radix tree, we store one bit to indicate
whether the lookup has been performed, i.e. if the cache is valid for
the XID, and another bit to indicate if the XID is visible or not. With
64-bit cache words stored in the radix tree, each cache word can store
the status of 32 transactions. It would probably be better to work in
bigger chunks. For example, when doing a lookup in the cache, check the
status of 64 transactions at once. Assuming they're all stored on the
same CSN page, it would not be much more expensive than a single XID
lookup. That would make the cache 2x more compact, and save on future
lookups of XIDS falling on the same cache word.

- Initializing the radix tree cache is fairly expensive, with several
memory allocations. Many of those allocations could be done lazily with
some effort in radixtree.h.

- Or start the cache as a small array of XIDs, and switch to the radix
tree only after it fills up.

--
Heikki Linnakangas
Neon (https://neon.tech)

Attachment	Content-Type	Size
v6-0001-Split-SnapshotData-into-separate-structs-for-each.patch	text/x-patch	85.4 KB
v6-0002-Simplify-historic-snapshot-refcounting.patch	text/x-patch	13.2 KB
v6-0003-Add-an-explicit-valid-flag-to-MVCCSnapshotData.patch	text/x-patch	4.0 KB
v6-0004-Replace-static-snapshot-pointers-with-the-valid-f.patch	text/x-patch	13.9 KB
v6-0005-Make-RestoreSnapshot-register-the-snapshot-with-c.patch	text/x-patch	3.1 KB
v6-0006-Replace-the-RegisteredSnapshot-pairing-heap-with-.patch	text/x-patch	21.6 KB
v6-0007-Split-MVCCSnapshot-into-inner-and-outer-parts.patch	text/x-patch	94.6 KB
v6-0008-XXX-add-perf-test.patch	text/x-patch	12.0 KB
v6-0009-Use-CSN-snapshots-during-Hot-Standby.patch	text/x-patch	129.5 KB
v6-0010-Make-SnapBuildWaitSnapshot-work-without-xl_runnin.patch	text/x-patch	6.2 KB
v6-0011-Remove-the-now-unused-xids-array-from-xl_running_.patch	text/x-patch	6.9 KB
v6-0012-Add-a-cache-to-Snapshot-to-avoid-repeated-CSN-loo.patch	text/x-patch	7.5 KB

From:	贾明伟 <wei19860922(at)163(dot)com>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	John Naylor <johncnaylorls(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, "Andrey M(dot) Borodin" <x4mmm(at)yandex-team(dot)ru>, Kirill Reshke <reshkekirill(at)gmail(dot)com>
Subject:	Re: CSN snapshots in hot standby
Date:	2025-04-12 15:18:06
Message-ID:	355C0849-820C-4373-9CC9-9A257C661C3A@163.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Thanks for the proposal, it's an interesting approach.

I have a question regarding the xid visibility during standby startup.
If the checkpoint’s `oldestActiveXid` is smaller than `nextXid`, then there may be already-committed transactions
in that range which will not be replayed on standby. In that case, I believe clog needs to be used for visibility
checks within that xid range — is that correct?
On top of your previous discussion, I wrote a test case and attempted to fix the issue.
Patches 0001–0012 are your original commits, unchanged. Patch 0013 contains my own ideas.

Since this is my first time replying to the mailing list, I was worried about breaking the thread, so I’ve included everything as attachments instead.

Looking forward to your thoughts.
Best regards,
Mingwei Jia

> 2025年4月1日 05:31，Heikki Linnakangas <hlinnaka(at)iki(dot)fi> 写道：
>
> Here's a new patchset version. Not much has changed in the actual CSN patches. But I spent a lot of time refactoring the snapshot management code, so that there is a simple place to add the "inprogress XID cache" for the CSN snapshots, in a way that avoids duplicating the cache if a snapshot is copied around.
>
> Patches 0001-0002 are the patches I posted on a separate thread earlier. See /message-id/ec10d398-c9b3-4542-8095-5fc6408b17d1%40iki.fi.
>
> Patches 0003-0006 contain more snapshot manager changes. The end state is that an MVCC snapshot consists of two structs: a shared "inner" struct that contains xmin, xmax and the XID lists, and an "outer" struct that contains a pointer to the shared struct and the current command ID. As a snapshot is copied around, all the copies share the same shared, reference-counted struct.
>
> The rest of the patches are the same CSN patches I posted before, rebased over the snapshot manager changes.
>
>
> There's one thing that hasn't been discussed yet: The ProcArrayRecoveryEndTransaction() function, which replaces ExpireTreeKnownAssignedTransactionIds() and is called on replay of every commit/abort record, does this:
>
>> /*
>> * If this was the oldest XID that was still running, advance it. This is
>> * important for advancing the global xmin, which avoids unnecessary
>> * recovery conflicts
>> *
>> * No locking required because this runs in the startup process.
>> *
>> * XXX: the caller actually has a list of XIDs that just committed. We
>> * could save some clog lookups by taking advantage of that list.
>> */
>> oldest_running_primary_xid = procArray->oldest_running_primary_xid;
>> while (oldest_running_primary_xid < max_xid)
>> {
>> if (!TransactionIdDidCommit(oldest_running_primary_xid) &&
>> !TransactionIdDidAbort(oldest_running_primary_xid))
>> {
>> break;
>> }
>> TransactionIdAdvance(oldest_running_primary_xid);
>> }
>> if (max_xid == oldest_running_primary_xid)
>> TransactionIdAdvance(oldest_running_primary_xid);
>
> The point is to maintain an "oldest xmin" value based on the WAL records that are being replayed. Whenever the currently oldest running XID finishes, we scan the CLOG to find the next oldest XID that hasn't completed yet.
>
> That adds approximately one or two CLOG lookup to every commit record replay on average. I haven't tried measuring that, but it seems like it could slow down recovery. There are ways that could be improved. For example, do it in larger batches.
>
>
> A bunch of other small XXX comments remain, but they're just markers for comments that need to be adjusted, or for further cleanups that are now possible.
>
>
> There are also several ways the inprogress cache could be made more efficient, which I haven't explored:
>
> - For each XID in the cache radix tree, we store one bit to indicate whether the lookup has been performed, i.e. if the cache is valid for the XID, and another bit to indicate if the XID is visible or not. With 64-bit cache words stored in the radix tree, each cache word can store the status of 32 transactions. It would probably be better to work in bigger chunks. For example, when doing a lookup in the cache, check the status of 64 transactions at once. Assuming they're all stored on the same CSN page, it would not be much more expensive than a single XID lookup. That would make the cache 2x more compact, and save on future lookups of XIDS falling on the same cache word.
>
> - Initializing the radix tree cache is fairly expensive, with several memory allocations. Many of those allocations could be done lazily with some effort in radixtree.h.
>
> - Or start the cache as a small array of XIDs, and switch to the radix tree only after it fills up.
>
> --
> Heikki Linnakangas
> Neon (https://neon.tech)
> <v6-0001-Split-SnapshotData-into-separate-structs-for-each.patch><v6-0002-Simplify-historic-snapshot-refcounting.patch><v6-0003-Add-an-explicit-valid-flag-to-MVCCSnapshotData.patch><v6-0004-Replace-static-snapshot-pointers-with-the-valid-f.patch><v6-0005-Make-RestoreSnapshot-register-the-snapshot-with-c.patch><v6-0006-Replace-the-RegisteredSnapshot-pairing-heap-with-.patch><v6-0007-Split-MVCCSnapshot-into-inner-and-outer-parts.patch><v6-0008-XXX-add-perf-test.patch><v6-0009-Use-CSN-snapshots-during-Hot-Standby.patch><v6-0010-Make-SnapBuildWaitSnapshot-work-without-xl_runnin.patch><v6-0011-Remove-the-now-unused-xids-array-from-xl_running_.patch><v6-0012-Add-a-cache-to-Snapshot-to-avoid-repeated-CSN-loo.patch>

From:	贾明伟 <wei19860922(at)163(dot)com>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	John Naylor <johncnaylorls(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, "Andrey M(dot) Borodin" <x4mmm(at)yandex-team(dot)ru>, Kirill Reshke <reshkekirill(at)gmail(dot)com>
Subject:	Re: CSN snapshots in hot standby
Date:	2025-04-13 13:17:36
Message-ID:	CD0FE043-0A6E-4D7F-9508-6BF6EA528438@163.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi all,

Apologies — the patch I sent earlier did not appear as expected on the mailing list archives because of wrong attachment style.

I'll resend it properly as an inline patch shortly.

Thanks for your understanding!

Best regards,
Mingwei Jia

Attachment	Content-Type	Size
v7-0013-use-clog-in-XidInMVCCSnapshot.patch	application/octet-stream	5.3 KB