Quick Links

Re: GiST VACUUM

Lists:	pgsql-hackers

From:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
To:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Cc:	Костя Кузнецов <chapaev28(at)ya(dot)ru>
Subject:	GiST VACUUM
Date:	2018-03-07 06:50:55
Message-ID:	1B9FAC6F-FA19-4A24-8C1B-F4F574844892@yandex-team.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi, hackers!

Here's new version of GiST VACUUM patch set aimed at CF 2018-09.
Original thread can be found at [0]

==Purpose==
Currently GiST never reuse pages even if they are completely empty. This can lead to a bloat, if a lot of index tuples are deleted from key space, which do not receive newly inserted tuples.
First patch fixes that issue: empty pages are collected and reused for new page splits.

Second patch improves scanning algorithm to read GiST blocks in physical order. This can dramatically increase speed of scanning, especially on HDD.
This scan is using relatively big chunk of memory to build map of whole GiST graph. If there is not enough maintenance memory, patch had the fallback to old GiST VACUUM (from first patch).

==How logical scan works==
GiST VACUUM scans graph in DFS search to find removed tuples on leaf pages. It remembers internal pages that are referencing completely empty leaf pages.
Then in additional step, these pages are rescanned to delete references and mark leaf pages as free.

==How physical scan works==
Scan builds array of GistPSItem (Physical scan item). GistPSItem contains List of offsets pointing to potentially empty leaf pages and the information necessary to collect that list in one sequential file read.
Array of GistPSItems has one item for each GiST block.
When we encounter leaf page, if it is empty - we mark it empty and jump to parent (if known) to update it's list.
When we encounter internal page, we check GistPSItem of every child block to check if it is empty leaf and to mark parent ptr there.

==Limitations==
At least one reference on each internal pages is left undeleted to preserve balancing of the tree.
Pages that has FOLLOW-RIGHT flag also are not deleted, even if empty.

Thank you for your attention, any thoughts are welcome.

Best regards, Andrey Borodin.

[0] /message-id/flat/1147341441925550%40web17j(dot)yandex(dot)ru#1147341441925550(at)web17j(dot)yandex(dot)ru

Attachment	Content-Type	Size
0001-Delete-pages-during-GiST-VACUUM-v3.patch	application/octet-stream	13.9 KB
0002-Physical-GiST-scan-during-VACUUM-v3.patch	application/octet-stream	15.3 KB

From:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Костя Кузнецов <chapaev28(at)ya(dot)ru>
Subject:	Re: GiST VACUUM
Date:	2018-05-17 04:40:13
Message-ID:	CAEepm=1N3qPJqP81U+YwKZYwM53dDU1O-PZ4JmFy03EgGWR+Ng@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Mar 7, 2018 at 7:50 PM, Andrey Borodin <x4mmm(at)yandex-team(dot)ru> wrote:
> Here's new version of GiST VACUUM patch set aimed at CF 2018-09.

Hi Andrey,

FYI Windows doesn't like this patch[1].

+ uint16_t flags;

I think that needs to be uint16?

[1] https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.16

--
Thomas Munro
http://www.enterprisedb.com

From:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
To:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Костя Кузнецов <chapaev28(at)ya(dot)ru>
Subject:	Re: GiST VACUUM
Date:	2018-05-17 05:33:26
Message-ID:	DFECE345-BEA2-4844-947D-F5875F8E4B06@yandex-team.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi, Thomas!

> 17 мая 2018 г., в 9:40, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> написал(а):
>
> On Wed, Mar 7, 2018 at 7:50 PM, Andrey Borodin <x4mmm(at)yandex-team(dot)ru> wrote:
>> Here's new version of GiST VACUUM patch set aimed at CF 2018-09.
>
> Hi Andrey,
>
> FYI Windows doesn't like this patch[1].
>
> + uint16_t flags;
>
> I think that needs to be uint16?

Thanks! Fixed.

Best regards, Andrey Borodin.

Attachment	Content-Type	Size
0001-Delete-pages-during-GiST-VACUUM-v4.patch	application/octet-stream	13.9 KB
0002-Physical-GiST-scan-during-VACUUM-v4.patch	application/octet-stream	15.3 KB

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Костя Кузнецов <chapaev28(at)ya(dot)ru>
Subject:	Re: GiST VACUUM
Date:	2018-07-10 20:07:44
Message-ID:	cd18b07b-f74c-34b2-313d-64ea7dc45f77@iki.fi
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	Postg사설 토토SQL

I'm now looking at the first patch in this series, to allow completely
empty GiST pages to be recycled. I've got some questions:

> --- a/src/backend/access/gist/gist.c
> +++ b/src/backend/access/gist/gist.c
> @@ -700,6 +700,13 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace, GISTSTATE *giststate)
> GISTInsertStack *item;
> OffsetNumber downlinkoffnum;
>
> + if(GistPageIsDeleted(stack->page))
> + {
> + UnlockReleaseBuffer(stack->buffer);
> + xlocked = false;
> + state.stack = stack = stack->parent;
> + continue;
> + }
> downlinkoffnum = gistchoose(state.r, stack->page, itup, giststate);
> iid = PageGetItemId(stack->page, downlinkoffnum);
> idxtuple = (IndexTuple) PageGetItem(stack->page, iid);

This seems misplaced. This code deals with internal pages, and as far as
I can see, this patch never marks internal pages as deleted, only leaf
pages. However, we should have something like this in the leaf-page
branch, to deal with the case that an insertion lands on a page that was
concurrently deleted. Did you have any tests, where an insertion runs
concurrently with vacuum, that would exercise this?

The code in gistbulkdelete() seems pretty expensive. In the first phase,
it records the parent of every empty leaf page it encounters. In the
second phase, it scans every leaf page of that parent, not only those
leaves that were seen as empty.

I'm a bit wary of using pd_prune_xid for the checks to determine if a
deleted page can be recycled yet. In heap pages, pd_prune_xid is just a
hint, but here it's used for a critical check. This seems to be the same
mechanism we use in B-trees, but in B-trees, we store the XID in
BTPageOpaqueData.xact, not pd_prune_xid. Also, in B-trees, we use
ReadNewTransactionId() to set it, not GetCurrentTransactionId(). See
comments in _bt_unlink_halfdead_page() for explanation. This patch is
missing any comments to explain how this works in GiST.

If you crash in the middle of gistbulkdelete(), after it has removed the
downlink from the parent, but before it has marked the leaf page as
deleted, the leaf page is "leaked". I think that's acceptable, but a
comment at least would be good.

- Heikki

From:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Костя Кузнецов <chapaev28(at)ya(dot)ru>
Subject:	Re: GiST VACUUM
Date:	2018-07-11 16:04:44
Message-ID:	77FB49FA-E718-4540-ADC9-0BDCA4114900@yandex-team.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	Postg토토 캔SQL :

Hi, Heikki!

Thanks for looking into the patch!

> 11 июля 2018 г., в 0:07, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> написал(а):
>
> I'm now looking at the first patch in this series, to allow completely empty GiST pages to be recycled. I've got some questions:
>
>> --- a/src/backend/access/gist/gist.c
>> +++ b/src/backend/access/gist/gist.c
>> @@ -700,6 +700,13 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace, GISTSTATE *giststate)
>> GISTInsertStack *item;
>> OffsetNumber downlinkoffnum;
>> + if(GistPageIsDeleted(stack->page))
>> + {
>> + UnlockReleaseBuffer(stack->buffer);
>> + xlocked = false;
>> + state.stack = stack = stack->parent;
>> + continue;
>> + }
>> downlinkoffnum = gistchoose(state.r, stack->page, itup, giststate);
>> iid = PageGetItemId(stack->page, downlinkoffnum);
>> idxtuple = (IndexTuple) PageGetItem(stack->page, iid);
>
> This seems misplaced. This code deals with internal pages, and as far as I can see, this patch never marks internal pages as deleted, only leaf pages. However, we should have something like this in the leaf-page branch, to deal with the case that an insertion lands on a page that was concurrently deleted.
That's a bug. Will fix this.

> Did you have any tests, where an insertion runs concurrently with vacuum, that would exercise this?
Yes, I've tried to test this, but, obviously, not enough. I'll think more about how to deal with it.

>
> The code in gistbulkdelete() seems pretty expensive. In the first phase, it records the parent of every empty leaf page it encounters. In the second phase, it scans every leaf page of that parent, not only those leaves that were seen as empty.
Yes, in first patch there is simplest gistbulkdelete(), second patch will remember line pointers of downlinks to empty leafs.

>
> I'm a bit wary of using pd_prune_xid for the checks to determine if a deleted page can be recycled yet. In heap pages, pd_prune_xid is just a hint, but here it's used for a critical check. This seems to be the same mechanism we use in B-trees, but in B-trees, we store the XID in BTPageOpaqueData.xact, not pd_prune_xid. Also, in B-trees, we use ReadNewTransactionId() to set it, not GetCurrentTransactionId(). See comments in _bt_unlink_halfdead_page() for explanation. This patch is missing any comments to explain how this works in GiST.
Will look into this. I remember it was OK half a year ago, but now it is clear to me that I had to document that part when I understood it....

>
> If you crash in the middle of gistbulkdelete(), after it has removed the downlink from the parent, but before it has marked the leaf page as deleted, the leaf page is "leaked". I think that's acceptable, but a comment at least would be good.
I was considering doing reverse-split (page merge) concurrency like in Lanin and Shasha's paper, but it is just too complex for little purpose. Will add comments on possible orphan pages.

Many thanks! I hope to post updated patch series this week.

Best regards, Andrey Borodin.

From:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Костя Кузнецов <chapaev28(at)ya(dot)ru>
Subject:	Re: GiST VACUUM
Date:	2018-07-12 16:06:38
Message-ID:	21F3E668-4570-4D98-B96D-66ED63A5BCBD@yandex-team.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	Postg스포츠 토토 사이트SQL

Hi!

PFA v5 of the patch series.

> 11 июля 2018 г., в 0:07, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> написал(а):
>
> This seems misplaced. This code deals with internal pages, and as far as I can see, this patch never marks internal pages as deleted, only leaf pages. However, we should have something like this in the leaf-page branch, to deal with the case that an insertion lands on a page that was concurrently deleted. Did you have any tests, where an insertion runs concurrently with vacuum, that would exercise this?

That bug could manifest only in case of crash between removing downlinks and marking pages deleted. I do not know how to test this reliably.
Internal pages are locked before leafs and locks are coupled. No cuncurrent backend can see downlinks to pages being deleted, unless crash happens.

I've replaced code covering this situation into leaf code path and added comment.

It is fixed in second patch of the series.

I've replaced usage of GetCurrentTransactionId() with ReadNewTransactionId() and added explanation of what is going on. Also, I've added comments about that pd_prune_xid usage. There is no other use in GiST for this field. And there is no other room to place this xid on a page without pg_upgrade.

Added explanatory comment between WAL-logging downlink removal and marking pages deleted.

Thank you for reviewing the patch!

Best regards, Andrey Borodin.

Attachment	Content-Type	Size
0002-Physical-GiST-scan-during-VACUUM-v5.patch	application/octet-stream	15.3 KB
0001-Delete-pages-during-GiST-VACUUM-v5.patch	application/octet-stream	15.0 KB

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
Cc:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Костя Кузнецов <chapaev28(at)ya(dot)ru>
Subject:	Re: GiST VACUUM
Date:	2018-07-12 16:40:33
Message-ID:	9cc18cc9-08d0-9838-7297-57500311d57b@iki.fi
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	Postg롤 토토SQL :

On 12/07/18 19:06, Andrey Borodin wrote:
>> 11 июля 2018 г., в 0:07, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
>> написал(а):
>>
>> This seems misplaced. This code deals with internal pages, and as
>> far as I can see, this patch never marks internal pages as deleted,
>> only leaf pages. However, we should have something like this in the
>> leaf-page branch, to deal with the case that an insertion lands on
>> a page that was concurrently deleted. Did you have any tests, where
>> an insertion runs concurrently with vacuum, that would exercise
>> this?
>
> That bug could manifest only in case of crash between removing
> downlinks and marking pages deleted.

Hmm. The downlink is removed first, so I don't think you can see that
situation after a crash. After a crash, you might have some empty,
orphaned, pages that have already been unlinked from the parent, but a
search/insert should never encounter them.

Actually, now that I think about it more, I'm not happy with leaving
orphaned pages like that behind. Let's WAL-log the removal of the
downlink, and marking the leaf pages as deleted, in one WAL record, to
avoid that.

But the situation in gistdoinsert(), where you encounter a deleted leaf
page, could happen during normal operation, if vacuum runs concurrently
with an insert. Insertion locks only one page at a time, as it descends
the tree, so after it has released the lock on the parent, but before it
has locked the child, vacuum might have deleted the page. In the latest
patch, you're checking for that just before swapping the shared lock for
an exclusive one, but I think that's wrong; you need to check for that
after swapping the lock, because otherwise vacuum might delete the page
while you're not holding the lock.

> I do not know how to test this
> reliably. Internal pages are locked before leafs and locks are
> coupled. No cuncurrent backend can see downlinks to pages being
> deleted, unless crash happens.

Are you sure? At a quick glance, I don't think the locks are coupled.

We do need some way of testing this..

- Heikki

From:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Костя Кузнецов <chapaev28(at)ya(dot)ru>
Subject:	Re: GiST VACUUM
Date:	2018-07-12 17:07:00
Message-ID:	89E1C149-8E24-4FF6-BF77-6119606AFCA9@yandex-team.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> 12 июля 2018 г., в 20:40, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> написал(а):
>
> On 12/07/18 19:06, Andrey Borodin wrote:
>>> 11 июля 2018 г., в 0:07, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
>>> написал(а):
>>> This seems misplaced. This code deals with internal pages, and as
>>> far as I can see, this patch never marks internal pages as deleted,
>>> only leaf pages. However, we should have something like this in the
>>> leaf-page branch, to deal with the case that an insertion lands on
>>> a page that was concurrently deleted. Did you have any tests, where
>>> an insertion runs concurrently with vacuum, that would exercise
>>> this?
>> That bug could manifest only in case of crash between removing
>> downlinks and marking pages deleted.
>
> Hmm. The downlink is removed first, so I don't think you can see that situation after a crash. After a crash, you might have some empty, orphaned, pages that have already been unlinked from the parent, but a search/insert should never encounter them.
>
> Actually, now that I think about it more, I'm not happy with leaving orphaned pages like that behind. Let's WAL-log the removal of the downlink, and marking the leaf pages as deleted, in one WAL record, to avoid that.

OK, will do this. But this will complicate WAL replay seriously, and I do not know a proper way to test that (BTW there is GiST amcheck in progress, but I decided to leave it for a while).

>
> But the situation in gistdoinsert(), where you encounter a deleted leaf page, could happen during normal operation, if vacuum runs concurrently with an insert. Insertion locks only one page at a time, as it descends the tree, so after it has released the lock on the parent, but before it has locked the child, vacuum might have deleted the page. In the latest patch, you're checking for that just before swapping the shared lock for an exclusive one, but I think that's wrong; you need to check for that after swapping the lock, because otherwise vacuum might delete the page while you're not holding the lock.
Looks like a valid concern, I'll move that code again.
>
>> I do not know how to test this
>> reliably. Internal pages are locked before leafs and locks are
>> coupled. No cuncurrent backend can see downlinks to pages being
>> deleted, unless crash happens.
>
> Are you sure? At a quick glance, I don't think the locks are coupled.

Sorry for overquoting

+ /* rescan inner pages that had empty child pages */
+ foreach(cell,rescanList)
+ {
+ Buffer buffer;
+ Page page;
+ OffsetNumber i,
+ maxoff;
+ IndexTuple idxtuple;
+ ItemId iid;
+ OffsetNumber todelete[MaxOffsetNumber];
+ Buffer buftodelete[MaxOffsetNumber];
+ int ntodelete = 0;
+
+ buffer = ReadBufferExtended(rel, MAIN_FORKNUM, (BlockNumber)lfirst_int(cell),
+ RBM_NORMAL, info->strategy);
+ LockBuffer(buffer, GIST_EXCLUSIVE);
Here's first lock
+ gistcheckpage(rel, buffer);
+ page = (Page) BufferGetPage(buffer);
+
+ Assert(!GistPageIsLeaf(page));
+
+ maxoff = PageGetMaxOffsetNumber(page);
+
+ for (i = OffsetNumberNext(FirstOffsetNumber); i <= maxoff; i = OffsetNumberNext(i))
+ {
+ Buffer leafBuffer;
+ Page leafPage;
+
+ iid = PageGetItemId(page, i);
+ idxtuple = (IndexTuple) PageGetItem(page, iid);
+
+ leafBuffer = ReadBufferExtended(rel, MAIN_FORKNUM, ItemPointerGetBlockNumber(&(idxtuple->t_tid)),
+ RBM_NORMAL, info->strategy);
+ LockBuffer(leafBuffer, GIST_EXCLUSIVE);
now locks are coupled in a top-down descent

>
> We do need some way of testing this..

Can we test replication of concurrent VACUUM and inserts in existing test suit? I just do not know.
I can do this tests manually if this is enough.

Best regards, Andrey Borodin.

From:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Костя Кузнецов <chapaev28(at)ya(dot)ru>
Subject:	Re: GiST VACUUM
Date:	2018-07-13 13:41:39
Message-ID:	1E8C1FFE-D547-427F-BF9E-AD674F5C5DED@yandex-team.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi!

PFA v6 of the patch.

12 июля 2018 г., в 21:07, Andrey Borodin x4mmm(at)yandex-team(dot)ru написал(а):

12 июля 2018 г., в 20:40, Heikki Linnakangas hlinnaka(at)iki(dot)fi написал(а):

Actually, now that I think about it more, I'm not happy with leaving orphaned pages like that behind. Let's WAL-log the removal of the downlink, and marking the leaf pages as deleted, in one WAL record, to avoid that.

OK, will do this. But this will complicate WAL replay seriously, and I do not know a proper way to test that (BTW there is GiST amcheck in progress, but I decided to leave it for a while).

Done. Now WAL record for deleted page also removes downlink from internal page. I had to use IndexPageTupleDelete() instead of IndexPageTupleMultiDelete(), but I do not think it will have any impact on performance.

But the situation in gistdoinsert(), where you encounter a deleted leaf page, could happen during normal operation, if vacuum runs concurrently with an insert. Insertion locks only one page at a time, as it descends the tree, so after it has released the lock on the parent, but before it has locked the child, vacuum might have deleted the page. In the latest patch, you're checking for that just before swapping the shared lock for an exclusive one, but I think that's wrong; you need to check for that after swapping the lock, because otherwise vacuum might delete the page while you're not holding the lock.

Looks like a valid concern, I'll move that code again.

Done.

I do not know how to test this

reliably. Internal pages are locked before leafs and locks are

coupled. No cuncurrent backend can see downlinks to pages being

deleted, unless crash happens.

Are you sure? At a quick glance, I don't think the locks are coupled.

I haven't yet tested WAL replay well, will do this soon.

Best regards, Andrey Borodin.

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
Cc:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Костя Кузнецов <chapaev28(at)ya(dot)ru>
Subject:	Re: GiST VACUUM
Date:	2018-07-13 14:10:40
Message-ID:	50d0105a-bd99-788e-37ee-51f59d81fbec@iki.fi
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 13/07/18 16:41, Andrey Borodin wrote:
>> 12 июля 2018 г., в 21:07, Andrey Borodin <x4mmm(at)yandex-team(dot)ru
>> <mailto:x4mmm(at)yandex-team(dot)ru>> написал(а):
>> 12 июля 2018 г., в 20:40, Heikki Linnakangas <hlinnaka(at)iki(dot)fi
>> <mailto:hlinnaka(at)iki(dot)fi>> написал(а):
>>> Actually, now that I think about it more, I'm not happy with leaving orphaned
>>> pages like that behind. Let's WAL-log the removal of the downlink, and
>>> marking the leaf pages as deleted, in one WAL record, to avoid that.
>>
>> OK, will do this. But this will complicate WAL replay seriously, and I do not
>> know a proper way to test that (BTW there is GiST amcheck in progress, but I
>> decided to leave it for a while).
> Done. Now WAL record for deleted page also removes downlink from internal page.
> I had to use IndexPageTupleDelete() instead of IndexPageTupleMultiDelete(), but
> I do not think it will have any impact on performance.

Yeah, I think that's fine, this isn't that performance critical

>>> But the situation in gistdoinsert(), where you encounter a deleted leaf page,
>>> could happen during normal operation, if vacuum runs concurrently with an
>>> insert. Insertion locks only one page at a time, as it descends the tree, so
>>> after it has released the lock on the parent, but before it has locked the
>>> child, vacuum might have deleted the page. In the latest patch, you're
>>> checking for that just before swapping the shared lock for an exclusive one,
>>> but I think that's wrong; you need to check for that after swapping the lock,
>>> because otherwise vacuum might delete the page while you're not holding the lock.
>> Looks like a valid concern, I'll move that code again.
> Done.

Ok, the comment now says:

> + /*
> + * Leaf pages can be left deleted but still referenced in case of
> + * crash during VACUUM's gistbulkdelete()
> + */

But that's not accurate, right? You should never see deleted pages after
a crash, because the parent is updated in the same WAL record as the
child page, right?

I'm still a bit scared about using pd_prune_xid to store the XID that
prevents recycling the page too early. Can we use some field in
GISTPageOpaqueData for that, similar to how the B-tree stores it in
BTPageOpaqueData?

- Heikki

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
Cc:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Костя Кузнецов <chapaev28(at)ya(dot)ru>
Subject:	Re: GiST VACUUM
Date:	2018-07-13 14:25:47
Message-ID:	f36887ce-b7fe-4957-3df6-f79507ed0917@iki.fi
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Looking at the second patch, to scan the GiST index in physical order,
that seems totally unsafe, if there are any concurrent page splits. In
the logical scan, pushStackIfSplited() deals with that, by comparing the
page's NSN with the parent's LSN. But I don't see anything like that in
the physical scan code.

I think we can do something similar in the physical scan: remember the
current LSN position at the beginning of the vacuum, and compare with
that. The B-tree code uses the "cycle ID" for similar purposes.

Do we still need the separate gistvacuumcleanup() pass, if we scan the
index in physical order in the bulkdelete pass already?

- Heikki

From:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Костя Кузнецов <chapaev28(at)ya(dot)ru>
Subject:	Re: GiST VACUUM
Date:	2018-07-13 18:28:48
Message-ID:	5F278E38-FD10-45EB-88D0-5B1D7108F984@yandex-team.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> 13 июля 2018 г., в 18:10, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> написал(а):
>
>>>> But the situation in gistdoinsert(), where you encounter a deleted leaf page, could happen during normal operation, if vacuum runs concurrently with an insert. Insertion locks only one page at a time, as it descends the tree, so after it has released the lock on the parent, but before it has locked the child, vacuum might have deleted the page. In the latest patch, you're checking for that just before swapping the shared lock for an exclusive one, but I think that's wrong; you need to check for that after swapping the lock, because otherwise vacuum might delete the page while you're not holding the lock.
>>> Looks like a valid concern, I'll move that code again.
>> Done.
>
> Ok, the comment now says:
>
>> + /*
>> + * Leaf pages can be left deleted but still referenced in case of
>> + * crash during VACUUM's gistbulkdelete()
>> + */
>
> But that's not accurate, right? You should never see deleted pages after a crash, because the parent is updated in the same WAL record as the child page, right?
Fixed the comment.
>
> I'm still a bit scared about using pd_prune_xid to store the XID that prevents recycling the page too early. Can we use some field in GISTPageOpaqueData for that, similar to how the B-tree stores it in BTPageOpaqueData?
There is no room in opaque data, but, technically all page is just a tombstone until reused, so we can pick arbitrary place. PFA v7 where xid after delete is stored in opaque data, but we can use other places, like line pointer array or opaque-1

> 13 июля 2018 г., в 18:25, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> написал(а):
>
> Looking at the second patch, to scan the GiST index in physical order, that seems totally unsafe, if there are any concurrent page splits. In the logical scan, pushStackIfSplited() deals with that, by comparing the page's NSN with the parent's LSN. But I don't see anything like that in the physical scan code.

Leaf page can be pointed by internal page and rightlink simultaneously. The purpose of NSN is to visit this page exactly once by following only on of two links in a scan. This is achieved naturally if we read everything from the beginning to the end. (That is how I understand, I can be wrong)

> I think we can do something similar in the physical scan: remember the current LSN position at the beginning of the vacuum, and compare with that. The B-tree code uses the "cycle ID" for similar purposes.
>
> Do we still need the separate gistvacuumcleanup() pass, if we scan the index in physical order in the bulkdelete pass already?

We do not need to gather stats there, but we are doing RecordFreeIndexPage() and IndexFreeSpaceMapVacuum(). Is it correct to move this to first scan?

Best regards, Andrey Borodin.

Attachment	Content-Type	Size
0002-Physical-GiST-scan-during-VACUUM-v7.patch	application/octet-stream	13.0 KB
0001-Delete-pages-during-GiST-VACUUM-v7.patch	application/octet-stream	18.9 KB

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
Cc:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Костя Кузнецов <chapaev28(at)ya(dot)ru>
Subject:	Re: GiST VACUUM
Date:	2018-07-13 20:28:41
Message-ID:	72ececf1-845d-f6d2-76b8-27d94615f2a8@iki.fi
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 13/07/18 21:28, Andrey Borodin wrote:
>> 13 июля 2018 г., в 18:25, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
>> написал(а):
>>
>> Looking at the second patch, to scan the GiST index in physical
>> order, that seems totally unsafe, if there are any concurrent page
>> splits. In the logical scan, pushStackIfSplited() deals with that,
>> by comparing the page's NSN with the parent's LSN. But I don't see
>> anything like that in the physical scan code.
>
> Leaf page can be pointed by internal page and rightlink
> simultaneously. The purpose of NSN is to visit this page exactly once
> by following only on of two links in a scan. This is achieved
> naturally if we read everything from the beginning to the end. (That
> is how I understand, I can be wrong)

The scenario where this fails goes like this:

1. Vacuum scans physical pages 1-10
2. A concurrent insertion splits page 15. The new left half stays on
page 15, but the new right half goes to page 5
3. Vacuum scans pages 11-20

Now, if there were any dead tuples on the right half of the split, moved
to page 5, the vacuum would miss them.

The way this is handled in B-tree is that when a page is split, the page
is stamped with current "vacuum cycle id". When the vacuum scan
encounters a page with the current cycle id, whose right-link points to
a lower-numbered page, it immediately follows the right link, and
re-scans it. I.e. in the above example, if it was a B-tree, in step 3
when vacuum scans page 15, it would see that it was concurrently split.
It would immediately vacuum page 5 again, before continuing the scan in
physical order.

We'll need to do something similar in GiST.

- Heikki

From:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Костя Кузнецов <chapaev28(at)ya(dot)ru>
Subject:	Re: GiST VACUUM
Date:	2018-07-14 07:26:26
Message-ID:	B15F859E-D858-4DC4-800C-69B2CB3094BF@yandex-team.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> 14 июля 2018 г., в 0:28, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> написал(а):
>
> On 13/07/18 21:28, Andrey Borodin wrote:
>>> 13 июля 2018 г., в 18:25, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
>>> написал(а):
>>> Looking at the second patch, to scan the GiST index in physical
>>> order, that seems totally unsafe, if there are any concurrent page
>>> splits. In the logical scan, pushStackIfSplited() deals with that,
>>> by comparing the page's NSN with the parent's LSN. But I don't see
>>> anything like that in the physical scan code.
>> Leaf page can be pointed by internal page and rightlink
>> simultaneously. The purpose of NSN is to visit this page exactly once
>> by following only on of two links in a scan. This is achieved
>> naturally if we read everything from the beginning to the end. (That
>> is how I understand, I can be wrong)
>
> The scenario where this fails goes like this:
>
> 1. Vacuum scans physical pages 1-10
> 2. A concurrent insertion splits page 15. The new left half stays on page 15, but the new right half goes to page 5
> 3. Vacuum scans pages 11-20
>
> Now, if there were any dead tuples on the right half of the split, moved to page 5, the vacuum would miss them.
>
> The way this is handled in B-tree is that when a page is split, the page is stamped with current "vacuum cycle id". When the vacuum scan encounters a page with the current cycle id, whose right-link points to a lower-numbered page, it immediately follows the right link, and re-scans it. I.e. in the above example, if it was a B-tree, in step 3 when vacuum scans page 15, it would see that it was concurrently split. It would immediately vacuum page 5 again, before continuing the scan in physical order.
>
> We'll need to do something similar in GiST.

OK, I will do this.

This is tradeoff between complex concurrency feature and possibility of few dead tuples left after VACUUM. I want to understand: is it something dangerous in this dead tuples?
There is one more serious race condition: result of first scan is just a hint where to look for downlinks to empty pages. If internal page splits between scan and cleanup, offsets of downlinks will be changed, cleanup will lock pages, see non-empty pages and will not delete them (though there are not dead tuples, just not deleted empty leafs).

Best regards, Andrey Borodin.

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
Cc:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Костя Кузнецов <chapaev28(at)ya(dot)ru>
Subject:	Re: GiST VACUUM
Date:	2018-07-14 10:39:56
Message-ID:	638aa9cf-4e28-a1a4-d2f3-b6697d7736cc@iki.fi
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 14/07/18 10:26, Andrey Borodin wrote:
> This is tradeoff between complex concurrency feature and possibility
> of few dead tuples left after VACUUM. I want to understand: is it
> something dangerous in this dead tuples?
Yeah, that's bad. When a new heap tuple is inserted in the same
location, the old index tuple will point to the new, unrelated, tuple,
and you will get incorrect query results.

> There is one more serious race condition: result of first scan is
> just a hint where to look for downlinks to empty pages. If internal
> page splits between scan and cleanup, offsets of downlinks will be
> changed, cleanup will lock pages, see non-empty pages and will not
> delete them (though there are not dead tuples, just not deleted empty
> leafs).

That's fine. Leaving behind a few empty pages is harmless, the next
vacuum will pick them up.

- Heikki

From:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Костя Кузнецов <chapaev28(at)ya(dot)ru>
Subject:	Re: GiST VACUUM
Date:	2018-07-14 17:43:26
Message-ID:	3D7749F3-CFD0-479B-B792-DE133057D950@yandex-team.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> 14 июля 2018 г., в 14:39, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> написал(а):
>
> On 14/07/18 10:26, Andrey Borodin wrote:
>> This is tradeoff between complex concurrency feature and possibility
>> of few dead tuples left after VACUUM. I want to understand: is it
>> something dangerous in this dead tuples?
> Yeah, that's bad. When a new heap tuple is inserted in the same location, the old index tuple will point to the new, unrelated, tuple, and you will get incorrect query results.

PFA v8: at the beginning of physical scan I grab GetInsertRecPtr() and if leaf page have rightlink lefter in file and NSN higher than LSN of start we get back to scan that page one more time. This is recursive.

I'm still not very comfortable with storing deletion xid in opaque data: we reuse rightlink field under very specific conditions instead of using totally unused pd_prune_xid. We have a remark in bufpage.h
* pd_prune_xid is a hint field that helps determine whether pruning will be
* useful. It is currently unused in index pages.
Or may be we should use some other place on those empty 8Kb page.

Deleted page do not have rightlink now, but in B-trees rightlink on deleted pages is actively used.
We either cannot reuse NSN: rightlink is useless without NSN anyway. We cannot reuse flags, page is deleted in flags after all. uint16 gist_page_id is just not enough.

Best regards, Andrey Borodin.

Attachment	Content-Type	Size
0002-Physical-GiST-scan-during-VACUUM-v8.patch	application/octet-stream	14.0 KB
0001-Delete-pages-during-GiST-VACUUM-v8.patch	application/octet-stream	18.8 KB

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Костя Кузнецов <chapaev28(at)ya(dot)ru>
Subject:	Re: GiST VACUUM
Date:	2018-07-16 14:58:58
Message-ID:	CA+TgmoZKrUDtWkerVNX+O-yTYoNj2bjUxR5GvmbaCK-TVnDqEw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	Postg범퍼카 토토SQL

On Fri, Jul 13, 2018 at 10:10 AM, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
> I'm still a bit scared about using pd_prune_xid to store the XID that
> prevents recycling the page too early. Can we use some field in
> GISTPageOpaqueData for that, similar to how the B-tree stores it in
> BTPageOpaqueData?

What's your reason for being scared? It seems to me that the
alternatives being proposed (altering the size of the special space,
or sometimes repurposing a field for some other purpose) have their
own associated scariness.

If I had my druthers, I'd nuke pd_prune_xid from orbit -- it's a piece
of seemingly heap-specific data that is kept in the generic page
header rather than in the heap's special space. Other AMs like btree
or zheap may have different needs; one size does not fit all. That
said, since getting rid of pd_prune_xid seems impractical, maybe it
makes more sense to reuse it than to insist on leaving it idle and
consuming space elsewhere in the page.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Костя Кузнецов <chapaev28(at)ya(dot)ru>
Subject:	Re: GiST VACUUM
Date:	2018-07-16 17:24:32
Message-ID:	0E0DF52D-8E01-4A37-A9C5-3B9A7E6E53DD@yandex-team.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	Postg스포츠 토토 베트맨SQL

> 16 июля 2018 г., в 18:58, Robert Haas <robertmhaas(at)gmail(dot)com> написал(а):
>
> On Fri, Jul 13, 2018 at 10:10 AM, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
>> I'm still a bit scared about using pd_prune_xid to store the XID that
>> prevents recycling the page too early. Can we use some field in
>> GISTPageOpaqueData for that, similar to how the B-tree stores it in
>> BTPageOpaqueData?
>
> What's your reason for being scared? It seems to me that the
> alternatives being proposed (altering the size of the special space,
> or sometimes repurposing a field for some other purpose) have their
> own associated scariness.

Thanks, that's exactly what I'm thinking about where to store this xid.

Here's v9 of the patch, it uses pd_prune_xid, but it is abstracted to GistPageGetDeleteXid() \ GistPageSetDeleteXid() so that we can change the way we store it easily.

Best regards, Andrey Borodin.

Attachment	Content-Type	Size
0002-Physical-GiST-scan-during-VACUUM-v9.patch	application/octet-stream	14.0 KB
0001-Delete-pages-during-GiST-VACUUM-v9.patch	application/octet-stream	18.3 KB

From:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Костя Кузнецов <chapaev28(at)ya(dot)ru>
Subject:	Re: GiST VACUUM
Date:	2018-07-17 18:41:03
Message-ID:	2DA06ECF-EB35-49B3-9F8D-4BB1800B755D@yandex-team.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	Postg토토 핫SQL :

Hi!

> 16 июля 2018 г., в 21:24, Andrey Borodin <x4mmm(at)yandex-team(dot)ru> написал(а):
>

I was checking WAL replay of new scheme to log page deletes and found a bug there (incorrect value of deleted downlink in WAL record). Here's fixed patch v10.

Also I've added support to WAL identification for new record, done some improvements to comments and naming in data structures.

Thanks!

Best regards, Andrey Borodin.

Attachment	Content-Type	Size
0002-Physical-GiST-scan-during-VACUUM-v10.patch	application/octet-stream	13.9 KB
0001-Delete-pages-during-GiST-VACUUM-v10.patch	application/octet-stream	18.7 KB

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Костя Кузнецов <chapaev28(at)ya(dot)ru>
Subject:	Re: GiST VACUUM
Date:	2018-07-18 12:02:05
Message-ID:	b015d1dc-67d7-0b10-c5ab-0ae3f20f34eb@iki.fi
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 17/07/18 21:41, Andrey Borodin wrote:
> I was checking WAL replay of new scheme to log page deletes and found
> a bug there (incorrect value of deleted downlink in WAL record).
> Here's fixed patch v10.
>
> Also I've added support to WAL identification for new record, done
> some improvements to comments and naming in data structures.

Thanks!

> + /*
> + * If this page was splitted after start of the VACUUM we have to
> + * revisit rightlink, if it points to block we already scanned.
> + * This is recursive revisit, should not be deep, but we check
> + * the possibility of stack overflow anyway.
> + */
> + if ((GistFollowRight(page) || startNSN < GistPageGetNSN(page)) &&
> + (opaque->rightlink != InvalidBlockNumber) && (opaque->rightlink < blkno))
> + {
> + gistbulkdeletephysicalcanpage(info, stats, callback, callback_state, opaque->rightlink, startNSN, graph);
> + }

In the corresponding B-tree code, we use don't do actual recursion, but
a hand-optimized "tail recursion", to avoid stack overflow if there are
a lot of splits. I think we need to do something like tha there, too. I
don't think it's safe to assume that we have enough stack space for the
recursion. You have a check_stack_depth() in the function, so you'll get
ERROR, but it sucks if VACUUM errors out because of that.

It's not cool to use up to 'maintenance_work_mem' of memory for holding
the in-memory graph. VACUUM already uses up to that much memory to hold
the list of dead TIDs, so we would use up to 2x maintenance_work_mem in
total.

The only reason we still need the logical scan is to support page
deletion, when there is not enough memory available. Can we get rid of
that altogether? I think I'd prefer some other scheme to deal with that
situation. For example, we could make note, in memory, of the empty
pages during the physical scan, and perform a second physical scan of
the index to find their downlinks. Two full-index scans is not nice, but
it's actually not that bad if it's done in physical order. And you could
have some heuristics, e.g. only do it if more than 10% of the pages were
deletable.

Sorry to ask you to rewrite this again, but I think it would be better
to split this into two patches as follows:

1st patch: Scan the index in physical rather than logical order. No
attempt at deleting empty pages yet.

2nd patch: Add support for deleting empty pages.

I would be more comfortable reviewing and committing that first patch,
which just switches to doing physical-order scan, first.

- Heikki

From:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Костя Кузнецов <chapaev28(at)ya(dot)ru>
Subject:	Re: GiST VACUUM
Date:	2018-07-18 18:27:48
Message-ID:	ADA3A0B7-A209-48E5-8942-BA913C3FD31D@yandex-team.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi!

> 18 июля 2018 г., в 16:02, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> написал(а):
>
> In the corresponding B-tree code, we use don't do actual recursion, but a hand-optimized "tail recursion", to avoid stack overflow if there are a lot of splits. I think we need to do something like tha there, too. I don't think it's safe to assume that we have enough stack space for the recursion. You have a check_stack_depth() in the function, so you'll get ERROR, but it sucks if VACUUM errors out because of that.
Ok, will do that.

>
>
> It's not cool to use up to 'maintenance_work_mem' of memory for holding the in-memory graph. VACUUM already uses up to that much memory to hold the list of dead TIDs, so we would use up to 2x maintenance_work_mem in total.
>
> The only reason we still need the logical scan is to support page deletion, when there is not enough memory available. Can we get rid of that altogether? I think I'd prefer some other scheme to deal with that situation. For example, we could make note, in memory, of the empty pages during the physical scan, and perform a second physical scan of the index to find their downlinks. Two full-index scans is not nice, but it's actually not that bad if it's done in physical order.
I think this is a good idea. I will implement this.

> And you could have some heuristics, e.g. only do it if more than 10% of the pages were deletable.
>
> Sorry to ask you to rewrite this again
Piece of cake :)

> , but I think it would be better to split this into two patches as follows:
>
> 1st patch: Scan the index in physical rather than logical order. No attempt at deleting empty pages yet.
>
> 2nd patch: Add support for deleting empty pages.
>
> I would be more comfortable reviewing and committing that first patch, which just switches to doing physical-order scan, first.

This seems very unproportional division of complexity. First patch (PFA) is very simple. All work is done in one cycle, without memorizing anything. Actually, you do not even need to rescan rightlinks: there may be no splits to the left when no pages are deleted.
If you think it is proper way to go - OK, I'll prepare better version of attached diff (by omitting tail recursion and adding more comments).

Best regards, Andrey Borodin.

Attachment	Content-Type	Size
gist_phisical_vacuum_v1.diff	application/octet-stream	7.6 KB

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Костя Кузнецов <chapaev28(at)ya(dot)ru>
Subject:	Re: GiST VACUUM
Date:	2018-07-18 21:12:15
Message-ID:	66dffd3a-b9d5-65cf-15bf-615a83b44301@iki.fi
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 18/07/18 21:27, Andrey Borodin wrote:
> Hi!
>
>> 18 июля 2018 г., в 16:02, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
>> написал(а):
>>
>> , but I think it would be better to split this into two patches as
>> follows:
>>
>> 1st patch: Scan the index in physical rather than logical order. No
>> attempt at deleting empty pages yet.
>>
>> 2nd patch: Add support for deleting empty pages.
>>
>> I would be more comfortable reviewing and committing that first
>> patch, which just switches to doing physical-order scan, first.
>
> This seems very unproportional division of complexity. First patch
> (PFA) is very simple. All work is done in one cycle, without
> memorizing anything. Actually, you do not even need to rescan
> rightlinks: there may be no splits to the left when no pages are
> deleted.

Heh, good point.

I googled around and bumped into an older patch to do this:
/message-id/1135121410099068%40web30j.yandex.ru.
Unfortunately, Костя never got around to update the patch, and it was
forgotten. But the idea seemed sound even back then.

As noted in that thread, there might be deleted pages in the index in
some rare circumstances, even though we don't recycled empty pages: if
the index was upgraded from a very old version, as VACUUM FULL used to
recycle empty pages, or if you crash just when extending the index, and
end up with newly-initialized but unused pages that way. So we do need
to handle the concurrent split scenario, even without empty page recycling.

> If you think it is proper way to go - OK, I'll prepare
> better version of attached diff (by omitting tail recursion and
> adding more comments).

Yeah, please, I think this is the way to go.

- Heikki

From:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Костя Кузнецов <chapaev28(at)ya(dot)ru>
Subject:	Re: GiST VACUUM
Date:	2018-07-19 10:52:34
Message-ID:	5A3FD588-BC51-4FC3-B9C3-19453F824C68@yandex-team.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi!

> 19 июля 2018 г., в 1:12, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> написал(а):
>
> Yeah, please, I think this is the way to go.

Here's v11 divided into proposed steps.

In second step I still use paloc's memory, but only to store two bitmaps: bitmap of internal pages and bitmap of empty leafs. Second physical scan only reads internal pages. I can omit that bitmap, if I'll scan everything.
Also, I can replace emptyLeafs bitmap with array\list, but I do not really think it will be big.

Anyway I propose focusing on first step.

Best regards, Andrey Borodin.

Attachment	Content-Type	Size
0002-Delete-pages-during-GiST-VACUUM-v11.patch	application/octet-stream	19.5 KB
0001-Physical-GiST-scan-in-VACUUM-v11.patch	application/octet-stream	8.4 KB

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Костя Кузнецов <chapaev28(at)ya(dot)ru>
Subject:	Re: GiST VACUUM
Date:	2018-07-19 11:20:51
Message-ID:	77c7023c-18d4-8f04-08ed-a5c5633046ea@iki.fi
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 19/07/18 13:52, Andrey Borodin wrote:
> Hi!
>
>> 19 июля 2018 г., в 1:12, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
>> написал(а):
>>
>> Yeah, please, I think this is the way to go.
>
> Here's v11 divided into proposed steps.

Thanks, one quick question:

> /* We should not unlock buffer if we are going to jump left */
> if (needScan)
> {
> GistBDItem *ptr = (GistBDItem *) palloc(sizeof(GistBDItem));
> ptr->buffer = buffer;
> ptr->next = bufferStack;
> bufferStack = ptr;
> }
> else
> UnlockReleaseBuffer(buffer);

Why? I don't see any need to keep the page locked, when we "jump left".

- Heikki

From:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Костя Кузнецов <chapaev28(at)ya(dot)ru>
Subject:	Re: GiST VACUUM
Date:	2018-07-19 11:42:06
Message-ID:	927581532000526@myt5-31f0319c46b9.qloud-c.yandex.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

19.07.2018, 15:20, "Heikki Linnakangas" <hlinnaka(at)iki(dot)fi>: <blockquote type="cite">On 19/07/18 13:52, Andrey Borodin wrote: <blockquote> Hi! <blockquote> 19 июля 2018 г., в 1:12, Heikki Linnakangas <<a href="mailto:hlinnaka(at)iki(dot)fi">hlinnaka(at)iki(dot)fi</a>> написал(а): Yeah, please, I think this is the way to go. </blockquote> Here's v11 divided into proposed steps. </blockquote> Thanks, one quick question: <blockquote> /* We should not unlock buffer if we are going to jump left */ if (needScan) { GistBDItem *ptr = (GistBDItem *) palloc(sizeof(GistBDItem)); ptr->buffer = buffer; ptr->next = bufferStack; bufferStack = ptr; } else UnlockReleaseBuffer(buffer); </blockquote> Why? I don't see any need to keep the page locked, when we "jump left".</blockquote><div>Because it can split to the left again, given that we release lock.</div><div> </div><div>Best regards, Andrey Borodin.</div>

Attachment	Content-Type	Size
unknown_filename	text/html	1.6 KB

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Костя Кузнецов <chapaev28(at)ya(dot)ru>
Subject:	Re: GiST VACUUM
Date:	2018-07-19 12:28:05
Message-ID:	266d7c7e-7431-b6dc-e4af-7ec9f08b5e52@iki.fi
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	Postg토토 꽁 머니SQL

On 19/07/18 14:42, Andrey Borodin wrote:
>
> 19.07.2018, 15:20, "Heikki Linnakangas" <hlinnaka(at)iki(dot)fi>:
>>
>> On 19/07/18 13:52, Andrey Borodin wrote:
>>
>> Hi!
>>
>> 19 июля 2018 г., в 1:12, Heikki Linnakangas <hlinnaka(at)iki(dot)fi
>> <mailto:hlinnaka(at)iki(dot)fi>>
>> написал(а):
>>
>> Yeah, please, I think this is the way to go.
>>
>>
>> Here's v11 divided into proposed steps.
>>
>>
>> Thanks, one quick question:
>>
>> /* We should not unlock buffer if we are going to
>> jump left */
>> if (needScan)
>> {
>> GistBDItem *ptr = (GistBDItem *)
>> palloc(sizeof(GistBDItem));
>> ptr->buffer = buffer;
>> ptr->next = bufferStack;
>> bufferStack = ptr;
>> }
>> else
>> UnlockReleaseBuffer(buffer);
>>
>>
>> Why? I don't see any need to keep the page locked, when we "jump left".
>>
> Because it can split to the left again, given that we release lock.

Hmm. So, while we are scanning the right sibling, which was moved to
lower-numbered block because of a concurrent split, the original page is
split again? That's OK, we've already scanned all the tuples on the
original page, before we recurse to deal with the right sibling. (The
corresponding B-tree code also releases the lock on the original page
when recursing)

I did some refactoring, to bring this closer to the B-tree code, for the
sake of consistency. See attached patch. This also eliminates the 2nd
pass by gistvacuumcleanup(), in case we did that in the bulkdelete-phase
already.

There was one crucial thing missing: in the outer loop, we must ensure
that we scan all pages, even those that were added after the vacuum
started. There's a comment explaining that in btvacuumscan(). This
version fixes that.

I haven't done any testing on this. Do you have any test scripts you
could share? I think we need some repeatable tests for the concurrent
split cases. Even if it involves gdb or some other hacks that we can't
include in the regression test suite, we need something now, while we're
hacking on this.

One subtle point, that I think is OK, but gave me a pause, and probably
deserves comment somewhere: A concurrent root split can turn a leaf page
into one internal (root) page, and two new leaf pages. The new root page
is placed in the same block as the old page, while both new leaf pages
go to freshly allocated blocks. If that happens while vacuum is running,
might we miss the new leaf pages? As the code stands, we don't do the
"follow-right" dance on internal pages, so we would not recurse into the
new leaf pages. At first, I thought that's a problem, but I think we can
get away with it. The only scenario where a root split happens on a leaf
page, is when the index has exactly one page, a single leaf page. Any
subsequent root splits will split an internal page rather than a leaf
page, and we're not bothered by those. In the case that a root split
happens on a single-page index, we're OK, because we will always scan
that page either before, or after the split. If we scan the single page
before the split, we see all the leaf tuples on that page. If we scan
the single page after the split, it means that we start the scan after
the split, and we will see both leaf pages as we continue the scan.

- Heikki

Attachment	Content-Type	Size
0001-Physical-GiST-scan-in-VACUUM-v12.patch	text/x-patch	15.5 KB

From:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Костя Кузнецов <chapaev28(at)ya(dot)ru>
Subject:	Re: GiST VACUUM
Date:	2018-07-19 18:26:05
Message-ID:	479336FE-B713-4DFB-9661-9BFCCA70F4EF@yandex-team.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> 19 июля 2018 г., в 16:28, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> написал(а):
> Hmm. So, while we are scanning the right sibling, which was moved to lower-numbered block because of a concurrent split, the original page is split again? That's OK, we've already scanned all the tuples on the original page, before we recurse to deal with the right sibling. (The corresponding B-tree code also releases the lock on the original page when recursing)
Seems right.

>
> I did some refactoring, to bring this closer to the B-tree code, for the sake of consistency. See attached patch. This also eliminates the 2nd pass by gistvacuumcleanup(), in case we did that in the bulkdelete-phase already.
Thanks!

>
> There was one crucial thing missing: in the outer loop, we must ensure that we scan all pages, even those that were added after the vacuum started.
Correct. Quite a neat logic behind the order of acquiring npages, comparing and vacuuming page. Notes in FIXME look correct except function names.

> There's a comment explaining that in btvacuumscan(). This version fixes that.
>
> I haven't done any testing on this. Do you have any test scripts you could share?
I use just a simple tests that setup replication and does random inserts and vaccums. Not a rocket science, just a mutated script
for i in $(seq 1 12); do
size=$((100 * 2**$i))
./psql postgres -c "create table x as select cube(random()) c from generate_series(1,$size) y; create index on x using gist(c);"
./psql postgres -c "delete from x;"
./psql postgres -c "VACUUM x;"
./psql postgres -c "VACUUM x;"
./psql postgres -c "drop table x;"
./psql postgres -c "create table x as select cube(random()) c from generate_series(1,$size) y; create index on x using gist(c);"
./psql postgres -c "delete from x where (c~>1)>0.1;"
./psql postgres -c "VACUUM x;"
./psql postgres -c "insert into x select cube(random()) c from generate_series(1,$size) y;"
./psql postgres -c "VACUUM x;"
./psql postgres -c "delete from x where (c~>1)>0.1;"
./psql postgres -c "select pg_size_pretty(pg_relation_size('x_c_idx'));"
./psql postgres -c "VACUUM FULL x;"
./psql postgres -c "select pg_size_pretty(pg_relation_size('x_c_idx'));"
./psql postgres -c "drop table x;"
done

> I think we need some repeatable tests for the concurrent split cases.
It is hard to trigger left splits until we delete pages. I'll try to hack gistNewBuffer() to cause something similar.

> Even if it involves gdb or some other hacks that we can't include in the regression test suite, we need something now, while we're hacking on this.
>
> One subtle point, that I think is OK, but gave me a pause, and probably deserves comment somewhere: A concurrent root split can turn a leaf page into one internal (root) page, and two new leaf pages. The new root page is placed in the same block as the old page, while both new leaf pages go to freshly allocated blocks. If that happens while vacuum is running, might we miss the new leaf pages? As the code stands, we don't do the "follow-right" dance on internal pages, so we would not recurse into the new leaf pages. At first, I thought that's a problem, but I think we can get away with it. The only scenario where a root split happens on a leaf page, is when the index has exactly one page, a single leaf page. Any subsequent root splits will split an internal page rather than a leaf page, and we're not bothered by those. In the case that a root split happens on a single-page index, we're OK, because we will always scan that page either before, or after the split. If we scan the single page before the split, we see all the leaf tuples on that page. If we scan the single page after the split, it means that we start the scan after the split, and we will see both leaf pages as we continue the scan.
Yes, only page 0 may change type, and page 0 cannot split to left.

I'm working on triggering left split during vacuum. Will get back when done. Thanks!

Best regards, Andrey Borodin.

From:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Костя Кузнецов <chapaev28(at)ya(dot)ru>
Subject:	Re: GiST VACUUM
Date:	2018-07-21 12:11:34
Message-ID:	51675201-3F1C-4796-ADB4-28D0B35E9CF0@yandex-team.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi!

> 19 июля 2018 г., в 23:26, Andrey Borodin <x4mmm(at)yandex-team(dot)ru> написал(а):
>
> I'm working on triggering left split during vacuum. Will get back when done. Thanks!

Here's patch including some messy hacks to trigger NSN and FollowRight jumps during VACUUM.

To trigger FollowRight GiST sometimes forget to clear follow-right marker simulating crash of an insert. This fills logs with "fixing incomplete split" messages. Search for "REMOVE THIS" to disable these ill-behavior triggers.
To trigger NSN jump GiST allocate empty page after every real allocation.

gistvacuumcleanup() was constantly generating left jumps because there was used 0 instead of real start NSN, I moved NSN acquisition to gistvacuumscan(). Also fixed some comments.

gistvacuumcleanup() will have same effect as gistbulkdelete(), is it OK?

To reproduce left-jumps run ./rescantest.sh
Script contain variables for my local paths.

Best regards, Andrey Borodin.

Attachment	Content-Type	Size
0001-Physical-GiST-scan-in-VACUUM-v13.patch	application/octet-stream	17.8 KB

From:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Костя Кузнецов <chapaev28(at)ya(dot)ru>
Subject:	Re: GiST VACUUM
Date:	2018-07-23 18:04:57
Message-ID:	651572F0-65C3-4285-85FC-50E81B1199F1@yandex-team.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi!

> 21 июля 2018 г., в 17:11, Andrey Borodin <x4mmm(at)yandex-team(dot)ru> написал(а):
>
> <0001-Physical-GiST-scan-in-VACUUM-v13.patch>

Just in case, here's second part of patch series with actual page deletion.

I was considering further decreasing memory footprint by using bloom filters instead of bitmap, but it will create seriously more work for cpu to compute hashes.

Best regards, Andrey Borodin.

Attachment	Content-Type	Size
0002-Delete-pages-during-GiST-VACUUM-v13.patch	application/octet-stream	22.2 KB
0001-Physical-GiST-scan-in-VACUUM-v13.patch	application/octet-stream	17.8 KB

From:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Костя Кузнецов <chapaev28(at)ya(dot)ru>
Subject:	Re: GiST VACUUM
Date:	2018-07-29 09:04:35
Message-ID:	CAEepm=1vznVTZf01f4KNVJEa4m4tCgtBuumgmY8q=g+S=Gdt5A@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Jul 24, 2018 at 6:04 AM, Andrey Borodin <x4mmm(at)yandex-team(dot)ru> wrote:
>> 21 июля 2018 г., в 17:11, Andrey Borodin <x4mmm(at)yandex-team(dot)ru> написал(а):
>> <0001-Physical-GiST-scan-in-VACUUM-v13.patch>
>
> Just in case, here's second part of patch series with actual page deletion.

Hi Andrey,

Cfbot says:

https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.7146

That's because you declared a new variable after some other
statements. You can't do that in old school C89.

https://travis-ci.org/postgresql-cfbot/postgresql/builds/409401951

That segfaulted here:

#0 0x00000000004ab620 in setinternal (state=<optimized out>,
state=<optimized out>, blkno=368) at gistvacuum.c:44
44 state->internalPagesMap[blkno / 8] |= 1 << (blkno % 8);

internalPagesMap was NULL, or pointed to memory that was too small and
happened to be near an unmapped region (unlikely?), or was some other
corrupted address?

--
Thomas Munro
http://www.enterprisedb.com

From:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
To:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Костя Кузнецов <chapaev28(at)ya(dot)ru>
Subject:	Re: GiST VACUUM
Date:	2018-07-29 11:47:29
Message-ID:	2AF4B7F1-ABCF-4602-8AF8-06EA45191BFC@yandex-team.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi!

Thank you!

> 29 июля 2018 г., в 14:04, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> написал(а):
>
> On Tue, Jul 24, 2018 at 6:04 AM, Andrey Borodin <x4mmm(at)yandex-team(dot)ru> wrote:
>>> 21 июля 2018 г., в 17:11, Andrey Borodin <x4mmm(at)yandex-team(dot)ru> написал(а):
>>> <0001-Physical-GiST-scan-in-VACUUM-v13.patch>
>>
>> Just in case, here's second part of patch series with actual page deletion.
>
> Hi Andrey,
>
> Cfbot says:
>
> https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.7146
>
> That's because you declared a new variable after some other
> statements. You can't do that in old school C89.
Yep, mismerged patch steps and created misplaced declaration

> https://travis-ci.org/postgresql-cfbot/postgresql/builds/409401951
>
> That segfaulted here:
>
> #0 0x00000000004ab620 in setinternal (state=<optimized out>,
> state=<optimized out>, blkno=368) at gistvacuum.c:44
> 44 state->internalPagesMap[blkno / 8] |= 1 << (blkno % 8);
>
> internalPagesMap was NULL, or pointed to memory that was too small and
> happened to be near an unmapped region (unlikely?), or was some other
> corrupted address?
Yes, there was conditionally uninitialized variable mapNumPages. I do not know why it didn't trigger failure last time, but now it was reproduced on my machine reliably.

Fixed both problems. PFA v14.

Best regards, Andrey Borodin.

Attachment	Content-Type	Size
0002-Delete-pages-during-GiST-VACUUM-v14.patch	application/octet-stream	23.8 KB
0001-Physical-GiST-scan-in-VACUUM-v14.patch	application/octet-stream	16.6 KB

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Костя Кузнецов <chapaev28(at)ya(dot)ru>
Subject:	Re: GiST VACUUM
Date:	2018-07-30 15:39:54
Message-ID:	844f3373-486e-a34d-d5a1-6a048e19355c@iki.fi
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 29/07/18 14:47, Andrey Borodin wrote:
> Fixed both problems. PFA v14.

Thanks, took a really quick look at this.

The text being added to README is outdated for these latest changes.

> In second step I still use paloc's memory, but only to store two
> bitmaps: bitmap of internal pages and bitmap of empty leafs. Second
> physical scan only reads internal pages. I can omit that bitmap, if
> I'll scan everything. Also, I can replace emptyLeafs bitmap with
> array\list, but I do not really think it will be big.

On a typical GiST index, what's the ratio of leaf vs. internal pages?
Perhaps an array would indeed be better. If you have a really large
index, the bitmaps can take a fair amount of memory, on top of the
memory used for tracking the dead TIDs. I.e. that memory will be in
addition to maintenance_work_mem. That's not nice, but I think it's OK
in practice, and not worth spending too much effort to eliminate. For a
1 TB index with default 8k block size, the two bitmaps will take 32 MB
of memory in total. If you're dealing with a database of that size, you
ought to have some memory to spare. But if an array would use less
memory, that'd be better.

If you go with bitmaps, please use the existing Bitmapset instead of
rolling your own. Saves some code, and it provides more optimized
routines for iterating through all the set bits, too
(bms_next_member()). Another possibility would be to use Tidbitmap, in
the "lossy" mode, i.e. add the pages with tbm_add_page(). That might
save some memory, compared to Bitmapset, if the bitmap is very sparse.
Not sure how it compares with a plain array.

A straightforward little optimization would be to skip scanning the
internal pages, when the first scan didn't find any empty pages. And
stop the scan of the internal pages as soon as all the empty pages have
been recycled.

- Heikki

From:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Костя Кузнецов <chapaev28(at)ya(dot)ru>
Subject:	Re: GiST VACUUM
Date:	2018-07-31 20:06:45
Message-ID:	4A1FF988-A69D-4607-8A6E-B3EC0FD6C2CF@yandex-team.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi! Thanks for looking into the patch!

> 30 июля 2018 г., в 18:39, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> написал(а):
>
> On 29/07/18 14:47, Andrey Borodin wrote:
>> Fixed both problems. PFA v14.
>
> Thanks, took a really quick look at this.
>
> The text being added to README is outdated for these latest changes.
Fixed.
>
>> In second step I still use paloc's memory, but only to store two
>> bitmaps: bitmap of internal pages and bitmap of empty leafs. Second
>> physical scan only reads internal pages. I can omit that bitmap, if
>> I'll scan everything. Also, I can replace emptyLeafs bitmap with
>> array\list, but I do not really think it will be big.
>
> On a typical GiST index, what's the ratio of leaf vs. internal pages? Perhaps an array would indeed be better.
Typical GiST has around 200 tuples per internal page. I've switched to List since it's more efficient than bitmap. Is
> If you have a really large index, the bitmaps can take a fair amount of memory, on top of the memory used for tracking the dead TIDs. I.e. that memory will be in addition to maintenance_work_mem. That's not nice, but I think it's OK in practice, and not worth spending too much effort to eliminate. For a 1 TB index with default 8k block size, the two bitmaps will take 32 MB of memory in total. If you're dealing with a database of that size, you ought to have some memory to spare. But if an array would use less memory, that'd be better.

>
> If you go with bitmaps, please use the existing Bitmapset instead of rolling your own. Saves some code, and it provides more optimized routines for iterating through all the set bits, too (bms_next_member()). Another possibility would be to use Tidbitmap, in the "lossy" mode, i.e. add the pages with tbm_add_page(). That might save some memory, compared to Bitmapset, if the bitmap is very sparse. Not sure how it compares with a plain array.
Yeah, I've stopped reinventing that bicycle. But I have to note that default growth strategy of Bitmap is not good: we will be repallocing byte by byte.

>
> A straightforward little optimization would be to skip scanning the internal pages, when the first scan didn't find any empty pages. And stop the scan of the internal pages as soon as all the empty pages have been recycled.
Done.

PFA v15.

Best regards, Andrey Borodin.

Attachment	Content-Type	Size
0002-Delete-pages-during-GiST-VACUUM-v15.patch	application/octet-stream	20.7 KB
0001-Physical-GiST-scan-in-VACUUM-v15.patch	application/octet-stream	16.6 KB

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
Cc:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Костя Кузнецов <chapaev28(at)ya(dot)ru>
Subject:	Re: GiST VACUUM
Date:	2018-08-05 11:18:59
Message-ID:	43f15ab5-dd95-97d4-658b-e9954ca986a6@iki.fi
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	Postg토토SQL : Postg토토SQL

On 31/07/18 23:06, Andrey Borodin wrote:
>> On a typical GiST index, what's the ratio of leaf vs. internal
>> pages? Perhaps an array would indeed be better.
>
> Typical GiST has around 200 tuples per internal page. I've switched
> to List since it's more efficient than bitmap.

Hmm. A ListCell is 16 bytes, plus the AllocChunk header, 16 bytes. 32
bytes per internal page in total, while a bitmap consumes one bit per
page, leaf or internal. If I'm doing
my math right, assuming the ratio of leaf pages vs internal pages is
1:200, a List actually consumes more memory than a bitmap; 256 bits per
internal page, vs. 200 bits. An array, with 4 bytes per block number,
would be the winner here.

> But I have to note that default growth strategy of Bitmap is not
> good: we will be repallocing byte by byte.

True, that repallocing seems bad. You could force it to be pre-allocated
by setting the last bit. Or add a function to explicitly enlarge the bitmap.

- Heikki

From:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Костя Кузнецов <chapaev28(at)ya(dot)ru>
Subject:	Re: GiST VACUUM
Date:	2018-08-05 16:45:36
Message-ID:	DABC2BB9-154A-47E2-B1D9-B58A4023E6D7@yandex-team.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi!

> 5 авг. 2018 г., в 16:18, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> написал(а):
>
> On 31/07/18 23:06, Andrey Borodin wrote:
>>> On a typical GiST index, what's the ratio of leaf vs. internal
>>> pages? Perhaps an array would indeed be better.
> >
>> Typical GiST has around 200 tuples per internal page. I've switched
>> to List since it's more efficient than bitmap.
>
> Hmm. A ListCell is 16 bytes, plus the AllocChunk header, 16 bytes. 32
> bytes per internal page in total, while a bitmap consumes one bit per page, leaf or internal. If I'm doing
> my math right, assuming the ratio of leaf pages vs internal pages is
> 1:200, a List actually consumes more memory than a bitmap; 256 bits per
> internal page, vs. 200 bits. An array, with 4 bytes per block number,
> would be the winner here.
We do not know size of this array beforehand. I can implement normal ArrayList though (with repallocing array) or linked list of chunks. Or anything from data structures zoo.
Or just stick with bitmap (my preferred way).
>
>> But I have to note that default growth strategy of Bitmap is not good: we will be repallocing byte by byte.
>
> True, that repallocing seems bad. You could force it to be pre-allocated
> by setting the last bit. Or add a function to explicitly enlarge the bitmap.
OK, I'll think of proper resize function (ensure capacity, to be precise).

Best regards, Andrey Borodin.

From:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Костя Кузнецов <chapaev28(at)ya(dot)ru>
Subject:	Re: GiST VACUUM
Date:	2018-08-06 18:12:00
Message-ID:	E8FDC7F0-0516-4A84-B754-4CC92801E5F5@yandex-team.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	Postg사설 토토 사이트SQL

Hi!

PFA v16.

> 5 авг. 2018 г., в 21:45, Andrey Borodin <x4mmm(at)yandex-team(dot)ru> написал(а):
>> 5 авг. 2018 г., в 16:18, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> написал(а):
>>
>> Hmm. A ListCell is 16 bytes, plus the AllocChunk header, 16 bytes. 32
>> bytes per internal page in total, while a bitmap consumes one bit per page, leaf or internal. If I'm doing
>> my math right, assuming the ratio of leaf pages vs internal pages is
>> 1:200, a List actually consumes more memory than a bitmap; 256 bits per
>> internal page, vs. 200 bits. An array, with 4 bytes per block number,
>> would be the winner here.
> We do not know size of this array beforehand. I can implement normal ArrayList though (with repallocing array) or linked list of chunks. Or anything from data structures zoo.
> Or just stick with bitmap (my preferred way).
Done.
>>
>>> But I have to note that default growth strategy of Bitmap is not good: we will be repallocing byte by byte.
>>
>> True, that repallocing seems bad. You could force it to be pre-allocated
>> by setting the last bit. Or add a function to explicitly enlarge the bitmap.
> OK, I'll think of proper resize function (ensure capacity, to be precise).
Done. Added function bms_make_empty(int size)

Best regards, Andrey Borodin.

Attachment	Content-Type	Size
0002-Delete-pages-during-GiST-VACUUM-v16.patch	application/octet-stream	22.9 KB
0001-Physical-GiST-scan-in-VACUUM-v16.patch	application/octet-stream	16.6 KB

From:	Michael Paquier <michael(at)paquier(dot)xyz>
To:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Костя Кузнецов <chapaev28(at)ya(dot)ru>
Subject:	Re: GiST VACUUM
Date:	2018-10-02 01:14:04
Message-ID:	20181002011404.GU11712@paquier.xyz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Aug 06, 2018 at 11:12:00PM +0500, Andrey Borodin wrote:
> Done. Added function bms_make_empty(int size)

Andrey, your latest patch does not apply. I am moving this to the next
CF, waiting for your input.
--
Michael

From:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
To:	Michael Paquier <michael(at)paquier(dot)xyz>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Костя Кузнецов <chapaev28(at)ya(dot)ru>
Subject:	Re: GiST VACUUM
Date:	2018-10-28 17:32:33
Message-ID:	C1D00194-9A4D-4DE1-AAFE-6DA478A15207@yandex-team.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi everyone!

> 2 окт. 2018 г., в 6:14, Michael Paquier <michael(at)paquier(dot)xyz> написал(а):
> Andrey, your latest patch does not apply. I am moving this to the next
> CF, waiting for your input.

I'm doing preps for CF.
Here's rebased version.

Best regards, Andrey Borodin.

Attachment	Content-Type	Size
0002-Delete-pages-during-GiST-VACUUM-v17.patch	application/octet-stream	21.9 KB
0001-Physical-GiST-scan-in-VACUUM-v17.patch	application/octet-stream	15.3 KB
unknown_filename	text/plain	3 bytes

From:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
To:	x4mmm(at)yandex-team(dot)ru, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, chapaev28(at)ya(dot)ru
Subject:	Re: GiST VACUUM
Date:	2018-11-30 12:21:13
Message-ID:	CA+q6zcWLUbvE7X=u43mowx551ot_oVeweKU_j62fDBuATLxX0A@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	Postg메이저 토토 사이트SQL

> On Sun, Oct 28, 2018 at 6:32 PM Andrey Borodin <x4mmm(at)yandex-team(dot)ru> wrote:
>
> Hi everyone!
>
> > 2 окт. 2018 г., в 6:14, Michael Paquier <michael(at)paquier(dot)xyz> написал(а):
> > Andrey, your latest patch does not apply. I am moving this to the next
> > CF, waiting for your input.
>
> I'm doing preps for CF.
> Here's rebased version.

Looks like this patch is waiting for an input since August. Could any of the
reviewers (Heikki?) please take a look at the latest version? In the meantime
I'm moving it to the next CF.

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, Michael Paquier <michael(at)paquier(dot)xyz>
Cc:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Костя Кузнецов <chapaev28(at)ya(dot)ru>
Subject:	Re: GiST VACUUM
Date:	2019-01-02 15:35:15
Message-ID:	d7e89e1b-efd4-1464-f03d-4ba4ea9e1835@iki.fi
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 28/10/2018 19:32, Andrey Borodin wrote:
> Hi everyone!
>
>> 2 окт. 2018 г., в 6:14, Michael Paquier <michael(at)paquier(dot)xyz> написал(а):
>> Andrey, your latest patch does not apply. I am moving this to the next
>> CF, waiting for your input.
>
> I'm doing preps for CF.
> Here's rebased version.

Thanks, I had another look at these.

In patch #1, to do the vacuum scan in physical order:

* The starting NSN was not acquired correctly for unlogged and temp
relations. They don't use WAL, so their NSN values are based on the
'unloggedLSN' counter, rather than current WAL insert pointer. So must
use gistGetFakeLSN() rather than GetInsertRecPtr() for them. Fixed that.

* Adjusted comments a little bit, mostly by copy-pasting the
better-worded comments from the corresponding nbtree code, and ran pgindent.

I think this is ready to be committed, except that I didn't do any
testing. We discussed the need for testing earlier. Did you write some
test scripts for this, or how have you been testing?

Patch #2:

* Bitmapset stores 32 bit signed integers, but BlockNumber is unsigned.
So this will fail with an index larger than 2^31 blocks. That's perhaps
academical, I don't think anyone will try to create a 16 TB GiST index
any time soon. But it feels wrong to introduce an arbitrary limitation
like that.

* I was surprised that the bms_make_empty() function doesn't set the
'nwords' field to match the allocated size. Perhaps that was
intentional, so that you don't need to scan the empty region at the end,
when you scan through all matching bits? Still, seems surprising, and
needs a comment at least.

* We're now scanning all internal pages in the 2nd phase. Not only those
internal pages that contained downlinks to empty leaf pages. That's
probably OK, but the comments need updating on that.

* In general, comments need to be fixed and added in several places. For
example, there's no clear explanation on what the "delete XID", stored
in pd_prune_xid, means. (I know what it is, because I'm familiar with
the same mechanism in B-tree, but it's not clear from the code itself.)

These can be fixed, they're not show-stoppers, but patch #2 isn't quite
ready yet.

- Heikki

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, Michael Paquier <michael(at)paquier(dot)xyz>
Cc:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Костя Кузнецов <chapaev28(at)ya(dot)ru>
Subject:	Re: GiST VACUUM
Date:	2019-01-02 15:36:33
Message-ID:	9f11fd6c-aa0f-42e9-4ae2-b5cc8c148658@iki.fi
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 02/01/2019 17:35, Heikki Linnakangas wrote:
> On 28/10/2018 19:32, Andrey Borodin wrote:
>> Hi everyone!
>>
>>> 2 окт. 2018 г., в 6:14, Michael Paquier <michael(at)paquier(dot)xyz> написал(а):
>>> Andrey, your latest patch does not apply. I am moving this to the next
>>> CF, waiting for your input.
>>
>> I'm doing preps for CF.
>> Here's rebased version.
>
> Thanks, I had another look at these.

Here are new patch versions, with the fixes I mentioned. Forgot to
attach these earlier.

- Heikki

Attachment	Content-Type	Size
0001-Physical-GiST-scan-in-VACUUM-v18-heikki.patch	text/x-patch	17.0 KB
0002-Delete-pages-during-GiST-VACUUM-v18-heikki.patch	text/x-patch	18.9 KB

From:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Костя Кузнецов <chapaev28(at)ya(dot)ru>
Subject:	Re: GiST VACUUM
Date:	2019-01-03 18:47:13
Message-ID:	EB87A69B-EE5E-4259-9EEB-DA9DC1F7E265@yandex-team.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Cool, thanks!

> 2 янв. 2019 г., в 20:35, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> написал(а):
>
> In patch #1, to do the vacuum scan in physical order:
> ...
> I think this is ready to be committed, except that I didn't do any testing. We discussed the need for testing earlier. Did you write some test scripts for this, or how have you been testing?
Please see test I used to check left jumps for v18:
0001-Physical-GiST-scan-in-VACUUM-v18-with-test-modificat.patch
0002-Test-left-jumps-v18.patch

Attachment	Content-Type	Size
0002-Test-left-jumps-v18.patch	application/octet-stream	3.2 KB
0001-Physical-GiST-scan-in-VACUUM-v18-with-test-modificat.patch	application/octet-stream	17.0 KB
unknown_filename	text/plain	2.7 KB
0002-Delete-pages-during-GiST-VACUUM-v19.patch	application/octet-stream	19.1 KB
0001-Physical-GiST-scan-in-VACUUM-v19.patch	application/octet-stream	17.0 KB

From:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Костя Кузнецов <chapaev28(at)ya(dot)ru>
Subject:	Re: GiST VACUUM
Date:	2019-01-04 13:26:18
Message-ID:	A51F64E3-850D-4249-814E-54967103A859@yandex-team.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

3 янв. 2019 г., в 23:47, Andrey Borodin <x4mmm(at)yandex-team(dot)ru> написал(а):
>
>> * Bitmapset stores 32 bit signed integers, but BlockNumber is unsigned. So this will fail with an index larger than 2^31 blocks. That's perhaps academical, I don't think anyone will try to create a 16 TB GiST index any time soon. But it feels wrong to introduce an arbitrary limitation like that.
> Looks like bitmapset is unsuitable for collecting block numbers due to the interface. Let's just create custom container for this?
> Or there is one other option: for each block number throw away sign bit and consider page potentially internal and potentially empty leaf if (blkno & 0x7FFFFFF) is in bitmapset.
> And then we will have to create at least one 17Tb GiST to check it actually works.

Heikki, how do you think, is implementing our own radix tree for this is viable solution?
I've written working implementation with 4-level statically typed tree. If we follow this route, probably, there must be tests for this data structure.

Best regards, Andrey Borodin.

Attachment	Content-Type	Size
radix_tree.diff	application/octet-stream	6.9 KB

From:	Michael Paquier <michael(at)paquier(dot)xyz>
To:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Костя Кузнецов <chapaev28(at)ya(dot)ru>
Subject:	Re: GiST VACUUM
Date:	2019-02-04 02:13:26
Message-ID:	20190204021326.GS1881@paquier.xyz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jan 04, 2019 at 06:26:18PM +0500, Andrey Borodin wrote:
> Heikki, how do you think, is implementing our own radix tree for
> this is viable solution?
> I've written working implementation with 4-level statically typed
> tree. If we follow this route, probably, there must be tests for
> this data structure.

(Note that the latest patch does not apply.)
Heikki, are you planning to answer to the questions and/or review the
patch? I have moved it to next CF for now.
--
Michael

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Костя Кузнецов <chapaev28(at)ya(dot)ru>
Subject:	Re: GiST VACUUM
Date:	2019-03-04 10:27:04
Message-ID:	ad2e8696-ba76-bf62-ebf7-bdef1dfeff6f@iki.fi
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	Postg토토 꽁 머니SQL

On 04/01/2019 21:26, Andrey Borodin wrote:
> 3 янв. 2019 г., в 23:47, Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
> написал(а):
>>
>>> * Bitmapset stores 32 bit signed integers, but BlockNumber is
>>> unsigned. So this will fail with an index larger than 2^31
>>> blocks. That's perhaps academical, I don't think anyone will try
>>> to create a 16 TB GiST index any time soon. But it feels wrong to
>>> introduce an arbitrary limitation like that.
>>
>> Looks like bitmapset is unsuitable for collecting block numbers due
>> to the interface. Let's just create custom container for this? Or
>> there is one other option: for each block number throw away sign
>> bit and consider page potentially internal and potentially empty
>> leaf if (blkno & 0x7FFFFFF) is in bitmapset. And then we will have
>> to create at least one 17Tb GiST to check it actually works.
>
> Heikki, how do you think, is implementing our own radix tree for this
> is viable solution? I've written working implementation with 4-level
> statically typed tree. If we follow this route, probably, there must
> be tests for this data structure.

Yeah, seems reasonable.

- Heikki

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Костя Кузнецов <chapaev28(at)ya(dot)ru>
Subject:	Re: GiST VACUUM
Date:	2019-03-04 13:04:37
Message-ID:	96ec7ebd-42b9-4df5-18a4-42181c8a5a41@iki.fi
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	Postg토토 커뮤니티SQL

On 04/01/2019 02:47, Andrey Borodin wrote:
>> 2 янв. 2019 г., в 20:35, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> написал(а):
>>
>> In patch #1, to do the vacuum scan in physical order:
>> ...
>> I think this is ready to be committed, except that I didn't do any testing. We discussed the need for testing earlier. Did you write some test scripts for this, or how have you been testing?
> Please see test I used to check left jumps for v18:
> 0001-Physical-GiST-scan-in-VACUUM-v18-with-test-modificat.patch
> 0002-Test-left-jumps-v18.patch
>
> To trigger FollowRight GiST sometimes forget to clear follow-right marker simulating crash of an insert. This fills logs with "fixing incomplete split" messages. Search for "REMOVE THIS" to disable these ill-behavior triggers.
> To trigger NSN jump GiST allocate empty page after every real allocation.
>
> In log output I see
> 2019-01-03 22:27:30.028 +05 [54596] WARNING: RESCAN TRIGGERED BY NSN
> WARNING: RESCAN TRIGGERED BY NSN
> 2019-01-03 22:27:30.104 +05 [54596] WARNING: RESCAN TRIGGERED BY FollowRight
> This means that code paths were really executed (for v18).

Thanks! As I noted at
/message-id/2ff57b1f-01b4-eacf-36a2-485a12017f6e%40iki.fi,
the test patch left the index corrupt. I fixed it so that it leaves
behind incompletely split pages, without the corruption, see attached.
It's similar to yours, but let me recap what it does:

* Hacks gistbuild(), create 100 empty pages immediately after the root
pages. They are leaked, so they won't be reused until the a VACUUM sees
them and puts them to the FSM

* Hacks gistinserttuples(), to leave the split incompleted with 50%
probability

* In vacuum, print a line to the log whenever it needs to "jump left"

I used this patch, with the attached test script that's similar to
yours, but it also tries to verify that the index returns correct
results. It prints a result set like this:

sum
---------
-364450
364450
(2 rows)

If the two numbers add up to 0, the index seems valid. And you should
see "RESCAN" lines in the log, to show that jumping left happened.
Because the behavior is random and racy, you may need to run the script
many times to see both "RESCAN TRIGGERED BY NSN" and "RESCAN TRIGGERED
BY FollowRight" cases. Especially the "FollowRight" case happens less
frequently than the NSN case, you might need to run the script > 10
times to see it.

I also tried your amcheck tool with this. It did not report any errors.

Attached is also latest version of the patch itself. It is the same as
your latest patch v19, except for some tiny comment kibitzing. I'll mark
this as Ready for Committer in the commitfest app, and will try to
commit it in the next couple of days.

- Heikki

Attachment	Content-Type	Size
gist-vacuum-test.sh	application/x-shellscript	1.7 KB
Test-left-jumps-2.patch	text/x-patch	2.6 KB

From:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Костя Кузнецов <chapaev28(at)ya(dot)ru>
Subject:	Re: GiST VACUUM
Date:	2019-03-04 18:26:27
Message-ID:	1972C522-9019-4B9D-A6DF-3A5786688DB8@yandex-team.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	Postg스포츠 토토 베트맨SQL

Hi!

Thanks for fixing gist amcheck! The idea of using these two patches together seems so obvious now, but never actually visited my mind before.

> 4 марта 2019 г., в 18:04, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> написал(а):
> Thanks! As I noted at /message-id/2ff57b1f-01b4-eacf-36a2-485a12017f6e%40iki.fi, the test patch left the index corrupt. I fixed it so that it leaves behind incompletely split pages, without the corruption, see attached. It's similar to yours, but let me recap what it does:
>
> * Hacks gistbuild(), create 100 empty pages immediately after the root pages. They are leaked, so they won't be reused until the a VACUUM sees them and puts them to the FSM
>
> * Hacks gistinserttuples(), to leave the split incompleted with 50% probability
>
> * In vacuum, print a line to the log whenever it needs to "jump left"
>
> I used this patch, with the attached test script that's similar to yours, but it also tries to verify that the index returns correct results. It prints a result set like this:
>
> sum
> ---------
> -364450
> 364450
> (2 rows)
>
> If the two numbers add up to 0, the index seems valid. And you should see "RESCAN" lines in the log, to show that jumping left happened. Because the behavior is random and racy, you may need to run the script many times to see both "RESCAN TRIGGERED BY NSN" and "RESCAN TRIGGERED BY FollowRight" cases. Especially the "FollowRight" case happens less frequently than the NSN case, you might need to run the script > 10 times to see it.
Great! I've repeated your tests on my machine, I observe similar frequencies of causes of rescan left jumps.

> I also tried your amcheck tool with this. It did not report any errors.
>
> Attached is also latest version of the patch itself. It is the same as your latest patch v19, except for some tiny comment kibitzing. I'll mark this as Ready for Committer in the commitfest app, and will try to commit it in the next couple of days.

That's cool! I'll work on 2nd step of these patchset to make blockset data structure prettier and less hacky.

Best regards, Andrey Borodin.

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Костя Кузнецов <chapaev28(at)ya(dot)ru>
Subject:	Re: GiST VACUUM
Date:	2019-03-05 13:21:09
Message-ID:	f8151e66-61ae-4065-fc67-d3fec59f5d5b@iki.fi
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	Postg윈 토토SQL :

On 05/03/2019 02:26, Andrey Borodin wrote:
>> I also tried your amcheck tool with this. It did not report any
>> errors.
>>
>> Attached is also latest version of the patch itself. It is the
>> same as your latest patch v19, except for some tiny comment
>> kibitzing. I'll mark this as Ready for Committer in the commitfest
>> app, and will try to commit it in the next couple of days.
>
> That's cool! I'll work on 2nd step of these patchset to make
> blockset data structure prettier and less hacky.

Committed the first patch. Thanks for the patch!

I did some last minute copy-editing on the comments, and fixed one
little thing that we missed earlier: the check for "invalid tuples" that
might be left over in pg_upgraded pre-9.1 indexes, was lost at some
point. I added that check back. (It would be nice if GiST indexes had a
metadata page with a version number, so we could avoid that work in the
99% of cases that that check is not needed, but that's a different story.)

I'll change the status of this patch to "Waiting on Author", to reflect
the state of the 2nd patch, since you're working on the radix tree
blockset stuff.

- Heikki

From:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Костя Кузнецов <chapaev28(at)ya(dot)ru>
Subject:	Re: GiST VACUUM
Date:	2019-03-10 16:40:17
Message-ID:	FD68CBB1-716F-4344-9E3F-C27C14B8678B@yandex-team.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> 5 марта 2019 г., в 18:21, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> написал(а):
>
> On 05/03/2019 02:26, Andrey Borodin wrote:
>>
>> That's cool! I'll work on 2nd step of these patchset to make
>> blockset data structure prettier and less hacky.
>
> Committed the first patch. Thanks for the patch!

That's cool! Thanks!

> I'll change the status of this patch to "Waiting on Author", to reflect the state of the 2nd patch, since you're working on the radix tree blockset stuff.

Here's new version of the patch for actual page deletion.
Changes:
1. Fixed possible concurrency issue:
We were locking child page while holding lock on internal page. Notes in GiST README recommend locking child before parent.
Thus now we unlock internal page before locking children for deletion, and lock it again, check that everything is still on it's place and delete only then.
One thing still bothers me. Let's assume that we have internal page with 2 deletable leaves. We lock these leaves in order of items on internal page. Is it possible that 2nd page have follow-right link on 1st and someone will lock 2nd page, try to lock 1st and deadlock with VACUUM?
2. Added radix-tree-based block set to lib, with tests.

Best regards, Andrey Borodin.

Attachment	Content-Type	Size
0002-Delete-pages-during-GiST-VACUUM.patch	application/octet-stream	20.2 KB
0001-Add-block-set-data-structure.patch	application/octet-stream	16.1 KB

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Костя Кузнецов <chapaev28(at)ya(dot)ru>
Subject:	Re: GiST VACUUM
Date:	2019-03-11 15:03:54
Message-ID:	7e3f29b6-de18-6bd5-ab06-db2eaa180173@iki.fi
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 10/03/2019 18:40, Andrey Borodin wrote:
> Here's new version of the patch for actual page deletion.
> Changes:
>
> 1. Fixed possible concurrency issue:
>
> We were locking child page while holding lock on internal page. Notes
> in GiST README recommend locking child before parent. Thus now we
> unlock internal page before locking children for deletion, and lock
> it again, check that everything is still on it's place and delete
> only then.

Good catch. The implementation is a bit broken, though. This segfaults:

create table gist_point_tbl(id int4, p point);
create index gist_pointidx on gist_point_tbl using gist(p);

insert into gist_point_tbl (id, p)
select g, point((1+g) % 100, (1+g) % 100) from generate_series(1,
10000) g;
insert into gist_point_tbl (id, p)
select -g, point(1000, 1000) from generate_series(1, 10000) g;

delete from gist_point_tbl where id >= 0;
vacuum verbose gist_point_tbl;

BTW, we don't seem to have test coverage for this in the regression
tests. The tests that delete some rows, where you updated the comment,
doesn't actually seem to produce any empty pages that could be deleted.

> One thing still bothers me. Let's assume that we have internal page
> with 2 deletable leaves. We lock these leaves in order of items on
> internal page. Is it possible that 2nd page have follow-right link on
> 1st and someone will lock 2nd page, try to lock 1st and deadlock with
> VACUUM?

Hmm. If the follow-right flag is set on a page, it means that its right
sibling doesn't have a downlink in the parent yet. Nevertheless, I think
I'd sleep better, if we acquired the locks in left-to-right order, to be
safe.

An easy cop-out would be to use LWLockConditionalAcquire, and bail out
if we can't get the lock. But if it's not too complicated, it'd be nice
to acquire the locks in the correct order to begin with.

I did a round of cleanup and moving things around, before I bumped into
the above issue. I'm including them here as separate patches, for easier
review, but they should of course be squashed into yours. For
convenience, I'm including your patches here as well, so that all the
patches you need are in the same place, but they are identical to what
you posted.

- Heikki

Attachment	Content-Type	Size
0001-Add-block-set-data-structure.patch	text/x-patch	16.1 KB
0002-Delete-pages-during-GiST-VACUUM.patch	text/x-patch	20.2 KB
0003-Minor-cleanup.patch	text/x-patch	8.7 KB
0004-Move-the-page-deletion-logic-to-separate-function.patch	text/x-patch	18.8 KB
0005-Remove-numEmptyPages-.-it-s-not-really-needed.patch	text/x-patch	2.5 KB
0006-Misc-cleanup.patch	text/x-patch	3.4 KB

From:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Костя Кузнецов <chapaev28(at)ya(dot)ru>
Subject:	Re: GiST VACUUM
Date:	2019-03-15 18:25:27
Message-ID:	B1E4DF12-6CD3-4706-BDBD-BF3283328F60@yandex-team.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi!

> 11 марта 2019 г., в 20:03, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> написал(а):
>
> On 10/03/2019 18:40, Andrey Borodin wrote:
>> Here's new version of the patch for actual page deletion.
>> Changes:
>> 1. Fixed possible concurrency issue:
>> We were locking child page while holding lock on internal page. Notes
>> in GiST README recommend locking child before parent. Thus now we
>> unlock internal page before locking children for deletion, and lock
>> it again, check that everything is still on it's place and delete
>> only then.
>
> Good catch. The implementation is a bit broken, though. This segfaults:
Sorry for the noise. Few lines of old code leaked into the patch between testing and mailing.

> BTW, we don't seem to have test coverage for this in the regression tests. The tests that delete some rows, where you updated the comment, doesn't actually seem to produce any empty pages that could be deleted.
I've updated test to delete more rows. But it did not trigger previous bug anyway, we had to delete everything for this case.

>
>> One thing still bothers me. Let's assume that we have internal page
>> with 2 deletable leaves. We lock these leaves in order of items on
>> internal page. Is it possible that 2nd page have follow-right link on
>> 1st and someone will lock 2nd page, try to lock 1st and deadlock with
>> VACUUM?
>
> Hmm. If the follow-right flag is set on a page, it means that its right sibling doesn't have a downlink in the parent yet. Nevertheless, I think I'd sleep better, if we acquired the locks in left-to-right order, to be safe.
Actually, I did not found lock coupling in GiST code. But I decided to lock just two pages at once (leaf, then parent, for every pair). PFA v22 with this concurrency logic.

>
> An easy cop-out would be to use LWLockConditionalAcquire, and bail out if we can't get the lock. But if it's not too complicated, it'd be nice to acquire the locks in the correct order to begin with.
>
> I did a round of cleanup and moving things around, before I bumped into the above issue. I'm including them here as separate patches, for easier review, but they should of course be squashed into yours. For convenience, I'm including your patches here as well, so that all the patches you need are in the same place, but they are identical to what you posted.
I've rebased all these patches so that VACUUM should work on every commit. Let's just squash them for the next iteration, it was quite a messy rebase.
BTW, you deleted numEmptyPages, this makes code cleaner, but this variable could stop gistvacuum_recycle_pages() when everything is recycled already.

Thanks!

Best regards, Andrey Borodin.

Attachment	Content-Type	Size
0001-Add-block-set-data-structure-v22.patch	application/octet-stream	16.1 KB
0002-Delete-pages-during-GiST-VACUUM-v22.patch	application/octet-stream	19.8 KB
0003-Minor-cleanup-v22.patch	application/octet-stream	8.2 KB
0004-Move-the-page-deletion-logic-to-separate-function-v2.patch	application/octet-stream	15.3 KB
0005-Remove-numEmptyPages-.-it-s-not-really-needed-v22.patch	application/octet-stream	2.5 KB
0006-Misc-cleanup-v22.patch	application/octet-stream	1.6 KB

From:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, Michael Paquier <michael(at)paquier(dot)xyz>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Костя Кузнецов <chapaev28(at)ya(dot)ru>
Subject:	Re: GiST VACUUM
Date:	2019-03-15 19:20:06
Message-ID:	CAMkU=1wOqJgjXMfO_R_rUjZ6dvba3x-xeCPBLYQZ1pTV1utLow@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Mar 5, 2019 at 8:21 AM Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:

> On 05/03/2019 02:26, Andrey Borodin wrote:
> >> I also tried your amcheck tool with this. It did not report any
> >> errors.
> >>
> >> Attached is also latest version of the patch itself. It is the
> >> same as your latest patch v19, except for some tiny comment
> >> kibitzing. I'll mark this as Ready for Committer in the commitfest
> >> app, and will try to commit it in the next couple of days.
> >
> > That's cool! I'll work on 2nd step of these patchset to make
> > blockset data structure prettier and less hacky.
>
> Committed the first patch. Thanks for the patch!
>

Thank you. This is a transformational change; it will allow GiST indexes
larger than RAM to be used in some cases where they were simply not
feasible to use before. On a HDD, it resulted in a 50 fold improvement in
vacuum time, and the machine went from unusably unresponsive to merely
sluggish during the vacuum. On a SSD (albeit a very cheap laptop one, and
exposed from Windows host to Ubuntu over VM Virtual Box) it is still a 30
fold improvement, from a far faster baseline. Even on an AWS instance with
a "GP2" SSD volume, which normally shows little benefit from sequential
reads, I get a 3 fold speed up.

I also ran this through a lot of crash-recovery testing using simulated
torn-page writes using my traditional testing harness with high concurrency
(AWS c4.4xlarge and a1.4xlarge using 32 concurrent update processes) and
did not encounter any problems. I tested both with btree_gist on a scalar
int, and on tsvector with each tsvector having 101 tokens.

I did notice that the space freed up in the index by vacuum doesn't seem to
get re-used very efficiently, but that is an ancestral problem independent
of this change.

Cheers,

Jeff

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Костя Кузнецов <chapaev28(at)ya(dot)ru>
Subject:	Re: GiST VACUUM
Date:	2019-03-20 18:30:10
Message-ID:	90464cde-5184-5f65-9f8f-3d5eafdb3b74@iki.fi
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	Postg배트맨 토토SQL

On 15/03/2019 20:25, Andrey Borodin wrote:
>> 11 марта 2019 г., в 20:03, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
>> написал(а):
>>
>> On 10/03/2019 18:40, Andrey Borodin wrote:
>>> One thing still bothers me. Let's assume that we have internal
>>> page with 2 deletable leaves. We lock these leaves in order of
>>> items on internal page. Is it possible that 2nd page have
>>> follow-right link on 1st and someone will lock 2nd page, try to
>>> lock 1st and deadlock with VACUUM?
>>
>> Hmm. If the follow-right flag is set on a page, it means that its
>> right sibling doesn't have a downlink in the parent yet.
>> Nevertheless, I think I'd sleep better, if we acquired the locks in
>> left-to-right order, to be safe.
>
> Actually, I did not found lock coupling in GiST code. But I decided
> to lock just two pages at once (leaf, then parent, for every pair).
> PFA v22 with this concurrency logic.

Good. I just noticed, that the README actually does say explicitly, that
the child must be locked before the parent.

I rebased this over the new IntegerSet implementation, from the other
thread, and did another round of refactoring, cleanups, etc. Attached is
a new version of this patch. I'm also including the IntegerSet patch
here, for convenience, but it's the same patch I posted at [1].

It's in pretty good shapre, but one remaining issue that needs to be fixed:

During Hot Standby, the B-tree code writes a WAL reord, when a deleted
page is recycled, to prevent the deletion from being replayed too early
in the hot standby. See _bt_getbuf() and btree_xlog_reuse_page(). I
think we need to do something similar in GiST.

I'll try fixing that tomorrow, unless you beat me to it. Making the
changes is pretty straightforward, but it's a bit cumbersome to test.

[1]
/message-id/1035d8e6-cfd1-0c27-8902-40d8d45eb6e8@iki.fi

- Heikki

Attachment	Content-Type	Size
0001-Add-IntegerSet-to-hold-large-sets-of-64-bit-ints-eff.patch	text/x-patch	56.3 KB
0002-Delete-empty-pages-during-GiST-VACUUM.patch	text/x-patch	29.0 KB

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Костя Кузнецов <chapaev28(at)ya(dot)ru>
Subject:	Re: GiST VACUUM
Date:	2019-03-20 18:30:20
Message-ID:	68ea7682-65e5-0dd4-84c9-104e98c2151a@iki.fi
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	Postg배트맨 토토SQL

Good. I just noticed, that the README actually does say explicitly, that
the child must be locked before the parent.

It's in pretty good shape, but one remaining issue that needs to be fixed:

I'll try fixing that tomorrow, unless you beat me to it. Making the
changes is pretty straightforward, but it's a bit cumbersome to test.

[1]
/message-id/1035d8e6-cfd1-0c27-8902-40d8d45eb6e8@iki.fi

- Heikki

Attachment	Content-Type	Size
0001-Add-IntegerSet-to-hold-large-sets-of-64-bit-ints-eff.patch	text/x-patch	56.3 KB
0002-Delete-empty-pages-during-GiST-VACUUM.patch	text/x-patch	29.0 KB

From:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Костя Кузнецов <chapaev28(at)ya(dot)ru>
Subject:	Re: GiST VACUUM
Date:	2019-03-21 16:06:41
Message-ID:	36666F0F-F00A-4B44-97F9-F65223CA19CA@yandex-team.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> 21 марта 2019 г., в 2:30, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> написал(а):
> one remaining issue that needs to be fixed:
>
> During Hot Standby, the B-tree code writes a WAL reord, when a deleted
> page is recycled, to prevent the deletion from being replayed too early
> in the hot standby. See _bt_getbuf() and btree_xlog_reuse_page(). I
> think we need to do something similar in GiST.
>
> I'll try fixing that tomorrow, unless you beat me to it. Making the
> changes is pretty straightforward, but it's a bit cumbersome to test.

I've tried to deal with it and stuck... I think we should make B-tree WAL record for this shared, remove BlockNumber and other unused stuff, leaving just xid and db oid.
And reuse this record for B-tree, GiST and GIN (yeah, it is not checking for that conflict).

Though, I'm not sure it is important for GIN. Scariest thing that can happen: it will return same tid twice. But it is doing bitmap scan, you cannot return same bit twice...

Eventually, hash, spgist and others will have same thing too.

Best regards, Andrey Borodin.

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Костя Кузнецов <chapaev28(at)ya(dot)ru>
Subject:	Re: GiST VACUUM
Date:	2019-03-21 17:04:52
Message-ID:	738ecf7d-abc1-ed64-6af7-09eb6d9f3d2f@iki.fi
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 21/03/2019 18:06, Andrey Borodin wrote:
>> 21 марта 2019 г., в 2:30, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
>> написал(а): one remaining issue that needs to be fixed:
>>
>> During Hot Standby, the B-tree code writes a WAL reord, when a
>> deleted page is recycled, to prevent the deletion from being
>> replayed too early in the hot standby. See _bt_getbuf() and
>> btree_xlog_reuse_page(). I think we need to do something similar in
>> GiST.
>>
>> I'll try fixing that tomorrow, unless you beat me to it. Making
>> the changes is pretty straightforward, but it's a bit cumbersome to
>> test.
>
> I've tried to deal with it and stuck...

So, I came up with the attached. I basically copy-pasted the page-reuse
WAL-logging stuff from nbtree.

When I started testing this, I quickly noticed that empty pages were not
being deleted nearly as much as I expected. I tracked it to this check
in gistdeletepage:

> + if (GistFollowRight(leafPage)
> + || GistPageGetNSN(parentPage) > GistPageGetNSN(leafPage))
> + {
> + /* Don't mess with a concurrent page split. */
> + return false;
> + }

That NSN test was bogus. It prevented the leaf page from being reused,
if the parent page was *ever* split after the leaf page was created. I
don't see any reason to check the NSN here. The NSN is normally used to
detect if a (leaf) page has been concurrently split, when you descend
the tree. We don't need to care about that here; as long as the
FOLLOW_RIGHT flag is not set, the page has a downlink, and if we can
find the downlink and the page is empty, we can delete it.

After removing that bogus NSN check, page reuse become much more
effective. I've been testing this by running this test script repeatedly:

----------
/*
create sequence gist_test_seq;
create table gist_point_tbl(id int4, p point);
create index gist_pointidx on gist_point_tbl using gist(p);
*/

insert into gist_point_tbl (id, p)
select nextval('gist_test_seq'), point(nextval('gist_test_seq'),
1000 + g) from generate_series(1, 10000) g;

delete from gist_point_tbl where id < currval('gist_test_seq') - 20000;
vacuum gist_point_tbl;

select pg_table_size('gist_point_tbl'), pg_indexes_size('gist_point_tbl');
----------

It inserts a bunch of rows, deletes a bunch of older rows, and vacuums.
The interesting thing here is that the key values keep "moving", so that
new tuples are added to different places than where old ones are
removed. That's the case where page reuse is needed.

Before this patch, the index bloated very quickly. With the patch, it
still bloats, because we still don't delete internal pages. But it's a
small fraction of the bloat you got before.

Attached is the latest patch version, to be applied on top of the
IntegerSet patch.

> I think we should make B-tree WAL record for this shared, remove
> BlockNumber and other unused stuff, leaving just xid and db oid. And
> reuse this record for B-tree, GiST and GIN (yeah, it is not checking
> for that conflict).
Good point. For now, I didn't try to generalize this, but perhaps we
should.

> Though, I'm not sure it is important for GIN. Scariest thing that can
> happen: it will return same tid twice. But it is doing bitmap scan,
> you cannot return same bit twice...

Hmm. Could it return a completely unrelated tuple? We don't always
recheck the original index quals in a bitmap index scan, IIRC. Also, a
search might get confused if it's descending down a posting tree, and
lands on a different kind of a page, altogether.

Alexander, you added the mechanism to GIN recently, to prevent pages
from being reused too early (commit 52ac6cd2d0). Do we need something
like B-tree's REUSE_PAGE records in GIN, too, to prevent the same bug
from happening in hot standby?

PS. for Gist, we could almost use the LSN / NSN mechanism to detect the
case that a deleted page is reused: Add a new field to the GiST page
header, to store a new "deleteNSN" field. When a page is deleted, the
deleted page's deleteNSN is set to the LSN of the deletion record. When
the page is reused, the deleteNSN field is kept unchanged. When you
follow a downlink during search, if you see that the page's deleteNSN >
parent's LSN, you know that it was concurrently deleted and recycled,
and should be ignored. That would allow reusing deleted pages
immediately. Unfortunately that would require adding a new field to the
gist page header/footer, which requires upgrade work :-(. Maybe one day,
we'll bite the bullet. Something to keep in mind, if we have to change
the page format anyway, for some reason.

- Heikki

Attachment	Content-Type	Size
0002-Delete-empty-pages-during-GiST-VACUUM.patch	text/x-patch	35.5 KB

From:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Костя Кузнецов <chapaev28(at)ya(dot)ru>
Subject:	Re: GiST VACUUM
Date:	2019-03-22 08:00:05
Message-ID:	5742CFFD-3A65-4980-9091-D74BE7C6D9E3@yandex-team.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> 22 марта 2019 г., в 1:04, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> написал(а):
> ...
> When I started testing this, I quickly noticed that empty pages were not being deleted nearly as much as I expected. I tracked it to this check in gistdeletepage:
>
>> + if (GistFollowRight(leafPage)
>> + || GistPageGetNSN(parentPage) > GistPageGetNSN(leafPage))
>> + {
>> + /* Don't mess with a concurrent page split. */
>> + return false;
>> + }
>
> That NSN test was bogus. It prevented the leaf page from being reused, if the parent page was *ever* split after the leaf page was created. I don't see any reason to check the NSN here.
That's true. This check had sense only when parent page was not locked (and seems like comparison should be opposite). When both pages are locked, this test as no sense at all. Check was incorrectly "fixed" by me when transitioning from single-scan delete to two-scan delete during summer 2018. Just wandering how hard is it to simply delete a page...

>> Though, I'm not sure it is important for GIN. Scariest thing that can
>> happen: it will return same tid twice. But it is doing bitmap scan,
>> you cannot return same bit twice...
>
> Hmm. Could it return a completely unrelated tuple?
No, I do not think so, it will do comparisons on posting tree tuples.

> We don't always recheck the original index quals in a bitmap index scan, IIRC. Also, a search might get confused if it's descending down a posting tree, and lands on a different kind of a page, altogether.
Yes, search will return an error, user will reissue query and everything will be almost fine.

> PS. for Gist, we could almost use the LSN / NSN mechanism to detect the case that a deleted page is reused: Add a new field to the GiST page header, to store a new "deleteNSN" field. When a page is deleted, the deleted page's deleteNSN is set to the LSN of the deletion record. When the page is reused, the deleteNSN field is kept unchanged. When you follow a downlink during search, if you see that the page's deleteNSN > parent's LSN, you know that it was concurrently deleted and recycled, and should be ignored. That would allow reusing deleted pages immediately. Unfortunately that would require adding a new field to the gist page header/footer, which requires upgrade work :-(. Maybe one day, we'll bite the bullet. Something to keep in mind, if we have to change the page format anyway, for some reason.
Yeah, the same day we will get rid of invalid tuples. I can make a patch for v13. Actually, I have a lot of patches that I want in GiST in v13. Or v14.

Best regards, Andrey Borodin.

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Костя Кузнецов <chapaev28(at)ya(dot)ru>
Subject:	Re: GiST VACUUM
Date:	2019-03-22 10:28:53
Message-ID:	8663c730-0895-cf80-acc2-e7149f6455e6@iki.fi
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	Postg스포츠 토토SQL

On 22/03/2019 10:00, Andrey Borodin wrote:
>> 22 марта 2019 г., в 1:04, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
>> написал(а):
>>
>> PS. for Gist, we could almost use the LSN / NSN mechanism to detect
>> the case that a deleted page is reused: Add a new field to the GiST
>> page header, to store a new "deleteNSN" field. When a page is
>> deleted, the deleted page's deleteNSN is set to the LSN of the
>> deletion record. When the page is reused, the deleteNSN field is
>> kept unchanged. When you follow a downlink during search, if you
>> see that the page's deleteNSN > parent's LSN, you know that it was
>> concurrently deleted and recycled, and should be ignored. That
>> would allow reusing deleted pages immediately. Unfortunately that
>> would require adding a new field to the gist page header/footer,
>> which requires upgrade work :-(. Maybe one day, we'll bite the
>> bullet. Something to keep in mind, if we have to change the page
>> format anyway, for some reason.
>
> Yeah, the same day we will get rid of invalid tuples. I can make a
> patch for v13. Actually, I have a lot of patches that I want in GiST
> in v13. Or v14.

Cool! Here's my wishlist:

* That deleteNSN thing
* Add a metapage to blk #0.
* Add a "level"-field to page header.
* Currently, a search needs to scan all items on a page. If the keys are
small, that can be pretty slow. Divide each page further into e.g. 4
sub-pages, with a "bounding box" key for each sub-page, to speed up search.

- Heikki

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Костя Кузнецов <chapaev28(at)ya(dot)ru>
Subject:	Re: GiST VACUUM
Date:	2019-03-22 11:37:36
Message-ID:	4bf0968d-22a2-c7a7-9fd9-bd608da2ad2f@iki.fi
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 21/03/2019 19:04, Heikki Linnakangas wrote:
> Attached is the latest patch version, to be applied on top of the
> IntegerSet patch.

And committed. Thanks Andrey!

- Heikki

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Костя Кузнецов <chapaev28(at)ya(dot)ru>
Subject:	Re: GiST VACUUM
Date:	2019-03-22 12:03:59
Message-ID:	025ee75c-3960-a8d0-2278-af44ae6feb94@iki.fi
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 22/03/2019 13:37, Heikki Linnakangas wrote:
> On 21/03/2019 19:04, Heikki Linnakangas wrote:
>> Attached is the latest patch version, to be applied on top of the
>> IntegerSet patch.
>
> And committed. Thanks Andrey!

This caused the buildfarm to go pink... I was able to reproduce it, by
running the regression tests in one terminal, and repeatedly running
"VACUUM;" in another terminal. Strange that it seems to happen all the
time on the buildfarm, but never happened to me locally.

Anyway, I'm investigating.

- Heikki

From:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Костя Кузнецов <chapaev28(at)ya(dot)ru>
Subject:	Re: GiST VACUUM
Date:	2019-03-22 12:48:15
Message-ID:	349EC5CE-8D46-41EB-A440-A16403EE3184@yandex-team.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	Postg토토 사이트SQL

> 22 марта 2019 г., в 19:37, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> написал(а):
>
> On 21/03/2019 19:04, Heikki Linnakangas wrote:
>> Attached is the latest patch version, to be applied on top of the
>> IntegerSet patch.
>
> And committed. Thanks Andrey!
>
> - Heikki

Cool! Thank you very much! At the beginning I could not image how many caveats are there.

> 22 марта 2019 г., в 18:28, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> написал(а):
>
> * Currently, a search needs to scan all items on a page. If the keys are small, that can be pretty slow. Divide each page further into e.g. 4 sub-pages, with a "bounding box" key for each sub-page, to speed up search.
BTW, I already have intra-page indexing patch. But now it obviously need a rebase :) Along with gist amcheck patch.

Thanks!

Best regards, Andrey Borodin.

From:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Костя Кузнецов <chapaev28(at)ya(dot)ru>
Subject:	Re: GiST VACUUM
Date:	2019-03-24 16:50:41
Message-ID:	835A15A5-F1B4-4446-A711-BF48357EB602@yandex-team.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> 22 марта 2019 г., в 17:03, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> написал(а):
>

I was working on new version of gist check in amcheck and understand one more thing:

/* Can this page be recycled yet? */
bool
gistPageRecyclable(Page page)
{
return PageIsNew(page) ||
(GistPageIsDeleted(page) &&
TransactionIdPrecedes(GistPageGetDeleteXid(page), RecentGlobalXmin));
}

Here RecentGlobalXmin can wraparound and page will become unrecyclable for half of xid cycle. Can we prevent it by resetting PageDeleteXid to InvalidTransactionId before doing RecordFreeIndexPage()?
(Seems like same applies to GIN...)

Best regards, Andrey Borodin.

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Костя Кузнецов <chapaev28(at)ya(dot)ru>
Subject:	Re: GiST VACUUM
Date:	2019-03-25 13:20:39
Message-ID:	5f7ed675-d1fc-66ef-f316-645080ff9625@iki.fi
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	Postg사설 토토SQL

On 24/03/2019 18:50, Andrey Borodin wrote:
> I was working on new version of gist check in amcheck and understand one more thing:
>
> /* Can this page be recycled yet? */
> bool
> gistPageRecyclable(Page page)
> {
> return PageIsNew(page) ||
> (GistPageIsDeleted(page) &&
> TransactionIdPrecedes(GistPageGetDeleteXid(page), RecentGlobalXmin));
> }
>
> Here RecentGlobalXmin can wraparound and page will become unrecyclable for half of xid cycle. Can we prevent it by resetting PageDeleteXid to InvalidTransactionId before doing RecordFreeIndexPage()?
> (Seems like same applies to GIN...)

True, and B-tree has the same issue. I thought I saw a comment somewhere
in the B-tree code about that earlier, but now I can't find it. I
must've imagined it.

We could reset it, but that would require dirtying the page. That would
be just extra I/O overhead, if the page gets reused before XID
wraparound. We could avoid that if we stored the full XID+epoch, not
just XID. I think we should do that in GiST, at least, where this is
new. In the B-tree, it would require some extra code to deal with
backwards-compatibility, but maybe it would be worth it even there.

- Heikki

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Костя Кузнецов <chapaev28(at)ya(dot)ru>
Subject:	Re: GiST VACUUM
Date:	2019-04-04 15:15:21
Message-ID:	68d00017-ae5c-b14f-fc3a-c9e38e3ce792@iki.fi
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	Postg스포츠 토토SQL

On 25/03/2019 15:20, Heikki Linnakangas wrote:
> On 24/03/2019 18:50, Andrey Borodin wrote:
>> I was working on new version of gist check in amcheck and understand one more thing:
>>
>> /* Can this page be recycled yet? */
>> bool
>> gistPageRecyclable(Page page)
>> {
>> return PageIsNew(page) ||
>> (GistPageIsDeleted(page) &&
>> TransactionIdPrecedes(GistPageGetDeleteXid(page), RecentGlobalXmin));
>> }
>>
>> Here RecentGlobalXmin can wraparound and page will become unrecyclable for half of xid cycle. Can we prevent it by resetting PageDeleteXid to InvalidTransactionId before doing RecordFreeIndexPage()?
>> (Seems like same applies to GIN...)
>
> True, and B-tree has the same issue. I thought I saw a comment somewhere
> in the B-tree code about that earlier, but now I can't find it. I
> must've imagined it.
>
> We could reset it, but that would require dirtying the page. That would
> be just extra I/O overhead, if the page gets reused before XID
> wraparound. We could avoid that if we stored the full XID+epoch, not
> just XID. I think we should do that in GiST, at least, where this is
> new. In the B-tree, it would require some extra code to deal with
> backwards-compatibility, but maybe it would be worth it even there.

I suggest that we do the attached. It fixes this for GiST. The patch
changes expands the "deletion XID" to 64-bits, and changes where it's
stored. Instead of storing it pd_prune_xid, it's stored in the page
contents. Luckily, a deleted page has no real content.

I think we should fix this in a similar manner in B-tree, too, but that
can be done separately. For B-tree, we need to worry about
backwards-compatibility, but that seems simple enough; we just need to
continue to understand old deleted pages, where the deletion XID is
stored in the page opaque field.

- Heikki

Attachment	Content-Type	Size
0001-Refactor-checks-for-deleted-GiST-pages.patch	text/x-patch	4.5 KB
0002-Use-full-64-bit-XID-for-checking-if-a-deleted-GiST-p.patch	text/x-patch	9.7 KB

From:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Костя Кузнецов <chapaev28(at)ya(dot)ru>
Subject:	Re: GiST VACUUM
Date:	2019-04-05 05:39:19
Message-ID:	4E1402A8-C05E-4087-B419-03FF5DDA4FB2@yandex-team.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi!

> 4 апр. 2019 г., в 20:15, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> написал(а):
>
> On 25/03/2019 15:20, Heikki Linnakangas wrote:
>> On 24/03/2019 18:50, Andrey Borodin wrote:
>>> I was working on new version of gist check in amcheck and understand one more thing:
>>>
>>> /* Can this page be recycled yet? */
>>> bool
>>> gistPageRecyclable(Page page)
>>> {
>>> return PageIsNew(page) ||
>>> (GistPageIsDeleted(page) &&
>>> TransactionIdPrecedes(GistPageGetDeleteXid(page), RecentGlobalXmin));
>>> }
>>>
>>> Here RecentGlobalXmin can wraparound and page will become unrecyclable for half of xid cycle. Can we prevent it by resetting PageDeleteXid to InvalidTransactionId before doing RecordFreeIndexPage()?
>>> (Seems like same applies to GIN...)
>> True, and B-tree has the same issue. I thought I saw a comment somewhere
>> in the B-tree code about that earlier, but now I can't find it. I
>> must've imagined it.
>> We could reset it, but that would require dirtying the page. That would
>> be just extra I/O overhead, if the page gets reused before XID
>> wraparound. We could avoid that if we stored the full XID+epoch, not
>> just XID. I think we should do that in GiST, at least, where this is
>> new. In the B-tree, it would require some extra code to deal with
>> backwards-compatibility, but maybe it would be worth it even there.
>
> I suggest that we do the attached. It fixes this for GiST. The patch changes expands the "deletion XID" to 64-bits, and changes where it's stored. Instead of storing it pd_prune_xid, it's stored in the page contents. Luckily, a deleted page has no real content.

So, we store full xid right after page header?
+static inline void
+GistPageSetDeleteXid(Page page, FullTransactionId deletexid)
+{
+ Assert(PageIsEmpty(page));
+ ((PageHeader) page)->pd_lower = MAXALIGN(SizeOfPageHeaderData) + sizeof(FullTransactionId);
+
+ *((FullTransactionId *) PageGetContents(page)) = deletexid;
+}

Usually we leave one ItemId (located at invalid offset number) untouched. I do not know is it done for a reason or not....

Also, I did not understand this optimization:
+ /*
+ * We can skip this if the page was deleted so long ago, that no scan can possibly
+ * still see it, even in a standby. One measure might be anything older than the
+ * table's frozen-xid, but we don't have that at hand here. But anything older than
+ * 2 billion, from the next XID, is surely old enough, because you would hit XID
+ * wraparound at that point.
+ */
+ nextxid = ReadNextFullTransactionId();
+ diff = U64FromFullTransactionId(nextxid) - U64FromFullTransactionId(latestRemovedXid);
+ if (diff < 0x7fffffff)
+ return;

Standby can be lagging months from primary, and, theoretically, close the gap in one sudden WAL leap... Also, I think, that comparison sign should be >, not <.

Best regards, Andrey Borodin.

From:	Michael Paquier <michael(at)paquier(dot)xyz>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Костя Кузнецов <chapaev28(at)ya(dot)ru>
Subject:	Re: GiST VACUUM
Date:	2019-06-24 08:26:04
Message-ID:	20190624082604.GG1637@paquier.xyz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki,

On Thu, Apr 04, 2019 at 06:15:21PM +0300, Heikki Linnakangas wrote:
> I think we should fix this in a similar manner in B-tree, too, but that can
> be done separately. For B-tree, we need to worry about
> backwards-compatibility, but that seems simple enough; we just need to
> continue to understand old deleted pages, where the deletion XID is stored
> in the page opaque field.

This is an open item present already for a couple of weeks. Are you
planning to tackle that soon?
--
Michael

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Костя Кузнецов <chapaev28(at)ya(dot)ru>
Subject:	Re: GiST VACUUM
Date:	2019-06-25 08:10:11
Message-ID:	4732efe3-9d96-58cb-a5ce-3f45b9f16ae8@iki.fi
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

(Thanks for the reminder on this, Michael!)

On 05/04/2019 08:39, Andrey Borodin wrote:
>> 4 апр. 2019 г., в 20:15, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> написал(а):
>> I suggest that we do the attached. It fixes this for GiST. The
>> patch changes expands the "deletion XID" to 64-bits, and changes
>> where it's stored. Instead of storing it pd_prune_xid, it's stored
>> in the page contents. Luckily, a deleted page has no real content.
>
> So, we store full xid right after page header?

Yep.

> +static inline void
> +GistPageSetDeleteXid(Page page, FullTransactionId deletexid)
> +{
> + Assert(PageIsEmpty(page));
> + ((PageHeader) page)->pd_lower = MAXALIGN(SizeOfPageHeaderData) + sizeof(FullTransactionId);
> +
> + *((FullTransactionId *) PageGetContents(page)) = deletexid;
> +}
>
> Usually we leave one ItemId (located at invalid offset number)
> untouched. I do not know is it done for a reason or not....
No. Take a look at PageGetItemId() macro, it subtracts one from the
offset number. But in any case, that's not really relevant here, because
this patch stores the transaction id directly as the page content. There
are no itemids at all on a deleted page.

> Also, I did not understand this optimization:
> + /*
> + * We can skip this if the page was deleted so long ago, that no scan can possibly
> + * still see it, even in a standby. One measure might be anything older than the
> + * table's frozen-xid, but we don't have that at hand here. But anything older than
> + * 2 billion, from the next XID, is surely old enough, because you would hit XID
> + * wraparound at that point.
> + */
> + nextxid = ReadNextFullTransactionId();
> + diff = U64FromFullTransactionId(nextxid) - U64FromFullTransactionId(latestRemovedXid);
> + if (diff < 0x7fffffff)
> + return;
>
> Standby can be lagging months from primary, and, theoretically, close
> the gap in one sudden WAL leap...
It would still process the WAL one WAL record at a time, even if it's
lagging months behind. It can't just jump over 2 billion XIDs.

> Also, I think, that comparison sign should be >, not <.

Ah, good catch! And it shows that this needs more testing..

- Heikki

From:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Костя Кузнецов <chapaev28(at)ya(dot)ru>
Subject:	Re: GiST VACUUM
Date:	2019-06-25 09:38:43
Message-ID:	6D4AAC25-CB7F-42D1-9757-D9FE3F418743@yandex-team.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	Postg토토 사이트SQL

Hi!

Thanks for clarification, now I understand these patches better.

> 25 июня 2019 г., в 13:10, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> написал(а):
>
>> Also, I did not understand this optimization:
>> + /*
>> + * We can skip this if the page was deleted so long ago, that no scan can possibly
>> + * still see it, even in a standby. One measure might be anything older than the
>> + * table's frozen-xid, but we don't have that at hand here. But anything older than
>> + * 2 billion, from the next XID, is surely old enough, because you would hit XID
>> + * wraparound at that point.
>> + */
>> + nextxid = ReadNextFullTransactionId();
>> + diff = U64FromFullTransactionId(nextxid) - U64FromFullTransactionId(latestRemovedXid);
>> + if (diff < 0x7fffffff)
>> + return;
>> Standby can be lagging months from primary, and, theoretically, close
>> the gap in one sudden WAL leap...
> It would still process the WAL one WAL record at a time, even if it's lagging months behind. It can't just jump over 2 billion XIDs.
I feel a little uncomfortable with number 0x7fffffff right in code.

Thanks!

Best regards, Andrey Borodin.

From:	Michael Paquier <michael(at)paquier(dot)xyz>
To:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Костя Кузнецов <chapaev28(at)ya(dot)ru>
Subject:	Re: GiST VACUUM
Date:	2019-06-26 02:33:09
Message-ID:	20190626023309.GG1714@paquier.xyz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Jun 25, 2019 at 02:38:43PM +0500, Andrey Borodin wrote:
> I feel a little uncomfortable with number 0x7fffffff right in code.

PG_INT32_MAX...
--
Michael

From:	Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To:	Michael Paquier <michael(at)paquier(dot)xyz>
Cc:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Костя Кузнецов <chapaev28(at)ya(dot)ru>
Subject:	Re: GiST VACUUM
Date:	2019-06-26 03:07:49
Message-ID:	CA+hUKG+q8sYnhStNvpwnozTS-BJKj0Ade5wmwJe-wZTqspmmMg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jun 26, 2019 at 2:33 PM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
> On Tue, Jun 25, 2019 at 02:38:43PM +0500, Andrey Borodin wrote:
> > I feel a little uncomfortable with number 0x7fffffff right in code.
>
> PG_INT32_MAX...

MaxTransactionId / 2?

--
Thomas Munro
https://enterprisedb.com

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>
Cc:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Костя Кузнецов <chapaev28(at)ya(dot)ru>
Subject:	Re: GiST VACUUM
Date:	2019-06-27 11:38:47
Message-ID:	f17e4589-a7f7-605e-4243-142e26260711@iki.fi
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 26/06/2019 06:07, Thomas Munro wrote:
> On Wed, Jun 26, 2019 at 2:33 PM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
>> On Tue, Jun 25, 2019 at 02:38:43PM +0500, Andrey Borodin wrote:
>>> I feel a little uncomfortable with number 0x7fffffff right in code.
>>
>> PG_INT32_MAX...
>
> MaxTransactionId / 2?

Yeah, that makes sense.

Here's a new version of the patches. Changes:

* I changed the reuse-logging so that we always write a reuse WAL
record, even if the deleteXid is very old. I tried to avoid that with
the check for MaxTransactionId / 2 or 0x7fffffff, but it had some
problems. In the previous patch version, it wasn't just an optimization.
Without the check, we would write 32-bit XIDs to the log that had
already wrapped around, and that caused the standby to unnecessarily
wait for or kill backends. But checking for MaxTransaction / 2 isn't
quite enough: there was a small chance that the next XID counter
advanced just after checking for that, so that we still logged an XID
that had just wrapped around. A more robust way to deal with this is to
log a full 64-bit XID, and check for wraparound at redo in the standby.
And if we do that, trying to optimize this in the master doesn't seem
that important anymore. So in this patch version, we always log the
64-bit XID, and check for the MaxTransaction / 2 when replaying the WAL
record instead.

* I moved the logic to extend a 32-bit XID to 64-bits to a new helper
function in varsup.c.

* Instead of storing just a naked FullTransactionId in the "page
contents" of a deleted page, I created a new struct for that. The effect
is the same, but I think the struct clarifies what's happening, and it's
a good location to attach a comment explaining it.

* Fixed the mixup between < and >

I haven't done any testing on this yet. Andrey, would you happen to have
an environment ready to test this?

- Heikki

Attachment	Content-Type	Size
0001-Refactor-checks-for-deleted-GiST-pages.patch	text/x-patch	4.5 KB
0002-Use-full-64-bit-XID-for-checking-if-a-deleted-GiST-p.patch	text/x-patch	12.2 KB

From:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Костя Кузнецов <chapaev28(at)ya(dot)ru>
Subject:	Re: GiST VACUUM
Date:	2019-06-27 13:12:29
Message-ID:	78F49EB5-1F5D-4C93-882C-88BBB1C2631F@yandex-team.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> 27 июня 2019 г., в 16:38, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> написал(а):
>
> I haven't done any testing on this yet. Andrey, would you happen to have an environment ready to test this?

Thanks!

I will do some testing this evening (UTC+5). But I'm not sure I can reliably test wraparound of xids...
I will try to break code with usual setup which we used to stress vacuum with deletes and inserts.

Best regards, Andrey Borodin.

From:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Костя Кузнецов <chapaev28(at)ya(dot)ru>
Subject:	Re: GiST VACUUM
Date:	2019-06-27 17:15:40
Message-ID:	CDCAACB6-04F6-45C9-85E9-389B913F92E9@yandex-team.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Patches do not deadlock and delete pages on "rescan test" - setup that we used to detect "left jumps" in during development of physical vacuum. check-world is happy on my machine.
I really like that now there is GISTDeletedPageContents and we do not just cast *(FullTransactionId *) PageGetContents(page).

But I have stupid question again, about this code:

https://github.com/x4m/postgres_g/commit/096d5586537d29ff316ca8ce074bbe1b325879ee#diff-754126824470cb8e68fd5e32af6d3bcaR417

nextFullXid = ReadNextFullTransactionId();
diff = U64FromFullTransactionId(nextFullXid) -
U64FromFullTransactionId(latestRemovedFullXid);
if (diff < MaxTransactionId / 2)
{
TransactionId latestRemovedXid;

// sleep(100500 hours); latestRemovedXid becomes xid from future

latestRemovedXid = XidFromFullTransactionId(latestRemovedFullXid);
ResolveRecoveryConflictWithSnapshot(latestRemovedXid,
xlrec->node);
}

Do we have a race condition here? Can latestRemovedXid overlap and start to be xid from future?
I understand that it is purely hypothetical, but still latestRemovedXid is from ancient past already.

Best regards, Andrey Borodin.

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
Cc:	Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Костя Кузнецов <chapaev28(at)ya(dot)ru>
Subject:	Re: GiST VACUUM
Date:	2019-06-27 21:19:15
Message-ID:	55249fb5-ea9e-0485-ab64-983b3369b131@iki.fi
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 27/06/2019 20:15, Andrey Borodin wrote:
> But I have stupid question again, about this code:
>
> https://github.com/x4m/postgres_g/commit/096d5586537d29ff316ca8ce074bbe1b325879ee#diff-754126824470cb8e68fd5e32af6d3bcaR417
>
> nextFullXid = ReadNextFullTransactionId();
> diff = U64FromFullTransactionId(nextFullXid) -
> U64FromFullTransactionId(latestRemovedFullXid);
> if (diff < MaxTransactionId / 2)
> {
> TransactionId latestRemovedXid;
>
> // sleep(100500 hours); latestRemovedXid becomes xid from future
>
> latestRemovedXid = XidFromFullTransactionId(latestRemovedFullXid);
> ResolveRecoveryConflictWithSnapshot(latestRemovedXid,
> xlrec->node);
> }
>
> Do we have a race condition here? Can latestRemovedXid overlap and start to be xid from future?
> I understand that it is purely hypothetical, but still latestRemovedXid is from ancient past already.

Good question. No, that can't happen, because this code is in the WAL
redo function. In a standby, the next XID counter only moves forward
when a WAL record is replayed that advances it, and all WAL records are
replayed serially, so that can't happen when we're in the middle of
replaying this record. A comment on that would be good, though.

When I originally had the check like above in the code that created the
WAL record, it had exactly that problem, because in the master the next
XID counter can advance concurrently.

- Heikki

From:	Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Костя Кузнецов <chapaev28(at)ya(dot)ru>
Subject:	Re: GiST VACUUM
Date:	2019-06-27 22:02:03
Message-ID:	CA+hUKGJPuKR7i7UvmXRXhjhdW=3v1-nSO3aFn4XDLdkBJru15g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Jun 27, 2019 at 11:38 PM Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
> * I moved the logic to extend a 32-bit XID to 64-bits to a new helper
> function in varsup.c.

I'm a bit uneasy about making this into a generally available function
for reuse. I think we should instead come up with a very small number
of sources of fxids that known to be free of races because of some
well explained interlocking.

For example, I believe it is safe to convert an xid obtained from a
WAL record during recovery, because (for now) recovery is a single
thread of execution and the next fxid can't advance underneath you.
So I think XLogRecGetFullXid(XLogReaderState *record)[1] as I'm about
to propose in another thread (though I've forgotten who wrote it,
maybe Andres, Amit or me, but if it wasn't me then it's exactly what I
would have written) is a safe blessed source of fxids. The rationale
for writing this function instead of an earlier code that looked much
like your proposed helper function, during EDB-internal review of
zheap stuff, was this: if we provide a general purpose xid->fxid
facility, it's virtually guaranteed that someone will eventually use
it to shoot footwards, by acquiring an xid from somewhere, and then
some arbitrary time later extending it to a fxid when no interlocking
guarantees that the later conversion has the correct epoch.

I'd like to figure out how to maintain full versions of the
procarray.c-managed xid horizons, without widening the shared memory
representation. I was planning to think harder about for 13.
Obviously that doesn't help you now. So I'm wondering if you should
just open-code this for now, and put in a comment about why it's safe
and a note that there'll hopefully be a fxid horizon available in a
later release?

[1] https://github.com/EnterpriseDB/zheap/commit/1203c2fa49f5f872f42ea4a02ccba2b191c45f32

--
Thomas Munro
https://enterprisedb.com

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, Michael Paquier <michael(at)paquier(dot)xyz>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Костя Кузнецов <chapaev28(at)ya(dot)ru>
Subject:	Re: GiST VACUUM
Date:	2019-07-04 01:11:10
Message-ID:	CAH2-Wz=s25OKJOYB69ggY-2PWwzXYeR69x_MkxsDms288TUGrQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	Postg무지개 토토SQL

On Thu, Apr 4, 2019 at 8:15 AM Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
> I think we should fix this in a similar manner in B-tree, too, but that
> can be done separately. For B-tree, we need to worry about
> backwards-compatibility, but that seems simple enough; we just need to
> continue to understand old deleted pages, where the deletion XID is
> stored in the page opaque field.

What Postgres versions will the B-Tree fix end up targeting? Sounds
like you plan to backpatch all the way?

--
Peter Geoghegan

From:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To:	Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Michael Paquier <michael(at)paquier(dot)xyz>, Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Костя Кузнецов <chapaev28(at)ya(dot)ru>
Subject:	Re: GiST VACUUM
Date:	2019-07-16 08:12:03
Message-ID:	CAA4eK1LD-T0+QZR7qXNsiqJFqvAGWZ3ujnC8YCPekjG6Vq+ZGw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jun 28, 2019 at 3:32 AM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
>
> On Thu, Jun 27, 2019 at 11:38 PM Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
> > * I moved the logic to extend a 32-bit XID to 64-bits to a new helper
> > function in varsup.c.
>
> I'm a bit uneasy about making this into a generally available function
> for reuse. I think we should instead come up with a very small number
> of sources of fxids that known to be free of races because of some
> well explained interlocking.
>

I have two more cases in undo patch series where the same function is
needed and it is safe to use it there. The first place is twophase.c
for rolling back prepared transactions where we know that we don't
support aborted xacts that are older than 2 billion, so we can rely on
such a function. We also need it in undodiscard.c to compute the
value of oldestFullXidHavingUnappliedUndo. See the usage of
GetEpochForXid in unprocessing patches. Now, we might find a way to
avoid using in one of these places by doing some more work, but not
sure we can avoid from all the three places (one proposed by this
patch and two by undo patchset).

> For example, I believe it is safe to convert an xid obtained from a
> WAL record during recovery, because (for now) recovery is a single
> thread of execution and the next fxid can't advance underneath you.
> So I think XLogRecGetFullXid(XLogReaderState *record)[1] as I'm about
> to propose in another thread (though I've forgotten who wrote it,
> maybe Andres, Amit or me, but if it wasn't me then it's exactly what I
> would have written) is a safe blessed source of fxids. The rationale
> for writing this function instead of an earlier code that looked much
> like your proposed helper function, during EDB-internal review of
> zheap stuff, was this: if we provide a general purpose xid->fxid
> facility, it's virtually guaranteed that someone will eventually use
> it to shoot footwards, by acquiring an xid from somewhere, and then
> some arbitrary time later extending it to a fxid when no interlocking
> guarantees that the later conversion has the correct epoch.
>
> I'd like to figure out how to maintain full versions of the
> procarray.c-managed xid horizons, without widening the shared memory
> representation. I was planning to think harder about for 13.
> Obviously that doesn't help you now. So I'm wondering if you should
> just open-code this for now, and put in a comment about why it's safe
> and a note that there'll hopefully be a fxid horizon available in a
> later release?
>

Do you suggest to open code for all the three places for now? I am
not against open coding the logic for now but not sure if we can
eliminate its need because even if we can do what you are saying with
procarray.c-managed xid horizons, I think we need to do something
about clog.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Костя Кузнецов <chapaev28(at)ya(dot)ru>
Subject:	Re: GiST VACUUM
Date:	2019-07-22 13:09:48
Message-ID:	2a45714a-6973-0b5d-e78b-b56618fce17b@iki.fi
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 28/06/2019 01:02, Thomas Munro wrote:
> On Thu, Jun 27, 2019 at 11:38 PM Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
>> * I moved the logic to extend a 32-bit XID to 64-bits to a new helper
>> function in varsup.c.
>
> I'm a bit uneasy about making this into a generally available function
> for reuse. I think we should instead come up with a very small number
> of sources of fxids that known to be free of races because of some
> well explained interlocking.
>
> For example, I believe it is safe to convert an xid obtained from a
> WAL record during recovery, because (for now) recovery is a single
> thread of execution and the next fxid can't advance underneath you.
> So I think XLogRecGetFullXid(XLogReaderState *record)[1] as I'm about
> to propose in another thread (though I've forgotten who wrote it,
> maybe Andres, Amit or me, but if it wasn't me then it's exactly what I
> would have written) is a safe blessed source of fxids. The rationale
> for writing this function instead of an earlier code that looked much
> like your proposed helper function, during EDB-internal review of
> zheap stuff, was this: if we provide a general purpose xid->fxid
> facility, it's virtually guaranteed that someone will eventually use
> it to shoot footwards, by acquiring an xid from somewhere, and then
> some arbitrary time later extending it to a fxid when no interlocking
> guarantees that the later conversion has the correct epoch.

Fair point.

> I'd like to figure out how to maintain full versions of the
> procarray.c-managed xid horizons, without widening the shared memory
> representation. I was planning to think harder about for 13.
> Obviously that doesn't help you now. So I'm wondering if you should
> just open-code this for now, and put in a comment about why it's safe
> and a note that there'll hopefully be a fxid horizon available in a
> later release?

I came up with the attached. Instead of having a general purpose "widen"
function, it adds GetFullRecentGlobalXmin(), to return a 64-bit version
of RecentGlobalXmin. That's enough for this Gist vacuum patch.

In addition to that change, I modified the GistPageSetDeleted(),
GistPageSetDeleteXid(), GistPageGetDeleteXid() inline functions a bit. I
merged GistPageSetDeleted() and GistPageSetDeleteXid() into a single
function, to make sure that the delete-XID is always set when a page is
marked as deleted. And I modified GistPageGetDeleteXid() to return
NormalTransactionId (or a FullTransactionId version of it, to be
precise), for Gist pages that were deleted with older PostgreSQL v12
beta versions. The previous patch tripped an assertion in that case,
which is not nice for people binary-upgrading from earlier beta versions.

I did some testing of this with XID wraparound, and it seems to work. I
used the attached bash script for the testing, with the a helper contrib
module to consume XIDs faster. It's not very well commented and probably
not too useful for anyone, but I'm attaching it here mainly as a note to
future-self, if we need to revisit this.

Unless something comes up, I'll commit this tomorrow.

- Heikki

Attachment	Content-Type	Size
0001-Refactor-checks-for-deleted-GiST-pages.patch	text/x-patch	4.5 KB
0002-Use-full-64-bit-XID-for-checking-if-a-deleted-GiST-p.patch	text/x-patch	13.3 KB
gist-vacuum-wraparound-test.sh	application/x-shellscript	1.1 KB
xidtest.tar.gz	application/gzip	1.4 KB

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Костя Кузнецов <chapaev28(at)ya(dot)ru>
Subject:	Re: GiST VACUUM
Date:	2019-07-24 17:30:09
Message-ID:	a63c3a13-f985-f3c2-3227-9f8ca201fa93@iki.fi
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 22/07/2019 16:09, Heikki Linnakangas wrote:
> Unless something comes up, I'll commit this tomorrow.

Pushed this now, to master and REL_12_STABLE.

Now, B-tree indexes still have the same problem, in all versions. Any
volunteers to write a similar fix for B-trees?

- Heikki

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Костя Кузнецов <chapaev28(at)ya(dot)ru>
Subject:	Re: GiST VACUUM
Date:	2019-07-24 18:02:13
Message-ID:	CAH2-WznOjh1nYz_grTYGR91ALA3qMd+nWN+0F=N3n8NecLvi6w@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jul 24, 2019 at 10:30 AM Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
> Pushed this now, to master and REL_12_STABLE.
>
> Now, B-tree indexes still have the same problem, in all versions. Any
> volunteers to write a similar fix for B-trees?

I was hoping that you'd work on it. :-)

Any reason to think that it's much different to what you've done with GiST?

--
Peter Geoghegan

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	Peter Geoghegan <pg(at)bowt(dot)ie>
Cc:	Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Костя Кузнецов <chapaev28(at)ya(dot)ru>
Subject:	Re: GiST VACUUM
Date:	2019-07-24 18:32:58
Message-ID:	135dfb36-e1d5-fd93-83b0-43c0b6ecf290@iki.fi
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 24/07/2019 21:02, Peter Geoghegan wrote:
> On Wed, Jul 24, 2019 at 10:30 AM Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
>> Pushed this now, to master and REL_12_STABLE.
>>
>> Now, B-tree indexes still have the same problem, in all versions. Any
>> volunteers to write a similar fix for B-trees?
>
> I was hoping that you'd work on it. :-)

That's probably how it's going to go, but hey, doesn't hurt to ask :-).

> Any reason to think that it's much different to what you've done with GiST?

No, it should be very similar.

- Heikki

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Костя Кузнецов <chapaev28(at)ya(dot)ru>
Subject:	Re: GiST VACUUM
Date:	2019-07-24 18:41:13
Message-ID:	CAH2-WzkE8tKc4pH5SXD+fa0ZC18hJrYOu0CYqhgE1TM2jEYFEQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jul 24, 2019 at 11:33 AM Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
> That's probably how it's going to go, but hey, doesn't hurt to ask :-).

I think that it would be fine to be conservative with nbtree, and only
target the master branch. The problem is annoying, certainly, but it's
not likely to make a huge difference for most real world workloads.
OTOH, perhaps the risk is so low that we might as well target
backbranches.

How do you feel about it?

--
Peter Geoghegan