From: | GUO Rui <ruig2(at)uci(dot)edu> |
---|---|
To: | Andrey Borodin <x4mmm(at)yandex-team(dot)ru> |
Cc: | pgsql-hackers(at)postgresql(dot)org, a(dot)korotkov(at)postgrespro(dot)ru |
Subject: | Re: Google Summer of Code: question about GiST API advancement project |
Date: | 2019-04-05 09:07:10 |
Message-ID: | CAEuz5PRfkd9S+MNsfzkbLtH-rtN6O4PseJcm7Zit-U4fVMZR7g@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | Postg젠 토토SQL : |
I drafted my proposal about the above topic at
https://docs.google.com/document/d/1X7Lw-c0rLYuSjwLNfw6qXpN5Cf1_0u2gXtgEgLkNezA/edit?usp=sharing
. Looking forward to your feedback.
On Wed, Apr 3, 2019 at 10:55 PM GUO Rui <ruig2(at)uci(dot)edu> wrote:
> Dear Andrey Borodin,
>
> I discussed the above topic with the professors at my school, and got the
> following points:
>
> 1. The case that the volume of an MBB is 0 should be very rare, and if the
> data is skewed (e.g. only a few nodes have non-NULL value on a dimension)
> then the data can be pre-proceeded and normalized before it goes to the
> database, thus the storage and query can be much faster;
> 2. The performance of the R-tree family may depend on the specific data
> set and even the order of the data insertions, so one algorithm may be
> better on one dataset and slower on another, thus the benchmark should
> include different datasets;
>
> I totally agree that by adopting the RR*-tree algorithm we can improve the
> performance of PostgreSQL. For my proposal, I'll:
> 1. Document the benchmarks I found available online (e.g.
> https://github.com/ambling/rtree-benchmark) and then state how we'd like
> to generate data ourselves (e.g. data with a Gaussian distribution, or the
> same dataset but different insertion order...) to test with for a wilder
> coverage;
> 2. Create tools to generate a report on current PostgreSQL performance
> with the benchmark;
> 3. Plan to improve the R-tree and GiST part of PostgreSQL. For the
> discussion in the email thread
> /message-id/flat/CAJEAwVFMo-FXaJ6Lkj8Wtb1br0MtBY48EGMVEJBOodROEGykKg%40mail(dot)gmail(dot)com#CAJEAwVFMo-FXaJ6Lkj8Wtb1br0MtBY48EGMVEJBOodROEGykKg(at)mail(dot)gmail(dot)com
> </message-id/flat/CAJEAwVFMo-FXaJ6Lkj8Wtb1br0MtBY48EGMVEJBOodROEGykKg%40mail(dot)gmail(dot)com#CAJEAwVFMo-FXaJ6Lkj8Wtb1br0MtBY48EGMVEJBOodROEGykKg(at)mail(dot)gmail(dot)com>
> , I prefer to do a *scale-based* trick rather than using bits in a float
> or creating a new struct;
> 4. Generate a performance report on PostgreSQL with the above R-tree patch;
> The following would be marked as *optional*:
> 5. Optimize GiST with New APIs (e.g. non-penalty-based choose subtree
> function, also discussed in the above email thread);
> 6. For skewed data, try to warn the user, and then suggest methods to cook
> the data (e.g. the normalization algorithms in ML); pre-proceeding the data
> should not be the duty of the database;
> 7. Other advanced features of RR*-tree and GiST bulk loading;
>
> Any comments or feedback on the above ideas? I'll work on a draft proposal
> ASAP.
>
> Many thanks,
> Rui Guo
>
>
>
>
> On Sun, Mar 31, 2019 at 10:53 AM Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
> wrote:
>
>> Hi!
>>
>> > 31 марта 2019 г., в 14:58, GUO Rui <ruig2(at)uci(dot)edu> написал(а):
>> >
>> > I'm Rui Guo, a PhD student focusing on database at the University of
>> California, Irvine. I'm interested in the "GiST API advancement" project
>> for the Google Summer of Code 2019 which is listed at
>> https://wiki.postgresql.org/wiki/GSoC_2019#GiST_API_advancement_.282019.29
>> .
>> >
>> > I'm still reading about RR*-tree, GiST and the PostgreSQL source code
>> to have a better idea on my proposal. Meanwhile, I have a very basic and
>> simple question:
>> >
>> > Since the chooseSubtree() algorithm in both R*-tree and RR*-tree are
>> heuristic and somehow greedy (e.g. pick the MBB that needs to enlarge the
>> least), is it possible to apply machine learning algorithm to improve it?
>> The only related reference I got is to use deep learning in database join
>> operation (https://arxiv.org/abs/1808.03196) Is it not suitable to use
>> machine learning here or someone already did?
>>
>> If you are interested in ML and DBs you should definitely look into [0].
>> You do not have to base your proposal on mentor ideas, you can use your
>> own. Implementing learned indexes - seems reasonable.
>>
>> RR*-tree algorithms are heuristic in some specific parts, but in general
>> they are designed to optimize very clear metrics. Generally, ML algorithms
>> tend to compose much bigger pile of heuristics and solve less
>> mathematically clear tasks than splitting subtrees or choosing subtree for
>> insertion.
>> R*-tree algorithms are heuristic only to be faster.
>>
>> Best regards, Andrey Borodin.
>>
>> [0] https://arxiv.org/pdf/1712.01208.pdf
>
>
From | Date | Subject | |
---|---|---|---|
Next Message | Floris Van Nee | 2019-04-05 09:13:29 | Re: speeding up planning with partitions |
Previous Message | Thomas Munro | 2019-04-05 09:05:02 | Re: Using condition variables to wait for checkpoints |