From: | Robert Haas <robertmhaas(at)gmail(dot)com> |
---|---|
To: | Jeff Davis <pgsql(at)j-davis(dot)com> |
Cc: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, David Fetter <david(at)fetter(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Spilling hashed SetOps and aggregates to disk |
Date: | 2018-06-13 14:08:47 |
Message-ID: | CA+TgmoY6MWPym2=19dS4f2kenwN6irSxThm0MyG8RCkOHxvdqQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | Postg토토SQL : Postg토토SQL |
On Mon, Jun 11, 2018 at 1:50 PM, Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
> I think average group size is the wrong way to look at it, because
> averages are only useful for a normal distribution. The two most
> interesting cases are:
>
> 1. Fairly uniform group size (e.g. normal dist)
> 2. Skew values, power law distribution, etc., where some groups are
> very large and most are tiny by comparison. I am calling the very large
> groups "skewed groups".
I wasn't using the term "average" in a mathematical sense. I suppose
I really meant "typical". I agree that thinking about cases where the
group size is nonuniform is a good idea, but I don't think I agree
that all uniform distributions are created equal. Uniform
distributions with 1 row per group are very different than uniform
distributions with 1000 rows per group.
> If we get the skewed groups into the hash table, and avoid OOM, I think
> that is a success. My patch did that, except it didn't account for two
> cases:
>
> (a) clustered input, where skewed groups may not appear until the
> hash table is already full; and
> (b) variable-sized transition values that grow with the group size
I think that many of the algorithms under consideration could be
implemented without worrying about variable-sized transition values,
and then the approach could be extended later. However, whether or
not a given algorithm can handle clustered input seems like a fairly
basic property of the algorithm. I don't think we should set the bar
too high because no algorithm is going to be perfect in every case; at
the same time, clustered input is pretty common in real-world
scenarios, and an algorithm that handles such cases better has a
significant leg up over one that can't, all other things being equal.
I'm not sure I remember precisely what your proposal was any more.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
From | Date | Subject | |
---|---|---|---|
Next Message | Simon Riggs | 2018-06-13 14:25:10 | Re: hot_standby_feedback vs excludeVacuum and snapshots |
Previous Message | Ashutosh Bapat | 2018-06-13 13:57:29 | Re: Partitioning with temp tables is broken |