Quick Links

Re: improve Chinese locale performance

Lists:	pgsql-hackers

From:	Craig Ringer <craig(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Greg Stark <stark(at)mit(dot)edu>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Quan Zongliang <quanzongliang(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: improve Chinese locale performance
Date:	2013-07-23 13:42:16
Message-ID:	xicwfqna5cb1csbw3se45b0r.1374586861704@email.android.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

(Replying on phone, please forgive bad quoting)

Isn't this pretty much what adopting ICU is supposed to give us? OS-independent collations?

I'd be interested in seeing the rest data for this performance report, partly as I'd like to see how ICU collations would compare when ICU is crudely hacked into place for testing.

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Craig Ringer <craig(at)2ndquadrant(dot)com>
Cc:	Greg Stark <stark(at)mit(dot)edu>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Quan Zongliang <quanzongliang(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: improve Chinese locale performance
Date:	2013-07-23 14:34:21
Message-ID:	CA+Tgmob8UxfNDc1gyX=7tPLtcaDcYzHLhSrDAkGkNq8-0YaJfA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Jul 23, 2013 at 9:42 AM, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:
> (Replying on phone, please forgive bad quoting)
>
> Isn't this pretty much what adopting ICU is supposed to give us? OS-independent collations?

Yes.

> I'd be interested in seeing the rest data for this performance report, partly as I'd like to see how ICU collations would compare when ICU is crudely hacked into place for testing.

I pretty much lost interest in ICU upon reading that they use UTF-16
as their internal format.

http://userguide.icu-project.org/strings#TOC-Strings-in-ICU

What that would mean for us is that instead of copying the input
strings into a temporary buffer and passing the buffer to strcoll(),
we'd need to convert them to ICU's representation (which means writing
twice as many bytes as the length of the input string in cases where
the input string is mostly single-byte characters) and then call ICU's
strcoll() equivalent. I agree that it might be worth testing, but I
can't work up much optimism. It seems to me that something that
operates directly on the server encoding could run a whole lot faster.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Quan Zongliang <quanzongliang(at)gmail(dot)com>
To:	Craig Ringer <craig(at)2ndquadrant(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Stark <stark(at)mit(dot)edu>, Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: improve Chinese locale performance
Date:	2013-07-24 01:19:25
Message-ID:	51EF2B9D.9090003@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> I'd be interested in seeing the rest data for this performance report, partly as I'd like to see how ICU collations would compare when ICU is crudely hacked into place for testing.
>

From:	Martijn van Oosterhout <kleptog(at)svana(dot)org>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Craig Ringer <craig(at)2ndquadrant(dot)com>, Greg Stark <stark(at)mit(dot)edu>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Quan Zongliang <quanzongliang(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: improve Chinese locale performance
Date:	2013-07-28 09:39:40
Message-ID:	20130728093940.GA5652@svana.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Jul 23, 2013 at 10:34:21AM -0400, Robert Haas wrote:
> I pretty much lost interest in ICU upon reading that they use UTF-16
> as their internal format.
>
> http://userguide.icu-project.org/strings#TOC-Strings-in-ICU

The utf-8 support has been steadily improving:

For example, icu::Collator::compareUTF8() compares two utf-8 strings
incrementally, without converting all of the two strings to UTF-16 if
there is an early base letter difference.

http://userguide.icu-project.org/strings/utf-8

For all other encodings you should be able to use an iterator. As to
performance I have no idea.

The main issue with strxfrm() is its lame API. If it supported
returning prefixes you'd be set, but as it is you need >10MB of memory
just to transform a 10MB string, even if only the first few characers
would be enough to sort...

Mvg,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> He who writes carelessly confesses thereby at the very outset that he does
> not attach much importance to his own thoughts.
-- Arthur Schopenhauer

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc:	Craig Ringer <craig(at)2ndquadrant(dot)com>, Greg Stark <stark(at)mit(dot)edu>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Quan Zongliang <quanzongliang(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: improve Chinese locale performance
Date:	2013-08-01 16:09:45
Message-ID:	CA+TgmoaaGa3HyYmMKFgc4m2Cps8Vv7L8534-89D_AaA5YS1CqA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Jul 28, 2013 at 5:39 AM, Martijn van Oosterhout
<kleptog(at)svana(dot)org> wrote:
> On Tue, Jul 23, 2013 at 10:34:21AM -0400, Robert Haas wrote:
>> I pretty much lost interest in ICU upon reading that they use UTF-16
>> as their internal format.
>>
>> http://userguide.icu-project.org/strings#TOC-Strings-in-ICU
>
> The utf-8 support has been steadily improving:
>
> For example, icu::Collator::compareUTF8() compares two utf-8 strings
> incrementally, without converting all of the two strings to UTF-16 if
> there is an early base letter difference.
>
> http://userguide.icu-project.org/strings/utf-8
>
> For all other encodings you should be able to use an iterator. As to
> performance I have no idea.
>
> The main issue with strxfrm() is its lame API. If it supported
> returning prefixes you'd be set, but as it is you need >10MB of memory
> just to transform a 10MB string, even if only the first few characers
> would be enough to sort...

Yep, definitely. And by ">10MB" you mean ">90MB", at least on my Mac,
which is really outrageous.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Quan Zongliang <quanzongliang(at)gmail(dot)com>
To:	Craig Ringer <craig(at)2ndquadrant(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Stark <stark(at)mit(dot)edu>, Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: improve Chinese locale performance
Date:	2013-09-05 03:02:47
Message-ID:	5227F457.8040108@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 07/23/2013 09:42 PM, Craig Ringer wrote:
> (Replying on phone, please forgive bad quoting)
>
> Isn't this pretty much what adopting ICU is supposed to give us? OS-independent collations?
>
> I'd be interested in seeing the rest data for this performance report, partly as I'd like to see how ICU collations would compare when ICU is crudely hacked into place for testing.
>
I think of a new idea.
Add a compare method column to pg_collation.
Every collation has its own compare function or null.
When function varstr_cmp is called, if specified collation
has compare function, call it instead of strcoll().

How about this?

Regards.

Quan Zongliang

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Quan Zongliang <quanzongliang(at)gmail(dot)com>
Cc:	Craig Ringer <craig(at)2ndquadrant(dot)com>, Greg Stark <stark(at)mit(dot)edu>, Peter Eisentraut <peter_e(at)gmx(dot)net>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: improve Chinese locale performance
Date:	2013-09-05 17:02:47
Message-ID:	CA+TgmoZ+2p-ObskiuNNW15E4fabEvphBmP7aysA8pWeMTbAB0Q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Sep 4, 2013 at 11:02 PM, Quan Zongliang <quanzongliang(at)gmail(dot)com> wrote:
> I think of a new idea.
> Add a compare method column to pg_collation.
> Every collation has its own compare function or null.
> When function varstr_cmp is called, if specified collation
> has compare function, call it instead of strcoll().

I think we're going to need to have two kinds of collations:
OS-derived collations (which get all of their smarts from the OS), and
PG-internal collations (which use PG-aware code for everything).
Which I suspect is a bit more involved than what you're imagining, but
mixing and matching doesn't seem likely to end well.

However, what you're proposing might serve as a useful demonstration
of how much performance there is to be gained here.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Quan Zongliang <quanzongliang(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Craig Ringer <craig(at)2ndquadrant(dot)com>, Greg Stark <stark(at)mit(dot)edu>, Peter Eisentraut <peter_e(at)gmx(dot)net>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: improve Chinese locale performance
Date:	2013-09-09 09:22:47
Message-ID:	522D9367.3070001@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 09/06/2013 01:02 AM, Robert Haas wrote:
> On Wed, Sep 4, 2013 at 11:02 PM, Quan Zongliang <quanzongliang(at)gmail(dot)com> wrote:
>> I think of a new idea.
>> Add a compare method column to pg_collation.
>> Every collation has its own compare function or null.
>> When function varstr_cmp is called, if specified collation
>> has compare function, call it instead of strcoll().
>
> I think we're going to need to have two kinds of collations:
> OS-derived collations (which get all of their smarts from the OS), and
> PG-internal collations (which use PG-aware code for everything).
> Which I suspect is a bit more involved than what you're imagining, but
> mixing and matching doesn't seem likely to end well.
>
> However, what you're proposing might serve as a useful demonstration
> of how much performance there is to be gained here.
>
Understood.

I just try to speed up text compare, not redesign locale.

Do you have a plan to do this?

Thank you.

Quan Zongliang

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Quan Zongliang <quanzongliang(at)gmail(dot)com>
Cc:	Craig Ringer <craig(at)2ndquadrant(dot)com>, Greg Stark <stark(at)mit(dot)edu>, Peter Eisentraut <peter_e(at)gmx(dot)net>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: improve Chinese locale performance
Date:	2013-09-09 16:09:05
Message-ID:	CA+TgmoaL-st4deKhf-1zRRuDb6whPLjHrbOg05QU5NUE5yYPmA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Sep 9, 2013 at 5:22 AM, Quan Zongliang <quanzongliang(at)gmail(dot)com> wrote:
> Understood.
>
> I just try to speed up text compare, not redesign locale.
>
> Do you have a plan to do this?

Not any time soon, anyway.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company