Quick Links

Re: Assert failure with ICU support

Lists:	pgsql-bugs

From:	Richard Guo <guofenglinux(at)gmail(dot)com>
To:	PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject:	Assert failure with ICU support
Date:	2023-04-19 10:30:10
Message-ID:	CAMbWs49Q6UoKGeT8pBkMtJGJd+16CBFZaaWUk9Du+2ERE5g_YA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	Postg토토 사이트 순위SQL : Postg토토 사이트 순위SQL 메일 링리스트 : 2023-04-19 이후 PGSQL-BUGS.

I happened to run into an assert failure by chance with ICU support.
Here is the query:

SELECT '1' SIMILAR TO '\൧';

The failure happens in lexescape(),

default:
assert(iscalpha(c));
FAILW(REG_EESCAPE); /* unknown alphabetic escape */
break;

Without ICU support, the same query just gives an error.

# SELECT '1' SIMILAR TO '\൧';
ERROR: invalid regular expression: invalid escape \ sequence

FWIW, I googled a bit and '൧' seems to be number 1 in Malayalam.

Thanks
Richard

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Richard Guo <guofenglinux(at)gmail(dot)com>
Cc:	PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject:	Re: Assert failure with ICU support
Date:	2023-04-19 15:42:20
Message-ID:	2983095.1681918940@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs

Richard Guo <guofenglinux(at)gmail(dot)com> writes:
> I happened to run into an assert failure by chance with ICU support.
> Here is the query:

> SELECT '1' SIMILAR TO '\൧';

> The failure happens in lexescape(),

> default:
> assert(iscalpha(c));
> FAILW(REG_EESCAPE); /* unknown alphabetic escape */
> break;

> Without ICU support, the same query just gives an error.

Interesting.

> FWIW, I googled a bit and '൧' seems to be number 1 in Malayalam.

The code in lexescape() is assuming that if "c" passes
iscalnum(), then either it's '0'-'9' or it passes iscalpha().
This is clearly wrong in Unicode-land, which has non-ASCII digits.
I imagine you can find libc locales where this fails, not only ICU.

I think the question here is what we want to do with such cases:
throw a regex syntax error, or just return the character as-is?
The fine manual says that if the character after '\' is
alphanumeric, it's an escape, and otherwise the character is
quoted literally. But how shall we interpret "alphanumeric"?

I'm kind of inclined to the idea that anything that's not ASCII
should be considered to be literally quoted by '\', rather than
being an erroneous regex escape. Maybe I'm too English-centric.
But I don't like the idea that what is a valid regex should vary
depending on locale.

regards, tom lane

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Richard Guo <guofenglinux(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject:	Re: Assert failure with ICU support
Date:	2023-04-19 20:31:17
Message-ID:	ece303a3068026c6d6f10cab2a7f3f0836f4adb7.camel@j-davis.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs

On Wed, 2023-04-19 at 18:30 +0800, Richard Guo wrote:
> I happened to run into an assert failure by chance with ICU support.
> Here is the query:
>
> SELECT '1' SIMILAR TO '\൧';
>
> The failure happens in lexescape(),
>
> default:
> assert(iscalpha(c));
> FAILW(REG_EESCAPE); /* unknown alphabetic escape */
> break;
>
> Without ICU support, the same query just gives an error.
>
> # SELECT '1' SIMILAR TO '\൧';
> ERROR: invalid regular expression: invalid escape \ sequence
>
> FWIW, I googled a bit and '൧' seems to be number 1 in Malayalam.

Thank you for the report and analysis! The problem exists all the way
back if you do:

SELECT '1' COLLATE "en-US-x-icu" SIMILAR TO '\൧';

The root cause (which you allude to) is that the code makes the
assumption that digits only include 0-9, but u_isdigit('൧') == true,
which violates that assumption.

For Linux[1] specifically, it seems that the assumption should hold for
iswdigit(). But looking here[2], it seems that the behavior of
iswdigit() depends on the locale and I don't think it's correct to make
that assumption.

I did some experimentation on ICU and I found (pseudocode -- the real
code needs to create a UChar32 from an encoded string first):

char name: MALAYALAM DIGIT ONE
u_isalnum('൧'): true
u_isalpha('൧'): false
u_isdigit('൧'): true
u_charType('൧') == U_DECIMAL_DIGIT_NUMBER: true
u_hasBinaryProperty('൧', UCHAR_POSIX_XDIGIT): true
u_hasBinaryProperty('൧', UCHAR_POSIX_ALNUM): true

The docs[3] for ICU say:

"There are also functions that provide easy migration from C/POSIX
functions like isblank(). Their use is generally discouraged because
the C/POSIX standards do not define their semantics beyond the ASCII
range, which means that different implementations exhibit very
different behavior. Instead, Unicode properties should be used
directly."

We should probably just check that it's plain ASCII.

Unfortunately I would not be surprised if there are more bugs similar
to this one.

Regards,
Jeff Davis

[1] https://man7.org/linux/man-pages/man3/iswdigit.3.html
[2]
https://pubs.opengroup.org/onlinepubs/9699919799/functions/iswdigit.html
[3]
https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/uchar_8h.html#details

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	Richard Guo <guofenglinux(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject:	Re: Assert failure with ICU support
Date:	2023-04-19 20:45:48
Message-ID:	3017542.1681937148@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs

Jeff Davis <pgsql(at)j-davis(dot)com> writes:
> We should probably just check that it's plain ASCII.

That was about the same conclusion I came to. More or less

c = *v->now++;
- if (!iscalnum(c))
+ if (!isascii(c) || !iscalnum(c))
RETV(PLAIN, c);

although I'm not sure if it's a good idea to apply isascii()
to something that's wider than char. It might be advisable,
if ugly, to write

+ if (c >= 0x100 || !iscalnum(c))

I'm also inclined to remove that "assert(iscalpha(c))" in
the switch. That's not checking sanity of our own code,
just consistency of the <wctype.h> functions; and as this
episode shows it's more trouble than it's worth.

regards, tom lane

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Richard Guo <guofenglinux(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject:	Re: Assert failure with ICU support
Date:	2023-04-20 19:47:21
Message-ID:	c480b495be592d600504b99e70745ed8f43c877b.camel@j-davis.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs

On Wed, 2023-04-19 at 16:45 -0400, Tom Lane wrote:
> + if (c >= 0x100 || !iscalnum(c))

I'm curious why you say >= 0x100 rather than >= 0x80?

What's the purpose of the error? Is it to catch mistakes, or is it to
reserve room for adding new escape sequences in the future?

Regards,
Jeff Davis

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	Richard Guo <guofenglinux(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject:	Re: Assert failure with ICU support
Date:	2023-04-20 19:53:01
Message-ID:	3170490.1682020381@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	Postg토토 핫SQL : Postg토토 핫SQL 메일 링리스트 : 2023-04-20 이후 PGSQL-BUGS 19:53

Jeff Davis <pgsql(at)j-davis(dot)com> writes:
> On Wed, 2023-04-19 at 16:45 -0400, Tom Lane wrote:
>> + if (c >= 0x100 || !iscalnum(c))

> I'm curious why you say >= 0x100 rather than >= 0x80?

Right, should be 0x80, my thinko.

> What's the purpose of the error? Is it to catch mistakes, or is it to
> reserve room for adding new escape sequences in the future?

As I read it, it's meant to leave room for defining more escapes.
If we allowed \x for any non-currently-defined "x" to just be "x",
then there would be a compatibility problem if we wanted to make
it mean something else. But I think it's sufficient to reserve
the ASCII letters for that purpose.

regards, tom lane

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Richard Guo <guofenglinux(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject:	Re: Assert failure with ICU support
Date:	2023-04-20 23:30:07
Message-ID:	fe9c75bafe422b9c672a037c8fa781cbcfef488d.camel@j-davis.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs

On Thu, 2023-04-20 at 15:53 -0400, Tom Lane wrote:
> As I read it, it's meant to leave room for defining more escapes.
> If we allowed \x for any non-currently-defined "x" to just be "x",
> then there would be a compatibility problem if we wanted to make
> it mean something else. But I think it's sufficient to reserve
> the ASCII letters for that purpose.

Sounds good, patch attached.

Regards,
Jeff Davis

Attachment	Content-Type	Size
0001-Avoid-character-classification-in-regex-escape-parsi.patch	text/x-patch	1.6 KB

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	Richard Guo <guofenglinux(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject:	Re: Assert failure with ICU support
Date:	2023-04-20 23:31:18
Message-ID:	3235932.1682033478@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	Postg토토 꽁 머니SQL : Postg토토 꽁 머니SQL 메일 링리스트 : 2023-04-20 이후 PGSQL-BUGS 23:31

Jeff Davis <pgsql(at)j-davis(dot)com> writes:
> Sounds good, patch attached.

WFM.

regards, tom lane