Re: fulltext parser strange behave

Lists: pgsql-hackerspgsql-patches
From: "Pavel Stehule" <pavel(dot)stehule(at)gmail(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Cc: "Oleg Bartunov" <oleg(at)sai(dot)msu(dot)su>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: fulltext parser strange behave
Date: 2007-11-06 22:13:13
Message-ID: 162867790711061413lf58dd58sd2047dca98dcf804@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Hello

I am writing tsearch2 wrapper and I testing functionality. I found
some little bit strange on default parser. It can't parse tags with
numbers:

test=# select * from parse('<h1>zluty kun se napil <b>zlute</b> vody</h2>');
tokid | token
-------+-------
12 | <
3 | h1
12 | >
1 | zluty
12 |
1 | kun
12 |
1 | se
12 |
1 | napil
12 |
13 | <b>
1 | zlute
13 | </b>
12 |
1 | vody
12 | < <=====
19 | /h2
12 | > <=====
(19 rows)

It is correct?

Regards
Pavel Stehule


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Pavel Stehule" <pavel(dot)stehule(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, "Oleg Bartunov" <oleg(at)sai(dot)msu(dot)su>
Subject: Re: fulltext parser strange behave
Date: 2007-11-07 23:11:00
Message-ID: 11425.1194477060@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

"Pavel Stehule" <pavel(dot)stehule(at)gmail(dot)com> writes:
> I am writing tsearch2 wrapper and I testing functionality. I found
> some little bit strange on default parser. It can't parse tags with
> numbers:

Well, the state machine definitely thinks that tag names should contain
only ASCII letters (with possibly a leading or trailing '/'). Given the
HTML examples I suppose we should allow non-first digits too. Is there
anything else that should be considered a tag? What about dash and
underscore for instance?

regards, tom lane


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Subject: Re: fulltext parser strange behave
Date: 2007-11-07 23:38:33
Message-ID: 47324C79.3090400@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Tom Lane wrote:
> "Pavel Stehule" <pavel(dot)stehule(at)gmail(dot)com> writes:
>
>> I am writing tsearch2 wrapper and I testing functionality. I found
>> some little bit strange on default parser. It can't parse tags with
>> numbers:
>>
>
> Well, the state machine definitely thinks that tag names should contain
> only ASCII letters (with possibly a leading or trailing '/'). Given the
> HTML examples I suppose we should allow non-first digits too. Is there
> anything else that should be considered a tag? What about dash and
> underscore for instance?
>
>
>

The docs say we specifically accept HTML tags. Are we really just
accepting anything that is a string of ASCII letters as the tag name?
Then we should adjust the docs. <foo> and <foo1234> are not HTML tags.

cheers

andrew


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Subject: Re: fulltext parser strange behave
Date: 2007-11-08 01:11:37
Message-ID: 12819.1194484297@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> Tom Lane wrote:
>> Well, the state machine definitely thinks that tag names should contain
>> only ASCII letters (with possibly a leading or trailing '/'). Given the
>> HTML examples I suppose we should allow non-first digits too. Is there
>> anything else that should be considered a tag? What about dash and
>> underscore for instance?

> The docs say we specifically accept HTML tags. Are we really just
> accepting anything that is a string of ASCII letters as the tag name?
> Then we should adjust the docs. <foo> and <foo1234> are not HTML tags.

I don't think I want to try to maintain a list of exactly which
identifiers are considered valid tag names ... and if I did, I wouldn't
put it into the parser. It would be a dictionary's job to tell valid
from invalid tag names, no?

regards, tom lane


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Subject: Re: fulltext parser strange behave
Date: 2007-11-08 02:01:57
Message-ID: 47326E15.8000702@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Tom Lane wrote:
> Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
>
>> Tom Lane wrote:
>>
>>> Well, the state machine definitely thinks that tag names should contain
>>> only ASCII letters (with possibly a leading or trailing '/'). Given the
>>> HTML examples I suppose we should allow non-first digits too. Is there
>>> anything else that should be considered a tag? What about dash and
>>> underscore for instance?
>>>
>
>
>> The docs say we specifically accept HTML tags. Are we really just
>> accepting anything that is a string of ASCII letters as the tag name?
>> Then we should adjust the docs. <foo> and <foo1234> are not HTML tags.
>>
>
> I don't think I want to try to maintain a list of exactly which
> identifiers are considered valid tag names ... and if I did, I wouldn't
> put it into the parser. It would be a dictionary's job to tell valid
> from invalid tag names, no?
>
>
>

I don't have a quarrel with that. But then we should be more clear about
what we are recognizing. We could describe the thing as an HTML-like
tag, possibly. I think the same probably goes for entities too.

cheers

andrew


From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fulltext parser strange behave
Date: 2007-11-08 05:01:29
Message-ID: Pine.LNX.4.64.0711080758320.31840@sn.sai.msu.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Wed, 7 Nov 2007, Tom Lane wrote:

> Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
>> Tom Lane wrote:
>>> Well, the state machine definitely thinks that tag names should contain
>>> only ASCII letters (with possibly a leading or trailing '/'). Given the
>>> HTML examples I suppose we should allow non-first digits too. Is there
>>> anything else that should be considered a tag? What about dash and
>>> underscore for instance?
>
>> The docs say we specifically accept HTML tags. Are we really just
>> accepting anything that is a string of ASCII letters as the tag name?
>> Then we should adjust the docs. <foo> and <foo1234> are not HTML tags.
>
> I don't think I want to try to maintain a list of exactly which
> identifiers are considered valid tag names ... and if I did, I wouldn't
> put it into the parser. It would be a dictionary's job to tell valid
> from invalid tag names, no?

it'd be nice to know in dictionary the parser state, but I think it's
too much knowledge for dictionary and the only possibility is to
let <foo1234> pass to dictionary. Currently we have three separate tokens.

>
> regards, tom lane
>

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru)
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Subject: Re: fulltext parser strange behave
Date: 2007-11-08 20:11:44
Message-ID: 47336D80.5000401@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: Postg토토 결과SQL pgsql-patches

Andrew Dunstan wrote:
>
>
> Tom Lane wrote:
>> Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
>>
>>> Tom Lane wrote:
>>>
>>>> Well, the state machine definitely thinks that tag names should
>>>> contain
>>>> only ASCII letters (with possibly a leading or trailing '/').
>>>> Given the
>>>> HTML examples I suppose we should allow non-first digits too. Is
>>>> there
>>>> anything else that should be considered a tag? What about dash and
>>>> underscore for instance?
>>>>
>>
>>
>>> The docs say we specifically accept HTML tags. Are we really just
>>> accepting anything that is a string of ASCII letters as the tag
>>> name? Then we should adjust the docs. <foo> and <foo1234> are not
>>> HTML tags.
>>>
>>
>> I don't think I want to try to maintain a list of exactly which
>> identifiers are considered valid tag names ... and if I did, I wouldn't
>> put it into the parser. It would be a dictionary's job to tell valid
>> from invalid tag names, no?
>>
>>
>>
>
> I don't have a quarrel with that. But then we should be more clear
> about what we are recognizing. We could describe the thing as an
> HTML-like tag, possibly. I think the same probably goes for entities too.
>
>
I've just been looking at the state machine in wparser_def.c. I think
the processing for entities is also a few bob short in the pound. It
recognises decimal numeric character references, but nor hexadecimal
numeric character references. That's fairly silly since the HTML spec
specifically says the latter are "particularly useful". The rules for
named entities are also deficient w.r.t. digits, just like the case of
tags that Tom noticed. This isn't academic: HTML features a number of
named entities with digits in the name (sup2, frac14 for example).

In XML at least, legal names are defined by the following rules from the
spec:

[4] NameStartChar ::= ":" | [A-Z] | "_" | [a-z] |
[#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] |
[#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] |
[#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
[4a] NameChar ::= NameStartChar | "-" | "." | [0-9] |
#xB7 | [#x0300-#x036F] | [#x203F-#x2040]
[5] Name ::= NameStartChar (NameChar)*

Restricting this to ASCII, we get:

[4] NameStartChar ::= ":" | [A-Z] | "_" | [a-z]
[4a] NameChar ::= NameStartChar | "-" | "." | [0-9]
[5] Name ::= NameStartChar (NameChar)*

or this regex for Name:

[A-Za-z:_][A-Za-z0-9:_.-]*

I suggest we use that or something very close to it as the rule for
names in these patterns.

cheers

andrew


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Subject: Re: fulltext parser strange behave
Date: 2007-11-09 18:53:53
Message-ID: 13471.1194634433@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> I've just been looking at the state machine in wparser_def.c. I think
> the processing for entities is also a few bob short in the pound. It
> recognises decimal numeric character references, but nor hexadecimal
> numeric character references. That's fairly silly since the HTML spec
> specifically says the latter are "particularly useful". The rules for
> named entities are also deficient w.r.t. digits, just like the case of
> tags that Tom noticed. This isn't academic: HTML features a number of
> named entities with digits in the name (sup2, frac14 for example).

> In XML at least, legal names are defined by the following rules from the
> spec:
> ...
> [A-Za-z:_][A-Za-z0-9:_.-]*

> I suggest we use that or something very close to it as the rule for
> names in these patterns.

No objections here. Who wants to patch wparser_def?

regards, tom lane


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Subject: Re: fulltext parser strange behave
Date: 2007-11-13 19:42:18
Message-ID: 4739FE1A.3090508@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: Postg스포츠 토토 베트맨SQL pgsql-patches

Tom Lane wrote:
> Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
>
>> I've just been looking at the state machine in wparser_def.c. I think
>> the processing for entities is also a few bob short in the pound. It
>> recognises decimal numeric character references, but nor hexadecimal
>> numeric character references. That's fairly silly since the HTML spec
>> specifically says the latter are "particularly useful". The rules for
>> named entities are also deficient w.r.t. digits, just like the case of
>> tags that Tom noticed. This isn't academic: HTML features a number of
>> named entities with digits in the name (sup2, frac14 for example).
>>
>
>
>> In XML at least, legal names are defined by the following rules from the
>> spec:
>> ...
>> [A-Za-z:_][A-Za-z0-9:_.-]*
>>
>
>
>> I suggest we use that or something very close to it as the rule for
>> names in these patterns.
>>
>
> No objections here. Who wants to patch wparser_def?
>
>
>

I can get to it some time in the next week. - rather snowed under right now.

BTW, I'm also suspicious of the clause that allows <?xml ... it appears
that it will allow <?xfoo and <?XFOO also, which seems quite odd,
especially the latter.

cheers

andrew


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, "Patches (PostgreSQL)" <pgsql-patches(at)postgresql(dot)org>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Subject: Re: [HACKERS] fulltext parser strange behave
Date: 2007-11-19 13:31:43
Message-ID: 4741903F.50006@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Andrew Dunstan wrote:
>
>
> Tom Lane wrote:
>> Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
>>
>>> I've just been looking at the state machine in wparser_def.c. I
>>> think the processing for entities is also a few bob short in the
>>> pound. It recognises decimal numeric character references, but nor
>>> hexadecimal numeric character references. That's fairly silly since
>>> the HTML spec specifically says the latter are "particularly
>>> useful". The rules for named entities are also deficient w.r.t.
>>> digits, just like the case of tags that Tom noticed. This isn't
>>> academic: HTML features a number of named entities with digits in
>>> the name (sup2, frac14 for example).
>>>
>>
>>
>>> In XML at least, legal names are defined by the following rules from
>>> the spec:
>>> ...
>>> [A-Za-z:_][A-Za-z0-9:_.-]*
>>>
>>
>>
>>> I suggest we use that or something very close to it as the rule for
>>> names in these patterns.
>>>
>>
>> No objections here. Who wants to patch wparser_def?
>>
>>
>>
>
>
> I can get to it some time in the next week. - rather snowed under
> right now.
>
> BTW, I'm also suspicious of the clause that allows <?xml ... it
> appears that it will allow <?xfoo and <?XFOO also, which seems quite
> odd, especially the latter.
>

Here's a patch that fixes the patterns for numeric entities, tag names,
and removes the upper case 'X' case in the special case for an XML
prolog. There are still some oddities, but I decided against making
heroic efforts to fix them. It's probably less important if the patterns
are slightly too liberal (e.g. accepting <a href="qwe<qwe>"> ) than if
they don't recognize what they are alleged to recognize.

cheers

andrew

Attachment Content-Type Size
tsfix.patch text/x-patch 8.7 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, "Patches (PostgreSQL)" <pgsql-patches(at)postgresql(dot)org>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Subject: Re: [HACKERS] fulltext parser strange behave
Date: 2007-11-19 15:58:34
Message-ID: 391.1195487914@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> Here's a patch that fixes the patterns for numeric entities, tag names,
> and removes the upper case 'X' case in the special case for an XML
> prolog. There are still some oddities, but I decided against making
> heroic efforts to fix them. It's probably less important if the patterns
> are slightly too liberal (e.g. accepting <a href="qwe<qwe>"> ) than if
> they don't recognize what they are alleged to recognize.

I don't approve of the changes to the exposed token type names, but
the state machine changes seem sane first-glance.

regards, tom lane


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, "Patches (PostgreSQL)" <pgsql-patches(at)postgresql(dot)org>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Subject: Re: [HACKERS] fulltext parser strange behave
Date: 2007-11-19 16:18:41
Message-ID: 4741B761.7080700@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Tom Lane wrote:
> I don't approve of the changes to the exposed token type names, but
> the state machine changes seem sane first-glance.
>
>
>

Well, I think it's just plain wrong to describe as HTML tags and
entities things that just aren't. In any case, what I changed was not
the name (or alias, to be more precise), but the exposed description.
The aliases (tag, entity) would remain the same.

cheers

andrew


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, "Patches (PostgreSQL)" <pgsql-patches(at)postgresql(dot)org>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Subject: Re: [HACKERS] fulltext parser strange behave
Date: 2007-11-19 16:39:10
Message-ID: 1618.1195490350@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> Tom Lane wrote:
>> I don't approve of the changes to the exposed token type names, but
>> the state machine changes seem sane first-glance.

> Well, I think it's just plain wrong to describe as HTML tags and
> entities things that just aren't.

Maybe, but "HTML-type" is an unhelpful description. Isn't there a more
general markup standard that subsumes both HTML and XML? (I seem to
recall that SGML might be that, but not sure.)

regards, tom lane


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, "Patches (PostgreSQL)" <pgsql-patches(at)postgresql(dot)org>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Subject: Re: [HACKERS] fulltext parser strange behave
Date: 2007-11-19 16:49:35
Message-ID: 4741BE9F.6040403@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Tom Lane wrote:
> Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
>
>> Tom Lane wrote:
>>
>>> I don't approve of the changes to the exposed token type names, but
>>> the state machine changes seem sane first-glance.
>>>
>
>
>> Well, I think it's just plain wrong to describe as HTML tags and
>> entities things that just aren't.
>>
>
> Maybe, but "HTML-type" is an unhelpful description. Isn't there a more
> general markup standard that subsumes both HTML and XML? (I seem to
> recall that SGML might be that, but not sure.)
>
>
>

Most people haven't heard of SGML. I'd settle for "XML tag" or maybe
"XML/HTML tag".

Any other bids?

cheers

andrew


From: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, "Patches (PostgreSQL)" <pgsql-patches(at)postgresql(dot)org>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Subject: Re: [HACKERS] fulltext parser strange behave
Date: 2007-11-19 17:00:52
Message-ID: 20071119090052.3157088a@commandprompt.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Mon, 19 Nov 2007 11:49:35 -0500
Andrew Dunstan <andrew(at)dunslane(dot)net> wrote:

> >
>
> Most people haven't heard of SGML. I'd settle for "XML tag" or maybe
> "XML/HTML tag".
>
> Any other bids?

XML tag is fine, imo.

Sincerely,

Joshua D. Drake

>
> cheers
>
> andrew
>
> ---------------------------(end of
> broadcast)--------------------------- TIP 5: don't forget to increase
> your free space map settings
>

- --

=== The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 24x7/Emergency: +1.800.492.2240
PostgreSQL solutions since 1997 http://www.commandprompt.com/
UNIQUE NOT NULL
Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFHQcFEATb/zqfZUUQRAnpqAKCRPpvG/AQmI5qqkokC1u13gdGw2ACcCC8J
o9F/VjTRIPrLynuJQnJB0L8=
=X0Kn
-----END PGP SIGNATURE-----


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: pgsql-patches(at)postgresql(dot)org
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Subject: Re: [HACKERS] fulltext parser strange behave
Date: 2007-11-19 17:50:53
Message-ID: 200711191850.53598.peter_e@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Am Montag, 19. November 2007 schrieb Tom Lane:
> Maybe, but "HTML-type" is an unhelpful description. Isn't there a more
> general markup standard that subsumes both HTML and XML? (I seem to
> recall that SGML might be that, but not sure.)

I think "XML tag" would actually cover anything that would be valid as an HTML
tag. (As opposed to the fact that an XML document is not a superset of an
HTML document.)

SGML might be too broad. It would require us to recognize "</>" and "<>" and
perhaps a few other odd things.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: pgsql-patches(at)postgresql(dot)org, Andrew Dunstan <andrew(at)dunslane(dot)net>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Subject: Re: [HACKERS] fulltext parser strange behave
Date: 2007-11-19 18:37:48
Message-ID: 3218.1195497468@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Peter Eisentraut <peter_e(at)gmx(dot)net> writes:
> Am Montag, 19. November 2007 schrieb Tom Lane:
>> Maybe, but "HTML-type" is an unhelpful description. Isn't there a more
>> general markup standard that subsumes both HTML and XML? (I seem to
>> recall that SGML might be that, but not sure.)

> I think "XML tag" would actually cover anything that would be valid as an HTML
> tag.

+1 for "XML tag", then.

regards, tom lane


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-patches(at)postgresql(dot)org, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Subject: Re: [HACKERS] fulltext parser strange behave
Date: 2007-11-20 02:31:06
Message-ID: 474246EA.2090706@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Tom Lane wrote:
> Peter Eisentraut <peter_e(at)gmx(dot)net> writes:
>
>> Am Montag, 19. November 2007 schrieb Tom Lane:
>>
>>> Maybe, but "HTML-type" is an unhelpful description. Isn't there a more
>>> general markup standard that subsumes both HTML and XML? (I seem to
>>> recall that SGML might be that, but not sure.)
>>>
>
>
>> I think "XML tag" would actually cover anything that would be valid as an HTML
>> tag.
>>
>
> +1 for "XML tag", then.
>
>
>

Changed to XML tag and XML entity. Code names adjusted accordingly.
Committed.

cheers

andrew