Lists: | pgsql-bugs |
---|
From: | PG Bug reporting form <noreply(at)postgresql(dot)org> |
---|---|
To: | pgsql-bugs(at)lists(dot)postgresql(dot)org |
Cc: | nikolajs(dot)arhipovs(at)gmail(dot)com |
Subject: | BUG #15488: Unexpected behaviour of to_tsverctor and ts_query |
Date: | 2018-11-06 12:51:58 |
Message-ID: | 15488-a738bcc48f1af087@postgresql.org |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-bugs |
The following bug has been logged on the website:
Bug reference: 15488
Logged by: Nikolajs Arhipovs
Email address: nikolajs(dot)arhipovs(at)gmail(dot)com
PostgreSQL version: 9.6.8
Operating system: Linux, kernel 4.18.16-arch1-1-ARCH
Description:
I am trying to do prefix search of emails using full text search. Emails
sometimes contain dot symbols.
Now, select to_tsquery('simple', 'thing.a:*'); returns a single term as
expected but
select to_tsquery('simple', 'the.thing.a:*') suddenly tokenizes my search
expression into 'the.thing':* & 'a':*. At the same time select
to_tsquery('simple', 'the.thing.aa:*'); tokenizes as expected into a single
term.
Also select to_tsquery('simple', 'the(dot)thing(at)gmail(dot)com'); returns a single
term as expected, but select to_tsquery('simple', 'the(dot)thing(at)gma:*'); splits
on @ char and returns two tokens.
This behavior is not covered in the official documentation.
From: | Arthur Zakirov <a(dot)zakirov(at)postgrespro(dot)ru> |
---|---|
To: | nikolajs(dot)arhipovs(at)gmail(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org |
Subject: | Re: BUG #15488: Unexpected behaviour of to_tsverctor and ts_query |
Date: | 2018-11-07 11:20:55 |
Message-ID: | dc4c5f92-0eeb-2d5e-32af-653b80816c31@postgrespro.ru |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-bugs |
Hello,
On 06.11.2018 15:51, PG Bug reporting form wrote:
> I am trying to do prefix search of emails using full text search. Emails
> sometimes contain dot symbols.
> Now, select to_tsquery('simple', 'thing.a:*'); returns a single term as
> expected but
> select to_tsquery('simple', 'the.thing.a:*') suddenly tokenizes my search
> expression into 'the.thing':* & 'a':*. At the same time select
> to_tsquery('simple', 'the.thing.aa:*'); tokenizes as expected into a single
> term.
> Also select to_tsquery('simple', 'the(dot)thing(at)gmail(dot)com'); returns a single
> term as expected, but select to_tsquery('simple', 'the(dot)thing(at)gma:*'); splits
> on @ char and returns two tokens.
> This behavior is not covered in the official documentation.
Yeah, it seems strange, but full text search parser thinks that each
label of a host should have at least two characters.
In first example it thinks that it is a file name:
=# select * from ts_debug('simple', 'thing.a:*');
alias | description | token | dictionaries | dictionary |
lexemes
-------+-------------------+---------+--------------+------------+-----------
file | File or path name | thing.a | {simple} | simple |
{thing.a}
blank | Space symbols | :* | {} | (null) | (null)
Next, it thinks that only first two label are only part of host name:
=# select * from ts_debug('simple', 'the.thing.a:*');
alias | description | token | dictionaries | dictionary |
lexemes
-----------+-----------------+-----------+--------------+------------+-------------
host | Host | the.thing | {simple} | simple |
{the.thing}
blank | Space symbols | . | {} | (null) |
(null)
asciiword | Word, all ASCII | a | {simple} | simple | {a}
blank | Space symbols | :* | {} | (null) |
(null)
And next is host name:
=# select * from ts_debug('simple', 'the.thing.aa:*');
alias | description | token | dictionaries | dictionary |
lexemes
-------+---------------+--------------+--------------+------------+----------------
host | Host | the.thing.aa | {simple} | simple |
{the.thing.aa}
blank | Space symbols | :* | {} | (null) | (null)
It seems that the RFC doesn't have such restriction, if I'm not
mistaken. See:
https://tools.ietf.org/html/rfc1034#section-3.1
And same for emails:
=# select * from ts_debug('simple', 'the(dot)thing(at)gmail(dot)com');
alias | description | token | dictionaries |
dictionary | lexemes
-------+---------------+---------------------+--------------+------------+-----------------------
email | Email address | the(dot)thing(at)gmail(dot)com | {simple} | simple
| {the(dot)thing(at)gmail(dot)com}
=# select * from ts_debug('simple', 'the(dot)thing(at)gmail(dot)c:*');
alias | description | token | dictionaries | dictionary |
lexemes
-------+-------------------+-----------+--------------+------------+-------------
host | Host | the.thing | {simple} | simple |
{the.thing}
blank | Space symbols | @ | {} | (null) | (null)
file | File or path name | gmail.c | {simple} | simple |
{gmail.c}
blank | Space symbols | :* | {} | (null) | (null)
=# select * from ts_debug('simple', 'the(dot)thing(at)gma:*');
alias | description | token | dictionaries | dictionary |
lexemes
-----------+-----------------+-----------+--------------+------------+-------------
host | Host | the.thing | {simple} | simple |
{the.thing}
blank | Space symbols | @ | {} | (null) |
(null)
asciiword | Word, all ASCII | gma | {simple} | simple |
{gma}
blank | Space symbols | :* | {} | (null) |
(null)
It seems that it can be easily fixed, but I'm not sure that it won't
break something.
If you have such host names only in prefix queries (not in documents
itself) I may suggest the following workaround:
=# select to_tsvector('simple', 'the(dot)thing(at)gmail(dot)com') @@
'the(dot)thing(at)gma:*'::tsquery;
?column?
----------
t
--
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company