Quick Links

TSearch2 / Get all unique lexems

Lists:	pgsql-general

From:	Hannes Dorbath <light(at)theendofthetunnel(dot)de>
To:	pgsql-general(at)postgresql(dot)org
Subject:	TSearch2 / Get all unique lexems
Date:	2005-12-07 13:57:38
Message-ID:	dn6pm06bdn6pm0$206b$1@news.hub.org@news.hub.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

Is there a way to get all unique lexems from a table with a tsvector
column? The stat() function does this (and more), but I cannot use it..

Thanks

--
Regards,
Hannes Dorbath

From:	Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To:	Hannes Dorbath <light(at)theendofthetunnel(dot)de>
Cc:	pgsql-general(at)postgresql(dot)org
Subject:	Re: TSearch2 / Get all unique lexems
Date:	2005-12-07 15:13:07
Message-ID:	Pine.GSO.4.63.0512071812290.13553@ra.sai.msu.su
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

On Wed, 7 Dec 2005, Hannes Dorbath wrote:

> Is there a way to get all unique lexems from a table with a tsvector column?
> The stat() function does this (and more), but I cannot use it..

hmm, you could dump tsvector column and use awk+sort+uniq

>
> Thanks
>
>
>

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru)
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

From:	Hannes Dorbath <light(at)theendofthetunnel(dot)de>
To:	Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Subject:	Re: TSearch2 / Get all unique lexems
Date:	2005-12-08 08:50:28
Message-ID:	4397F3D4.8080004@theendofthetunnel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

On 07.12.2005 16:13, Oleg Bartunov wrote:
> hmm, you could dump tsvector column and use awk+sort+uniq

Thanks. I hoped for something possible inside a pl/pgsql proc. I'm
trying to integrate pg_trgm with Tsearch2. I'm still on my utf-8
database. Yes I know, there is _NO_ utf-8 support of any kind in
Tsearch2 yet, but I got it working to a degree that is OK for my
application (Created my own stemmer variant, ispell dict, affix file
etc). The last missing bit is to get a source for pg_trgm. I cannot use
the the stat() function, because it breaks as soon it sees an utf-8 char.

I thought of using lexise(), cast the text array to rows somehow, write
it to a temp table, use SELECT DISTINCT.. but I hadn't any success yet.

--
Regards,
Hannes Dorbath

From:	Teodor Sigaev <teodor(at)sigaev(dot)ru>
To:	Hannes Dorbath <light(at)theendofthetunnel(dot)de>
Cc:	Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, pgsql-general(at)postgresql(dot)org
Subject:	Re: TSearch2 / Get all unique lexems
Date:	2005-12-08 11:00:55
Message-ID:	43981267.6090802@sigaev.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

> Thanks. I hoped for something possible inside a pl/pgsql proc. I'm
> trying to integrate pg_trgm with Tsearch2. I'm still on my utf-8
> database. Yes I know, there is _NO_ utf-8 support of any kind in
> Tsearch2 yet, but I got it working to a degree that is OK for my
> application (Created my own stemmer variant, ispell dict, affix file
> etc). The last missing bit is to get a source for pg_trgm. I cannot use
> the the stat() function, because it breaks as soon it sees an utf-8 char.

I suppose noncompatible with UTF wordparser can produce illegal lexemes (with
part of multibyte char) and stores it in tsvector. Tsvector hasn't any control
of breakness lexemes (with a help pg_verifymbstr() call), but stat() makes text
field and then postgres check it and found incomplete mbchars. Which way I see
(except waiting UTF support in tsearch2 which we develop now):

1 modify stat() function to check text field and if it fails then remove lexeme
from output

2 Take from CVS HEAD wordpaser (ts_locale.[ch], wparser_def.c,
wordparser/parser.[ch]). to_tsvector will works fine, to_tsquery will works
correct only with quoted string (for examle, 'foo' & 'bar', bad: foo & bar).
But casting 'asasas'::tsvector and dump/reload will not work correct.

--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
WWW: http://www.sigaev.ru/

From:	Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To:	Hannes Dorbath <light(at)theendofthetunnel(dot)de>
Cc:	pgsql-general(at)postgresql(dot)org
Subject:	Re: TSearch2 / Get all unique lexems
Date:	2005-12-08 11:04:03
Message-ID:	Pine.GSO.4.63.0512081355160.13553@ra.sai.msu.su
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

On Thu, 8 Dec 2005, Hannes Dorbath wrote:

> On 07.12.2005 16:13, Oleg Bartunov wrote:
>> hmm, you could dump tsvector column and use awk+sort+uniq
>
> Thanks. I hoped for something possible inside a pl/pgsql proc. I'm trying to
> integrate pg_trgm with Tsearch2. I'm still on my utf-8 database. Yes I know,
> there is _NO_ utf-8 support of any kind in Tsearch2 yet, but I got it working
> to a degree that is OK for my application (Created my own stemmer variant,
> ispell dict, affix file etc). The last missing bit is to get a source for
> pg_trgm. I cannot use the the stat() function, because it breaks as soon it
> sees an utf-8 char.

unless there is some way to ignore errors in utf8 convertation to text
this is a dead-end. stat() function uses text representation.

You have to wait new release with full UTF8 support or go 'lazy' way,
i.e. use any tools to get a list of unique words and create pg_trgm index.
There are several questions:
* Do you actually need to be synchronized with tsvector ?
* Do you need to recognize all words ? I supposed no. In real life you should
have a dictionary which you certainly need to recognize.