Re: Re: [SQL] FTI, paged, ranked searching and efficiency.

Lists: pgsql-phppgsql-sql
From: Paul <paul(at)operamail(dot)com>
To: pgsql-sql(at)postgresql(dot)org
Subject: FTI, paged, ranked searching and efficiency.
Date: 2000-11-15 01:23:04
Message-ID: 3A36DC72@operamail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-php pgsql-sql

Hello,

This is going to be a bit long, I hope some of you will take the
trouble to read it :)

I am building a search engine for a section of a (PHP based) website.
I wish the user to be able to a number of words in the search, and the
search results to be ranked by the number of times words occur (both
different words and the same word occuring multiple times are good).

My (simplified) table structure is this:
======
Table "entry_fti"
Attribute | Type | Modifier
-----------+-------------+----------
string | varchar(25) |
id | oid |
Index: entry_fti_string_idx

Table "entry"
Attribute | Type | Modifier
-----------------------+-------+----------------------------------------------
-
entry_id |integer| not null default
nextval('entry_id_seq'::text)
entry_name |text |
entry_description_html|text |
entry_image_id |integer| not null default 0
entry_tn_image_id |integer| not null default 0
entry_live |boolean| not null default 't'
Index: entry_pkey

Table "image"
Attribute | Type | Modifier
------------+-------------+------------------------------------------------
image_id | integer | not null default nextval('image_id_seq'::text)
image_name | varchar(32) |
height | integer | not null
width | integer | not null
Indices: image_pkey
======

And my (simplified) query looks like this:
======
SELECT COUNT(entry_fti.id) AS rating,
entry.entry_name AS name,
entry.entry_id AS id,
entry.entry_description_html AS description_html,
image.image_name AS thumb1_name,
image.height AS thumb1_height,
image.width AS thumb1_width
FROM entry, entry_fti, image
WHERE entry_fti.id=entry.oid
AND entry.entrytn_image_id=image.image_id
AND entry.entry_live = 't'::bool
AND (
entry_fti.string ~'^word1'
OR
entry_fti.string ~'^word2'
OR
.
.
OR
entry_fti.string ~'^wordn'
)
GROUP BY entry.entry_id,
entry.entry_name,
entry.entry_description_html,
image.image_name,
image.height,
image.width
ORDER BY rating DESC
======

Now this all works, which is good. My problem now is that I want to
limit the number of results shown on a page to 20 and show the number
of pages of extra results, much like you'd see on any search engine site.
Naturally I immediatly thought of the LIMIT and OFFSET clauses, but then:
a) I'd need to do an extra query, to find out the total number of results
to show the number of pages on the webpage.
b) I have no idea how to write that query. It'd be a COUNT of 'rating'
in the above, which would be a COUNT(COUNT(entry_fti.id)) which
would probably require some hideous (and not legal?) double GROUP
BY construct. Ouch.

So basically, LIMIT/OFFSET looks like a no go. This leaves me with just
doing the above query, and using PHP to jump to a particular row in the
results depending on what page you are on and pg_numrows() to
calculate the number of pages.

Would that be particularly inefficient?
Should I be looking harder for a LIMIT/OFFSET based solution?

Perhaps I'd be better off splitting it into two queries, one to just
get the entry_id list in order, then another query to pull out the
rest of the information for the 20 of those entry_ids that are on the results
page I wish to show?
That would stop Postgres from gathering so much information that I am just
going to throw away anyway without looking at.

Any ideas? Have I missed something obvious that will help me? Or better yet,
can someone who has done this sort of thing before tell me whether I am on the
right track?

Paul


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: sqllist <pgsql-sql(at)postgresql(dot)org>, phpp <pgsql-php(at)postgresql(dot)org>
Subject: Re: [SQL] FTI, paged, ranked searching and efficiency.
Date: 2000-11-15 17:03:22
Message-ID: 3A12C1DA.D296628C@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-php pgsql-sql

Paul,

I'm afraid that much of that was over my head. In fact, I'm keeping it
as an example in case I ever need to do something similar. Forward your
info and I'll include credit in the source :-)

In general terms, I've always depended on the PHP to select a "page" of
results, using the logic that the number of results on a page is a
matter for the web application to handle (a display issue) rather than
something to be handled on the back-end (a data issue).

However, you point about not pulling out tons of data the user will
never examine (i.e. 2nd and succeeding pages) is well-taken. Although,
without pulling the entire data set, you can't display to the user how
many results there are, total. If it's a strong possibility that the
users are really only ever going to want the top 20-40 rated results,
then splitting it as you suggest ... first, counting all the matches and
then dragging in the rest of the data for the top X records ... makes a
lot of sense.

Unfortunately, your only real option I can see for DB server-side row
grabbing is: Create the query(ies) as a temporary table or view using a
function. Then use Limit and Offset to grab one chunk of data at a
time. This is, of course, a serious problem for mutli-user performance
since eash user would need their own temp table or view.

From what I can tell, search engines (google, for example) grab the
whole recordset and use the web script to parse it out 25 records at a
time.

Hopefully, someone on this list will have done that before and can
provide less theoretical advice.

-Josh Berkus

P.S. I've also posted this to the pgsql-php list. I;ve quoted the full
text of your question below my .sig for that reason.

--
______AGLIO DATABASE SOLUTIONS___________________________
Josh Berkus
Complete information technology josh(at)agliodbs(dot)com
and data management solutions (415) 436-9166
for law firms, small businesses fax 436-0137
and non-profit organizations. pager 338-4078
San Francisco

Paul wrote:
>
> Hello,
>
> This is going to be a bit long, I hope some of you will take the
> trouble to read it :)
>
> I am building a search engine for a section of a (PHP based) website.
> I wish the user to be able to a number of words in the search, and the
> search results to be ranked by the number of times words occur (both
> different words and the same word occuring multiple times are good).
>
> My (simplified) table structure is this:
> ======
> Table "entry_fti"
> Attribute | Type | Modifier
> -----------+-------------+----------
> string | varchar(25) |
> id | oid |
> Index: entry_fti_string_idx
>
> Table "entry"
> Attribute | Type | Modifier
> -----------------------+-------+----------------------------------------------
> -
> entry_id |integer| not null default
> nextval('entry_id_seq'::text)
> entry_name |text |
> entry_description_html|text |
> entry_image_id |integer| not null default 0
> entry_tn_image_id |integer| not null default 0
> entry_live |boolean| not null default 't'
> Index: entry_pkey
>
> Table "image"
> Attribute | Type | Modifier
> ------------+-------------+------------------------------------------------
> image_id | integer | not null default nextval('image_id_seq'::text)
> image_name | varchar(32) |
> height | integer | not null
> width | integer | not null
> Indices: image_pkey
> ======
>
> And my (simplified) query looks like this:
> ======
> SELECT COUNT(entry_fti.id) AS rating,
> entry.entry_name AS name,
> entry.entry_id AS id,
> entry.entry_description_html AS description_html,
> image.image_name AS thumb1_name,
> image.height AS thumb1_height,
> image.width AS thumb1_width
> FROM entry, entry_fti, image
> WHERE entry_fti.id=entry.oid
> AND entry.entrytn_image_id=image.image_id
> AND entry.entry_live = 't'::bool
> AND (
> entry_fti.string ~'^word1'
> OR
> entry_fti.string ~'^word2'
> OR
> .
> .
> OR
> entry_fti.string ~'^wordn'
> )
> GROUP BY entry.entry_id,
> entry.entry_name,
> entry.entry_description_html,
> image.image_name,
> image.height,
> image.width
> ORDER BY rating DESC
> ======
>
> Now this all works, which is good. My problem now is that I want to
> limit the number of results shown on a page to 20 and show the number
> of pages of extra results, much like you'd see on any search engine site.
> Naturally I immediatly thought of the LIMIT and OFFSET clauses, but then:
> a) I'd need to do an extra query, to find out the total number of results
> to show the number of pages on the webpage.
> b) I have no idea how to write that query. It'd be a COUNT of 'rating'
> in the above, which would be a COUNT(COUNT(entry_fti.id)) which
> would probably require some hideous (and not legal?) double GROUP
> BY construct. Ouch.
>
> So basically, LIMIT/OFFSET looks like a no go. This leaves me with just
> doing the above query, and using PHP to jump to a particular row in the
> results depending on what page you are on and pg_numrows() to
> calculate the number of pages.
>
> Would that be particularly inefficient?
> Should I be looking harder for a LIMIT/OFFSET based solution?
>
> Perhaps I'd be better off splitting it into two queries, one to just
> get the entry_id list in order, then another query to pull out the
> rest of the information for the 20 of those entry_ids that are on the results
> page I wish to show?
> That would stop Postgres from gathering so much information that I am just
> going to throw away anyway without looking at.
>
> Any ideas? Have I missed something obvious that will help me? Or better yet,
> can someone who has done this sort of thing before tell me whether I am on the
> right track?
>
> Paul


From: Stephen van Egmond <svanegmond(at)bang(dot)dhs(dot)org>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: sqllist <pgsql-sql(at)postgresql(dot)org>, phpp <pgsql-php(at)postgresql(dot)org>
Subject: Re: Re: [SQL] FTI, paged, ranked searching and efficiency.
Date: 2000-11-15 17:25:16
Message-ID: 20001115122516.B25685@bang.dhs.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-php pgsql-sql

Josh Berkus (josh(at)agliodbs(dot)com) wrote:
> Unfortunately, your only real option I can see for DB server-side row
> grabbing is: Create the query(ies) as a temporary table or view using a
> function. Then use Limit and Offset to grab one chunk of data at a
> time. This is, of course, a serious problem for mutli-user performance
> since eash user would need their own temp table or view.

How come nobody's ever thought of cursors?

DECLARE foo CURSOR FOR SELECT stuff FROM stuff WHERE foo ORDER BY
something;

Hop forward N rows?
MOVE FORWARD $n IN foo

Want M rows?
FETCH FORWARD $m IN foo

/Steve


From: Stephen van Egmond <svanegmond(at)bang(dot)dhs(dot)org>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: sqllist <pgsql-sql(at)postgresql(dot)org>, phpp <pgsql-php(at)postgresql(dot)org>
Subject: Re: Re: [SQL] FTI, paged, ranked searching and efficiency.
Date: 2000-11-15 17:28:48
Message-ID: 20001115122848.C25685@bang.dhs.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-php pgsql-sql

My example wasn't complete. You need to wrap the cursor calls in BEGIN
and END. Read the DECLARE docs for the minutiae.

/Steve


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: sqllist <pgsql-sql(at)postgresql(dot)org>, phpp <pgsql-php(at)postgresql(dot)org>
Subject: Re: Re: [SQL] FTI, paged, ranked searching and efficiency.
Date: 2000-11-15 17:31:38
Message-ID: 3A12C87A.390C6346@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-php pgsql-sql

Stephen,

> How come nobody's ever thought of cursors?
>
> DECLARE foo CURSOR FOR SELECT stuff FROM stuff WHERE foo ORDER BY
> something;
>
> Hop forward N rows?
> MOVE FORWARD $n IN foo
>
> Want M rows?
> FETCH FORWARD $m IN foo

I'm intrigued by this. How would I retrieve cursor rows into a web
application? If we could output a cursor to a functon result (we
can't), it would be easy, but I'm not sure otherwise.

-Josh
--
______AGLIO DATABASE SOLUTIONS___________________________
Josh Berkus
Complete information technology josh(at)agliodbs(dot)com
and data management solutions (415) 436-9166
for law firms, small businesses fax 436-0137
and non-profit organizations. pager 338-4078
San Francisco


From: Stephen van Egmond <svanegmond(at)bang(dot)dhs(dot)org>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: sqllist <pgsql-sql(at)postgresql(dot)org>, phpp <pgsql-php(at)postgresql(dot)org>
Subject: Re: Re: [SQL] FTI, paged, ranked searching and efficiency.
Date: 2000-11-15 17:51:16
Message-ID: 20001115125116.A25885@bang.dhs.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-php pgsql-sql

Josh Berkus (josh(at)agliodbs(dot)com) wrote:

> Stephen,
>
> > How come nobody's ever thought of cursors?
> >
> > DECLARE foo CURSOR FOR SELECT stuff FROM stuff WHERE foo ORDER BY
> > something;
> >
> > Hop forward N rows?
> > MOVE FORWARD $n IN foo
> >
> > Want M rows?
> > FETCH FORWARD $m IN foo
>
> I'm intrigued by this. How would I retrieve cursor rows into a web
> application? If we could output a cursor to a functon result (we
> can't), it would be easy, but I'm not sure otherwise.

The result of the FETCH is something you can use pg_fetch_* functions
on as usual.