Lists: | pgsql-www |
---|
From: | Dave Page <dpage(at)pgadmin(dot)org> |
---|---|
To: | PostgreSQL WWW <pgsql-www(at)postgresql(dot)org> |
Subject: | Fixing Google Search on the docs (redux) |
Date: | 2020-11-18 16:20:05 |
Message-ID: | CA+OCxoyVwmmZkWUJCez2hCqa89iGv=vq58NF1yQkTg9gtpkn=g@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-www |
I was looking at our analytic data, and saw that the vast majority of
inbound traffic to the docs, hits the 9.1 version. We've known this has
been an issue for years and have tried various remedies, clearly none of
which are working.
Should we try an experiment for a couple of months, in which we simply
block anything that matches \/docs\/((\d+)|(\d.\d))\/ in robots.txt? It's a
much more drastic option, but at least it might force Google into indexing
the latest doc version with the highest priority.
--
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake
From: | "Jonathan S(dot) Katz" <jkatz(at)postgresql(dot)org> |
---|---|
To: | Dave Page <dpage(at)pgadmin(dot)org>, PostgreSQL WWW <pgsql-www(at)postgresql(dot)org> |
Subject: | Re: Fixing Google Search on the docs (redux) |
Date: | 2020-11-18 16:44:01 |
Message-ID: | 9388577f-5fda-97b4-d38e-02851547d5f5@postgresql.org |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-www |
On 11/18/20 11:20 AM, Dave Page wrote:
> I was looking at our analytic data, and saw that the vast majority of
> inbound traffic to the docs, hits the 9.1 version. We've known this has
> been an issue for years and have tried various remedies, clearly none of
> which are working.
>
> Should we try an experiment for a couple of months, in which we simply
> block anything that matches \/docs\/((\d+)|(\d.\d))\/ in robots.txt?
> It's a much more drastic option, but at least it might force Google into
> indexing the latest doc version with the highest priority.
If we're going down this road, I would suggest borrowing a concept from
the Django Project documentation which has a similar issue to us. In
their codebase, use a <link> tag with rel="canonical" to point to the
latest version of docs on their page[1].
So for example, given 3.1 is their latest release, you will find
something similar to this:
<link rel="canonical"
href="https://docs.djangoproject.com/en/3.1/ref/templates/builtins/">
From a quick test of searching various Django concepts, it seems that
the 3.1 pages tend to turn up first.
Our equivalent would be "current".
Jonathan
[1]
https://developers.google.com/search/docs/advanced/crawling/consolidate-duplicate-urls
Attachment | Content-Type | Size |
---|---|---|
OpenPGP_0xF1049C729F1C6527.asc | application/pgp-keys | 12.4 KB |
From: | Magnus Hagander <magnus(at)hagander(dot)net> |
---|---|
To: | "Jonathan S(dot) Katz" <jkatz(at)postgresql(dot)org> |
Cc: | Dave Page <dpage(at)pgadmin(dot)org>, PostgreSQL WWW <pgsql-www(at)postgresql(dot)org> |
Subject: | Re: Fixing Google Search on the docs (redux) |
Date: | 2020-11-18 17:28:49 |
Message-ID: | CABUevExZ_mfbZK=9XmBaxk5osNXNzewfUV=obC-vzTSMa9Xo2Q@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-www |
On Wed, Nov 18, 2020 at 5:44 PM Jonathan S. Katz <jkatz(at)postgresql(dot)org> wrote:
>
> On 11/18/20 11:20 AM, Dave Page wrote:
> > I was looking at our analytic data, and saw that the vast majority of
> > inbound traffic to the docs, hits the 9.1 version. We've known this has
> > been an issue for years and have tried various remedies, clearly none of
> > which are working.
> >
> > Should we try an experiment for a couple of months, in which we simply
> > block anything that matches \/docs\/((\d+)|(\d.\d))\/ in robots.txt?
> > It's a much more drastic option, but at least it might force Google into
> > indexing the latest doc version with the highest priority.
>
> If we're going down this road, I would suggest borrowing a concept from
> the Django Project documentation which has a similar issue to us. In
> their codebase, use a <link> tag with rel="canonical" to point to the
> latest version of docs on their page[1].
>
> So for example, given 3.1 is their latest release, you will find
> something similar to this:
>
> <link rel="canonical"
> href="https://docs.djangoproject.com/en/3.1/ref/templates/builtins/">
>
> From a quick test of searching various Django concepts, it seems that
> the 3.1 pages tend to turn up first.
>
> Our equivalent would be "current".
>
> Jonathan
>
> [1]
> https://developers.google.com/search/docs/advanced/crawling/consolidate-duplicate-urls
We've discussed this many times before, and I think so far they've all
bogged down at "google suck" :) The problem is that they don't even
consider the case like we have where the pages *aren't* identical, but
yet related.
The problem it usually comes down to is that if we do that, then you
will no longer be able to say search for something in the old docs *at
all*. A good example right now might be that recovery.conf stuff goes
away. Even if you explicitly search for "postgresql recovery.conf 11".
And I'd guess the majority of people are actually looking for things
in versions that are NOT the latest (though an even bigger majority of
people will be looking for things in versions that are not 9.1).
FWIW, I find the django example absolutely terrible -- in fact, it's a
great example of how the canonical URL handling sucks. There is AFAICT
no way to actually search for information about old versions. You have
to search for it in the new version and then hope that the same info
happens to be on the same page in an earlier version, and then
manually browse your way back to that version (also through very
annoying js popover stuff, but that's a different thing)
I don't know of any way to actually tell google to prioritise the new
versions. You used to be able to do this using the sitemap.xml stuff,
which is why we do that, but at some point they just stopped caring
about those, even in the cases where we're *lowering* our own
priority, under the argument of not letting us increase our priority.
It's not that what we have now for this is especially great. It might
be that going down that route is still the least bad. But we have to
make that decision while knowing this means that *nobody* will be able
to search for things in our older documentation even if they
explicitly ask for it. At all. Their only chance is to search for
something else that might hit our docs, then in that click over to the
correct version they actually asked for, and then search *again* using
our site-search and hope that it shows up there. I'm willing to bet
very few users will figure that part out...
--
Magnus Hagander
Me: https://www.hagander.net/
Work: https://www.redpill-linpro.com/
From: | Christophe Pettus <xof(at)thebuild(dot)com> |
---|---|
To: | Magnus Hagander <magnus(at)hagander(dot)net> |
Cc: | "Jonathan S(dot) Katz" <jkatz(at)postgresql(dot)org>, Dave Page <dpage(at)pgadmin(dot)org>, PostgreSQL WWW <pgsql-www(at)postgresql(dot)org> |
Subject: | Re: Fixing Google Search on the docs (redux) |
Date: | 2020-11-18 17:33:41 |
Message-ID: | 1ECDD1C4-7945-406E-B172-97D76F5E15B4@thebuild.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-www |
> On Nov 18, 2020, at 09:28, Magnus Hagander <magnus(at)hagander(dot)net> wrote:
> Their only chance is to search for
> something else that might hit our docs, then in that click over to the
> correct version they actually asked for, and then search *again* using
> our site-search and hope that it shows up there. I'm willing to bet
> very few users will figure that part out...
I'm not sure that is a worse situation than searching for something and having the first page be 9.1 hits.
--
-- Christophe Pettus
xof(at)thebuild(dot)com
From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | Christophe Pettus <xof(at)thebuild(dot)com> |
Cc: | Magnus Hagander <magnus(at)hagander(dot)net>, "Jonathan S(dot) Katz" <jkatz(at)postgresql(dot)org>, Dave Page <dpage(at)pgadmin(dot)org>, PostgreSQL WWW <pgsql-www(at)postgresql(dot)org> |
Subject: | Re: Fixing Google Search on the docs (redux) |
Date: | 2020-11-18 17:45:47 |
Message-ID: | 992905.1605721547@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-www |
Christophe Pettus <xof(at)thebuild(dot)com> writes:
>> On Nov 18, 2020, at 09:28, Magnus Hagander <magnus(at)hagander(dot)net> wrote:
>> Their only chance is to search for
>> something else that might hit our docs, then in that click over to the
>> correct version they actually asked for, and then search *again* using
>> our site-search and hope that it shows up there. I'm willing to bet
>> very few users will figure that part out...
> I'm not sure that is a worse situation than searching for something and having the first page be 9.1 hits.
Maybe, rather than trying to force google to index "current", we should
force them to index current minus one or two releases, so that what they
index is in the middle of the range of supported releases. That would
represent a decent compromise between "info too old" and "info too new".
Another idea is to block, via robots.txt, any out-of-support branches.
We won't know which of the supported branches they then prioritize,
but at least it won't be 9.1.
regards, tom lane
From: | Magnus Hagander <magnus(at)hagander(dot)net> |
---|---|
To: | Christophe Pettus <xof(at)thebuild(dot)com> |
Cc: | "Jonathan S(dot) Katz" <jkatz(at)postgresql(dot)org>, Dave Page <dpage(at)pgadmin(dot)org>, PostgreSQL WWW <pgsql-www(at)postgresql(dot)org> |
Subject: | Re: Fixing Google Search on the docs (redux) |
Date: | 2020-11-18 18:03:19 |
Message-ID: | CABUevExAi0Xvi8W_3QL6scYAd_YMmNasyZ+hK0O6coa=Le2NkQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-www |
On Wed, Nov 18, 2020 at 6:33 PM Christophe Pettus <xof(at)thebuild(dot)com> wrote:
>
>
>
> > On Nov 18, 2020, at 09:28, Magnus Hagander <magnus(at)hagander(dot)net> wrote:
> > Their only chance is to search for
> > something else that might hit our docs, then in that click over to the
> > correct version they actually asked for, and then search *again* using
> > our site-search and hope that it shows up there. I'm willing to bet
> > very few users will figure that part out...
>
> I'm not sure that is a worse situation than searching for something and having the first page be 9.1 hits.
Today you can append "12" to your search and get the results for v12
most of the time.
So today the default is really bad, but the exact right thing is possible.
With the change, the default would be less bad (but not necessarily
exactly right), and the exact right thing would be impossible.
--
Magnus Hagander
Me: https://www.hagander.net/
Work: https://www.redpill-linpro.com/
From: | Christophe Pettus <xof(at)thebuild(dot)com> |
---|---|
To: | Magnus Hagander <magnus(at)hagander(dot)net> |
Cc: | "Jonathan S(dot) Katz" <jkatz(at)postgresql(dot)org>, Dave Page <dpage(at)pgadmin(dot)org>, PostgreSQL WWW <pgsql-www(at)postgresql(dot)org> |
Subject: | Re: Fixing Google Search on the docs (redux) |
Date: | 2020-11-18 18:08:39 |
Message-ID: | A8314490-EFB3-4580-A9ED-A24CC626FABF@thebuild.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-www |
> On Nov 18, 2020, at 10:03, Magnus Hagander <magnus(at)hagander(dot)net> wrote:
> So today the default is really bad, but the exact right thing is possible.
> With the change, the default would be less bad (but not necessarily
> exactly right), and the exact right thing would be impossible.
We're kind of speculating that "oh, right, I have to slap a version number on when I search by Google" is less frustrating or more common than just "click on the wrong version result and then navigate to the right version result."
I also think it's a benefit to prioritize the most recent version on external search hits.
I haven't significant complaints within the Django community about the way they handle it, and that's with the Django documentation being *much* less well-organized than the PostgreSQL documentation (and thus more reliant on external search engines to find the right thing).
--
-- Christophe Pettus
xof(at)thebuild(dot)com
From: | Dave Page <dpage(at)pgadmin(dot)org> |
---|---|
To: | Magnus Hagander <magnus(at)hagander(dot)net> |
Cc: | "Jonathan S(dot) Katz" <jkatz(at)postgresql(dot)org>, PostgreSQL WWW <pgsql-www(at)postgresql(dot)org> |
Subject: | Re: Fixing Google Search on the docs (redux) |
Date: | 2020-11-19 09:39:58 |
Message-ID: | CA+OCxoy9=wJxWtEkH6j0B6pg+H35TJxx+MZoLiZm9Edd9PsNeg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-www |
On Wed, Nov 18, 2020 at 5:29 PM Magnus Hagander <magnus(at)hagander(dot)net> wrote:
> On Wed, Nov 18, 2020 at 5:44 PM Jonathan S. Katz <jkatz(at)postgresql(dot)org>
> wrote:
> >
> > On 11/18/20 11:20 AM, Dave Page wrote:
> > > I was looking at our analytic data, and saw that the vast majority of
> > > inbound traffic to the docs, hits the 9.1 version. We've known this has
> > > been an issue for years and have tried various remedies, clearly none
> of
> > > which are working.
> > >
> > > Should we try an experiment for a couple of months, in which we simply
> > > block anything that matches \/docs\/((\d+)|(\d.\d))\/ in robots.txt?
> > > It's a much more drastic option, but at least it might force Google
> into
> > > indexing the latest doc version with the highest priority.
> >
> > If we're going down this road, I would suggest borrowing a concept from
> > the Django Project documentation which has a similar issue to us. In
> > their codebase, use a <link> tag with rel="canonical" to point to the
> > latest version of docs on their page[1].
> >
> > So for example, given 3.1 is their latest release, you will find
> > something similar to this:
> >
> > <link rel="canonical"
> > href="https://docs.djangoproject.com/en/3.1/ref/templates/builtins/">
> >
> > From a quick test of searching various Django concepts, it seems that
> > the 3.1 pages tend to turn up first.
> >
> > Our equivalent would be "current".
> >
> > Jonathan
> >
> > [1]
> >
> https://developers.google.com/search/docs/advanced/crawling/consolidate-duplicate-urls
>
> We've discussed this many times before, and I think so far they've all
> bogged down at "google suck" :) The problem is that they don't even
> consider the case like we have where the pages *aren't* identical, but
> yet related.
>
Sure, but we need to do something, regardless of whether Google suck in
this case. The current situation is ridiculous; I don't remember the last
time I searched on something and didn't have to click an alternate version
link if I chose a result from our docs.
>
> The problem it usually comes down to is that if we do that, then you
> will no longer be able to say search for something in the old docs *at
> all*. A good example right now might be that recovery.conf stuff goes
> away. Even if you explicitly search for "postgresql recovery.conf 11".
> And I'd guess the majority of people are actually looking for things
> in versions that are NOT the latest (though an even bigger majority of
> people will be looking for things in versions that are not 9.1).
>
The irony is that that example would be far less of an issue if we hadn't
removed all the release notes for older versions (see
https://www.enterprisedb.com/edb-docs/s?q=recovery.conf&c=&p=19&v=272 as an
example). The older release notes would give users a hint as to where to
look.
> FWIW, I find the django example absolutely terrible -- in fact, it's a
> great example of how the canonical URL handling sucks. There is AFAICT
> no way to actually search for information about old versions. You have
> to search for it in the new version and then hope that the same info
> happens to be on the same page in an earlier version, and then
> manually browse your way back to that version (also through very
> annoying js popover stuff, but that's a different thing)
>
That is true, however the *vast* majority of cases will be present in older
versions.
>
> I don't know of any way to actually tell google to prioritise the new
> versions. You used to be able to do this using the sitemap.xml stuff,
> which is why we do that, but at some point they just stopped caring
> about those, even in the cases where we're *lowering* our own
> priority, under the argument of not letting us increase our priority.
>
> It's not that what we have now for this is especially great. It might
> be that going down that route is still the least bad. But we have to
> make that decision while knowing this means that *nobody* will be able
> to search for things in our older documentation even if they
> explicitly ask for it. At all.
On public search engines. They will still be able to using our own site
search.
> Their only chance is to search for
> something else that might hit our docs, then in that click over to the
> correct version they actually asked for, and then search *again* using
> our site-search and hope that it shows up there. I'm willing to bet
> very few users will figure that part out...
>
The issue for me is that the current situation sucks for the vast majority
of users, as evidenced by our analytics. If we blocked indexing of all but
the current version of the docs, it would suck in the same way only for
those that specifically want to look at an older version, and those that
search for one of the very few things that have been removed from the
latest version. In short, I think the current situation is worse.
--
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake
From: | Dave Page <dpage(at)pgadmin(dot)org> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Christophe Pettus <xof(at)thebuild(dot)com>, Magnus Hagander <magnus(at)hagander(dot)net>, "Jonathan S(dot) Katz" <jkatz(at)postgresql(dot)org>, PostgreSQL WWW <pgsql-www(at)postgresql(dot)org> |
Subject: | Re: Fixing Google Search on the docs (redux) |
Date: | 2020-11-19 09:42:08 |
Message-ID: | CA+OCxoyWW-cKxwHOy32Ar4M4crE=yHXyoc49vWwV05toAtPCCA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-www |
On Wed, Nov 18, 2020 at 5:45 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Christophe Pettus <xof(at)thebuild(dot)com> writes:
> >> On Nov 18, 2020, at 09:28, Magnus Hagander <magnus(at)hagander(dot)net> wrote:
> >> Their only chance is to search for
> >> something else that might hit our docs, then in that click over to the
> >> correct version they actually asked for, and then search *again* using
> >> our site-search and hope that it shows up there. I'm willing to bet
> >> very few users will figure that part out...
>
> > I'm not sure that is a worse situation than searching for something and
> having the first page be 9.1 hits.
>
> Maybe, rather than trying to force google to index "current", we should
> force them to index current minus one or two releases, so that what they
> index is in the middle of the range of supported releases. That would
> represent a decent compromise between "info too old" and "info too new".
>
That'll stop people searching about the new features in the latest, which I
think is likely a common pattern.
>
> Another idea is to block, via robots.txt, any out-of-support branches.
> We won't know which of the supported branches they then prioritize,
> but at least it won't be 9.1.
>
That I could get on board with.
--
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake
From: | Magnus Hagander <magnus(at)hagander(dot)net> |
---|---|
To: | Dave Page <dpage(at)pgadmin(dot)org> |
Cc: | "Jonathan S(dot) Katz" <jkatz(at)postgresql(dot)org>, PostgreSQL WWW <pgsql-www(at)postgresql(dot)org> |
Subject: | Re: Fixing Google Search on the docs (redux) |
Date: | 2020-11-19 09:57:57 |
Message-ID: | CABUevEzmy02nUWdisHNCoM9-19vWE1xATinWfFgW6n-iFd3qUQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-www |
On Thu, Nov 19, 2020 at 10:40 AM Dave Page <dpage(at)pgadmin(dot)org> wrote:
>
>
>
> On Wed, Nov 18, 2020 at 5:29 PM Magnus Hagander <magnus(at)hagander(dot)net> wrote:
>>
>> On Wed, Nov 18, 2020 at 5:44 PM Jonathan S. Katz <jkatz(at)postgresql(dot)org> wrote:
>> >
>> > On 11/18/20 11:20 AM, Dave Page wrote:
>> > > I was looking at our analytic data, and saw that the vast majority of
>> > > inbound traffic to the docs, hits the 9.1 version. We've known this has
>> > > been an issue for years and have tried various remedies, clearly none of
>> > > which are working.
>> > >
>> > > Should we try an experiment for a couple of months, in which we simply
>> > > block anything that matches \/docs\/((\d+)|(\d.\d))\/ in robots.txt?
>> > > It's a much more drastic option, but at least it might force Google into
>> > > indexing the latest doc version with the highest priority.
>> >
>> > If we're going down this road, I would suggest borrowing a concept from
>> > the Django Project documentation which has a similar issue to us. In
>> > their codebase, use a <link> tag with rel="canonical" to point to the
>> > latest version of docs on their page[1].
>> >
>> > So for example, given 3.1 is their latest release, you will find
>> > something similar to this:
>> >
>> > <link rel="canonical"
>> > href="https://docs.djangoproject.com/en/3.1/ref/templates/builtins/">
>> >
>> > From a quick test of searching various Django concepts, it seems that
>> > the 3.1 pages tend to turn up first.
>> >
>> > Our equivalent would be "current".
>> >
>> > Jonathan
>> >
>> > [1]
>> > https://developers.google.com/search/docs/advanced/crawling/consolidate-duplicate-urls
>>
>> We've discussed this many times before, and I think so far they've all
>> bogged down at "google suck" :) The problem is that they don't even
>> consider the case like we have where the pages *aren't* identical, but
>> yet related.
>
>
> Sure, but we need to do something, regardless of whether Google suck in this case. The current situation is ridiculous; I don't remember the last time I searched on something and didn't have to click an alternate version link if I chose a result from our docs.
>
>>
>>
>> The problem it usually comes down to is that if we do that, then you
>> will no longer be able to say search for something in the old docs *at
>> all*. A good example right now might be that recovery.conf stuff goes
>> away. Even if you explicitly search for "postgresql recovery.conf 11".
>> And I'd guess the majority of people are actually looking for things
>> in versions that are NOT the latest (though an even bigger majority of
>> people will be looking for things in versions that are not 9.1).
>
>
> The irony is that that example would be far less of an issue if we hadn't removed all the release notes for older versions (see https://www.enterprisedb.com/edb-docs/s?q=recovery.conf&c=&p=19&v=272 as an example). The older release notes would give users a hint as to where to look.
The release notes themselves are still under for example
/docs/release/12.0/ as well, so we should be
able to keep *that* searchable still. So for this particular case it
would at least tell people that "yeah, you're right, it used to be
called recovery conf" when they're searching for documentation about
11 and earlier... They still won't get to the actual documentation for
it though -- but neither does your example from edb :)
>> FWIW, I find the django example absolutely terrible -- in fact, it's a
>> great example of how the canonical URL handling sucks. There is AFAICT
>> no way to actually search for information about old versions. You have
>> to search for it in the new version and then hope that the same info
>> happens to be on the same page in an earlier version, and then
>> manually browse your way back to that version (also through very
>> annoying js popover stuff, but that's a different thing)
>
>
> That is true, however the *vast* majority of cases will be present in older versions.
Yes, but one could also argue that specifically the things that people
search for might be less cross-platform present there..
>> I don't know of any way to actually tell google to prioritise the new
>> versions. You used to be able to do this using the sitemap.xml stuff,
>> which is why we do that, but at some point they just stopped caring
>> about those, even in the cases where we're *lowering* our own
>> priority, under the argument of not letting us increase our priority.
>>
>> It's not that what we have now for this is especially great. It might
>> be that going down that route is still the least bad. But we have to
>> make that decision while knowing this means that *nobody* will be able
>> to search for things in our older documentation even if they
>> explicitly ask for it. At all.
>
>
> On public search engines. They will still be able to using our own site search.
Yes, of course.
>> Their only chance is to search for
>> something else that might hit our docs, then in that click over to the
>> correct version they actually asked for, and then search *again* using
>> our site-search and hope that it shows up there. I'm willing to bet
>> very few users will figure that part out...
>
>
> The issue for me is that the current situation sucks for the vast majority of users, as evidenced by our analytics. If we blocked indexing of all but the current version of the docs, it would suck in the same way only for those that specifically want to look at an older version, and those that search for one of the very few things that have been removed from the latest version. In short, I think the current situation is worse.
Or we need a somewhat in between level. Like, right now I bet most
people would actually want version 11 or 12, not 13. So do we need to
define a "most likely wants to search for this" version as well, which
would then trail the actual latest-release version, and point the
search engines to that?
That said, I also agree with the suggestion to start by at least
blocking those that are unsupported. However, we should monitor the
results carefully so that doesn't end up with google just zapping
*everything* -- we need them to realize the newer versions are there.
Doing the canonical-URL-setup that Jonathan suggested would make
google update it, the question is what happens if they just "go away".
Do we *loose* all the existing "google power" of those links? If so,
it might be a very costly expereiment...
--
Magnus Hagander
Me: https://www.hagander.net/
Work: https://www.redpill-linpro.com/
From: | Dave Page <dpage(at)pgadmin(dot)org> |
---|---|
To: | Magnus Hagander <magnus(at)hagander(dot)net> |
Cc: | "Jonathan S(dot) Katz" <jkatz(at)postgresql(dot)org>, PostgreSQL WWW <pgsql-www(at)postgresql(dot)org> |
Subject: | Re: Fixing Google Search on the docs (redux) |
Date: | 2020-11-19 10:34:27 |
Message-ID: | CA+OCxoxet+zWmWB5b2sLjvUHwNngPnyu8STQ4frYoMtY--MNMA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-www |
On Thu, Nov 19, 2020 at 9:58 AM Magnus Hagander <magnus(at)hagander(dot)net> wrote:
>
> > The issue for me is that the current situation sucks for the vast
> majority of users, as evidenced by our analytics. If we blocked indexing of
> all but the current version of the docs, it would suck in the same way only
> for those that specifically want to look at an older version, and those
> that search for one of the very few things that have been removed from the
> latest version. In short, I think the current situation is worse.
>
> Or we need a somewhat in between level. Like, right now I bet most
> people would actually want version 11 or 12, not 13. So do we need to
> define a "most likely wants to search for this" version as well, which
> would then trail the actual latest-release version, and point the
> search engines to that?
>
Perhaps an interesting datapoint is this
====
If you have a single page accessible by multiple URLs, or different pages
with similar content (for example, a page with both a mobile and a desktop
version), Google sees these as duplicate versions of the same page. Google
will choose one URL as the canonical version and crawl that, and all other
URLs will be considered duplicate URLs and crawled less often.
If you don't explicitly tell Google which URL is canonical, Google will
make the choice for you, or might consider them both of equal weight, which
might lead to unwanted behavior, as explained below in Why should I choose
a canonical URL?
====
(from:
https://developers.google.com/search/docs/advanced/crawling/consolidate-duplicate-urls?hl=en
)
I think this is interesting because it makes the point that non-canonical
URLs will still be indexed, just less often. I wonder if we can do
something like the following, but still retain the ability to do a search
like "postgresql 12 create trigger":
- Remove (by default) all doc URLs from the sitemap that aren't under
/current/ (note that evidence indicates Google will still index pages not
in the sitemap if it finds them, if a sitemap is present).
- Include a canonical URL in all doc pages that points to the /current/
version
- Where a page has been removed entirely, mark the most recent version of
it as the canonical one instead of the /current/ version).
If the Google docs are correct, it'll still index the older versions (and
presumably use them in results if it needs to, e.g. because the user
included a version number), but it'll prefer the canonical one.
> That said, I also agree with the suggestion to start by at least
> blocking those that are unsupported. However, we should monitor the
> results carefully so that doesn't end up with google just zapping
> *everything* -- we need them to realize the newer versions are there.
> Doing the canonical-URL-setup that Jonathan suggested would make
> google update it, the question is what happens if they just "go away".
> Do we *loose* all the existing "google power" of those links? If so,
> it might be a very costly expereiment...
>
I think there's a risk here whatever we do. I'm not sure that's a good
enough reason to do nothing though.
--
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake
From: | Greg Stark <stark(at)mit(dot)edu> |
---|---|
To: | Dave Page <dpage(at)pgadmin(dot)org> |
Cc: | Magnus Hagander <magnus(at)hagander(dot)net>, "Jonathan S(dot) Katz" <jkatz(at)postgresql(dot)org>, PostgreSQL WWW <pgsql-www(at)postgresql(dot)org> |
Subject: | Re: Fixing Google Search on the docs (redux) |
Date: | 2020-11-19 14:19:06 |
Message-ID: | CAM-w4HOheMMcDOJUCZn32YwKEux_VJYjPKjVXufLWnGkrWon_g@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-www |
> all other URLs will be considered duplicate URLs and crawled less often
What Google crawls and what Google considers a valid search result to
serve users are two independent questions. Google may well crawl the
non-canonical results but never serve them. The crawl would still, for
example, add weight to pages linked from it. It's always really hard
to tell when reading Google docs whether they're talking about crawl
behaviour or search results behaviour.
> - Where a page has been removed entirely, mark the most recent version of it as the canonical one instead of the /current/ version).
This seems like a significant advance on previous ideas. If we have
enough meta data available to do this that would be a big win. I think
it's rare that we remove information from a page but keep the same
page. Generally things like recovery.conf would mean removing whole
pages replacing them with new pages that document new functionality.
From: | Magnus Hagander <magnus(at)hagander(dot)net> |
---|---|
To: | Greg Stark <stark(at)mit(dot)edu> |
Cc: | Dave Page <dpage(at)pgadmin(dot)org>, "Jonathan S(dot) Katz" <jkatz(at)postgresql(dot)org>, PostgreSQL WWW <pgsql-www(at)postgresql(dot)org> |
Subject: | Re: Fixing Google Search on the docs (redux) |
Date: | 2020-11-19 14:22:44 |
Message-ID: | CABUevEwG8-+EY945Ar_5cX7Me1th3juFczHYArg2_CPoDv3QRQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-www |
On Thu, Nov 19, 2020 at 3:19 PM Greg Stark <stark(at)mit(dot)edu> wrote:
>
> > - Where a page has been removed entirely, mark the most recent version of it as the canonical one instead of the /current/ version).
>
> This seems like a significant advance on previous ideas. If we have
> enough meta data available to do this that would be a big win. I think
> it's rare that we remove information from a page but keep the same
> page. Generally things like recovery.conf would mean removing whole
> pages replacing them with new pages that document new functionality.
It's actually the other way around. We very seldom remove pages, but
more often change the information that's on them.
But yes, we definitely have the metadata to do that. It'll take some
SQL magic in the page generation I think, but luckily we know one or
two people who can write such things :)
--
Magnus Hagander
Me: https://www.hagander.net/
Work: https://www.redpill-linpro.com/
From: | Peter Geoghegan <pg(at)bowt(dot)ie> |
---|---|
To: | Magnus Hagander <magnus(at)hagander(dot)net> |
Cc: | Greg Stark <stark(at)mit(dot)edu>, Dave Page <dpage(at)pgadmin(dot)org>, "Jonathan S(dot) Katz" <jkatz(at)postgresql(dot)org>, PostgreSQL WWW <pgsql-www(at)postgresql(dot)org> |
Subject: | Re: Fixing Google Search on the docs (redux) |
Date: | 2020-11-19 17:05:30 |
Message-ID: | CAH2-WznRpNBqfTG3GVw_m2H0iLpV5jezRRJgdE09g7ne+eNu=w@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-www |
On Thu, Nov 19, 2020 at 6:23 AM Magnus Hagander <magnus(at)hagander(dot)net> wrote:
> It's actually the other way around. We very seldom remove pages, but
> more often change the information that's on them.
I agree.
--
Peter Geoghegan
From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | Magnus Hagander <magnus(at)hagander(dot)net> |
Cc: | "Jonathan S(dot) Katz" <jkatz(at)postgresql(dot)org>, Dave Page <dpage(at)pgadmin(dot)org>, PostgreSQL WWW <pgsql-www(at)postgresql(dot)org> |
Subject: | Re: Fixing Google Search on the docs (redux) |
Date: | 2020-11-19 19:50:33 |
Message-ID: | 20201119195033.wuuhzza7tm2jtzuu@alap3.anarazel.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-www |
Hi,
On 2020-11-18 18:28:49 +0100, Magnus Hagander wrote:
> We've discussed this many times before, and I think so far they've all
> bogged down at "google suck" :) The problem is that they don't even
> consider the case like we have where the pages *aren't* identical, but
> yet related.
Is any search engine better at this? I don't think so?
> The problem it usually comes down to is that if we do that, then you
> will no longer be able to say search for something in the old docs *at
> all*.
I think that'd still be better than the current situation. But I hope we
can do better:
> A good example right now might be that recovery.conf stuff goes
> away. Even if you explicitly search for "postgresql recovery.conf 11".
> And I'd guess the majority of people are actually looking for things
> in versions that are NOT the latest (though an even bigger majority of
> people will be looking for things in versions that are not 9.1).
E.g. not applying canonical when there's no newer version.
> I don't know of any way to actually tell google to prioritise the new
> versions. You used to be able to do this using the sitemap.xml stuff,
> which is why we do that, but at some point they just stopped caring
> about those, even in the cases where we're *lowering* our own
> priority, under the argument of not letting us increase our priority.
Have we evaluated not using canonical, but not including old versions in
the sitemap?
Greetings,
Andres Freund
From: | Magnus Hagander <magnus(at)hagander(dot)net> |
---|---|
To: | Andres Freund <andres(at)anarazel(dot)de> |
Cc: | "Jonathan S(dot) Katz" <jkatz(at)postgresql(dot)org>, Dave Page <dpage(at)pgadmin(dot)org>, PostgreSQL WWW <pgsql-www(at)postgresql(dot)org> |
Subject: | Re: Fixing Google Search on the docs (redux) |
Date: | 2020-11-21 14:57:28 |
Message-ID: | CABUevExxMMkQ78fHi9wkjcs1tTduUEE8ZrWxiRpdp0Tk1D0dcw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-www |
On Thu, Nov 19, 2020 at 8:50 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> Hi,
>
> On 2020-11-18 18:28:49 +0100, Magnus Hagander wrote:
> > We've discussed this many times before, and I think so far they've all
> > bogged down at "google suck" :) The problem is that they don't even
> > consider the case like we have where the pages *aren't* identical, but
> > yet related.
>
> Is any search engine better at this? I don't think so?
I doubt it, most tend to copy Google. And in either case it doesn't
matter that much -- the *vast* majority of our inbound search traffic
is google vs the other searches. By such a margin that it's not even a
point in considering the others.
> > The problem it usually comes down to is that if we do that, then you
> > will no longer be able to say search for something in the old docs *at
> > all*.
>
> I think that'd still be better than the current situation. But I hope we
> can do better:
>
> > A good example right now might be that recovery.conf stuff goes
> > away. Even if you explicitly search for "postgresql recovery.conf 11".
> > And I'd guess the majority of people are actually looking for things
> > in versions that are NOT the latest (though an even bigger majority of
> > people will be looking for things in versions that are not 9.1).
>
> E.g. not applying canonical when there's no newer version.
That we can definitely go. So for recovery.conf it would still work,
but anything that goes on a page where the page still exists, I don't
see how we could separate that out and not do a canonical for that...
> > I don't know of any way to actually tell google to prioritise the new
> > versions. You used to be able to do this using the sitemap.xml stuff,
> > which is why we do that, but at some point they just stopped caring
> > about those, even in the cases where we're *lowering* our own
> > priority, under the argument of not letting us increase our priority.
>
> Have we evaluated not using canonical, but not including old versions in
> the sitemap?
AIUI from my reading, Google mostly ignores sitemaps these days. The
only thing it's used for is seeding *new* URLs into the search engine,
not removing old and not having any effect on priority. Probably
because it was abused too much.
--
Magnus Hagander
Me: https://www.hagander.net/
Work: https://www.redpill-linpro.com/
From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | Magnus Hagander <magnus(at)hagander(dot)net> |
Cc: | "Jonathan S(dot) Katz" <jkatz(at)postgresql(dot)org>, Dave Page <dpage(at)pgadmin(dot)org>, PostgreSQL WWW <pgsql-www(at)postgresql(dot)org> |
Subject: | Re: Fixing Google Search on the docs (redux) |
Date: | 2020-11-21 19:45:34 |
Message-ID: | 20201121194534.bthxe2nrc7pvjelo@alap3.anarazel.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-www |
Hi,
On 2020-11-21 15:57:28 +0100, Magnus Hagander wrote:
> On Thu, Nov 19, 2020 at 8:50 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > On 2020-11-18 18:28:49 +0100, Magnus Hagander wrote:
> > > We've discussed this many times before, and I think so far they've all
> > > bogged down at "google suck" :) The problem is that they don't even
> > > consider the case like we have where the pages *aren't* identical, but
> > > yet related.
> >
> > Is any search engine better at this? I don't think so?
>
> I doubt it, most tend to copy Google. And in either case it doesn't
> matter that much -- the *vast* majority of our inbound search traffic
> is google vs the other searches. By such a margin that it's not even a
> point in considering the others.
I was more wondering whether it's "search engines sucks" or "google
sucks" - obviously g search is dominant...
> > > The problem it usually comes down to is that if we do that, then you
> > > will no longer be able to say search for something in the old docs *at
> > > all*.
> >
> > I think that'd still be better than the current situation. But I hope we
> > can do better:
> >
> > > A good example right now might be that recovery.conf stuff goes
> > > away. Even if you explicitly search for "postgresql recovery.conf 11".
> > > And I'd guess the majority of people are actually looking for things
> > > in versions that are NOT the latest (though an even bigger majority of
> > > people will be looking for things in versions that are not 9.1).
> >
> > E.g. not applying canonical when there's no newer version.
>
> That we can definitely go. So for recovery.conf it would still work,
> but anything that goes on a page where the page still exists, I don't
> see how we could separate that out and not do a canonical for that...
Compute a similarity metric ;). No, I'm not serious.
I wonder if it's worth adding some more metadata to our pages for
google's benefit. Perhaps it'd be *slightly* less annoying to navigate
to the right version of the docs if we added breadcrumb annotations
https://developers.google.com/search/docs/data-types/breadcrumb#json-ld_1
I can imagine - but have nothing but intuition to back that up - that we
also make google's job harder by having very recent timestamp for each
version of the docs. Perhaps we ought to add datePublished /
dateModified annotations, and freeze datePublished to the release?
And probably also not update dateModified when the page didn't change,
but I think you were discussing that elsewhere.
Greetings,
Andres Freund