Re: Clustering, parallelised operating system, super-computing

Lists: pgsql-general
From: Brian Modra <epailty(at)googlemail(dot)com>
To: PGSQL Mailing List <pgsql-general(at)postgresql(dot)org>
Subject: Clustering, parallelised operating system, super-computing
Date: 2010-05-13 08:36:20
Message-ID: AANLkTikXDfdeYOQj6BGZjsnLXzAyerWaTvDZfua81gTr@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Hi,
I've been told that PostgreSQL and other similar databases don't work
well on a parallelised operating system because they make good use of
shared memory which does not cross the boundary between nodes in a
cluster.

So I am wondering if any work is being done to make it possible to
have a single database schema that spans a number of hosts?

For example, a table on one host/node that has a reference to a table
on another host/node with deletes cascading back.
e.g.

on host A:

create table person (
username text not null unique primary key,
...
);

on host B:

create table race_entry (
person text
references person(username)
match full
on delete cascade
on update cascade,
...
);

--
Brian Modra Land line: +27 23 5411 462
Mobile: +27 79 69 77 082
5 Jan Louw Str, Prince Albert, 6930
Postal: P.O. Box 2, Prince Albert 6930
South Africa
http://www.zwartberg.com/


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Brian Modra <epailty(at)googlemail(dot)com>
Cc: PGSQL Mailing List <pgsql-general(at)postgresql(dot)org>
Subject: Re: Clustering, parallelised operating system, super-computing
Date: 2010-05-14 00:21:30
Message-ID: 201005140021.o4E0LUl14844@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Brian Modra wrote:
> Hi,
> I've been told that PostgreSQL and other similar databases don't work
> well on a parallelised operating system because they make good use of
> shared memory which does not cross the boundary between nodes in a
> cluster.
>
> So I am wondering if any work is being done to make it possible to
> have a single database schema that spans a number of hosts?
>
> For example, a table on one host/node that has a reference to a table
> on another host/node with deletes cascading back.
> e.g.

Not currently. There are some prototypes in development, but those
usually have the same database on all the machines and they share the
load.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com


From: Brian Modra <brian(at)zwartberg(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: PGSQL Mailing List <pgsql-general(at)postgresql(dot)org>
Subject: Re: Clustering, parallelised operating system, super-computing
Date: 2010-05-14 06:51:08
Message-ID: AANLkTimV_Wh1DKofePDbUZErIb4li1qY7F9Qdi3LgxA_@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On 14/05/2010, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
> Brian Modra wrote:
>> Hi,
>> I've been told that PostgreSQL and other similar databases don't work
>> well on a parallelised operating system because they make good use of
>> shared memory which does not cross the boundary between nodes in a
>> cluster.
>>
>> So I am wondering if any work is being done to make it possible to
>> have a single database schema that spans a number of hosts?
>>
>> For example, a table on one host/node that has a reference to a table
>> on another host/node with deletes cascading back.
>> e.g.
>
> Not currently. There are some prototypes in development, but those
> usually have the same database on all the machines and they share the
> load.

I'm trying to solve the problem of firstly distributing the volume of
data, and secondarily the load.

So far, I'm putting some bulky data onto different hosts, where there
is no need to ever do a join. I put a "reference" table onto a host
with the data that needs to be joined, then I can select the actual
data from the other host by unique IDs after the join has been
performed locally.

To create a reference with "on delete cascade" across hosts, I create
a trigger (after) delete, and in the plpgsql I call dblink to do the
remote delete.

Similarly, I can do joins in plpgsql with the help of dblink.
But, doing joins across hosts certainly does defeat the purpose of
"distributing the load".

I think that the schema design must be done carefully when distributing data.
So it really will be difficult to get this "supercomputer database" right.

Maybe the best way to solve this is not to do automatic distribution
of the data, but rather to provide tools for implementing distributed
references and joins.

I'm thinking of working on this as part of "The Karoo Project" Open
Source Project I'm working on, and would appreciate
comments/support/criticism.
Thanks

--
Brian Modra Land line: +27 23 5411 462
Mobile: +27 79 69 77 082
5 Jan Louw Str, Prince Albert, 6930
Postal: P.O. Box 2, Prince Albert 6930
South Africa
http://www.zwartberg.com/


From: Benjamin Smith <lists(at)benjamindsmith(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: Clustering, parallelised operating system, super-computing
Date: 2010-08-18 23:39:50
Message-ID: 201008181639.50779.lists@benjamindsmith.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Thursday, May 13, 2010 11:51:08 pm Brian Modra wrote:
> Maybe the best way to solve this is not to do automatic distribution
> of the data, but rather to provide tools for implementing distributed
> references and joins.

Here's my vote! I'd *LOVE* it if I could do a simple cross-database join
(without the ugliness of dblink), it would be just awesome. Two beers for
cross-database foreign keys...

We already do use dblink extensively with a wrapper. for various reporting
functions. Since the cross-database queries are in the minority, it does
function as load balancing, even if the cross-joined queries aren't so
balanced.

--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.