Quick Links

Re: Slow duplicate deletes

Lists:	pgsql-novice

From:	DrYSG <ygutfreund(at)draper(dot)com>
To:	pgsql-novice(at)postgresql(dot)org
Subject:	Slow duplicate deletes
Date:	2012-03-05 15:17:11
Message-ID:	1330960631243-5537818.post@n5.nabble.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-novice

I have a large table (20M records) but mostly short text fields. There are
duplicates that I am trying to remove. I have a bigseriel index that I
unique, but duplicates in another field.

I have an 8 core, 12GB memory computer with RAID disks.

This request has been running for 70 hours (is it safe to kill it?).

How can I make this run faster? This is a one time processing task, but it
is taking a long time.

DELETE FROM portal.metadata
WHERE idx NOT IN
(
SELECT MIN(idx)
FROM portal.metadata
GROUP BY "data_object.unique_id"
);

CREATE TABLE
metadata
(
data_object.unique_id CHARACTER(64) NOT NULL,
size_bytes BIGINT,
object_date TIMESTAMP(6) WITHOUT TIME ZONE,
object_type CHARACTER VARYING(25),
classification CHARACTER VARYING(7),
object_source CHARACTER VARYING(50),
object_managed_date TIMESTAMP(6) WITHOUT TIME ZONE,
clevel INTEGER,
fsctlh CHARACTER VARYING(50),
oname CHARACTER VARYING(40),
description CHARACTER VARYING(80),
horizontal_datum CHARACTER VARYING(20),
do_location CHARACTER VARYING(200),
elevation_ft INTEGER,
location USER-DEFINED,
idx BIGINT DEFAULT nextval('portal.metadata_idx_seq'::regclass) NOT
NULL,
bbox CHARACTER VARYING(160),
CONSTRAINT MetaDataKey PRIMARY KEY (idx)
)

--
View this message in context: http://postgresql.1045698.n5.nabble.com/Slow-duplicate-deletes-tp5537818p5537818.html
Sent from the PostgreSQL - novice mailing list archive at Nabble.com.

From:	DrYSG <ygutfreund(at)draper(dot)com>
To:	pgsql-novice(at)postgresql(dot)org
Subject:	Re: Slow duplicate deletes
Date:	2012-03-05 15:20:47
Message-ID:	1330960847608-5537832.post@n5.nabble.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-novice

Actually, I have 24GB, but my question remains: How can I speed this up? and
can I kill the current SQL command (running in pgAdmin).

--
View this message in context: http://postgresql.1045698.n5.nabble.com/Slow-duplicate-deletes-tp5537818p5537832.html
Sent from the PostgreSQL - novice mailing list archive at Nabble.com.

From:	Merlin Moncure <mmoncure(at)gmail(dot)com>
To:	DrYSG <ygutfreund(at)draper(dot)com>
Cc:	pgsql-novice(at)postgresql(dot)org
Subject:	Re: Slow duplicate deletes
Date:	2012-03-05 19:52:38
Message-ID:	CAHyXU0ztqbzy3=cQiJDW+96mHzUz7WWKhqk00okvtp2jvQYC9A@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-novice

On Mon, Mar 5, 2012 at 9:17 AM, DrYSG <ygutfreund(at)draper(dot)com> wrote:
> I have a large table (20M records) but mostly short text fields. There are
> duplicates that I am trying to remove. I have a bigseriel index that I
> unique, but duplicates in another field.
>
> I have an 8 core, 12GB memory computer with RAID disks.
>
> This request has been running for 70 hours (is it safe to kill it?).

yes...generally speaking, it's safe to kill just about any query in
postgres any time.

> How can I make this run faster? This is a one time processing task, but it
> is taking a long time.
>
> DELETE FROM portal.metadata
> WHERE idx NOT IN
> (
> SELECT MIN(idx)
> FROM portal.metadata
> GROUP BY "data_object.unique_id"
> );

compare the plan for that query (EXPLAIN) vs this one:

/* delete the records from m1 if there is another record with a lower
idx for the same unique_id */
DELETE FROM portal.metadata m1
WHERE EXISTS
(
SELECT 1 FROM portal.metadata m2
WHERE m1.unique_id = m2.unique_id
AND m2.idx < m1.idx
)

also, if you don't already have one, consider making an index on at
least unqiue_id, or possibly unique_id, idx.

back up your database before running this query :-).

merlin

From:	DrYSG <ygutfreund(at)draper(dot)com>
To:	pgsql-novice(at)postgresql(dot)org
Subject:	Re: Slow duplicate deletes
Date:	2012-03-05 20:43:47
Message-ID:	1330980227869-5538858.post@n5.nabble.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-novice

One point I might not have made clear. The reason I want to remove duplicates
is that the column "data_object.unique_id" became non-unique (someone added
duplicate rows). So I added the bigSeriel (idx) to uniquely identify the
rows, and I was using the SELECT MIN(idx) and GroupBy to pick just one of
the rows that became duplicated.

I am going to try out some of your excellent suggestions. I will report back
on how they are working.

One idea that was given to me was the following (what do you think Merlin?)

CREATE TABLE portal.new_metatdata AS
select distinct on (data_object.unique_id) * FROM portal.metadata;

Or something of this ilk should be faster because it only need to do a
sort on data_object.unique_id and then an insert. After you have
verified the results you can do:

BEGIN;
ALTER TABLE portal.metatdata rename TO portal.new_metatdata_old;
ALTER TABLE portal.new_metatdata rename TO portal.metatdata_old;
COMMIT;

--
View this message in context: http://postgresql.1045698.n5.nabble.com/Slow-duplicate-deletes-tp5537818p5538858.html
Sent from the PostgreSQL - novice mailing list archive at Nabble.com.

From:	Michael Wood <esiotrot(at)gmail(dot)com>
To:	DrYSG <ygutfreund(at)draper(dot)com>
Cc:	pgsql-novice(at)postgresql(dot)org
Subject:	Re: Slow duplicate deletes
Date:	2012-03-06 08:14:30
Message-ID:	CAP6d-HWTDgsbf2-Rr+umdm4ZboEFYf2RMLNeLebZ-OhueaCa=Q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-novice

On 5 March 2012 22:43, DrYSG <ygutfreund(at)draper(dot)com> wrote:
> One point I might not have made clear. The reason I want to remove duplicates
> is that the column "data_object.unique_id" became non-unique (someone added
> duplicate rows). So I added the bigSeriel (idx) to uniquely identify the
> rows, and I was using the SELECT MIN(idx) and GroupBy to pick just one of
> the rows that became duplicated.
>
> I am going to try out some of your excellent suggestions. I will report back
> on how they are working.
>
> One idea that was given to me was the following (what do you think Merlin?)
>
> CREATE TABLE portal.new_metatdata AS
> select distinct on (data_object.unique_id) * FROM portal.metadata;
>
> Or something of this ilk should be faster because it only need to do a
> sort on data_object.unique_id and then an insert. After you have
> verified the results you can do:
>
> BEGIN;
> ALTER TABLE portal.metatdata rename TO portal.new_metatdata_old;
> ALTER TABLE portal.new_metatdata rename TO portal.metatdata_old;
> COMMIT;

This sounds like a good way to go, but if you have foreign keys
pointing at portal.metadata I think you will need to drop and recreate
them again after the rename.

--
Michael Wood <esiotrot(at)gmail(dot)com>

From:	Merlin Moncure <mmoncure(at)gmail(dot)com>
To:	DrYSG <ygutfreund(at)draper(dot)com>
Cc:	pgsql-novice(at)postgresql(dot)org
Subject:	Re: Slow duplicate deletes
Date:	2012-03-06 22:04:06
Message-ID:	CAHyXU0wD32XGpubZWVOmZE2PL0x2tpYiwSiCXpZf=6RoEz5Fcw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-novice

On Mon, Mar 5, 2012 at 2:43 PM, DrYSG <ygutfreund(at)draper(dot)com> wrote:
> One point I might not have made clear. The reason I want to remove duplicates
> is that the column "data_object.unique_id" became non-unique (someone added
> duplicate rows). So I added the bigSeriel (idx) to uniquely identify the
> rows, and I was using the SELECT MIN(idx) and GroupBy to pick just one of
> the rows that became duplicated.
>
> I am going to try out some of your excellent suggestions. I will report back
> on how they are working.
>
> One idea that was given to me was the following (what do you think Merlin?)
>
> CREATE TABLE portal.new_metatdata AS
> select distinct on (data_object.unique_id) * FROM portal.metadata;

sure that will work, but as Michael noted it's not always practical to
do that. Also, if a fairly small percentage of the records have to be
deleted, an in-place delete may end up being faster anyways. Modern
postgres is pretty smart at optimizing 'where exists' and you should
get a decent plan.

merlin

From:	DrYSG <ygutfreund(at)draper(dot)com>
To:	pgsql-novice(at)postgresql(dot)org
Subject:	Re: Slow duplicate deletes
Date:	2012-03-07 15:09:38
Message-ID:	1331132978011-5544386.post@n5.nabble.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-novice

The following query took only 16 hours 2 minutes

CREATE TABLE portal.new_metatdata AS
select distinct on (data_object.unique_id) * FROM portal.metadata;

Thank you.

--
View this message in context: http://postgresql.1045698.n5.nabble.com/Slow-duplicate-deletes-tp5537818p5544386.html
Sent from the PostgreSQL - novice mailing list archive at Nabble.com.