Lists: | pgsql-novice |
---|
From: | DrYSG <ygutfreund(at)draper(dot)com> |
---|---|
To: | pgsql-novice(at)postgresql(dot)org |
Subject: | Slow duplicate deletes |
Date: | 2012-03-05 15:17:11 |
Message-ID: | 1330960631243-5537818.post@n5.nabble.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-novice |
I have a large table (20M records) but mostly short text fields. There are
duplicates that I am trying to remove. I have a bigseriel index that I
unique, but duplicates in another field.
I have an 8 core, 12GB memory computer with RAID disks.
This request has been running for 70 hours (is it safe to kill it?).
How can I make this run faster? This is a one time processing task, but it
is taking a long time.
DELETE FROM portal.metadata
WHERE idx NOT IN
(
SELECT MIN(idx)
FROM portal.metadata
GROUP BY "data_object.unique_id"
);
CREATE TABLE
metadata
(
data_object.unique_id CHARACTER(64) NOT NULL,
size_bytes BIGINT,
object_date TIMESTAMP(6) WITHOUT TIME ZONE,
object_type CHARACTER VARYING(25),
classification CHARACTER VARYING(7),
object_source CHARACTER VARYING(50),
object_managed_date TIMESTAMP(6) WITHOUT TIME ZONE,
clevel INTEGER,
fsctlh CHARACTER VARYING(50),
oname CHARACTER VARYING(40),
description CHARACTER VARYING(80),
horizontal_datum CHARACTER VARYING(20),
do_location CHARACTER VARYING(200),
elevation_ft INTEGER,
location USER-DEFINED,
idx BIGINT DEFAULT nextval('portal.metadata_idx_seq'::regclass) NOT
NULL,
bbox CHARACTER VARYING(160),
CONSTRAINT MetaDataKey PRIMARY KEY (idx)
)
--
View this message in context: http://postgresql.1045698.n5.nabble.com/Slow-duplicate-deletes-tp5537818p5537818.html
Sent from the PostgreSQL - novice mailing list archive at Nabble.com.
From: | DrYSG <ygutfreund(at)draper(dot)com> |
---|---|
To: | pgsql-novice(at)postgresql(dot)org |
Subject: | Re: Slow duplicate deletes |
Date: | 2012-03-05 15:20:47 |
Message-ID: | 1330960847608-5537832.post@n5.nabble.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-novice |
Actually, I have 24GB, but my question remains: How can I speed this up? and
can I kill the current SQL command (running in pgAdmin).
--
View this message in context: http://postgresql.1045698.n5.nabble.com/Slow-duplicate-deletes-tp5537818p5537832.html
Sent from the PostgreSQL - novice mailing list archive at Nabble.com.
From: | Merlin Moncure <mmoncure(at)gmail(dot)com> |
---|---|
To: | DrYSG <ygutfreund(at)draper(dot)com> |
Cc: | pgsql-novice(at)postgresql(dot)org |
Subject: | Re: Slow duplicate deletes |
Date: | 2012-03-05 19:52:38 |
Message-ID: | CAHyXU0ztqbzy3=cQiJDW+96mHzUz7WWKhqk00okvtp2jvQYC9A@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-novice |
On Mon, Mar 5, 2012 at 9:17 AM, DrYSG <ygutfreund(at)draper(dot)com> wrote:
> I have a large table (20M records) but mostly short text fields. There are
> duplicates that I am trying to remove. I have a bigseriel index that I
> unique, but duplicates in another field.
>
> I have an 8 core, 12GB memory computer with RAID disks.
>
> This request has been running for 70 hours (is it safe to kill it?).
yes...generally speaking, it's safe to kill just about any query in
postgres any time.
> How can I make this run faster? This is a one time processing task, but it
> is taking a long time.
>
> DELETE FROM portal.metadata
> WHERE idx NOT IN
> (
> SELECT MIN(idx)
> FROM portal.metadata
> GROUP BY "data_object.unique_id"
> );
compare the plan for that query (EXPLAIN) vs this one:
/* delete the records from m1 if there is another record with a lower
idx for the same unique_id */
DELETE FROM portal.metadata m1
WHERE EXISTS
(
SELECT 1 FROM portal.metadata m2
WHERE m1.unique_id = m2.unique_id
AND m2.idx < m1.idx
)
also, if you don't already have one, consider making an index on at
least unqiue_id, or possibly unique_id, idx.
back up your database before running this query :-).
merlin
From: | DrYSG <ygutfreund(at)draper(dot)com> |
---|---|
To: | pgsql-novice(at)postgresql(dot)org |
Subject: | Re: Slow duplicate deletes |
Date: | 2012-03-05 20:43:47 |
Message-ID: | 1330980227869-5538858.post@n5.nabble.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-novice |
One point I might not have made clear. The reason I want to remove duplicates
is that the column "data_object.unique_id" became non-unique (someone added
duplicate rows). So I added the bigSeriel (idx) to uniquely identify the
rows, and I was using the SELECT MIN(idx) and GroupBy to pick just one of
the rows that became duplicated.
I am going to try out some of your excellent suggestions. I will report back
on how they are working.
One idea that was given to me was the following (what do you think Merlin?)
CREATE TABLE portal.new_metatdata AS
select distinct on (data_object.unique_id) * FROM portal.metadata;
Or something of this ilk should be faster because it only need to do a
sort on data_object.unique_id and then an insert. After you have
verified the results you can do:
BEGIN;
ALTER TABLE portal.metatdata rename TO portal.new_metatdata_old;
ALTER TABLE portal.new_metatdata rename TO portal.metatdata_old;
COMMIT;
--
View this message in context: http://postgresql.1045698.n5.nabble.com/Slow-duplicate-deletes-tp5537818p5538858.html
Sent from the PostgreSQL - novice mailing list archive at Nabble.com.
From: | Michael Wood <esiotrot(at)gmail(dot)com> |
---|---|
To: | DrYSG <ygutfreund(at)draper(dot)com> |
Cc: | pgsql-novice(at)postgresql(dot)org |
Subject: | Re: Slow duplicate deletes |
Date: | 2012-03-06 08:14:30 |
Message-ID: | CAP6d-HWTDgsbf2-Rr+umdm4ZboEFYf2RMLNeLebZ-OhueaCa=Q@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-novice |
On 5 March 2012 22:43, DrYSG <ygutfreund(at)draper(dot)com> wrote:
> One point I might not have made clear. The reason I want to remove duplicates
> is that the column "data_object.unique_id" became non-unique (someone added
> duplicate rows). So I added the bigSeriel (idx) to uniquely identify the
> rows, and I was using the SELECT MIN(idx) and GroupBy to pick just one of
> the rows that became duplicated.
>
> I am going to try out some of your excellent suggestions. I will report back
> on how they are working.
>
> One idea that was given to me was the following (what do you think Merlin?)
>
> CREATE TABLE portal.new_metatdata AS
> select distinct on (data_object.unique_id) * FROM portal.metadata;
>
> Or something of this ilk should be faster because it only need to do a
> sort on data_object.unique_id and then an insert. After you have
> verified the results you can do:
>
> BEGIN;
> ALTER TABLE portal.metatdata rename TO portal.new_metatdata_old;
> ALTER TABLE portal.new_metatdata rename TO portal.metatdata_old;
> COMMIT;
This sounds like a good way to go, but if you have foreign keys
pointing at portal.metadata I think you will need to drop and recreate
them again after the rename.
--
Michael Wood <esiotrot(at)gmail(dot)com>
From: | Merlin Moncure <mmoncure(at)gmail(dot)com> |
---|---|
To: | DrYSG <ygutfreund(at)draper(dot)com> |
Cc: | pgsql-novice(at)postgresql(dot)org |
Subject: | Re: Slow duplicate deletes |
Date: | 2012-03-06 22:04:06 |
Message-ID: | CAHyXU0wD32XGpubZWVOmZE2PL0x2tpYiwSiCXpZf=6RoEz5Fcw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-novice |
On Mon, Mar 5, 2012 at 2:43 PM, DrYSG <ygutfreund(at)draper(dot)com> wrote:
> One point I might not have made clear. The reason I want to remove duplicates
> is that the column "data_object.unique_id" became non-unique (someone added
> duplicate rows). So I added the bigSeriel (idx) to uniquely identify the
> rows, and I was using the SELECT MIN(idx) and GroupBy to pick just one of
> the rows that became duplicated.
>
> I am going to try out some of your excellent suggestions. I will report back
> on how they are working.
>
> One idea that was given to me was the following (what do you think Merlin?)
>
> CREATE TABLE portal.new_metatdata AS
> select distinct on (data_object.unique_id) * FROM portal.metadata;
sure that will work, but as Michael noted it's not always practical to
do that. Also, if a fairly small percentage of the records have to be
deleted, an in-place delete may end up being faster anyways. Modern
postgres is pretty smart at optimizing 'where exists' and you should
get a decent plan.
merlin
From: | DrYSG <ygutfreund(at)draper(dot)com> |
---|---|
To: | pgsql-novice(at)postgresql(dot)org |
Subject: | Re: Slow duplicate deletes |
Date: | 2012-03-07 15:09:38 |
Message-ID: | 1331132978011-5544386.post@n5.nabble.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-novice |
The following query took only 16 hours 2 minutes
CREATE TABLE portal.new_metatdata AS
select distinct on (data_object.unique_id) * FROM portal.metadata;
Thank you.
--
View this message in context: http://postgresql.1045698.n5.nabble.com/Slow-duplicate-deletes-tp5537818p5544386.html
Sent from the PostgreSQL - novice mailing list archive at Nabble.com.