Quick Links

Different length lines in COPY CSV

Lists:	pgsql-hackers

From:	Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>
To:	Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Different length lines in COPY CSV
Date:	2005-12-12 04:17:44
Message-ID:	439CF9E8.3060104@familyhealth.com.au
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Is there any way to force COPY to accept that there will be lines of
different length in a data file?

I have a rather large file I'm trying to import. It's in CSV format,
however, they leave off trailing empty columns on most lines.

Any way to do this? Should it be supported by CSV mode?

Chris

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>
Cc:	Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Different length lines in COPY CSV
Date:	2005-12-12 04:55:49
Message-ID:	200512120455.jBC4tnh03411@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Christopher Kings-Lynne wrote:
> Hi,
>
> Is there any way to force COPY to accept that there will be lines of
> different length in a data file?
>
> I have a rather large file I'm trying to import. It's in CSV format,
> however, they leave off trailing empty columns on most lines.
>
> Any way to do this? Should it be supported by CSV mode?

No there is not, partly so we can complain that the file is corrupt, and
partly because we are unsure what to make the missing values, NULL?

I suppose we could have a TRAILINGNULL flag to COPY but because few ask
for this feature, it hardly seems worth it.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc:	Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>, Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Different length lines in COPY CSV
Date:	2005-12-12 05:42:25
Message-ID:	22169.1134366145@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> Christopher Kings-Lynne wrote:
>> Is there any way to force COPY to accept that there will be lines of
>> different length in a data file?

> I suppose we could have a TRAILINGNULL flag to COPY but because few ask
> for this feature, it hardly seems worth it.

There is no chance that we'll ever be able to cope with every insane
file format that some benighted program claims is CSV. The harder we
try, the more we will lose the ability to detect data errors at all;
not to mention the likely negative consequences for the readability
and performance of the COPY code. I think "fix it with a perl script"
is a very reasonable answer for cases like this one.

regards, tom lane

From:	"Andrew Dunstan" <andrew(at)dunslane(dot)net>
To:	<tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	<pgman(at)candle(dot)pha(dot)pa(dot)us>, <chriskl(at)familyhealth(dot)com(dot)au>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Different length lines in COPY CSV
Date:	2005-12-12 09:32:22
Message-ID:	3044.24.211.165.134.1134379942.squirrel@www.dunslane.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane said:
> Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
>> Christopher Kings-Lynne wrote:
>>> Is there any way to force COPY to accept that there will be lines of
>>> different length in a data file?
>
>> I suppose we could have a TRAILINGNULL flag to COPY but because few
>> ask for this feature, it hardly seems worth it.
>
> There is no chance that we'll ever be able to cope with every insane
> file format that some benighted program claims is CSV. The harder we
> try, the more we will lose the ability to detect data errors at all;
> not to mention the likely negative consequences for the readability and
> performance of the COPY code. I think "fix it with a perl script" is a
> very reasonable answer for cases like this one.
>

I agree.

The COPY code is probably on the edge of maintainability now.

Our CSV routines accept a wide variety of imports formats, but a fixed
number of columns is required. Maybe we need a pgfoundry project with some
general perl CSV munging utilities - this issue comes up often enough.

cheers

andrew

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Andrew Dunstan" <andrew(at)dunslane(dot)net>
Cc:	pgman(at)candle(dot)pha(dot)pa(dot)us, chriskl(at)familyhealth(dot)com(dot)au, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Different length lines in COPY CSV
Date:	2005-12-12 15:00:16
Message-ID:	1693.1134399616@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Andrew Dunstan" <andrew(at)dunslane(dot)net> writes:
> The COPY code is probably on the edge of maintainability now.
> Our CSV routines accept a wide variety of imports formats, but a fixed
> number of columns is required. Maybe we need a pgfoundry project with some
> general perl CSV munging utilities - this issue comes up often enough.

What's been suggested in the past is some sort of standalone
file-format-conversion utility, which could deal with this sort of stuff
without having to also deal with all the backend-internal considerations
that COPY must handle. So (at least in theory) it'd be simpler and more
maintainable. That still seems like a good idea to me --- in fact,
given my druthers I would rather have seen CSV support done in such an
external program.

regards, tom lane

From:	Andrew Dunstan <andrew(at)dunslane(dot)net>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgman(at)candle(dot)pha(dot)pa(dot)us, chriskl(at)familyhealth(dot)com(dot)au, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Different length lines in COPY CSV
Date:	2005-12-12 15:55:55
Message-ID:	439D9D8B.8030100@dunslane.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:

>"Andrew Dunstan" <andrew(at)dunslane(dot)net> writes:
>
>
>>The COPY code is probably on the edge of maintainability now.
>>Our CSV routines accept a wide variety of imports formats, but a fixed
>>number of columns is required. Maybe we need a pgfoundry project with some
>>general perl CSV munging utilities - this issue comes up often enough.
>>
>>
>
>What's been suggested in the past is some sort of standalone
>file-format-conversion utility, which could deal with this sort of stuff
>without having to also deal with all the backend-internal considerations
>that COPY must handle. So (at least in theory) it'd be simpler and more
>maintainable. That still seems like a good idea to me --- in fact,
>given my druthers I would rather have seen CSV support done in such an
>external program.
>
>
>
>

We debated the reasons at the time, and I am not convinced we were wrong
- huge bulk loads are a lot simpler if you don't have to call some
external program to munge the data first.

From time to time people thank me for things I have contributed to in
PostgreSQL. The two that get the most thanks by far are CSV support and
dollar quoting.

Anyway, that's history now. Where would you want this file conversion
utility? bin? contrib? pgfoundry?

cheers

andrew

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc:	pgman(at)candle(dot)pha(dot)pa(dot)us, chriskl(at)familyhealth(dot)com(dot)au, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Different length lines in COPY CSV
Date:	2005-12-12 16:10:41
Message-ID:	2548.1134403841@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> Tom Lane wrote:
>> What's been suggested in the past is some sort of standalone
>> file-format-conversion utility,

> Anyway, that's history now. Where would you want this file conversion
> utility? bin? contrib? pgfoundry?

I'd say pgfoundry for starters --- there's no reason to tie it down
to server release cycles. Maybe when the thing is fairly mature and
doesn't need frequent releases, we could think about whether it ought
to be brought into the core distro.

However, it likely won't ever be a candidate to become part of core
unless it's written in C, and offhand I would judge C to not be the
best choice of implementation language for such a thing. This is surely
going to be mostly a string-pushing type of problem, so something like
perl might be a better bet.

regards, tom lane

From:	Andrew Dunstan <andrew(at)dunslane(dot)net>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgman(at)candle(dot)pha(dot)pa(dot)us, chriskl(at)familyhealth(dot)com(dot)au, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Different length lines in COPY CSV
Date:	2005-12-12 20:08:37
Message-ID:	439DD8C5.6050502@dunslane.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:

>>Where would you want this file conversion
>>utility? bin? contrib? pgfoundry?
>>
>>
>
>I'd say pgfoundry for starters --- there's no reason to tie it down
>to server release cycles. Maybe when the thing is fairly mature and
>doesn't need frequent releases, we could think about whether it ought
>to be brought into the core distro.
>
>However, it likely won't ever be a candidate to become part of core
>unless it's written in C, and offhand I would judge C to not be the
>best choice of implementation language for such a thing. This is surely
>going to be mostly a string-pushing type of problem, so something like
>perl might be a better bet.
>
>
>
>

You are probably right. The biggest wrinkle will be dealing with various
encodings, I suspect. That at least is one thing that doing CSV within
the backend bought us fairly painlessly. Perl's Text::CSV_XS module for
example simply handles this by declaring that only [\x09\x20-\x7f] are
valid in its non-binary mode, and in either mode appears to be MBCS
unaware. We should try to do better than that.

cheers

andrew

From:	Tino Wildenhain <tino(at)wildenhain(dot)de>
To:	Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgman(at)candle(dot)pha(dot)pa(dot)us, chriskl(at)familyhealth(dot)com(dot)au, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Different length lines in COPY CSV
Date:	2005-12-12 20:30:12
Message-ID:	1134419412.4975.86.camel@Andrea.peacock.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Am Montag, den 12.12.2005, 15:08 -0500 schrieb Andrew Dunstan:
>
> Tom Lane wrote:
...
>
> You are probably right. The biggest wrinkle will be dealing with various
> encodings, I suspect. That at least is one thing that doing CSV within
> the backend bought us fairly painlessly. Perl's Text::CSV_XS module for
> example simply handles this by declaring that only [\x09\x20-\x7f] are
> valid in its non-binary mode, and in either mode appears to be MBCS
> unaware. We should try to do better than that.

Are there any test datafiles available in a repository?
I could give it a shot I think.

If not maybe we could set up something like that.

Regards
Tino

From:	Martijn van Oosterhout <kleptog(at)svana(dot)org>
To:	Tino Wildenhain <tino(at)wildenhain(dot)de>
Cc:	Andrew Dunstan <andrew(at)dunslane(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgman(at)candle(dot)pha(dot)pa(dot)us, chriskl(at)familyhealth(dot)com(dot)au, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Different length lines in COPY CSV
Date:	2005-12-12 20:35:45
Message-ID:	20051212203535.GD30160@svana.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Dec 12, 2005 at 09:30:12PM +0100, Tino Wildenhain wrote:
> Am Montag, den 12.12.2005, 15:08 -0500 schrieb Andrew Dunstan:
> > You are probably right. The biggest wrinkle will be dealing with various
> > encodings, I suspect. That at least is one thing that doing CSV within
> > the backend bought us fairly painlessly. Perl's Text::CSV_XS module for
> > example simply handles this by declaring that only [\x09\x20-\x7f] are
> > valid in its non-binary mode, and in either mode appears to be MBCS
> > unaware. We should try to do better than that.
>
> Are there any test datafiles available in a repository?
> I could give it a shot I think.
>
> If not maybe we could set up something like that.

Note, recent versions of Perl allow you to specify the file encoding
when you open the file and will convert things to utf-8 as appropriate.
So in theory it should be fairly simple to make a script that could
handle various encodings. The hardest part is always determining which
encoding a file is in in the first place...

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

From:	Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>
To:	Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgman(at)candle(dot)pha(dot)pa(dot)us, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Different length lines in COPY CSV
Date:	2005-12-13 02:05:18
Message-ID:	439E2C5E.8060003@familyhealth.com.au
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> Anyway, that's history now. Where would you want this file conversion
> utility? bin? contrib? pgfoundry?

How about a full SQL*Loader clone? :D