Quick Links

postgres large database backup

Lists:	pgsql-general

From:	Atul Kumar <akumar14871(at)gmail(dot)com>
To:	pgsql-general <pgsql-general(at)lists(dot)postgresql(dot)org>
Subject:	postgres large database backup
Date:	2022-11-30 15:40:23
Message-ID:	CA+ONtZ7CdiJ2X8x_8sPHR+tz-vv7mR1uoDzLia3fzWVZA9k0JQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

Hi,

I have a 10TB database running on postgres 11 version running on centos 7
"on premises", I need to schedule the backup of this database in a faster
way.

The scheduled backup will be used for PITR purposes.

So please let me know how I should do it in a quicker backup for my 10TB
database ? Is there any tool to take backups and subsequently incremental
backups in a faster way and restore it for PITR in a faster way when
required.

What should be the exact approach for scheduling such backups so that it
can be restored in a faster way ?

Regards.

From:	Michael Loftis <mloftis(at)wgops(dot)com>
To:	Atul Kumar <akumar14871(at)gmail(dot)com>
Cc:	pgsql-general <pgsql-general(at)lists(dot)postgresql(dot)org>
Subject:	Re: postgres large database backup
Date:	2022-11-30 16:24:55
Message-ID:	CAHDg04tXCrWPpB13RrXYxFWoC3neyjKNWBQC8-Ju_f+uNPy9yw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

On Wed, Nov 30, 2022 at 8:40 AM Atul Kumar <akumar14871(at)gmail(dot)com> wrote:
>
> Hi,
>
> I have a 10TB database running on postgres 11 version running on centos 7 "on premises", I need to schedule the backup of this database in a faster way.
>
> The scheduled backup will be used for PITR purposes.
>
> So please let me know how I should do it in a quicker backup for my 10TB database ? Is there any tool to take backups and subsequently incremental backups in a faster way and restore it for PITR in a faster way when required.
>
> What should be the exact approach for scheduling such backups so that it can be restored in a faster way ?

Faster than *what*?

If speed is the primary criteria, filesystem snapshots by using
pg_start_backup() to tell the DB cluster to be in a binary ready
backup mode, snapshot, then pg_stop_backup(), capture the WALs
generated alongside your FS snapshot, all on the same machine or
shared storage would be the fastest to restore. To restore, bring
back the old snapshot+ the WALs captured with the DB shutdown/stopped,
startup is normal "crash recovery" or you can select PITR/LSN in the
short pg_start_backup() ... pg_stop_backup() window. If you're
properly archiving WALs outside of JUST the full backup you can PITR
to any point after the full backup snapshot, but the more
transactions/WAL it has to process to get to the desired point the
longer the recovery.

pgbackrest can backup a PG cluster in multiple ways (including taking
a base backup while/and actively streaming WALs or being the WAL
archiver), and a restore on the same machine as the backup repository
would be basically limited by I/O (well, unless you've got all NVMe,
then CPU, bus, or memory bandwidth constraints become the limiting
factor).

Basically no matter how you backup, 10TB takes a long time to copy,
and except in the "local FS snapshot" method I outlined above, that's
going to be your limiting factor, is how fast you can move the data
back to where you need it.

For critical DBs of this nature I've actually done almost exactly the
method I just outlined, only the backup/snapshot process happens on a
replica. *NORMAL* failure recovery in that replicated cluster is by
failovers, but, for actual backup restore due to disaster or need to
go back in time (which is...extremely rare...) there's some manual
intervention to bring up a snapshot and play back WALs to the point in
time that we want the DB cluster.

>
>
>
> Regards.

"Genius might be described as a supreme capacity for getting its possessors
into trouble of all kinds."
-- Samuel Butler

From:	Christophe Pettus <xof(at)thebuild(dot)com>
To:	Atul Kumar <akumar14871(at)gmail(dot)com>
Cc:	pgsql-general <pgsql-general(at)lists(dot)postgresql(dot)org>
Subject:	Re: postgres large database backup
Date:	2022-11-30 17:31:37
Message-ID:	75EDB99D-C34A-426E-AD6A-A904E2BE96A5@thebuild.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

> On Nov 30, 2022, at 07:40, Atul Kumar <akumar14871(at)gmail(dot)com> wrote:
> So please let me know how I should do it in a quicker backup for my 10TB database ? Is there any tool to take backups and subsequently incremental backups in a faster way and restore it for PITR in a faster way when required.

For an initial backup, you'll need to copy 10TB to another medium; there's no getting around that. pgBackRest supports incremental and differential backups, which are especially useful if there are tables that are largely for archival purposes and do not get frequently modified:

https://pgbackrest.org

From:	Vijaykumar Jain <vijaykumarjain(dot)github(at)gmail(dot)com>
To:	Atul Kumar <akumar14871(at)gmail(dot)com>
Cc:	pgsql-general <pgsql-general(at)lists(dot)postgresql(dot)org>
Subject:	Re: postgres large database backup
Date:	2022-11-30 17:41:02
Message-ID:	CAM+6J96m9rM+jXw8n_-_ZyiUDXOQ555XH0UOJEt6TLFYH2zLDA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

On Wed, Nov 30, 2022, 9:10 PM Atul Kumar <akumar14871(at)gmail(dot)com> wrote:

> Hi,
>
> I have a 10TB database running on postgres 11 version running on centos 7
> "on premises", I need to schedule the backup of this database in a faster
> way.
>
> The scheduled backup will be used for PITR purposes.
>
> So please let me know how I should do it in a quicker backup for my 10TB
> database ? Is there any tool to take backups and subsequently incremental
> backups in a faster way and restore it for PITR in a faster way when
> required.
>
> What should be the exact approach for scheduling such backups so that it
> can be restored in a faster way ?
>

We had a 96 cpu, 385gb ram, nvme storage and 10g network baremetal server.
We used pgbackrest for full backup.
It supports pitr and differential backup.

28tb db took 2.5 hours for backup on remote storage, and restore from the
remote storage took 3.5 hours when immediately restored (a lot of time is
later due to wal replay to catch up)

pg_basebackup took 9 hours.

From:	Mladen Gogala <gogala(dot)mladen(at)gmail(dot)com>
To:	pgsql-general(at)lists(dot)postgresql(dot)org
Subject:	Re: postgres large database backup
Date:	2022-11-30 19:01:08
Message-ID:	9d50f118-a3dd-565a-41bd-d3056b9b90d6@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

On 11/30/22 10:40, Atul Kumar wrote:
> Hi,
>
> I have a 10TB database running on postgres 11 version running on
> centos 7 "on premises", I need to schedule the backup of this database
> in a faster way.
>
> The scheduled backup will be used for PITR purposes.
>
> So please let me know how I should do it in a quicker backup for my
> 10TB database ? Is there any tool to take backups and subsequently
> incremental backups in a faster way and restore it for PITR in a
> faster way when required.
>
> What should be the exact approach for scheduling such backups so that
> it can be restored in a faster way ?
>
Well, that depends on your configuration details. Is your DB located on
a SAN device or JBOD? If it's SAN, than it can do snapshots. All
enterprise backup utilities like NetBackup, TSM, Avamar or Commvault
support snapshots. They all can do full and incremental backups and they
can also do PITR. Your database is an enterprise sized database and
needs an enterprise level tool. If you want a freebie, pgbackrest is the
best thing no money can buy. Pgbackrest supports parallelism, full and
incremental backups.

--
Mladen Gogala
Database Consultant
Tel: (347) 321-1217
https://dbwhisperer.wordpress.com

From:	Ron <ronljohnsonjr(at)gmail(dot)com>
To:	pgsql-general(at)lists(dot)postgresql(dot)org
Subject:	Re: postgres large database backup
Date:	2022-11-30 19:06:53
Message-ID:	bc763ac9-e11b-051c-cfb7-8de60666c08b@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

On 11/30/22 11:41, Vijaykumar Jain wrote:
>
> On Wed, Nov 30, 2022, 9:10 PM Atul Kumar <akumar14871(at)gmail(dot)com> wrote:
>
> Hi,
>
> I have a 10TB database running on postgres 11 version running on
> centos 7 "on premises", I need to schedule the backup of this database
> in a faster way.
>
> The scheduled backup will be used for PITR purposes.
>
> So please let me know how I should do it in a quicker backup for my
> 10TB database ? Is there any tool to take backups and subsequently
> incremental backups in a faster way and restore it for PITR in a
> faster way when required.
>
> What should be the exact approach for scheduling such backups so that
> it can be restored in a faster way ?
>
>
>
> We had a 96 cpu, 385gb ram, nvme storage and 10g network baremetal server.
> We used pgbackrest for full backup.
> It supports pitr and differential backup.
>
> 28tb db took 2.5 hours for backup on remote storage, and restore from the
> remote storage took 3.5 hours when immediately restored (a lot of time is
> later due to wal replay to catch up)

Another vote for pgbackrest. It supports AES encryption, and multiple forms
of compression.

--
Angular momentum makes the world go 'round.

From:	Hannes Erven <hannes(at)erven(dot)at>
To:	pgsql-general(at)lists(dot)postgresql(dot)org
Subject:	Re: postgres large database backup
Date:	2022-11-30 23:19:38
Message-ID:	592a2c79-9728-e450-ad60-a0df7dd0f214@erven.at
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

Am 30.11.22 um 20:01 schrieb Mladen Gogala:
> On 11/30/22 10:40, Atul Kumar wrote:
>> Hi,
>>
>> I have a 10TB database running on postgres 11 version running on
>> centos 7 "on premises", I need to schedule the backup of this
>> database in a faster way.
>>
>> The scheduled backup will be used for PITR purposes.
>>
> Well, that depends on your configuration details. Is your DB located
> on a SAN device or JBOD?
>

You could also use a filesystem that can do atomic snapshots - like ZFS.

However, I'm wondering why none of the previous respondents mentioned
it? Sure, ZFS may have its own performance implications... but my
experience is that it makes "safety snapshots" and "on-demand throw-away
testing environments" very convenient to use.

Best regards,

-hannes

From:	Mladen Gogala <gogala(dot)mladen(at)gmail(dot)com>
To:	pgsql-general(at)lists(dot)postgresql(dot)org
Subject:	Re: postgres large database backup
Date:	2022-12-01 01:03:33
Message-ID:	303273b8-6116-23b2-f26d-866e38ef1c0a@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

On 11/30/22 18:19, Hannes Erven wrote:
> You could also use a filesystem that can do atomic snapshots - like ZFS.

Uh, oh. Not so sure about that. Here is a page from the world of the big
O: https://blog.docbert.org/oracle-on-zfs/

However, similar can be said about ZFS. ZFS snapshots will slow down the
I/O considerably. I would definitely prefer snapshots done in hardware
and not in software. My favorite file systems, depending on the type of
disk, are F2FS and XFS.

--
Mladen Gogala
Database Consultant
Tel: (347) 321-1217
https://dbwhisperer.wordpress.com

From:	Michael Loftis <mloftis(at)wgops(dot)com>
To:	Mladen Gogala <gogala(dot)mladen(at)gmail(dot)com>
Cc:	pgsql-general(at)lists(dot)postgresql(dot)org
Subject:	Re: postgres large database backup
Date:	2022-12-01 01:41:39
Message-ID:	CAHDg04u0jMySKG+dN_+zpGa=_fAT-rZ2XpQPR_rOVKS7FOHMXQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

On Wed, Nov 30, 2022 at 18:03 Mladen Gogala <gogala(dot)mladen(at)gmail(dot)com> wrote:

> On 11/30/22 18:19, Hannes Erven wrote:
>
> You could also use a filesystem that can do atomic snapshots - like ZFS.
>
> Uh, oh. Not so sure about that. Here is a page from the world of the big
> O: https://blog.docbert.org/oracle-on-zfs/
>
> However, similar can be said about ZFS. ZFS snapshots will slow down the
> I/O considerably. I would definitely prefer snapshots done in hardware and
> not in software. My favorite file systems, depending on the type of disk,
> are F2FS and XFS.
>

ZFS snapshots don’t typically have much if any performance impact versus
not having a snapshot (and already being on ZFS) because it’s already doing
COW style semantics.

Postgres write performance using ZFS is difficult because it’s super
important to match up the underlying I/O sizes to the device/ZFS ashift,
the ZFS recordsize, and the DB’s page/wal page sizes though, but not
getting this right also cause performance issues without any snapshots,
because again COW. If you’re constantly breaking a record block or sector
there’s going to be a big impact. It won’t be any worse (in my own testing)
regardless of if you have snapshots or not. Snapshots on ZFS don’t cause
any crazy write amplification by themselves (I’m not sure they cause any
extra writes at all, I’d have to do some sleuthing)

ZFS will yes be slower than a raw disk (but that’s not an option for Pg
anyway), and may or may not be faster than a different filesystem on a HW
RAID volume or storage array volume. It absolutely takes more
care/clue/tuning to get Pg write performance on ZFS, and ZFS does duplicate
some of Pg’s resiliency so there is duplicate work going on.

I’d say really that 2016 article is meaningless as ZFS, Oracle, and
Postgres have all evolved dramatically in six years. Even further since
there’s nothing remotely like ASM for Postgres.

>
> --
> Mladen Gogala
> Database Consultant
> Tel: (347) 321-1217https://dbwhisperer.wordpress.com
>
> --

"Genius might be described as a supreme capacity for getting its possessors
into trouble of all kinds."
-- Samuel Butler

From:	Ron <ronljohnsonjr(at)gmail(dot)com>
To:	pgsql-general(at)lists(dot)postgresql(dot)org
Subject:	Re: postgres large database backup
Date:	2022-12-01 01:51:31
Message-ID:	31e934ad-5e3e-22d4-f8b1-0fdee463eb42@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

On 11/30/22 19:41, Michael Loftis wrote:
>
>
> On Wed, Nov 30, 2022 at 18:03 Mladen Gogala <gogala(dot)mladen(at)gmail(dot)com> wrote:
>
> On 11/30/22 18:19, Hannes Erven wrote:
>> You could also use a filesystem that can do atomic snapshots - like ZFS.
>
> Uh, oh. Not so sure about that. Here is a page from the world of the
> big O: https://blog.docbert.org/oracle-on-zfs/
>
> However, similar can be said about ZFS. ZFS snapshots will slow down
> the I/O considerably. I would definitely prefer snapshots done in
> hardware and not in software. My favorite file systems, depending on
> the type of disk, are F2FS and XFS.
>
>
> ZFS snapshots don’t typically have much if any performance impact versus
> not having a snapshot (and already being on ZFS) because it’s already
> doing COW style semantics.
>
> Postgres write performance using ZFS is difficult because it’s super
> important to match up the underlying I/O sizes to the device/ZFS ashift,
> the ZFS recordsize, and the DB’s page/wal page sizes though, but not
> getting this right also cause performance issues without any snapshots,
> because again COW. If you’re constantly breaking a record block or sector
> there’s going to be a big impact. It won’t be any worse (in my own
> testing) regardless of if you have snapshots or not. Snapshots on ZFS
> don’t cause any crazy write amplification by themselves (I’m not sure
> they cause any extra writes at all, I’d have to do some sleuthing)
>
> ZFS will yes be slower than a raw disk (but that’s not an option for Pg
> anyway), and may or may not be faster than a different filesystem on a HW
> RAID volume or storage array volume. It absolutely takes more
> care/clue/tuning to get Pg write performance on ZFS, and ZFS does
> duplicate some of Pg’s resiliency so there is duplicate work going on.

I wonder what percentage of /Big Databases/ (like Op's and Vijaykumar's) are
still on physical servers, as opposed to VMs connected to SANs. Even many
physical servers are connected to SANs. (That is, of course, in the dreaded
Enterprise environment.)

--
Angular momentum makes the world go 'round.

From:	Mladen Gogala <gogala(dot)mladen(at)gmail(dot)com>
To:	Michael Loftis <mloftis(at)wgops(dot)com>
Cc:	pgsql-general(at)lists(dot)postgresql(dot)org
Subject:	Re: postgres large database backup
Date:	2022-12-01 13:40:52
Message-ID:	d45f9147-f694-713b-ac83-00b47a42f241@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

On 11/30/22 20:41, Michael Loftis wrote:
>
> ZFS snapshots don’t typically have much if any performance impact
> versus not having a snapshot (and already being on ZFS) because it’s
> already doing COW style semantics.
>
Hi Michael,

I am not sure that such statement holds water. When a snapshot is taken,
the amount of necessary I/O requests goes up dramatically. For every
block that snapshot points to, it is necessary to read the block, write
it to the spare location and then overwrite it, if you want to write to
a block pointed by snapshot. That gives 3 I/O requests for every block
written. NetApp is trying to optimize it by using 64MB blocks, but ZFS
on Linux cannot do that, they have to use standard CoW because they
don't have the benefit of their own hardware and OS. And the standard
CoW is tripling the number of I/O requests for every write to the blocks
pointed to by the snapshot, for every snapshot. CoW is a very expensive
animal, with horns.

Regards

--
Mladen Gogala
Database Consultant
Tel: (347) 321-1217
https://dbwhisperer.wordpress.com

From:	Mladen Gogala <gogala(dot)mladen(at)gmail(dot)com>
To:	pgsql-general(at)lists(dot)postgresql(dot)org
Subject:	Re: postgres large database backup
Date:	2022-12-01 13:47:39
Message-ID:	d901c987-76e5-873e-2e5d-b358535b11b2@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

On 11/30/22 20:51, Ron wrote:
> ZFS will yes be slower than a raw disk (but that’s not an option for
> Pg anyway), and may or may not be faster than a different filesystem
> on a HW RAID volume or storage array volume. It absolutely takes more
> care/clue/tuning to get Pg write performance on ZFS, and ZFS does
> duplicate some of Pg’s resiliency so there is duplicate work going on.

Ron, if this raw device reference is about ASM, Oracle has a file system
on top of ASM, called ACFS, and I have been able to run PostgreSQL on
top of ACFS. The reason to do that is redundancy. ASM/ACFS with
PostgreSQL behaves similarly to Veritas Cluster, when one Postgres
cluster goes down, the other one is started. And you don't have to pay
for it, unless you start using storage snapshots. That ACFS feature
requires a commercial license.

--
Mladen Gogala
Database Consultant
Tel: (347) 321-1217
https://dbwhisperer.wordpress.com

From:	Vijaykumar Jain <vijaykumarjain(dot)github(at)gmail(dot)com>
To:	Mladen Gogala <gogala(dot)mladen(at)gmail(dot)com>
Cc:	Michael Loftis <mloftis(at)wgops(dot)com>, pgsql-general <pgsql-general(at)lists(dot)postgresql(dot)org>
Subject:	Re: postgres large database backup
Date:	2022-12-01 14:18:38
Message-ID:	CAM+6J94-1Nr6jvRKja=WWFhG2tEd90yft1fcxB6GBUWiuYLV8A@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

On Thu, Dec 1, 2022, 7:11 PM Mladen Gogala <gogala(dot)mladen(at)gmail(dot)com> wrote:

> On 11/30/22 20:41, Michael Loftis wrote:
>
>
> ZFS snapshots don’t typically have much if any performance impact versus
> not having a snapshot (and already being on ZFS) because it’s already doing
> COW style semantics.
>
> Hi Michael,
>
> I am not sure that such statement holds water. When a snapshot is taken,
> the amount of necessary I/O requests goes up dramatically. For every block
> that snapshot points to, it is necessary to read the block, write it to the
> spare location and then overwrite it, if you want to write to a block
> pointed by snapshot. That gives 3 I/O requests for every block written.
> NetApp is trying to optimize it by using 64MB blocks, but ZFS on Linux
> cannot do that, they have to use standard CoW because they don't have the
> benefit of their own hardware and OS. And the standard CoW is tripling the
> number of I/O requests for every write to the blocks pointed to by the
> snapshot, for every snapshot. CoW is a very expensive animal, with horns.
>

I am not an expert in this area, but we have zfs for specific instances
which have timeseries/event log data, and we also need compression.
One day, there was a need to snapshot a 35tb zfs pool and send it across
the network to a relplica, coz both the disks in the mirror degraded around
same time, I do not recall zfs snapshots took anything resource intensive,
and it was quick.ill ask around for actual time though.

We have more than 500 of these type of nodes with zfs (each having 20 disks
in mirror each 8tb) for event log with compression, and zfs works just
fine. This is a special setup where the data is assumed to be cold storage,
hence compression, so it was designed for heavy writes and occasional reads
queries only for debugging.

From:	Vijaykumar Jain <vijaykumarjain(dot)github(at)gmail(dot)com>
To:	Mladen Gogala <gogala(dot)mladen(at)gmail(dot)com>
Cc:	Michael Loftis <mloftis(at)wgops(dot)com>, pgsql-general <pgsql-general(at)lists(dot)postgresql(dot)org>
Subject:	Re: postgres large database backup
Date:	2022-12-01 14:40:21
Message-ID:	CAM+6J94dnY-fG-JbvJmM08iwXNeQ1ukkFi0W+UgXQsR_T4wg1Q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

> I do not recall zfs snapshots took anything resource intensive, and it
> was quick.ill ask around for actual time.
>

Ok just a small note, out ingestion pattern is write anywhere, read
globally. So we did stop ingestion while snapshot was taken as we could
afford it that way. Maybe the story is different when snapshot is taken on
live systems which generate a lot of delta.

From:	Michael Loftis <mloftis(at)wgops(dot)com>
To:	Mladen Gogala <gogala(dot)mladen(at)gmail(dot)com>
Cc:	pgsql-general(at)lists(dot)postgresql(dot)org
Subject:	Re: postgres large database backup
Date:	2022-12-01 16:21:08
Message-ID:	CAHDg04sGAfTjsDO1Gt6T4Eq5K5X_Emtma+4iZUBMCBF5bWQJWA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

On Thu, Dec 1, 2022 at 06:40 Mladen Gogala <gogala(dot)mladen(at)gmail(dot)com> wrote:

Nope, ZFS does not behave that way. Yup AFAIK all other snapshotting
filesystems or volume managers do. One major architectural decision of ZFS
is the atomicity of writes. Data at rest stays at rest. Thus it does NOT
overwrite live data. Snapshots do not change the write path/behavior in
ZFS. In ZFS writes are atomic, you’re always writing new data to free
space, and accounting for where the current record/volume block within a
file or volume actually lives on disk. If a filesystem, volume manager, or
RAID system, is overwriting data and in the middle of that process and has
an issue that breaks that write, and that data is also live data, you can't
be atomic, you've now destroyed data (RAID write hole is one concept of
this). That’s why adding a snapshot isn’t an additional cost for ZFS. For
better or worse you're paying that snapshot cost already because it already
does not overwrite live data. If there's no snapshot once the write is
committed and the refcount is zero for the old blocks, and it's safe (TXG
committed), those old blocks go back to the free pool to be potentially
used again. There's a bunch of optimization to that and how it actually
happens, but at the end of the day, your writes do not overwrite your data
in ZFS, writes of data get directed at free space, and eventually the
on-disk structures get an atomic update that happens to say it now lives
here. In the time between that all happening the ZIL (which may live on
its own special devices called SLOG -- this is why you often see the terms
ZIL/journal/SLOG/log vdev used interchangeably) is the durable bit, but
that's never normally read, it's only read back during recovery. This is
also where the ZFS filesystem property of recordsize or volblocksize
(independently configurable on every filesystem/volume within a pool) is
important for performance. If you clobber a whole record ZFS isn't going
to read anything extra when it gets around to committing, it knows the
whole record changed and can safely write a whole new record (every 5s it
goes about this TXG commit, so two 64k writes are still slower with a 128k
recordsize, but still shouldn't pull in that 128k record). There's other
optimizations there, but at the end of the day as long as the chosen
recordsize/volblocksize that matches up to your writes, and your writes are
aligned to that within your file or volume, you'll not see an extra read of
the data as part of it's normal flow of committing data. Snapshots don't
change that.

Because of those architectural decisions, CoW behavior is part of ZFS'
existing performance penalty, so when you look at that older Oracle ASM vs
ZFS article, remember that that extra...what was it 0.5ms?... is accounting
for most, probably all of the penalties for a snapshot too if you want (or
need) it. It's fundamental to how ZFS works and provides data
durability+atomicity. This is what ZFS calls it's snapshots essentially
free, because you're already paying the performance for it. What would
ASM do if it had a snapshot to manage? Or a few dozen on the same data?
Obviously during the first writes to those snapshotted areas you'd see it.
Ongoing performance penalties with those snapshots? Maybe ASM has an
optimization that saves that benchmark a bunch of time if there is no
snapshot. But once one exists it takes a different write path and adds a
performance penalty? If a snapshot was taken in the middle of the
benchmark? Yeah there's going to be some extra IOPS when you take the
snapshot to say "a snapshot now exists" for ZFS, but that doesn't
dramatically change it's underlying write path after that point.

That atomicity and data durability also means that even if you lose the
SLOG devices (which hold the ZIL/journal, if you don't have SLOG/log vdev
then it's in-pool) you do not lose all the data. Only stuff that somehow
remained uncommitted after the ZIL failed. Say you had some sort of hard
fault/crash, the SLOG/ZIL devices were destroyed, you can still opt to
mount the ZFS pool, and filesystems/volumes, without that ZIL, which could
(well, would) still suck, but would be better than just losing everything.
If the ZIL fails while the system is live, ZFS is going to do it's best to
hopefully get everything committed ASAP as soon as it knows something is
wrong, and keep it that way. So on a SLOG/ZIL failure your performance
WILL suffer (and boy it's UGLY, but at least it's not dead and destroyed).
And because of the atomicity property even if it has further fails during
that window of time where it scrambles to commit, ZFS does not wreck the
filesystem. If the devices are still available it'll still provide
whatever data it can back to you.

So there's a very different approach to what's important with ZFS, it's not
that performance isn't important, it's that your data is more important.
Performance is NOT ignored, but, to get that atomicity and durability you
ARE paying some performance costs. Is that worth it for YOUR database or
files? Only you as an admin can decide that. No ZFS is NOT a great choice
for every database or dataset! For some workloads that penalty is not
going to be acceptable.

So writes in ZFS are always towards the journal (ZIL) first, barring config
property tweaks, once a journal entry is durable it’s considered
written, but uncommitted. If we crash at that point then journal recovery
brings the pool back to that last written journal entry so ZFS is never
lying to the application or OS.

The data is always written back in an atomic manner, writing new or changed
data to a free block, never to the existing block. So if it bombs in the
middle of THAT, you're fine. When the whole (record size or vol block
size) block of sectors is written at once or within a window of time
there’s no read of the original data once it finishes coalescing and the
writing process to fully commit it. There’s of course always the free
space/reference counts. But if they haven’t changed they aren’t written,
and they're needed anyway to find where the data lives so are already
present, so yeah first write best case there’s an extra write in the area
where that is kept (which may itself be coalesced with other writes towards
that), but it's not like every write incurs that extra overhead to maintain
the old block, and if your write replaces that whole recordsize sized
block, But after that until a snapshot is added or removed there’s no more
changes for the references to that old block. When there’s no more
references to a given block it goes into the free pool.

Having a snapshot doesn’t add more work. It’s a byproduct of the atomic
write behavior that ZFS implements (always write to free blocks, always
write a sector). You're just asking ZFS to not free those blocks.

Regards
>
> --
> Mladen Gogala
> Database Consultant
> Tel: (347) 321-1217https://dbwhisperer.wordpress.com
>
>

From:	Michael Loftis <mloftis(at)wgops(dot)com>
To:	Mladen Gogala <gogala(dot)mladen(at)gmail(dot)com>
Cc:	pgsql-general(at)lists(dot)postgresql(dot)org
Subject:	Re: postgres large database backup
Date:	2022-12-01 16:35:31
Message-ID:	CAHDg04vE914BtRx9UD-O9WcZgGoT6giZ3aq9tqyjCCebf_AbyA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

On Thu, Dec 1, 2022 at 9:21 AM Michael Loftis <mloftis(at)wgops(dot)com> wrote:
>
>
>
> On Thu, Dec 1, 2022 at 06:40 Mladen Gogala <gogala(dot)mladen(at)gmail(dot)com> wrote:
>>
>> On 11/30/22 20:41, Michael Loftis wrote:
>>
>>
>> ZFS snapshots don’t typically have much if any performance impact versus not having a snapshot (and already being on ZFS) because it’s already doing COW style semantics.
>>
>> Hi Michael,
>>
>> I am not sure that such statement holds water. When a snapshot is taken, the amount of necessary I/O requests goes up dramatically. For every block that snapshot points to, it is necessary to read the block, write it to the spare location and then overwrite it, if you want to write to a block pointed by snapshot. That gives 3 I/O requests for every block written. NetApp is trying to optimize it by using 64MB blocks, but ZFS on Linux cannot do that, they have to use standard CoW because they don't have the benefit of their own hardware and OS. And the standard CoW is tripling the number of I/O requests for every write to the blocks pointed to by the snapshot, for every snapshot. CoW is a very expensive animal, with horns.

And if you want to know more, ARS wrote a good ZFS 101 article -- the
write semantics I described in overview are on page three,
https://arstechnica.com/information-technology/2020/05/zfs-101-understanding-zfs-storage-and-performance/3/

"Genius might be described as a supreme capacity for getting its possessors
into trouble of all kinds."
-- Samuel Butler

From:	hvjunk <hvjunk(at)gmail(dot)com>
To:	Hannes Erven <hannes(at)erven(dot)at>
Cc:	pgsql-general(at)lists(dot)postgresql(dot)org
Subject:	Re: postgres large database backup
Date:	2022-12-06 04:44:15
Message-ID:	52B3F216-2F91-4442-B6CA-7EE6FE960351@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

> On 01 Dec 2022, at 01:19, Hannes Erven <hannes(at)erven(dot)at> wrote:
>
> You could also use a filesystem that can do atomic snapshots - like ZFS.
>
> However, I'm wondering why none of the previous respondents mentioned it?

the reason: 'cause most people are "stuck" on public clouds that does not provide that as a "solution" in their cookie cutter (wrongly called "managed") solutions...

> Sure, ZFS may have its own performance implications... but my experience is that it makes "safety snapshots" and "on-demand throw-away testing environments" very convenient to use.

Using that myself, but also as a VM in a ProxMox environmenet, I've found (and restored) with ProxMox Backup Server goot single snapshot backup solution.

My cases doesn't need the PITR yet

From:	Michael Loftis <mloftis(at)wgops(dot)com>
To:	Vijaykumar Jain <vijaykumarjain(dot)github(at)gmail(dot)com>
Cc:	Mladen Gogala <gogala(dot)mladen(at)gmail(dot)com>, pgsql-general <pgsql-general(at)lists(dot)postgresql(dot)org>
Subject:	Re: postgres large database backup
Date:	2022-12-06 15:03:34
Message-ID:	CAHDg04smet6qG0tJvzaP5S98qX+fueXPN+FCd57p56h_HWdqgg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

On Thu, Dec 1, 2022 at 7:40 AM Vijaykumar Jain
<vijaykumarjain(dot)github(at)gmail(dot)com> wrote:
>
>
>> I do not recall zfs snapshots took anything resource intensive, and it was quick.ill ask around for actual time.
>
>
> Ok just a small note, out ingestion pattern is write anywhere, read globally. So we did stop ingestion while snapshot was taken as we could afford it that way. Maybe the story is different when snapshot is taken on live systems which generate a lot of delta.

Snapshot in ZFS at worst case would copy the entire allocation tree
and adjusts ref counters, IE metadata, no data copy. I don't know if
it even works that hard to create a snapshot now, as in it might just
make a marker, all I know is they've always been fast/cheap.
Differential zfs send|recv based off two snapshots is also pretty damn
fast because it knows what's shared, and only sends what changes.
There's definitely been major changes in how snapshots are created
over the years to make them even quicker (ISTR it's the "bookmarks"
feature?)

This is just a small pool on my local/home NAS (TrueNAS Scale) of
around 40T of data...Note that -r, it's not creating one snapshot but
uhm *checks* 64 (-r create also a snapshot of every volume/filesystem
underneath that)
root(at)(dot)(dot)(dot):~ # time zfs snapshot -r tank(at)TESTSNAP0
0.000u 0.028s 0:00.32 6.2% 144+280k 0+0io 0pf+0w
root(at)(dot)(dot)(dot):~ #

I have no idea how many files are in there. My personal home
directory and dev tree is in one of those, and I've got at least half
a dozen versions of the Linux Kernel, FreeBSD kernel, and other source
trees, and quite a few other Very Bushy(tm) source trees so it's quite
a fair amount of files.

So yeah, 28msec, 64 snapshots....they're REALLY cheap to create, and
since you pay the performance costs already, they're not very
expensive to maintain. And the performance cost isn't awful unlike in
more traditional snapshot systems. I will say that is a kind of
optimal case because I have a very fast NVMe SLOG/ZIL, and the box is
otherwise effectively idle. Destroying the freshly created snapshot
is about the same...So is destroying 6 months old snapshots though I
don't have a bonkers amount of changed data in my pool.

"Genius might be described as a supreme capacity for getting its possessors
into trouble of all kinds."
-- Samuel Butler