Quick Links

Re: BDR problem

Lists:	pgsql-general

From:	Charles Lynch <charleslynchpostgresql(at)gmail(dot)com>
To:	pgsql-general(at)postgresql(dot)org
Subject:	BDR problem
Date:	2015-09-11 21:21:41
Message-ID:	CAEoYqXBH1yLBH=Fzux4TC6SKjEqcDnRBYAvaznmGy7gE0C9SCQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

So for about a month now, we've been getting things prepared to use a BDR
cluster in a production, multi-region setup on aws. Our initial testing
produced some absolutely fantastic results with replication delays less
than 150ms between singapore, ireland, and north virginia and this is will
SSL encryption.

We have, just recently, ran into a problem. I created a test cluster only
within NV and after about a week of working without any problems, we got an
error: Unexpected EOF on SSL connection. I had seen something like this
before but on initial cluster join and chalked it up to me doing something
wrong. This was after a week of working without issue. I wasn't sure what
to do next. restarting the database started producing errors like this:

LOG: starting background worker process "bdr
(6188205071755053119,1,16385,)->bdr (6188203625564571611,1,"
FATAL: mismatch in worker state, got 3, expected 1
LOG: starting background worker process "bdr
(6188205071755053119,1,16385,)->bdr (6188203625564571611,1,"
FATAL: mismatch in worker state, got 3, expected 1
FATAL: mismatch in worker state, got 3, expected 1
LOG: starting background worker process "bdr
(6188205071755053119,1,16385,)->bdr (6188203625564571611,1,"
LOG: worker process: bdr (6188205071755053119,1,16385,)->bdr
(6188203625564571611,1, (PID 20300) exited with exit code 1

This would repeat. So I removed this node from the cluster using the proper
bdr commands and tried re-joining but that just resulted in the return
error changing from a 3 to a 0 and the same errors repeating. I have BDR
completely automated and orchestrated using chef so I simply fired up a new
cluster and started over.

My problem is I don't know what caused this and, more importantly, I'm not
sure how to fix it / prevent it and I can't launch this into production
without figuring this out.

One other thing: I've seen a lot of conflicting information on how to setup
BDR on ubuntu (using ppas, what pkg to install, and where to get source)
I'm curious now if I don't have a younger version and that this issue is
all but fixed now. Here are my build steps if anyone has any comments on
how to setup bdr better, please let me know.

I grab postgres 9.4.4 from here:
https://github.com/2ndQuadrant/bdr/archive/bdr-pg/REL9_4_4-1.tar.gz
and compile it with "./configure --prefix=/opt/psql --with-openssl && make
-j4 -s install"

then I compile and install the btree_gist module

then I get the BDR plugin from here:
https://github.com/2ndQuadrant/bdr/archive/bdr-plugin/0.9.2.tar.gz
and compile it with "./configure && make -j4 -s all && make install"

then init the db and set everything with config, ssl certs, and cluster
creation and joining.

Any help on this would be really appreciated.

Thanks guys

Charles

From:	Giovanni Maruzzelli <gmaruzz(at)gmail(dot)com>
To:	Charles Lynch <charleslynchpostgresql(at)gmail(dot)com>
Cc:	pgsql-general(at)postgresql(dot)org
Subject:	Re: BDR problem
Date:	2015-09-14 06:51:57
Message-ID:	CALXCt0pU4KeRmCNZSjjSwmgVfNtrB+F_MCSf3Go1TiPyyX4c7A@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

http://bdr-project.org/docs/next/index.html

On Fri, Sep 11, 2015 at 11:21 PM, Charles Lynch <
charleslynchpostgresql(at)gmail(dot)com> wrote:

> So for about a month now, we've been getting things prepared to use a BDR
> cluster in a production, multi-region setup on aws. Our initial testing
> produced some absolutely fantastic results with replication delays less
> than 150ms between singapore, ireland, and north virginia and this is will
> SSL encryption.
>
> We have, just recently, ran into a problem. I created a test cluster only
> within NV and after about a week of working without any problems, we got an
> error: Unexpected EOF on SSL connection. I had seen something like this
> before but on initial cluster join and chalked it up to me doing something
> wrong. This was after a week of working without issue. I wasn't sure what
> to do next. restarting the database started producing errors like this:
>
> LOG: starting background worker process "bdr
> (6188205071755053119,1,16385,)->bdr (6188203625564571611,1,"
> FATAL: mismatch in worker state, got 3, expected 1
> LOG: starting background worker process "bdr
> (6188205071755053119,1,16385,)->bdr (6188203625564571611,1,"
> FATAL: mismatch in worker state, got 3, expected 1
> FATAL: mismatch in worker state, got 3, expected 1
> LOG: starting background worker process "bdr
> (6188205071755053119,1,16385,)->bdr (6188203625564571611,1,"
> LOG: worker process: bdr (6188205071755053119,1,16385,)->bdr
> (6188203625564571611,1, (PID 20300) exited with exit code 1
>
> This would repeat. So I removed this node from the cluster using the
> proper bdr commands and tried re-joining but that just resulted in the
> return error changing from a 3 to a 0 and the same errors repeating. I have
> BDR completely automated and orchestrated using chef so I simply fired up a
> new cluster and started over.
>
> My problem is I don't know what caused this and, more importantly, I'm not
> sure how to fix it / prevent it and I can't launch this into production
> without figuring this out.
>
> One other thing: I've seen a lot of conflicting information on how to
> setup BDR on ubuntu (using ppas, what pkg to install, and where to get
> source) I'm curious now if I don't have a younger version and that this
> issue is all but fixed now. Here are my build steps if anyone has any
> comments on how to setup bdr better, please let me know.
>
> I grab postgres 9.4.4 from here:
> https://github.com/2ndQuadrant/bdr/archive/bdr-pg/REL9_4_4-1.tar.gz
> and compile it with "./configure --prefix=/opt/psql --with-openssl && make
> -j4 -s install"
>
> then I compile and install the btree_gist module
>
> then I get the BDR plugin from here:
> https://github.com/2ndQuadrant/bdr/archive/bdr-plugin/0.9.2.tar.gz
> and compile it with "./configure && make -j4 -s all && make install"
>
> then init the db and set everything with config, ssl certs, and cluster
> creation and joining.
>
> Any help on this would be really appreciated.
>
> Thanks guys
>
> Charles
>

--
Sincerely,

Giovanni Maruzzelli
Cell : +39-347-2665618

From:	Craig Ringer <craig(at)2ndquadrant(dot)com>
To:	Charles Lynch <charleslynchpostgresql(at)gmail(dot)com>
Cc:	"pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject:	Re: BDR problem
Date:	2015-09-14 09:37:12
Message-ID:	CAMsr+YG3Gn5_GqF0C16QhhD47GbSojtAVMD__bJvHt3wN+nCkA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

On 12 September 2015 at 05:21, Charles Lynch
<charleslynchpostgresql(at)gmail(dot)com> wrote:

> We have, just recently, ran into a problem. I created a test cluster only
> within NV and after about a week of working without any problems, we got an
> error: Unexpected EOF on SSL connection. I had seen something like this
> before but on initial cluster join and chalked it up to me doing something
> wrong.

That's generally network level, though it could also occur if a worker
exits unexpectedly.

> This was after a week of working without issue. I wasn't sure what to
> do next. restarting the database started producing errors like this:
>
> LOG: starting background worker process "bdr
> (6188205071755053119,1,16385,)->bdr (6188203625564571611,1,"
> FATAL: mismatch in worker state, got 3, expected 1

That's ... very odd. It's violating a sanity check that shouldn't
really ever be triggered.

How exactly did you restart the database? Can you send more info on
your configuration via direct mail to me?

> This would repeat. So I removed this node from the cluster using the proper
> bdr commands and tried re-joining

You can't just re-join a removed node. Once it's removed it's removed
for ever. You have to drop the database (or re-initdb), create a new
blank database, and join it as a new node.

The reason for this is that when you remove the node the replication
slots on other nodes get dropped, so there's no record of what catchup
work needs to be done. It's not really possible to resync the node
with the rest after that. That's the point of node removal, to free
the resources from those slots when a node is retired, otherwise you'd
just switch it off.

> My problem is I don't know what caused this and, more importantly, I'm not
> sure how to fix it / prevent it and I can't launch this into production
> without figuring this out.

The "mismatch in worker state" is strongly likely to be a bug. The
trick will be figuring out how you triggered it.

Did you retain the malfunctioning cluster, or have you deleted it?

> One other thing: I've seen a lot of conflicting information on how to setup
> BDR on ubuntu (using ppas, what pkg to install, and where to get source) I'm
> curious now if I don't have a younger version and that this issue is all but
> fixed now. Here are my build steps if anyone has any comments on how to
> setup bdr better, please let me know.

You should use the apt respository referenced by
http://bdr-project.org/docs/stable/installation-packages.html#INSTALLATION-PACKAGES-DEBIAN
.

Support is focused mainly on RHEL/CentOS/Fedora, but Debian/Ubuntu
packages are also produced. We're a little behind at the moment and
haven't got 0.9.2 packages out. I'll be pushing 0.9.3 soon and will
produce 0.9.3 packages for Debian/Ubuntu as well as for
Fedora/RHEL/CentOS.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Martín Marqués <martin(at)2ndquadrant(dot)com>
To:	Craig Ringer <craig(at)2ndquadrant(dot)com>, Charles Lynch <charleslynchpostgresql(at)gmail(dot)com>
Cc:	"pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject:	Re: BDR problem
Date:	2015-09-14 20:32:09
Message-ID:	55F72EC9.3040900@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

El 14/09/15 a las 06:37, Craig Ringer escribió:
>
> Support is focused mainly on RHEL/CentOS/Fedora, but Debian/Ubuntu
> packages are also produced. We're a little behind at the moment and
> haven't got 0.9.2 packages out. I'll be pushing 0.9.3 soon and will
> produce 0.9.3 packages for Debian/Ubuntu as well as for
> Fedora/RHEL/CentOS.

We (well, actually mostly you ;)) have pushed 0.9.2 bdr packages in rpm
and deb format.

$ rpm -qa | grep bdr94-bdr
postgresql-bdr94-bdr-debuginfo-0.9.2-1_2ndQuadrant.el7.centos.x86_64
postgresql-bdr94-bdr-0.9.2-1_2ndQuadrant.el7.centos.x86_64

Regards,

--
Martín Marqués http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Florin Andrei <florin(at)andrei(dot)myip(dot)org>
To:	pgsql-general(at)postgresql(dot)org
Subject:	Re: BDR problem
Date:	2015-09-16 22:34:54
Message-ID:	1021d271bde954c9de7de9c5aee10afb@andrei.myip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

On 2015-09-14 13:32, Martín Marqués wrote:
>
> We (well, actually mostly you ;)) have pushed 0.9.2 bdr packages in rpm
> and deb format.
>
> $ rpm -qa | grep bdr94-bdr
> postgresql-bdr94-bdr-debuginfo-0.9.2-1_2ndQuadrant.el7.centos.x86_64
> postgresql-bdr94-bdr-0.9.2-1_2ndQuadrant.el7.centos.x86_64

Yup, I'm using .deb packages from
http://packages.2ndquadrant.com/bdr/apt/ on Ubuntu 14.04:

# dpkg -l | grep postgresql-bdr | awk '{print $2"\t"$3}'
postgresql-bdr-9.4 9.4.4-1trusty
postgresql-bdr-9.4-bdr-plugin 0.9.2-1trusty
postgresql-bdr-client-9.4 9.4.4-1trusty
postgresql-bdr-contrib-9.4 9.4.4-1trusty
postgresql-bdr-server-dev-9.4 9.4.4-1trusty

It's very useful to have these packages available, helps a lot with
testing, kudos to everyone involved.

--
Florin Andrei
http://florin.myip.org/