Resolution: Cannot Reproduce
Affects Version/s: None
Fix Version/s: None
We are experiencing a random problem with our Marathon instances committing suicide when losing connection to zookeeper. We bumped up the zk initLimit and multiplier within the zoo.cfg and have noticed the problem happening much less frequently, but it still happens. This cluster is running in Azure. We aren't sure if this an underlying problem in Marathon similar to this: https://github.com/mesosphere/marathon/issues/4266
or if this is a network connectivity issue within Azure. This is DC/OS 1.8.4.
First of all, we see this within the wiki for Exhibitor located here: https://github.com/soabase/exhibitor/wiki
Important Facts About Exhibitor
Exhibitor manages your zoo.cfg and myid files. Do not edit these manually as Exhibitor will overwrite them.
Am I to assume the same holds for DC/OS? That we should not edit the zoo.cfg? Because, we have indeed.
Here are some logs when we see this happen.
`Dec 02 17:34:50 dcos-master3 java: [2016-12-02 17:34:50,858] INFO Client session timed out, have not heard from server in 6669ms for sessionid 0x1587cf5535b000f, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn:ForkJoinPool-2-worker-13-SendThread(zk-1.zk:2181))
Dec 02 17:34:49 dcos-master2 java: [2016-12-02 17:34:49,934] ERROR ZooKeeper access failed - Committing suicide to avoid invalidating ZooKeeper state (mesosphere.marathon.core.election.impl.CuratorElectionService:qtp817460653-42)
`Dec 02 17:34:51 dcos-master3 mesos-master: I1202 17:34:51.468739 52757 group.cpp:460] Lost connection to ZooKeeper, attempting to reconnect ...
all 3 Exhibitors shut down Zookeeper at that time:
Dec 02 17:34:55 dcos-master3 java: Shutting down
Dec 02 17:34:55 dcos-master1 java: Shutting down
Dec 02 17:34:56 dcos-master2 java: shutting down
We do have a log bundle, but not sure if posting it publicly here is called for. I'd be more than happy to provide it privately for anyone at mesosphere that may be interested.