I'm having an issue where in case of connectivity loss with the ZooKeeper servers, the Marathon cluster would be inoperable without manual intervention.
Here's some details about the issue:
- Restart a ZooKeeper instance
- Marathon's leader is automatically killed (and restarted) and another assumes the role of master
- I do have access to Marathon's UI but any task operations stays in Waiting/Deploying forever
- If I restart the two remaining Marathon services, everything is back to normal, I can deploy and scale my apps
I get that the leader might be killed if ZK connectivity is lost, but I thought the cluster would survive without my having to restart all Marathon instances.
I am running ZooKeeper 3.4.5, Mesos 1.0.0 and Marathon 1.1.2.