I'm having trouble setting up failover docker containers, it seems that no matter if my docker host is down, containers are not rebalanced on another host.
I'm trying a simple topology, with 3 masters (only one active) and a 1 quorum. My master args are :
/usr/sbin/mesos-master --zk=zk://foo.bar.0.1:2181,foo.bar.0.2:2181,foo.bar.0.3:2181/mesos --ip=foo.bar.0.1 --cluster=mesos_cluster --log_dir=/var/log/mesos --work_dir=/var/mesos --advertise_ip=foo.bar.0.1 --hostname=foobar --quorum=1
and my slaves are all configured on the same pattern :
/usr/sbin/mesos-slave --ip=foo.bar.0.1 --master=zk://foo.bar.0.1:2181,foo.bar.0.2:2181,foo.bar.0.3:2181/mesos --containerizers=mesos,docker --log_dir=/var/log/mesos --work_dir=/var/mesos --docker_config=/root/.dockercfg --hostname=foobar --resources=file:///etc/mesos-resources.txt
I'm rolling 10 instances of the attached json, changing the mock-producer-x value and PARTITION value to assign 10 different numbers
In this context, my application is properly deployed and runs smoothly on my 3 nodes.
My problem occurs if one of my nodes falls down (here, I'm doing a reboot or ifconfig eth1 down), the assigned containers are not seen as "running" on marathon anymore, BUT if I try and restart them, they are properly restarted elsewhere. When the missing node pops back in the cluster, its containers are still up (if it's just a simple network failure) and my issue comes from the fact that they are never killed nor rebalanced, in any way.
[EDIT] : I also tried the --recover=cleanup option, who failed all my agents.