Affects Version/s: Marathon 1.4.2, Marathon 1.4.5
Sprint:Marathon Sprint 1.10-10
Zendesk Ticket IDs:8329
I'm noticing some large problems in removing agents and unreachable strategy handling in the latest stable marathon/mesos versions- 1.4.3 and 1.1.1.
I started by ensuring all of our app definitions included this new stuff: https://mesosphere.github.io/marathon/docs/configure-task-handling.html#unreachablestrategy-configuration-options
I ran a quick script to PATCH all applications via the API with the following aggressive unreachable strategy:
We're using 100% MESOS_HTTP health checks, no more HTTP based ones running from marathon itself.
Steps to reproduce:
- Leave marathon `reconciliation_interval` command line setting unset (defaults to 10 minutes)
- Have at least a two slave/agent cluster, so one can be shut off/terminated
- Launch a few dummy apps in marathon into the cluster with the above unreachable strategy settings
- Terminate/disconnect a slave (kill -USR1 <mesos_slave_pid>; sleep 10; systemctl stop mesos-slave)
- Notice that although marathon will indicate "0 of 1 running" or similar, and the app continues to show "Healthy" for the health check, it isn't reachable at all. It won't restart until the next time marathon runs task reconciliation at the earliest- 10 minutes.
It also seems like only the "expungeAfterSeconds" setting is being honored; I'm not sure "inactiveAfterSeconds" is actually working. But that is a hunch, I haven't been able to prove that yet.
A not-so-great workaround here is to set the reconciliation interval for marathon to a super low value, like 60 seconds, which seems to force things to get into sync quicker. You can also force marathon to reconcile by simply restarting it- it then seems to honor the unreachable settings quicker.
Let me know if I can provide any other useful info!