This is a parent issue to aggregate the handful of sub-issues related to resident tasks.
(check indicates it is merged to master. Please see https://github.com/mesosphere/marathon/issues/5206 for the backport to 1.4 status)
MGI-5141- Marathon fails to release reserved resources for deleted apps
MGI-5154- "Kill and wipe" does not actually kill task
MGI-5162- UnreachableStrategy configuration has no effect; should not be used by resident tasks
MGI-5164- localVolumes Instance property is lost with Marathon restart
MGI-5206- Killing a lost resident task results in expunge
- [ ]
MGI-5283- Workaround required for unreachable resident tasks
– original –
I've recorded a video to show the problem:
In effect, Mesos tells Marathon a task was lost during a reconciliation (for a variety of reasons, but in this demonstrated occurrence it is lost because the mesos-slave id is forcibly changed and a new ID comes up on the same mesos-slave IP address). Then, Marathon responds to that by reserving a new set of resources and persistent volume, and launching a new task.
The expected behavior should be that Marathon should reuse the reserved resources (which it can't because it thinks there is a task running there... status.state == Unknown from looking at the protobuf hexdump in zookeeper). If it can't use the reserved resources because it thinks something might be running then it should not launch additional persistent volumes (when push comes to shove, if it can't satisfy 0% over capacity and 0% under capacity thresholds, it should heed the 0% over capacity limit).