In Mesos commit b8562b8fc578f8d62f4a934d315516161d9e1720, the following change was introduced:
This change landed in Mesos 1.2.x, and breaks Marathon's ability to recover persistent volume tasks (at least until the Mesos master forgets about the former agent).
The scenario goes like this:
- Marathon launches a persistent volume task with ID task-1, on agent agent-1
- Machine hosting agent agent-1 is rebooted
- Machine previously hosting agent agent-1 is booted, and agent reconnects with a new ID agent-2
- Mesos has no idea what state agent-1 is in, assumes it is lost
- Persistent resources are offered to Marathon, and Marathon re-launches a task with ID task-1 on agent-2
- Mesos rejects this because agent-1 has an unreachable task ID of task-1
Note that either of the following would solve the problem:
- If Mesos reused agent IDs when the host machine was rebooted
- If Marathon did not replace a resident task using the same task ID
MESOS-6223 states that agent IDs should be re-used between host reboots, and Mesos commit cd6495e677ec74fd3f40b0dbf3b9654475308575 addresses it. ٩(ˊᗜˋ*)و However, this change will not land until 1.4.x.
Ideally, we should stop re-using task Ids, especially now that we have done work to decouple instance ID from it.
– UPDATE –
I have confirmed that Mesos 1.2.2 will evict the former agentID if a new agent ID joins with the same IP address. This issue appears to only affect users that have dynamically assigned IPs.