Uploaded image for project: 'Marathon'
  1. Marathon
  2. MARATHON-7654

Marathon cannot replace persistent tasks for rebooted hosts until the Mesos Master forgets about the old agent

    Details

      Description

      In Mesos commit b8562b8fc578f8d62f4a934d315516161d9e1720, the following change was introduced:

      Author: Neil Conway <neil.conway@gmail.com>
      Date:   Mon Jan 23 17:04:22 2017 -0800
      
          Prevented task launches that reuse unreachable task IDs.
      
          The master keeps an in-memory cache of task IDs that have recently been
          marked unreachable. The master now consults this cache to reject task
          launch attempts that reuse one of these recently unreachable task IDs
          (such tasks are not terminal and may resume running in the future). This
          check is not complete (we won't detect all cases in which unreachable
          task IDs are reused), but preventing this from happening in the common
          case seems worth doing. See MESOS-6785 for details.
      
          Review: https://reviews.apache.org/r/54793/
      

      This change landed in Mesos 1.2.x, and breaks Marathon's ability to recover persistent volume tasks (at least until the Mesos master forgets about the former agent).

      The scenario goes like this:

      • Marathon launches a persistent volume task with ID task-1, on agent agent-1
      • Machine hosting agent agent-1 is rebooted
      • Machine previously hosting agent agent-1 is booted, and agent reconnects with a new ID agent-2
      • Mesos has no idea what state agent-1 is in, assumes it is lost
      • Persistent resources are offered to Marathon, and Marathon re-launches a task with ID task-1 on agent-2
      • Mesos rejects this because agent-1 has an unreachable task ID of task-1

      Note that either of the following would solve the problem:

      • If Mesos reused agent IDs when the host machine was rebooted
      • If Marathon did not replace a resident task using the same task ID

      MESOS-6223 states that agent IDs should be re-used between host reboots, and Mesos commit cd6495e677ec74fd3f40b0dbf3b9654475308575 addresses it. ٩(ˊᗜˋ*)و However, this change will not land until 1.4.x.

      Ideally, we should stop re-using task Ids, especially now that we have done work to decouple instance ID from it.

      – UPDATE –

      I have confirmed that Mesos 1.2.2 will evict the former agentID if a new agent ID joins with the same IP address. This issue appears to only affect users that have dynamically assigned IPs.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                tharper Tim Harper
                Reporter:
                tharper Tim Harper
                Team:
                Orchestration Team
                Watchers:
                Cathy Daw, Tim Harper, Vinod Kone
              • Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: