Uploaded image for project: 'Marathon'
  1. Marathon
  2. MARATHON-7926

Task IDs ignored if Mesos Task ID does not match expected Marathon Task ID

    Details

      Description

      Summary

      Recently, in order to cope with a Mesos requirement that task IDs for non-terminal tasks be unique, Marathon has started to append a task-run count to each run of a resident task, where the format is "{instanceId}.{launchCount}", incrementing launchCount with each launch of the resident task.

      We observed in a cluster in which Mesos had a running task id app.04c610fb-bf45-11e7-ba0e-0abe3aa64bb5.7, but Marathon's state was instance app.04c610fb-bf45-11e7-ba0e-0abe3aa64bb5 had task ID app.04c610fb-bf45-11e7-ba0e-0abe3aa64bb5.6. Logs did not go back far enough to uncover the origin, but a seemingly likely explanation is that a resident task was relaunched before the newly assigned task ID was persisted.

      If this happens, then Marathon will effectively ignore all status updates for the prior task, effectively leaving the task orphaned. Were Marathon to re-start the resident task in question, then it would never succeed as the resources for the resident task would not be resolved, leading to a stuck deployment. Further, the task in question is not being monitored for health-checks.

      A work-around is to manually find and kill the task, allowing Marathon to re-launch the task and get back in sync with the IDs.

      If a Mesos task ID does not match Marathon's task ID for some instance to which it belongs, then the behavior should either be to kill the task, or, to update the task state as to own the state. Stated differently, if Marathon thinks it is at app.04c610fb-bf45-11e7-ba0e-0abe3aa64bb5.6, and it receives an update for app.04c610fb-bf45-11e7-ba0e-0abe3aa64bb5.7, Marathon could accept the task if it is newer (or, potentially, kill the task if it is older than expected, which seems unlikely given the explanation for how this issue cropped up).

      Acceptance

      Given I have launch some persistent app with one task running
      And I back up the Marathon state
      And then I restart the persistent task
      And then I restore Marathon's state
      When Mesos sends a task running update for a task ID with a launchCount counterpart newer than that which Marathon expects,
      Marathon kills the task, and eventually replaces it

       

      Given I have launch some persistent app with one task running
      And I back up the Marathon state
      And then I restart the persistent task
      And then I restore Marathon's state, but I manually kill the task before restarting Marathon
      Marathon restarts the instance

       

      (hypothetical situation... needs coverage?)

      Given I have launch some persistent app with one task running
      And then I restart the persistent task
      And then I stop Marathon, and I edit the state in Zookeeper (using storage tool) to increase the launchCount for the given taskId
      And then I restart Marathon
      When Mesos sends a task running update for a task ID with a launchCount counterpart older than that which Marathon expects,
      Marathon kills the task

        Attachments

          Activity

            People

            • Assignee:
              nikitamelkozerov Nikita Melkozerov
              Reporter:
              tharper Tim Harper
              Team:
              Orchestration Team
              Watchers:
            • Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: