Uploaded image for project: 'Marathon'
  1. Marathon
  2. MARATHON-8372

Race condition because of reconciliation when launching a task

    Details

    • Type: Task
    • Status: Resolved
    • Priority: High
    • Resolution: Done
    • Affects Version/s: None
    • Fix Version/s: DC/OS 1.12.0
    • Component/s: Scheduling
    • Labels:
      None
    • Sprint:
      Marathon 2018-28, Marathon 2018-29, Marathon 2018-30
    • Story Points:
      3
    • Build artifact:
      Marathon-v1.7.174

      Description

      Consider following scenario:

      1. Marathon receives offer
      2. Marathon decides to launch an instance on that offer
      3. Marathon persists new instance that will be launched
      4. Reconciliation kicks in (it's a time scheduled event, totally unrelated to launching a task), and it fetches all instances known at that time and asks Mesos to reconcile
      5. Marathon accepts offer by responding to Mesos

      The problem is between step 4 and 5.
      In step 4 we ask Mesos to reconcile also the instance that is being launched. Master does not know about that instance yet, but it knows the agent, so it replies with TASK_GONE.

      When we receive TASK_GONE, we immediately expunge the instance and launch a new one even though technically there was nothing wrong with the first instance. This causes 'lastTaskFailure' to be non-zero during the deployment.

      Acceptance criteria

      Scenario described above won't end up with instance being thrown away.

      Implementation notes

      There are multiple ways how to deal with this:

      1. having event loop in heart of marathon would make it easier to sequentialize these correctly and to send messages to Mesos in the right order
      2. we might introduce new instance state that will be for instances before we sent offer to Mesos, after we accept the offer, we would change that state to e.g. Accepted
        1. with that, we won't reconcile tasks in states prior to Accepted
        2. this introduces one extra ZK write to change the state of the instance after accepting offer
      3. We could only reconcile active instances (e.g. Starting/Staging/Running/Killing), and apply something related to MARATHON-8235 for instances that are provisioned or unreachable.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                alenavarkockova Alena Varkockova
                Reporter:
                alenavarkockova Alena Varkockova
                Team:
                Orchestration Team
                Watchers:
                Alena Varkockova, Matthias Eichstedt, Mergebot
              • Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: