Uploaded image for project: 'Marathon'
  1. Marathon
  2. MARATHON-8372

Race condition because of reconciliation when launching a task


    • Type: Task
    • Status: Resolved
    • Priority: High
    • Resolution: Done
    • Affects Version/s: None
    • Fix Version/s: DC/OS 1.12.0
    • Component/s: Scheduling
    • Labels:
    • Sprint:
      Marathon 2018-28, Marathon 2018-29, Marathon 2018-30
    • Story Points:
    • Build artifact:


      Consider following scenario:

      1. Marathon receives offer
      2. Marathon decides to launch an instance on that offer
      3. Marathon persists new instance that will be launched
      4. Reconciliation kicks in (it's a time scheduled event, totally unrelated to launching a task), and it fetches all instances known at that time and asks Mesos to reconcile
      5. Marathon accepts offer by responding to Mesos

      The problem is between step 4 and 5.
      In step 4 we ask Mesos to reconcile also the instance that is being launched. Master does not know about that instance yet, but it knows the agent, so it replies with TASK_GONE.

      When we receive TASK_GONE, we immediately expunge the instance and launch a new one even though technically there was nothing wrong with the first instance. This causes 'lastTaskFailure' to be non-zero during the deployment.

      Acceptance criteria

      Scenario described above won't end up with instance being thrown away.

      Implementation notes

      There are multiple ways how to deal with this:

      1. having event loop in heart of marathon would make it easier to sequentialize these correctly and to send messages to Mesos in the right order
      2. we might introduce new instance state that will be for instances before we sent offer to Mesos, after we accept the offer, we would change that state to e.g. Accepted
        1. with that, we won't reconcile tasks in states prior to Accepted
        2. this introduces one extra ZK write to change the state of the instance after accepting offer
      3. We could only reconcile active instances (e.g. Starting/Staging/Running/Killing), and apply something related to MARATHON-8235 for instances that are provisioned or unreachable.


          Issue Links



              • Assignee:
                alenavarkockova Alena Varkockova
                alenavarkockova Alena Varkockova
                Orchestration Team
                Alena Varkockova, Matthias Eichstedt, Mergebot
              • Watchers:
                3 Start watching this issue


                • Created: