Uploaded image for project: 'Marathon'
  1. Marathon
  2. MARATHON-8285

More gracefully recover from scenarios in which unreserveAndDestroy partially succeeds




      We have seen our first-ever occurrence of Mesos executing the volume deletion operation but dropping the unreserve operation in an destroy-and-unreserve set of offer operations. In such a case, Marathon will see the rogue reservation in the future, and respond to it with yet another destroy-and-unreserve set of offer operations. Since the offer only has a reservation, and not a volume, Mesos will fail on the first destroy command, and not attempt second unreserve command. As such, the cycle will repeat indefinitely and the resources involved will be unavailable until they are manually unreserved.

      In order to get Marathon out of this state, and free up the associated reservation, currently the operator must manually unreserve the resources. Marathon should be able to detect and issue the appropriate command to free up a spurious reservation, and not unconditionally send destroy-and-unreserve.

      Additionally, seeing how this failure scenario is a possibility, we should also ensure that the logic that detects if a reservation is successful inspects the reservation to make sure that the volume is created (currently, we don't have any logic which would make a reservation without a volume, but that will unlikely be true in the future). If the reservation lacks a volume, it should detect the reservation is failed such that the reservation logic will attempt again (and, expunge the faulty reservation using the same logic specified by the former paragraph).

      Acceptance Criteria

      Given a fresh Marathon instance
      When I launch a resident task on agent 1
      And I make agent 1 unreachable (stop the agent)
      And I expunge the resident task in Marathon
      and I stop Marathon
      and I bring back agent 1
      And I manually run a destroy operation on the reservation for the resident task's reservation on agent 1
      And I bring back Marathon
      Then Marathon should detect the spurious reservation in it's volume-less state, and issue a successful unreserve (but not destroy) operation.
      And the reservation should be gone from agent 1

      (The following behavior will likely have to be verified in a unit test as creating these conditions could be very difficult)

      Given a fresh Marathon instance
      When I launch a resident task on agent 1
      And Mesos performs the reservation operation, but does not perform the volume creation operation
      Then, Marathon eventually realizes the reservation is unsuccessful, and creates a new reservation and volume
      And Marathon eventually unreserves the spurious reservation from the first attempt


          Issue Links



              • Assignee:
                chrisgaun Chris Gaun
                tharper Tim Harper
                Orchestration Team
                Alena Varkockova, Ian Fraser, Ivan Chernetsky, John Dohoney (Inactive), Josh Baverstock (Inactive), Tim Harper
              • Watchers:
                6 Start watching this issue


                • Created: