Uploaded image for project: 'Marathon'
  1. Marathon
  2. MARATHON-8042

Improve constraint enforcement when over scaling in the event of unreachable-inactive tasks

    Details

      Description

      Overview of problem

      Given an application with a GROUP_BY:2 constraint, and two instances:

      Instance 1: A
      Instance 2: B
      

      In the event that instance 2 becomes unreachable and marked an active, Marathon will spin up a new instance, instance 3. However, the placement decision is made irrespective of the fact that instance 2 is unreachable. This means that the following scenarios possible:

      Instance 1: A
      Instance 2 (unreachable-inactive): B
      Instance 3 (new instance, spun up in response to instanced to being inactive): A
      

      At this point, the instances do meet the specified constraints. However, the problem becomes when instance 2 is expunged.

      Instance 1: A
      Instance 3: A
      

      We have two instances running, as is specified by the app definition, but the instances violate the specified constraints.

      Proposed solution

      We can make the logic more intelligent, potentially, if we have placement constraints evaluate with the assumption that unreachable inactive tasks will eventually be gone.

      Therefore, in the above situation, the offer matcher would have the following instances as an input to the placement constraints:

      Instance 1: A
      

      Therefore, the next logical placement would be:

      Instance 1: A
      Instance 3: B
      

      However, there are consequences to such an approach. Presume that we have a UNIQUE constraint, or, perhaps, a MAX_PER:1 constraint, and we overscale in such a way that we will violate the constraint if the inactive instance does happen to come back.

      Instance 1: A
      Instance 2 (unreachable-inactive): B
      Instance 3: B
      

      Now, let's also assume that the kill strategy is specified to delete the oldest instance first, and, instance 1 is the oldest.

      Instance 2: B
      Instance 3: B
      

      We will need to modify the kill strategy for constraints that apply in the context of other values, preferring to kill those that violate constraints, first, before applying the specified kill strategy.

      Acceptance Criteria

      Given a Mesos cluster with the following nodes and agent attributes:

      agent-1 {color:RED}
      agent-2 {color:BLUE}
      agent-3 {color:BLUE}
      

      and a Marathon app definition with a kill policy of kill-oldest, a placement constraint of color:UNIQUE, unreachableStrategy of 5:300 and target instance count of 2

      and I manually kill-and-scale the instance placed on a color:BLUE node (making the instance on node color:RED oldest)

      and I scale the app back up to two instances

      When I kill the Mesos agent on the node with attribute color:BLUE (on which one of the instances is running)

      Then Marathon should scale up another instance on the other agent with color:BLUE

      When I restart the Mesos agent previously killed, and the instance running on it becomes reachable again

      Then Marathon should prefer to kill that instance, rather than the oldest, and the unique constraint should be satisfied

      (Similar scenario for MAX_PER:1, and GROUP_BY:2)

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                tharper Tim Harper
                Team:
                Orchestration Team
                Watchers:
                Matthias Eichstedt, Tim Harper
              • Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated: