Uploaded image for project: 'Marathon'
  1. Marathon
  2. MARATHON-2536

Trigger explicit reconciliation for unreachable instances

    Details

      Description

      Symptom
      An unreachable instance will be considered inactive when it’s passed the inactiveAfterSeconds timeout. Marathon will try to launch a replacement from that point on. Similarly, an instance will be expunged when it’s passed the expungeAfterSeconds timeout. Marathon itself triggers both actions, but not based on status updates: the ExpungeOverdueLostTasksActor will check for instances matching these conditions, and will issue the given actions, potentially based on stale state.

      Impact
      All Marathon versions.

      Background
      When Mesos agents are disconnected from the masters, their tasks are eventually reported as LOST/UNREACHABLE. In past versions, Mesos sent TASK_LOST in various other cases as well. Also, when an agent was disconnected longer than a certain timeout, the Mesos master would instruct the agent to kill all tasks in case the agent ever re-registered. However, in edge cases, agents were allowed to keep those tasks; e.g. when the master failed over as well and lost knowledge about whether or not the agent was connected before.

      The initial behavior of Marathon in response to TASK_LOST was to consider those tasks as terminal, and expunge all metadata about them. As a result, Marathon killed these tasks if they ever turned up RUNNING, in order to converge its state with Mesos. This however led to outages where a big number of tasks were reported LOST, later turned up RUNNING, and were subsequently killed.

      Mesos 1.1.0 introduced experimental support for partition aware frameworks. Opting-in to this capability allows a framework to decide what to do with unreachable tasks that come back running – meaning the master will not instruct agents to kill those tasks. In order to allow the user to configure how Marathon should handle unreachable tasks on a per-app basis, an unreachableStrategy was introduced. The value for inactiveAfterSeconds as configured in the unreachableStrategy of an Application or Pod is meant to transition an affected instance to an inactive state, thus triggering the launch of a replacement. Similarly, the expungeAfterSeconds is meant to completely expunge an instance after it has been unreachable longer than the specified amount of seconds, which is basically a cleanup action for instances that are not required or wanted any longer.

      More background about unreachable task handling in Marathon can be read here.

      Acceptance Criteria

      1. Marathon will only consider instances as inactive, or expunge them, in response to a Mesos status update.
      2. If both inactiveAfterSeconds and expungeAfterSeconds are 0, Marathon will immediately expunge an instance when it’s reported as UNREACHABLE.
      3. If inactiveAfterSeconds is non-zero, and an instance is reported as UNREACHABLE
        1. Marathon will initiate an explicit reconciliation for tasks associated with this instance shortly after the inactiveAfterSeconds timeout.
        2. Marathon will only consider an instance as inactive in response to a subsequent status update that confirms the unreachable status.
      4. If expungeAfterSeconds is non-zero, and an instance is reported as UNREACHABLE and considered as inactive
        1. Marathon will initiate an explicit reconciliation for tasks associated with this instance shortly after the expungeAfterSeconds timeout.
        2. Marathon will only expunge an instance in response to a subsequent status update that confirms the unreachable status.
      5. Reconciliation can happen in batches and it is not required to reconcile at the exact point in time.

      References

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                kjeschkies Karsten Jeschkies
                Team:
                Orchestration Team
                Watchers:
                daltonmatos, Jason Gilanfarr (Inactive)
              • Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated: