Details

    • Type: Task
    • Status: Accepted
    • Priority: High
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: metronome
    • Labels:

      Description

      Triage outcome

      Metronome uses Marathon AppDefinitions launched via the Marathon scheduler underneath. In Marathon, maxLaunchDelaySeconds is documented as

      Configures exponential backoff behavior when launching potentially sick apps. This prevents sandboxes associated with consecutively failing tasks from filling up the hard disk on Mesos agents. The backoff period is multiplied by the factor for each consecutive failure until it reaches maxLaunchDelaySeconds. This applies also to tasks that are killed due to failing too many health checks.

      This property was recently changed to default to five minutes instead of one hour.

      The Metronome JobRunSpec also exposes a property names maxLaunchDelay, but has a different semantic according to the documentation:

      The number of seconds until the job needs to be running. If the deadline is reached without successfully running the job, the job is aborted.

      However, since the beginning of time in Metronome, maxLaunchDelay was used to populate the according property in the resulting Marathon AppDefinition (see here on master). So, the documentation for Metronome is obviously wrong and misleading. As you can see here, Metronome configures both the initial backoff as well as the backoff factor to be 0:

      backoffStrategy = BackoffStrategy(
      backoff = 0.seconds,
      factor = 0.0,
      maxLaunchDelay = FiniteDuration(jobSpec.run.maxLaunchDelay.toMillis, TimeUnit.MILLISECONDS))

      As a result, the maxLaunchDelay has no effect at all, since the initial backoff of 0 seconds will be multiplied by a factor of 0 upon each failure, resulting in a backoff of 0.

      The somewhat related startingDeadlineSeconds in Metronome is only used to evaluate whether a new task should be started after one failed. This property is not used to abort a JobRun when a task continues running after say an hour, or when it is stuck in staging and doesn't terminate.

      Proposal

      We have several options to fix the perceived problem:

      1. Provide means to configure a complete backoffStrategy in Metronome, to allow applying a backoff. For long running services that potentially depend on other services, it makes sense to apply a backoff. For jobs, whose ultimate goal is to finish and not be restarted, this might not be considered valuable.
      2. Deprecate the maxLaunchDelay property, as it has no effect, and remove it in a future version.
      3. Add a property in conjunction to startingDeadlineSeconds, that can be used to abort a JobRun if e.g. a task is stuck in staging. This would have to be scoped and could be worked around by having something similar to the OverdueTasksActor in Marathon.

      Original text

      Based on our UI, this is what `maxLaunchDelay` is supposed to do "The number of seconds until the job needs to be running. If the deadline is reached without successfully running the job, the job is aborted."

      Based on my analysis and based on situation on soak this afternoon this is not true. We've seen jobs that were in "staging" state for many hours.

      The only thing we do with the property is that we map it to Marathon runspec here https://github.com/dcos/metronome/blob/master/jobs/src/main/scala/dcos/metronome/utils/glue/MarathonImplicits.scala#L124

      But that is a different property. That is maximum amount of seconds for which an app can be backed off. Something, that has absolutely no effect on Metronome. We have a starting deadline enforced, but we don't do it for jobs that turn staging and never go running https://github.com/dcos/metronome/blob/master/jobs/src/main/scala/dcos/metronome/jobrun/impl/JobRunExecutorActor.scala#L40
      If we go back in the metronome history, it looks like this thing never worked https://github.com/dcos/metronome/blame/3000b5c8d5612983db6d9abdd4458f577436a1dc/jobs/src/main/scala/dcos/metronome/utils/glue/MarathonImplicits.scala#L103

      Acceptance criteria

      When a job is started and it does not turn running before `maxLaunchDelay` is hit, it should be aborted and new job should be started.
      We should definitely backport this fix.

        Attachments

          Activity

            People

            • Assignee:
              matthias.eichstedt Matthias Eichstedt
              Reporter:
              alenavarkockova Alena Varkockova
              Team:
              Orchestration Team
              Watchers:
              Alena Varkockova
            • Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated: