Uploaded image for project: 'Marathon'
  1. Marathon
  2. MARATHON-7696

Backoff settings are not respected

    Details

    • Story Points:
      3

      Description

      {
          "id": "/failing-sleep",
          "cmd": "sleep 5; exit 1",
          "cpus": 0.1,
          "mem": 10.0,
          "instances": 1,
          "backoffFactor": 5,
          "backoffSeconds": 1,
          "maxLaunchDelaySeconds": 30
      }
      

      After deploying this app, it goes one like this:

      [2017-08-13 22:48:09,475] INFO  Increasing delay. Task launch delay for [/failing-sleep] changed from [0 milliseconds] to [1 seconds]. (mesosphere.marathon.core.launchqueue.impl.RateLimiter$:marathon-akka.actor.default-dispatcher-14)
      [2017-08-13 22:48:16,053] INFO  Increasing delay. Task launch delay for [/failing-sleep] changed from [-580 milliseconds] to [5 seconds]. (mesosphere.marathon.core.launchqueue.impl.RateLimiter$:marathon-akka.actor.default-dispatcher-21)
      [2017-08-13 22:48:26,675] INFO  Increasing delay. Task launch delay for [/failing-sleep] changed from [-622 milliseconds] to [25 seconds]. (mesosphere.marathon.core.launchqueue.impl.RateLimiter$:marathon-akka.actor.default-dispatcher-21)
      [2017-08-13 22:48:57,106] INFO  Increasing delay. Task launch delay for [/failing-sleep] changed from [-431 milliseconds] to [30 seconds]. (mesosphere.marathon.core.launchqueue.impl.RateLimiter$:marathon-akka.actor.default-dispatcher-14)
      [2017-08-13 22:49:32,608] INFO  Increasing delay. Task launch delay for [/failing-sleep] changed from [0 milliseconds] to [1 seconds]. (mesosphere.marathon.core.launchqueue.impl.RateLimiter$:marathon-akka.actor.default-dispatcher-21)
      [2017-08-13 22:49:39,179] INFO  Increasing delay. Task launch delay for [/failing-sleep] changed from [0 milliseconds] to [1 seconds]. (mesosphere.marathon.core.launchqueue.impl.RateLimiter$:marathon-akka.actor.default-dispatcher-8)
      [2017-08-13 22:49:45,761] INFO  Increasing delay. Task launch delay for [/failing-sleep] changed from [-582 milliseconds] to [5 seconds]. (mesosphere.marathon.core.launchqueue.impl.RateLimiter$:marathon-akka.actor.default-dispatcher-15)
      [2017-08-13 22:49:56,296] INFO  Increasing delay. Task launch delay for [/failing-sleep] changed from [-535 milliseconds] to [25 seconds]. (mesosphere.marathon.core.launchqueue.impl.RateLimiter$:marathon-akka.actor.default-dispatcher-21)
      [2017-08-13 22:50:26,758] INFO  Increasing delay. Task launch delay for [/failing-sleep] changed from [-462 milliseconds] to [30 seconds]. (mesosphere.marathon.core.launchqueue.impl.RateLimiter$:marathon-akka.actor.default-dispatcher-19)
      

      I think in any interpretation, the delay should not go from 30s back to 1s here. If backoffFactor is set to its default default value, which is 1.15, the things get even weirder

      [2017-08-13 23:09:03,987] INFO  Increasing delay. Task launch delay for [/failing-sleep] changed from [0 milliseconds] to [1 seconds]. (mesosphere.marathon.core.launchqueue.impl.RateLimiter$:marathon-akka.actor.default-dispatcher-14)
      [2017-08-13 23:09:10,511] INFO  Increasing delay. Task launch delay for [/failing-sleep] changed from [0 milliseconds] to [1 seconds]. (mesosphere.marathon.core.launchqueue.impl.RateLimiter$:marathon-akka.actor.default-dispatcher-14)
      [2017-08-13 23:09:17,046] INFO  Increasing delay. Task launch delay for [/failing-sleep] changed from [-535 milliseconds] to [1 seconds 150 milliseconds]. (mesosphere.marathon.core.launchqueue.impl.RateLimiter$:marathon-akka.actor.default-dispatcher-12)
      [2017-08-13 23:09:23,732] INFO  Increasing delay. Task launch delay for [/failing-sleep] changed from [0 milliseconds] to [1 seconds]. (mesosphere.marathon.core.launchqueue.impl.RateLimiter$:marathon-akka.actor.default-dispatcher-15)
      [2017-08-13 23:09:30,252] INFO  Increasing delay. Task launch delay for [/failing-sleep] changed from [0 milliseconds] to [999 milliseconds]. (mesosphere.marathon.core.launchqueue.impl.RateLimiter$:marathon-akka.actor.default-dispatcher-12)
      [2017-08-13 23:09:36,666] INFO  Increasing delay. Task launch delay for [/failing-sleep] changed from [-414 milliseconds] to [1 seconds 150 milliseconds]. (mesosphere.marathon.core.launchqueue.impl.RateLimiter$:marathon-akka.actor.default-dispatcher-20)
      [2017-08-13 23:09:43,252] INFO  Increasing delay. Task launch delay for [/failing-sleep] changed from [0 milliseconds] to [1 seconds]. (mesosphere.marathon.core.launchqueue.impl.RateLimiter$:marathon-akka.actor.default-dispatcher-20)
      

      The delay is effectively always just a second. And it is the same with this application too:

      {
          "id": "/failing-sleep",
          "cmd": "sleep 5; exit 1",
          "cpus": 0.1,
          "mem": 10.0,
          "instances": 1,
          "container": {
              "type": "DOCKER",
              "docker": {
                  "image": "not-existing"
              }
          },
          "backoffFactor": 1.15,
          "backoffSeconds": 1,
          "maxLaunchDelaySeconds": 30
      }
      

      The expected behavior is

      this seems to indicate that with a backoffSeconds of 1, a backoffFactor of 5, and a maxLaunchDelaySeconds of 30 seconds, that the behavior for a continually failed task would be: Initial -> 1 second delay -> 2nd -> 5 second delay -> 3rd -> 25 second delay -> 4th -> 30 second delay -> 5th and then for all other tries, it would be a delay of 30 seconds.
      however observed behavior is that is after a few 30 second pauses, it restarts the exponential backoff.

        Attachments

          Activity

            People

            • Assignee:
              ivanchernetsky Ivan Chernetsky
              Reporter:
              ivanchernetsky Ivan Chernetsky
              Team:
              Orchestration Team
              Watchers:
              benhemp, Ivan Chernetsky, Marco Monaco
            • Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: