Uploaded image for project: 'Marathon'
  1. Marathon
  2. MARATHON-7286

Unreachable strategy not working properly

    Details

    • Sprint:
      Marathon Sprint 1.10-10
    • Story Points:
      2
    • Zendesk Ticket IDs:
      8329

      Description

      I'm noticing some large problems in removing agents and unreachable strategy handling in the latest stable marathon/mesos versions- 1.4.3 and 1.1.1.

      I started by ensuring all of our app definitions included this new stuff: https://mesosphere.github.io/marathon/docs/configure-task-handling.html#unreachablestrategy-configuration-options

      I ran a quick script to PATCH all applications via the API with the following aggressive unreachable strategy:

      {
        "unreachableStrategy": {
          "inactiveAfterSeconds": 60,
          "expungeAfterSeconds": 90
        }
      }
      

      We're using 100% MESOS_HTTP health checks, no more HTTP based ones running from marathon itself.

      Steps to reproduce:

      • Leave marathon `reconciliation_interval` command line setting unset (defaults to 10 minutes)
      • Have at least a two slave/agent cluster, so one can be shut off/terminated
      • Launch a few dummy apps in marathon into the cluster with the above unreachable strategy settings
      • Terminate/disconnect a slave (kill -USR1 <mesos_slave_pid>; sleep 10; systemctl stop mesos-slave)
      • Notice that although marathon will indicate "0 of 1 running" or similar, and the app continues to show "Healthy" for the health check, it isn't reachable at all. It won't restart until the next time marathon runs task reconciliation at the earliest- 10 minutes.

      It also seems like only the "expungeAfterSeconds" setting is being honored; I'm not sure "inactiveAfterSeconds" is actually working. But that is a hunch, I haven't been able to prove that yet.

      A not-so-great workaround here is to set the reconciliation interval for marathon to a super low value, like 60 seconds, which seems to force things to get into sync quicker. You can also force marathon to reconcile by simply restarting it- it then seems to honor the unreachable settings quicker.

      Let me know if I can provide any other useful info!

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                junterstein Johannes Unterstein
                Reporter:
                toofishes toofishes
                Team:
                Orchestration Team
                Watchers:
                brugidou, daltonmatos, Johannes Unterstein, kamsz, Karsten Jeschkies, Marco Monaco, Matthias Eichstedt, Ondrej Smola, sethp-nr, Tim Harper, Tobias Mueller, toofishes
              • Watchers:
                12 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: