Uploaded image for project: 'Marathon'
  1. Marathon
  2. MARATHON-2492

Marathon can kill lots of services at once if its host gets overloaded

    Details

    • Type: Task
    • Status: Resolved
    • Priority: Medium
    • Resolution: Won't Do
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Mesos Integration
    • Labels:

      Description

      If the box running Marathon gets overloaded / its NIC gets saturated, healthchecks for many different tasks may fail at once.

      This, combined with the fact that Marathon kills unhealthy tasks before starting new ones and waiting for them to come up, means that a minute or so of high load on your Marathon master can cause all instances of your apps to die, depending on your health check parameters.

      Some changes that might solve this issue:
      1. When tasks are unhealthy, start new tasks (and wait for their healthchecks to pass) before killing old ones. (my preferred solution, as this would be helpful in other situations. MGI-3318)
      2. Globally rate-limiting task kills. This would limit the damage of a short-lived master overload.
      3. A circuit-breaker on the health check kills when enough health checks fail in a short period of time. This would limit the damage of a longer period of overload.

      cc Kyle Anderson @nhandler @rob-johnson

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              evankrall EvanKrall
              Team:
              Orchestration Team
              Watchers:
              EvanKrall, Harpreet Gulati, Jason Gilanfarr (Inactive), Matthias Eichstedt
            • Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: