If the box running Marathon gets overloaded / its NIC gets saturated, healthchecks for many different tasks may fail at once.
This, combined with the fact that Marathon kills unhealthy tasks before starting new ones and waiting for them to come up, means that a minute or so of high load on your Marathon master can cause all instances of your apps to die, depending on your health check parameters.
Some changes that might solve this issue:
1. When tasks are unhealthy, start new tasks (and wait for their healthchecks to pass) before killing old ones. (my preferred solution, as this would be helpful in other situations.
2. Globally rate-limiting task kills. This would limit the damage of a short-lived master overload.
3. A circuit-breaker on the health check kills when enough health checks fail in a short period of time. This would limit the damage of a longer period of overload.
cc Kyle Anderson @nhandler @rob-johnson