When a cluster is under load or affected by certain edge cases, Marathon sometimes kills tasks that have taken too long to turn running.
Affects all Marathon versions.
Marathon considers tasks as overdue when they are not reported STAGING within a timeout, or do not turn RUNNING within another timeout. Under normal circumstances, Marathon would eventually receive a status update for these tasks. Edge cases might be
- the status updates are delayed because Mesos is overloaded or agents are temporarily disconnected (but still within the timeout so they are not considered unreachable)
- Mesos never received the offer operation to launch the task, and would only send an UNKNOWN status update when asked explicitly
- the task is running in docker, and the docker daemon hangs, preventing a status update being sent.
Basically: things are slow to respond, possibly overloaded, or there are intrinsic problems (like an unresponsive Docker daemon). Handling these timeouts by expunging tasks kind of shadows these underlying issues, and potentially puts more load on an overloaded system. The "rescheduled" task could land on the same agent which may be heavy burdened with staging other tasks. Additionally this could cause cascading effects. Task statuses are guaranteed. We will receive them and if not we should reconcile. In other words, instead of expunging the associated instances, Marathon should rather try to reconcile task statuses.
Marathon should not deliberately delete metadata if the associated Mesos tasks are not reportedly terminal. Otherwise, it maneuvers itself into a situation, where it can only kill a given task that is reported RUNNING in order to reach consensus with Mesos.
Currently there is are two task launch related timeouts that, when hit, lead to Marathon expunging task information:
- task_launch_confirm_timeout with a default of 300s (5m). If Marathon does not receive any status update for a task within this timeout, the task is considered terminal and expunged from state.
- task_launch_timeout with a default of 300s (5m). If Marathon does not receive a TASK_RUNNING status update within this timeout for a task that has previously been reported as TASK_STAGING or TASK_STARTING, the task is considered terminal and expunged from state.
Both these timeouts are handled within the OverdueTasksActor. As a result of expunging the given instances, Marathon will schedule new ones, end eventually kill the initially launched tasks in case they are reported non-terminal.
Please check with someone from the Mesos team whether explicit reconciliations are ok to be made for each single task, or whether Marathon should reconcile these in batches. The former would simplify implementation in Marathon, the latter would lead to less traffic.
- The following criteria should be given by the OverdueTasksActor, or an implementation that replaces this actor.
- Reconciling tasks instead of expunging them can be considered a behavioral change: before, Marathon would schedule a replacement as soon as a task was expunged; with the new behavior, Marathon will first try to reconcile, and only eventually expunge the instance (or consider it Inactive). Therefore, this new behavior should potentially be guarded by a feature flag – unless reconciliation happens earlier than the configured timeouts, and the timeouts keep being the point of the expunge operation.
- When Marathon detects an instance for which one of the tasks reached the confirmation timeout, it will log a warning and trigger an explicit reconciliation for the given task and agentId.
- When Marathon detects an instance for which one of the tasks reached the launch timeout, it will log a warning and trigger an explicit reconciliation for the given task and agentId.
- If any of the above reconciliation requests does not result in a status update within a tbd timeout, Marathon will log a warning and trigger another reconciliation request. The required state to check for which tasks reconciliation attempts have been made is OK to be held in memory, and to be lost upon a failover.
- If a configurable amount of reconciliation requests did not result in a status update, Marathon will log a warning and expunge the instance. (After
MARATHON-8167: set the goal to Decommissioned and have the instance considered Inactive and eventually expunged upon a terminal status update).