Uploaded image for project: 'Marathon'
  1. Marathon
  2. MARATHON-7764

Marathon does not apply TASK_UNREACHABLE until full reconciliation

    Details

    • Type: Task
    • Status: Resolved
    • Priority: Medium
    • Resolution: Done
    • Affects Version/s: Marathon 1.4.7
    • Fix Version/s: None
    • Component/s: Scheduling
    • Labels:

      Description

      It looks like Marathon is ignoring the TASK_UNREACHABLE event sent from Mesos when it detects that the agent is gone.

      In our case this happens when the host comes back online again and the agent reconnects to Mesos. If the machine is stopped (and never restarts) Mesos will send the TASK_UNREACHABLE after ~75 seconds (which is expected based on default Mesos config for agent_ping_timeout and max_agent_ping_timeouts).

      Marathon will not launch a new replacement task until it performs an explicit reconciliation (by default every 10'th minute). At that point the tasks is labeled as "UnreachableInactive" and a new task is launched.

      We're using a Marathon health-check, and that one gets triggered after ~14 seconds when the host goes away. After this Marathon tries to kill the task (without success, since the mesos-agent for that host isn't running). As far as I understand Marathon will not launch a new replacement task until it have successfully killed the unhealthy task, right?

      I would expect that Marathon could "move on" when it receives the TASK_UNREACHABLE message from Mesos. But instead the TASK_UNREACHABLE event is ignored and it's not until a full reconciliation is done that the task state in Marathon is updated.

      Is this a known issue? Or do we need to tweak configuration parameters for Marathon to handle TASK_UNREACHABLE in correct way (or rather, how to gracefully handle a agent host restart).

      See below for more details. I've also attached the log-files from Marathon and Mesos.

      logs.tar.gz

      Details

      Running Mesos 1.3.1 and Marathon 1.4.7

      Mesos configuration

      Default values

      • agent_ping_timeout: 15sec
      • agent_reregister_timeout: 10min
      • max_agent_ping_timeouts: 5

      Marathon configuration

      Default values

      • reconciliation_initial_delay: 15sec
      • reconciliation_interval: 10min
      • scale_apps_initial_delay: 15sec
      • scale_apps_interval: 5min
      • task_lost_expunge_initial_delay: 5min
      • task_lost_expunge_interval: 30sec

      Service config

       
      { 
          "backoffFactor": 1.15, 
          "backoffSeconds": 1, 
          "cmd": "python -m SimpleHTTPServer $PORT_HTTP", 
          "constraints": [], 
          "cpus": 1, 
          "dependencies": [], 
          "disk": 0, 
          "env": {}, 
          "gpus": 0, 
          "healthChecks": [ 
              { 
                  "gracePeriodSeconds": 300, 
                  "ignoreHttp1xx": false, 
                  "intervalSeconds": 5, 
                  "maxConsecutiveFailures": 5, 
                  "path": "/", 
                  "portIndex": 0, 
                  "protocol": "HTTP", 
                  "timeoutSeconds": 5 
              } 
          ], 
          "id": "/http", 
          "instances": 1, 
          "maxLaunchDelaySeconds": 3600, 
          "mem": 128, 
          "portDefinitions": [ 
              { 
                  "labels": {}, 
                  "name": "http", 
                  "port": 10000, 
                  "protocol": "tcp" 
              } 
          ], 
          "ports": [ 
              10000 
          ], 
          "unreachableStrategy": { 
              "expungeAfterSeconds": 60, 
              "inactiveAfterSeconds": 10 
          }, 
          "upgradeStrategy": { 
              "maximumOverCapacity": 1, 
              "minimumHealthCapacity": 1 
          } 
      } 
      

      Restarting a host (restart)

      Restarting a host. For some reason Marathon launches 2 tasks after reconciliation. It's not until `-scale_app_interval` (5 minutes) that this is detected and the extra task is killed.

      Elapsed Time Mesos Marathon
      0 09:38:40 Agent disconnected  
      14 09:38:54   Service unhealthy
      39 09:39:19 Receive kill (fails) <- Request Kill
      79 09:39:59 Connect new agent  
          Send TASK_LOST -> Receive TASK_LOST
      364 09:44:44   Reconciliation
          Send TASK_UNKNOWN ->  
            <- Start task
          Send TASK_RUNNING ->  
            <- Start task
          Send Task RUNNING ->  
      664 09:49:44   Scale down 2 -> 1
            <- Request KILL
          Processing KILL  

      Removing a host (delete)

      Shutting down a host (not rebooting it). No replacement task until full reconciliation.

      Elapsed Time Mesos Marathon
      0 10:33:03 Agent disconnected  
      14 10:33:17   Service Unhealthy
      39 10:33:42   FailedHealthChecks
          Receive kill (fails) <- Send Kill
      88 10:34:31 Mark agent as unreachable  
          Send TASK_UNREACHABLE -> Received TASK_UNREACHABLE
      101 10:34:44   <- Reconciliation
            Set state UnreachableInactive
            <- Start Task
          Add task, Send TASK_RUNNING -> Receive TASK_RUNNING
      176 10:35:59   Expunge old task

      Restarting a host (restart2)

      Restarting a host. No replacement task until full reconciliation.

      Elapsed Time Mesos Marathon
      0 10:37:51 Agent disconnected  
      12 10:38:03   Service Unhealthy
      37 10:38:28 Receive kill (fails) <- Request Kill
      79 10:39:10 Connect new agent  
      85 10:39:16 Mark as unreachable  
          Send TASK_UNREACHABLE -> Receive TASK_UNREACHABLE
      413 10:44:44   <- Reconciliation
            Set state UnreachableInactive
            <- Start task
          Add task, Send TASK_RUNNING ->  
      428 10:44:59   Expunging old task

      This isssue has been created automatically from Marathon GitHub Issue 5506.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                matthias.eichstedt Matthias Eichstedt
                Reporter:
                andersenleo andersenleo
                Team:
                Orchestration Team
                Watchers:
                andersenleo, Matthias Eichstedt
              • Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: