Uploaded image for project: 'Marathon'
  1. Marathon
  2. MARATHON-7726

Expose path of a failed health check in TaskFailure

    Details

      Description

      Requirement
      Expose information about the health check that last failed leading to the task being killed.

      Acceptance criteria

      • When a task is killed due to a failed Marathon health check, Marathon exposes information about the path (and optionally port) that was used to perform the health check.
      • If there are multiple Marathon health checks defined, Marathon exposes information about the health check that failed last, leading to the task being killed.

      Hint
      The Unhealthy should be populated with a cause that hints at the health check that failed.


      Original text
      Marathon: 1.4.5
      Mesos: 1.3.0

      A have a problem with a production application which is being killed periodically due to a failed health check, but the Marathon UI and Marathon API doesn't include enough information to tell which health check failed.

      The problem is that the 'lastTaskFailure' attribute doesn't contain enough information to tell which of the health checks failed, the value of that property is:

      "lastTaskFailure": {
        "appId": "/myapps/prod-app",
        "host": "xxxxxxx",
        "message": "Task was killed since health check failed. Reason: AskTimeoutException: Ask timed out on [Actor[akka://marathon/user/IO-HTTP#1571149239]] after [15000 ms]. Sender[null] sent message of type \"spray.http.HttpRequest\".",
        "state": "TASK_KILLED",
        "taskId": "myapps_prod-app.8f046197-8703-11e7-947d-525400362d69",
        "timestamp": "2017-08-23T10:46:23.151Z",
        "version": "2017-08-22T06:31:41.154Z",
        "slaveId": "00925b1a-c3b9-4d24-baf3-5300d99bb4b9-S6"
      }
      

      And here is a copy of the app (with some details hidden) as fetched from the /v2/apps/<appId> rest end-point:

      {
        "app": {
          "id": "/myapps/prod-app",
          "cmd": null,
          "args": null,
          "user": null,
          "instances": 2,
          "cpus": 0.2,
          "mem": 1024,
          "disk": 0,
          "gpus": 0,
          "executor": "",
          "constraints": [
            [
              "hostname",
              "UNIQUE"
            ]
          ],
          "uris": [],
          "fetch": [],
          "storeUrls": [],
          "backoffSeconds": 1,
          "backoffFactor": 2,
          "maxLaunchDelaySeconds": 3600,
          "container": {
            "type": "DOCKER",
            "volumes": [],
            "docker": {
              "image": "xxxxx",
              "network": "BRIDGE",
              "portMappings": [
                {
                  "containerPort": 0,
                  "hostPort": 0,
                  "servicePort": 10045,
                  "protocol": "tcp",
                  "labels": {}
                },
                {
                  "containerPort": 0,
                  "hostPort": 0,
                  "servicePort": 10046,
                  "protocol": "tcp,udp",
                  "labels": {}
                },
                {
                  "containerPort": 0,
                  "hostPort": 0,
                  "servicePort": 10070,
                  "protocol": "tcp,udp",
                  "labels": {}
                }
              ],
              "privileged": false,
              "parameters": [],
              "forcePullImage": false
            }
          },
          "healthChecks": [
            {
              "gracePeriodSeconds": 120,
              "intervalSeconds": 20,
              "timeoutSeconds": 15,
              "maxConsecutiveFailures": 3,
              "portIndex": 0,
              "path": "/health/ping",
              "protocol": "HTTP",
              "ignoreHttp1xx": false
            },
            {
              "gracePeriodSeconds": 120,
              "intervalSeconds": 20,
              "timeoutSeconds": 15,
              "maxConsecutiveFailures": 3,
              "portIndex": 0,
              "path": "/health/checkDatabase",
              "protocol": "HTTP",
              "ignoreHttp1xx": false
            },
            {
              "gracePeriodSeconds": 120,
              "intervalSeconds": 20,
              "timeoutSeconds": 15,
              "maxConsecutiveFailures": 1,
              "portIndex": 0,
              "path": "/health/checkKafka",
              "protocol": "HTTP",
              "ignoreHttp1xx": false
            },
            {
              "gracePeriodSeconds": 120,
              "intervalSeconds": 20,
              "timeoutSeconds": 15,
              "maxConsecutiveFailures": 3,
              "portIndex": 0,
              "path": "/health/checkConsul",
              "protocol": "HTTP",
              "ignoreHttp1xx": false
            }
          ],
          "readinessChecks": [],
          "dependencies": [],
          "upgradeStrategy": {
            "minimumHealthCapacity": 1,
            "maximumOverCapacity": 1
          },
          "labels": {},
          "ipAddress": null,
          "version": "2017-08-22T06:31:41.154Z",
          "residency": null,
          "secrets": {},
          "taskKillGracePeriodSeconds": null,
          "unreachableStrategy": {
            "inactiveAfterSeconds": 300,
            "expungeAfterSeconds": 600
          },
          "killSelection": "YOUNGEST_FIRST",
          "ports": [
            10045,
            10046,
            10070
          ],
          "portDefinitions": [
            {
              "port": 10045,
              "protocol": "tcp",
              "labels": {}
            },
            {
              "port": 10046,
              "protocol": "tcp",
              "labels": {}
            },
            {
              "port": 10070,
              "protocol": "tcp",
              "labels": {}
            }
          ],
          "requirePorts": false,
          "versionInfo": {
            "lastScalingAt": "2017-08-22T06:31:41.154Z",
            "lastConfigChangeAt": "2017-08-22T06:31:41.154Z"
          },
          "tasksStaged": 0,
          "tasksRunning": 2,
          "tasksHealthy": 2,
          "tasksUnhealthy": 0,
          "deployments": [],
          "tasks": [
            {
              "ipAddresses": [
                {
                  "ipAddress": "xxxxx",
                  "protocol": "IPv4"
                }
              ],
              "stagedAt": "2017-08-22T06:31:41.218Z",
              "state": "TASK_RUNNING",
              "ports": [
                9856,
                9857,
                9858
              ],
              "startedAt": "2017-08-22T06:31:51.584Z",
              "version": "2017-08-22T06:31:41.154Z",
              "id": "myapps_prod-app.8f0488a8-8703-11e7-947d-525400362d69",
              "appId": "/myapps/prod-app",
              "slaveId": "00925b1a-c3b9-4d24-baf3-5300d99bb4b9-S14",
              "host": "xxxxxxx",
              "healthCheckResults": [
                {
                  "alive": true,
                  "consecutiveFailures": 0,
                  "firstSuccess": "2017-08-22T06:32:46.988Z",
                  "lastFailure": null,
                  "lastSuccess": "2017-08-23T22:42:31.905Z",
                  "lastFailureCause": null,
                  "instanceId": "myapps_prod-app.marathon-8f0488a8-8703-11e7-947d-525400362d69"
                },
                {
                  "alive": true,
                  "consecutiveFailures": 0,
                  "firstSuccess": "2017-08-22T06:32:47.519Z",
                  "lastFailure": null,
                  "lastSuccess": "2017-08-23T22:42:31.994Z",
                  "lastFailureCause": null,
                  "instanceId": "myapps_prod-app.marathon-8f0488a8-8703-11e7-947d-525400362d69"
                },
                {
                  "alive": true,
                  "consecutiveFailures": 0,
                  "firstSuccess": "2017-08-22T06:32:46.987Z",
                  "lastFailure": null,
                  "lastSuccess": "2017-08-23T22:42:31.816Z",
                  "lastFailureCause": null,
                  "instanceId": "myapps_prod-app.marathon-8f0488a8-8703-11e7-947d-525400362d69"
                },
                {
                  "alive": true,
                  "consecutiveFailures": 0,
                  "firstSuccess": "2017-08-22T06:32:46.988Z",
                  "lastFailure": null,
                  "lastSuccess": "2017-08-23T22:42:31.816Z",
                  "lastFailureCause": null,
                  "instanceId": "myapps_prod-app.marathon-8f0488a8-8703-11e7-947d-525400362d69"
                }
              ]
            },
            {
              "ipAddresses": [
                {
                  "ipAddress": "xxxxx",
                  "protocol": "IPv4"
                }
              ],
              "stagedAt": "2017-08-23T10:46:33.606Z",
              "state": "TASK_RUNNING",
              "ports": [
                9369,
                9370,
                9371
              ],
              "startedAt": "2017-08-23T10:46:40.164Z",
              "version": "2017-08-22T06:31:41.154Z",
              "id": "myapps_prod-app.54678358-87f0-11e7-947d-525400362d69",
              "appId": "/myapps/prod-app",
              "slaveId": "00925b1a-c3b9-4d24-baf3-5300d99bb4b9-S3",
              "host": "xxxxxxx",
              "healthCheckResults": [
                {
                  "alive": true,
                  "consecutiveFailures": 0,
                  "firstSuccess": "2017-08-23T10:47:48.312Z",
                  "lastFailure": null,
                  "lastSuccess": "2017-08-23T22:42:31.905Z",
                  "lastFailureCause": null,
                  "instanceId": "myapps_prod-app.marathon-54678358-87f0-11e7-947d-525400362d69"
                },
                {
                  "alive": true,
                  "consecutiveFailures": 0,
                  "firstSuccess": "2017-08-23T10:47:48.364Z",
                  "lastFailure": null,
                  "lastSuccess": "2017-08-23T22:42:31.979Z",
                  "lastFailureCause": null,
                  "instanceId": "myapps_prod-app.marathon-54678358-87f0-11e7-947d-525400362d69"
                },
                {
                  "alive": true,
                  "consecutiveFailures": 0,
                  "firstSuccess": "2017-08-23T10:47:48.272Z",
                  "lastFailure": null,
                  "lastSuccess": "2017-08-23T22:42:31.816Z",
                  "lastFailureCause": null,
                  "instanceId": "myapps_prod-app.marathon-54678358-87f0-11e7-947d-525400362d69"
                },
                {
                  "alive": true,
                  "consecutiveFailures": 0,
                  "firstSuccess": "2017-08-23T10:47:48.312Z",
                  "lastFailure": null,
                  "lastSuccess": "2017-08-23T22:42:31.816Z",
                  "lastFailureCause": null,
                  "instanceId": "myapps_prod-app.marathon-54678358-87f0-11e7-947d-525400362d69"
                }
              ]
            }
          ],
          "lastTaskFailure": {
            "appId": "/myapps/prod-app",
            "host": "xxxxxxx",
            "message": "Task was killed since health check failed. Reason: AskTimeoutException: Ask timed out on [Actor[akka://marathon/user/IO-HTTP#1571149239]] after [15000 ms]. Sender[null] sent message of type \"spray.http.HttpRequest\".",
            "state": "TASK_KILLED",
            "taskId": "myapps_prod-app.8f046197-8703-11e7-947d-525400362d69",
            "timestamp": "2017-08-23T10:46:23.151Z",
            "version": "2017-08-22T06:31:41.154Z",
            "slaveId": "00925b1a-c3b9-4d24-baf3-5300d99bb4b9-S6"
          }
        }
      }
      

      This isssue has been created automatically from Marathon GitHub Issue 5490.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                marathon-bot Marathon Bot
                Team:
                Orchestration Team
                Watchers:
                ddevos78, imechemi, Karsten Jeschkies, Ken Sipe, Marco Monaco, Matthias Eichstedt
              • Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated: