Uploaded image for project: 'DC/OS'
  1. DC/OS
  2. DCOS_OSS-2277

DCOS Diagnostics Checks Too Aggressive Timeout

    Details

    • Story Points:
      3

      Description

      Summary

      The `mesos-metrics` agent check in `dcos-checks` has too aggressive a timeout, when the system or the metrics agent is very busy.

      Walkthrough

      Below screenshot shows two of my agent nodes, showing unhealthy.

      Unable to render embedded object: File (image-2018-03-20-10-33-30-118.png) not found.

      Checking the service health shows that the `DCOS Poststart Checks` are failing.

      Checking `systemctl status dcos-checks-poststart` on the affected nodes show the following output:

      ● dcos-checks-poststart.service - DC/OS Poststart Checks: Run node-poststart checks
      {{ Loaded: loaded (/opt/mesosphere/packages/dcos-checks--8e85c6860115e74c3f75551ee96d2e85e0b5cf25/dcos.target.wants/dcos-checks-poststart.service; enabled; vendor preset: disabled)}}
      {{ Active: failed (Result: exit-code) since Tue 2018-03-20 14:35:03 UTC; 53s ago}}
      {{ Process: 18558 ExecStart=/opt/mesosphere/bin/dcos-diagnostics check node-poststart (code=exited, status=3)}}
      {{ Main PID: 18558 (code=exited, status=3)}}Mar 20 14:35:02 sp204-happy-b.dev.lgscout.com systemd[1]: Started DC/OS Poststart Checks: Run node-poststart checks.
      Mar 20 14:35:02 sp204-happy-b.dev.lgscout.com adminrouter.sh[5992]: 2018/03/20 14:35:02 [info] 6026#0: *16915 client 10.66.18.52 closed keepalive connection
      Mar 20 14:35:03 sp204-happy-b.dev.lgscout.com systemd[1]: dcos-checks-poststart.service: Main process exited, code=exited, status=3/NOTIMPLEMENTED
      Mar 20 14:35:03 sp204-happy-b.dev.lgscout.com systemd[1]: dcos-checks-poststart.service: Failed with result 'exit-code'.

       

      Manually running the poststart checks show:

       

      /opt/mesosphere/bin/dcos-diagnostics --verbose check node-poststart
      {{{}}
      {{ "error": "One or more requested checks failed to execute.",}}
      {{ "checks": [}}
      {{ "mesos_agent_registered_with_masters: signal: killed"}}
      {{ ]}}
      }

       

      Manually running the `mesos_agent_registered_with_masters` check shows successful execution:

       

      wbrown@sp204-happy-b ~ $ /opt/mesosphere/bin/dcos-checks --role agent mesos-metrics --verbose
      DEBU[0000] GET http://10.66.18.52:5051/metrics/snapshot
      wbrown@sp204-happy-b ~ $ echo $?
      0

       

      Timing the check shows that it takes over 1s to execute:

      wbrown@sp204-happy-b ~ $ time /opt/mesosphere/bin/dcos-checks --role agent mesos-metrics --verbose
      DEBU[0000] GET http://10.66.18.52:5051/metrics/snapshot

      real 0m1.233s
      user 0m0.004s
      sys 0m0.012s

       

      Doing a manual hit of the endpoint using curl and timing it shows similar results:

      wbrown@sp204-happy-b ~ $ time curl http://10.66.18.52:5051/metrics/snapshot
      {"slave\/gpus_total":0.0,"slave\/tasks_killed":3.0,"gc\/path_removals_succeeded":0.0,"slave\/executors_terminating":0.0,"slave\/tasks_failed":1.0,"slave\/disk_revocable_total":0.0,"slave\/tasks_killing":0.0,"slave\/tasks_starting":0.0,"containerizer\/fetcher\/task_fetches_succeeded":55.0,"slave\/gpus_revocable_total":0.0,"system\/load_15min":12.3,"containerizer\/mesos\/provisioner\/bind\/remove_rootfs_errors":0.0,"system\/mem_total_bytes":270460579840.0,"slave\/cpus_used":29.5,"system\/cpus_total":32.0,"slave\/frameworks_active":2.0,"containerizer\/mesos\/filesystem\/containers_new_rootfs":0.0,"slave\/mem_used":124608.0,"system\/load_1min":5.4,"gc\/path_removals_failed":0.0,"containerizer\/fetcher\/cache_size_used_bytes":0.0,"slave\/uptime_secs":1420.700610048,"slave\/disk_percent":0.608669558997598,"slave\/disk_revocable_percent":0.0,"slave\/tasks_gone":0.0,"containerizer\/fetcher\/task_fetches_failed":0.0,"slave\/executors_registering":0.0,"slave\/invalid_status_updates":0.0,"slave\/cpus_percent":0.921875,"slave\/executors_preempted":0.0,"slave\/cpus_total":32.0,"slave\/tasks_finished":23.0,"slave\/gpus_used":0.0,"system\/load_5min":10.97,"slave\/valid_status_updates":0.0,"slave\/disk_used":2014216.0,"slave\/executor_directory_max_allowed_age_secs":138462.431292152,"slave\/container_launch_errors":0.0,"slave\/mem_revocable_used":0.0,"slave\/invalid_framework_messages":0.0,"slave\/tasks_running":10.0,"slave\/executors_running":10.0,"slave\/gpus_revocable_used":0.0,"slave\/executors_terminated":27.0,"slave\/valid_framework_messages":0.0,"slave\/tasks_staging":0.0,"slave\/registered":1.0,"slave\/gpus_percent":0.0,"slave\/tasks_lost":0.0,"slave\/disk_total":3309211.0,"slave\/mem_percent":0.485031548381321,"slave\/mem_revocable_total":0.0,"slave\/disk_revocable_used":0.0,"slave\/gpus_revocable_percent":0.0,"slave\/cpus_revocable_total":0.0,"gc\/path_removals_pending":108.0,"system\/mem_free_bytes":54394863616.0,"slave\/mem_revocable_percent":0.0,"slave\/cpus_revocable_percent":0.0,"containerizer\/fetcher\/cache_size_total_bytes":2147483648.0,"slave\/recovery_errors":0.0,"slave\/cpus_revocable_used":0.0,"slave\/mem_total":256907.0,"containerizer\/mesos\/container_destroy_errors":0.0,"containerizer\/mesos\/provisioner\/remove_container_errors":0.0}
      real 0m1.329s
      user 0m0.004s
      sys 0m0.001s

       Looking at the DC/OS configuration shows a default timeout of 1s.

      {{ "mesos_agent_registered_with_masters": {}}
      {{ "description": "The Mesos agent has registered with the masters",}}
      {{ "cmd": [}}
      {{ "/opt/mesosphere/bin/dcos-checks",}}
      {{ "--role",}}
      {{ "agent",}}
      {{ "mesos-metrics"}}
      {{ ],}}
      {{ "timeout": "1s",}}
      {{ "roles": [}}
      {{ "agent"}}
      {{ ]}}
      {{ },}}

      Workaround

      Changing the timeout from 1s to 5s, and rerunning the dcos-diagnostics utility shows that this corrects the issue:

      wbrown@sp204-happy-b ~ $ /opt/mesosphere/bin/dcos-diagnostics --verbose check node-poststart
      {{{}}
      {{ "status": 0,}}
      {{ "checks": {}}
      {{ "clock_sync": {}}
      {{ "output": "Clock is synced\n",}}
      {{ "status": 0}}
      {{ },}}
      {{ "components_agent": {}}
      {{ "output": "",}}
      {{ "status": 0}}
      {{ },}}
      {{ "curl": {}}
      {{ "output": "",}}
      {{ "status": 0}}
      {{ },}}
      {{ "ip_detect_script": {}}
      {{ "output": "10.66.18.52 is a valid IPV4 address\n",}}
      {{ "status": 0}}
      {{ },}}
      {{ "journald_dir_permissions": {}}
      {{ "output": "directory /var/log/journal has the group owner `systemd-journal` and group permissons r-x\n",}}
      {{ "status": 0}}
      {{ },}}
      {{ "mesos_agent_registered_with_masters": {}}
      {{ "output": "",}}
      {{ "status": 0}}
      {{ },}}
      {{ "tar": {}}
      {{ "output": "",}}
      {{ "status": 0}}
      {{ },}}
      {{ "unzip": {}}
      {{ "output": "",}}
      {{ "status": 0}}
      {{ },}}
      {{ "xz": {}}
      {{ "output": "",}}
      {{ "status": 0}}
      {{ }}}
      {{ }}}
      }

       

      Possible Root Cause

      The root cause of this may be partially because the datadog agent is pinging the metrics endpoint here, and it may be because the master is servicing these requests.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                danielbaker Daniel Baker
                Reporter:
                wbrown-lg wbrown-lg
                Team:
                Cluster Ops Team
                Watchers:
                Asit Trivedi, Branden Rolston, Craig Neth, Greg Mann, Gustav Paul, Harpreet Gulati, Pablo Caceres, wbrown-lg, Yishan Lin (Inactive)
              • Watchers:
                9 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: