Uploaded image for project: 'DC/OS'
  1. DC/OS
  2. DCOS_OSS-4250

Avoid failure when overlay and related endpoints are not available

    Details

    • Story Points:
      1

      Description

      SC 13187

      A recent change to the diagnostic bundle collection process added code to collect the overlay state:

      https://github.com/dcos/dcos/pull/3006

      This is very useful addition, but it is being done even if the customer has disabled the overlay, leading to 404 errors when collecting a bundle:

       

      10.184.123.139
      command_exec_timeout_sec: 120
      diagnostics_bundle_dir: /var/lib/dcos/dcos-diagnostics/diag-bundles
      diagnostics_job_get_since_url_timeout_min: 8
      diagnostics_job_timeout_min: 720
      diagnostics_partition_disk_usage_percent: 7.214537866199991
      errors: ['unable to fetch https://10.184.123.139:5050/overlay-master/state. Return code 404', 'unable to fetch https://10.184.122.171:5050/overlay-master/state. Return code 404', 'unable to fetch https://10.184.121.169:5050/overlay-master/state. Return code 404', 'unable to fetch https://10.184.124.104:5051/overlay-agent/overlay. Return code 404', 'unable to fetch https://10.184.126.135:5051/overlay-agent/overlay. Return code 404', 'unable to fetch https://10.184.125.147:5051/overlay-agent/overlay. Return code 404', 'unable to fetch https://10.184.126.41:5051/overlay-agent/overlay. Return code 404', 'unable to fetch https://10.184.125.108:5051/overlay-agent/overlay. Return code 404', 'unable to fetch https://10.184.124.16:5051/overlay-agent/overlay. Return code 404']
      is_running: False
      job_duration: 1m3.218690968s
      job_ended: 2018-09-27 12:31:32.550006858 +0000 UTC
      job_progress_percentage: 100
      job_started: 2018-09-27 12:30:29.331316685 +0000 UTC
      journald_logs_since_hours: 24h
      last_bundle_dir: /var/lib/dcos/dcos-diagnostics/diag-bundles/bundle-2018-09-27-1538051429.zip
      status: Diagnostics job failed

       

      At the bare minimum, dcos-diagnostics should fail when that endpoint is not available. 

        Attachments

          Activity

            People

            • Assignee:
              tomaszjaniszewski Tomasz Janiszewski
              Reporter:
              mbernadin Miguel Bernadin
              Team:
              Cluster Ops Team
              Watchers:
              Kevin Klues (Inactive), Matthias Eichstedt, Mergebot, Miguel Bernadin, Tomasz Janiszewski
            • Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: