Uploaded image for project: 'DC/OS'
  1. DC/OS
  2. DCOS_OSS-394

Provide better Master HA Status, reporting, and Drill down capabilities



      In doing an upgrade of my DCOS cluster from 1.7 to 1.8 I found "knowing" the status of the Master to be lacking. I.e. I'd reboot a non-leader master and wait for it come up. Before moving to the next master, I'd check the logs, and hope everything was good before doing the next one, but wasn't all together sure.

      Then I thought, how would I know if a master was down or services were at issue on a master? I can go to component health, and I can see when individual services may be unhealthy. But I think having a "HA Status" light (Green, yellow, red) visible on all pages (perhaps by the DCOS version in the lower left).

      Green: There are > 1 Healthy Masters (including the leader). So in a 1 node setup (lab) this would not show (there are not more than 1 masters). If there are three masters, and one is rebooting, but the other two are healthy this would still show green.

      Yellow: There is 1 healthy master. A failure on this could be potentially detrimental to the cluster.

      Red: There are no active masters in a healthy state. (I.e. a failure of some component has occurred on the active master.

      This "visual" indicator, should be click able (Think "drill down") to a page like the system -> components page, but instead of listing all the services, it lists all the masters, and it's state (healthy, unhealthy) and then from there a click on a given master lists components on that master.

      Basically, it allows a user to easily know, something is amiss, and get the needed information to correct that.

      On the initial drill down page, a status message such as "All X nodes are healthy" (if Green) or "One or More nodes have unhealthy components" or "There is only one Master in this cluster" (if Yellow) or "Danger: All No operating Master Nodes have all components healthy" (or something).

      Basically, as an operator, one of the most important questions I have is "is my HA status good" and by adding a simple indicator on all pages, with drill down information capabilities, this information is always available.

      In addition, it allows for adding "events" of System HA status change which can be very important.






            • Assignee:
              amr Amr Abdelrazik (Inactive)
              mandoskippy John Omernik
              Frontend Team
              Amr Abdelrazik (Inactive), Julian Gieseke
            • Watchers:
              2 Start watching this issue


              • Created: