Uploaded image for project: 'Marathon'
  1. Marathon
  2. MARATHON-7554

Marathon Shows Healthy, but API only returns 500 errors. Dashboard dead

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Medium
    • Resolution: Cannot Reproduce
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Unknown
    • Labels:

      Description

      We are running marathon within DC/OS v1.8.4 on ACS and we've had it fail twice in the same way after a long period of time.

      The manifestation is that the UI fails to display the "Services" tab. When you try to hit the marathon api manually, the non-masters proxy to the marathon master and it returns a 500 error. 

      The logs for the non-marathon masters look normal and only indicate a previous leader election happened. The logs (i.e. journalctl -ru dcos-marathon.service) are empty.

      The most concerning part of this is that the Marathon health checks are reporting healthy when the service is actually down.

      In order to recover I had to manually restart the marathon leader via systemctl.

      At the very least the marathon health checks should be updated to not treat a 500 error to the API end points as healthy and force a restart.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              jzampieron@zproject.net Jeffrey Zampieron
              Team:
              Orchestration Team
              Watchers:
              Jeffrey Zampieron, Karsten Jeschkies, Ken Sipe, Matthias Eichstedt
            • Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: