We are running marathon within DC/OS v1.8.4 on ACS and we've had it fail twice in the same way after a long period of time.
The manifestation is that the UI fails to display the "Services" tab. When you try to hit the marathon api manually, the non-masters proxy to the marathon master and it returns a 500 error.
The logs for the non-marathon masters look normal and only indicate a previous leader election happened. The logs (i.e. journalctl -ru dcos-marathon.service) are empty.
The most concerning part of this is that the Marathon health checks are reporting healthy when the service is actually down.
In order to recover I had to manually restart the marathon leader via systemctl.
At the very least the marathon health checks should be updated to not treat a 500 error to the API end points as healthy and force a restart.