I'm running DCOS 1.9 on Vagrant 1.9.4 (DCOS-Vagrant). This issue is reproducible. I worked with Karl Isenberg on Slack. I have a single master and 2 nodes. Everything comes up on all of the nodes. On the master everything is running except dcos-adminrouter. Tracing from here I realized that marathon.mesos was failing ping:
So Karl and I looked at the Marathon service logs. I've attached the journal from dcos-marathon at that point. As you can see from the log the service is unable to resolve any of the zookeeper hostnames. This causes a stack dump each time. The issue here is that the service remains in active status and doesn't notice that there is any issue. I suggest adding an additional Pre command to the service:
This would have prevented the service from getting stuck. Also an additional reload.service and/or reload.timer like dcos-adminrouter has might make recovery easier.
I've also included the spartan and exhibitor service logs to add some context to what was occurring at the time.