Uploaded image for project: 'Marathon'
  1. Marathon
  2. MARATHON-7390

Marathon Unable to Connect to Zookeeper - No Retries

    Details

    • Story Points:
      2
    • Build artifact:
      Marathon-v1.7.174

      Description

      I'm running DCOS 1.9 on Vagrant 1.9.4 (DCOS-Vagrant). This issue is reproducible. I worked with Karl Isenberg on Slack. I have a single master and 2 nodes. Everything comes up on all of the nodes. On the master everything is running except dcos-adminrouter. Tracing from here I realized that marathon.mesos was failing ping:

      Process: 7792 ExecStartPre=/bin/ping -c1 marathon.mesos (code=exited, status=2)
      [vagrant@m1 ~]$ ping -c 1 marathon.mesos
      ping: marathon.mesos: Name or service not known

      So Karl and I looked at the Marathon service logs. I've attached the journal from dcos-marathon at that point. As you can see from the log the service is unable to resolve any of the zookeeper hostnames. This causes a stack dump each time. The issue here is that the service remains in active status and doesn't notice that there is any issue. I suggest adding an additional Pre command to the service:

      https://github.com/dcos/dcos/blob/master/packages/marathon/build

      ExecStartPre=/bin/ping -c 1 zk1.zk

       

      This would have prevented the service from getting stuck. Also an additional reload.service and/or reload.timer like dcos-adminrouter has might make recovery easier.

      I've also included the spartan and exhibitor service logs to add some context to what was occurring at the time.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                kjeschkies Karsten Jeschkies
                Reporter:
                mastermindg mastermindg
                Team:
                Orchestration Team
                Watchers:
                Chun-Hung Hsiao, Ioannis Charalampidis, Jan-Philip Gehrcke, Karsten Jeschkies, Ken Sipe, Marco Monaco, mastermindg, Matthias Eichstedt, Mergebot
              • Watchers:
                9 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: