Uploaded image for project: 'Marathon'
  1. Marathon
  2. MARATHON-4236

Marathon leaders committing suicide

    Details

    • Type: Task
    • Status: Resolved
    • Priority: Medium
    • Resolution: Cannot Reproduce
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:

      Description

      We are experiencing a random problem with our Marathon instances committing suicide when losing connection to zookeeper. We bumped up the zk initLimit and multiplier within the zoo.cfg and have noticed the problem happening much less frequently, but it still happens. This cluster is running in Azure. We aren't sure if this an underlying problem in Marathon similar to this: https://github.com/mesosphere/marathon/issues/4266
      or if this is a network connectivity issue within Azure. This is DC/OS 1.8.4.

      First of all, we see this within the wiki for Exhibitor located here: https://github.com/soabase/exhibitor/wiki

      Important Facts About Exhibitor

      Exhibitor manages your zoo.cfg and myid files. Do not edit these manually as Exhibitor will overwrite them.

      Am I to assume the same holds for DC/OS? That we should not edit the zoo.cfg? Because, we have indeed.

      Here are some logs when we see this happen.
      Marathon
      `Dec 02 17:34:50 dcos-master3 java[51265]: [2016-12-02 17:34:50,858] INFO Client session timed out, have not heard from server in 6669ms for sessionid 0x1587cf5535b000f, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn:ForkJoinPool-2-worker-13-SendThread(zk-1.zk:2181))
      ...
      Dec 02 17:34:49 dcos-master2 java[38457]: [2016-12-02 17:34:49,934] ERROR ZooKeeper access failed - Committing suicide to avoid invalidating ZooKeeper state (mesosphere.marathon.core.election.impl.CuratorElectionService:qtp817460653-42)
      `
      Mesos Master
      `Dec 02 17:34:51 dcos-master3 mesos-master[52747]: I1202 17:34:51.468739 52757 group.cpp:460] Lost connection to ZooKeeper, attempting to reconnect ...
      `
      all 3 Exhibitors shut down Zookeeper at that time:
      `
      Dec 02 17:34:55 dcos-master3 java[48715]: Shutting down
      ...
      Dec 02 17:34:55 dcos-master1 java[1511]: Shutting down
      ...
      Dec 02 17:34:56 dcos-master2 java[62453]: shutting down
      `

      We do have a log bundle, but not sure if posting it publicly here is called for. I'd be more than happy to provide it privately for anyone at mesosphere that may be interested.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              GitHub_teskew-rms Tate Eskew (Inactive)
              Team:
              Orchestration Team
              Watchers:
              Jason Gilanfarr (Inactive)
            • Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: