Details

    • Type: Bug
    • Status: Resolved
    • Priority: High
    • Resolution: Won't Do
    • Affects Version/s: Marathon 1.3.7
    • Fix Version/s: None
    • Component/s: Leader Election
    • Labels:

      Description

      We lost the marathon servers in our production environment. Fortunately applications kept running but manual intervention was required to return marathon to functional state. (Killed zombie master)

      5 node HA. (call them M0,M1,M2,M3,M4)

      Mesos 1.0.1.

      Marathon 1.3.7.

      ZK 3.4.8

      ZK was experiencing severe lag (which we are working on). I can't tell from the logs what caused what. Zookeeper connections were timing out. Marathon leader was "asked to stop the driver" at 18:53:48. Mesos Master M3 lost its leadership at 18:53:53 (new mesos master M0) - and there doesn't appear to be clock skew so this happened after? Marathon did a leadership election, and M4 won (it had previously been the leader before 18:53). It did not reregister successfully re-register with Mesos, and so the Mesos system reported no frameworks running until we killed the M4 marathon leader. M0 mesos master lost leadership at 18:54:18 - M3 became new mesos master. So M3 was mesos master, lost it for ~30 seconds, and regained it. M4 was marathon leader, lost it ~18:53:48, regained it, but never reregistered successfully with mesos. 

      Some picked out log entries:

       

      Sep 22 18:53:48 master4.prod.tfc.thermofisher.net marathon[26735]: I0922 18:53:48.266819  6173 sched.cpp:330] New master detected at master@172.29.32.4:5050

      Sep 22 18:53:48 master4.prod.tfc.thermofisher.net marathon[26734]: [2017-09-22 18:53:48,430] INFO Driver future completed with result=Success(()). (mesosphere.marathon.MarathonSchedulerService$$EnhancerByGuice$$8afd4609:ForkJoinPool-2-worker-9)

       

      Sep 22 18:54:18 master4.prod.tfc.thermofisher.net marathon[26735]: I0922 18:54:18.011649  6171 detector.cpp:152] Detected a new leader: None

       

      Sep 22 18:54:27 master4.prod.tfc.thermofisher.net marathon[26735]: I0922 18:54:27.502760 6171 detector.cpp:152] Detected a new leader: (id='23')
      Sep 22 18:54:27 master4.prod.tfc.thermofisher.net marathon[26735]: I0922 18:54:27.502833 6171 group.cpp:706] Trying to get '/mesos/json.info_0000000023' in ZooKeeper
      Sep 22 18:54:27 master4.prod.tfc.thermofisher.net marathon[26735]: I0922 18:54:27.504118 6171 zookeeper.cpp:259] A new leading master (UPID=master@172.29.32.196:5050) is detected

       

        Attachments

          Activity

            People

            • Assignee:
              tharper Tim Harper
              Reporter:
              jamiebriant Jamie Briant
              Team:
              Orchestration Team
              Watchers:
              Ivan Chernetsky, Jamie Briant, Matthias Eichstedt, Tim Harper
            • Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: