Uploaded image for project: 'Marathon'
  1. Marathon
  2. MARATHON-7567

Marathon leader randomly fails when discovering new Mesos master

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Medium
    • Resolution: Won't Do
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Leader Election
    • Labels:
      None

      Description

      We're currently running a test setup consisting of three Mesos master nodes with three Marathon instances on top. We found that in some cases when the current Mesos master goes down and looses leadership, the current Marathon leader also dies after having detected the new Mesos master. According to the logs this doesn't happen abruptly, but seems to follow a well-defined procedure. 
      At first, our theory was that this behavior only occrus when leading Mesos and Marathon masters run on the same physical machine at that time. However, we could not verify this assumption. Moreover, this doesn't happen on every Mesos leader election process and sometimes works without any difficulty.

      Our setup looks as follows:

      • Three VMs, each running a Mesos master, agent and Marathon node
      • Mesos version: 1.1.1
      • Marathon version: 1.4.2  

      It may also be a point of interest that our setup runs entirely in Docker containers using the host network instead of bridged mode.

      Here's the part of the logs that shows what's going on right before the Marathon leader exits:

       

      marathon | I0628 12:14:36.033452 83 detector.cpp:152] Detected a new leader: (id='13')
      marathon | I0628 12:14:36.033769 85 group.cpp:697] Trying to get '/mesos/json.info_0000000013' in ZooKeeper
      marathon | I0628 12:14:36.037595 85 zookeeper.cpp:259] A new leading master (UPID=master@9.152.171.131:5050) is detected
      marathon | [2017-06-28 12:14:36,040] WARN Disconnected (mesosphere.marathon.MarathonScheduler$$EnhancerByGuice$$e7b980c9:Thread-82)
      marathon | I0628 12:14:36.042668 85 sched.cpp:1995] Asked to stop the driver
      marathon | I0628 12:14:36.042914 85 sched.cpp:330] New master detected at master@9.152.171.131:5050
      marathon | I0628 12:14:36.043254 85 sched.cpp:341] No credentials provided. Attempting to register without authentication
      marathon | I0628 12:14:36.043442 85 sched.cpp:1187] Stopping framework 990b8a6d-3f6f-454c-a062-604eaec87746-0000
      marathon | [2017-06-28 12:14:36,046] INFO Driver future completed with result=Success(()). (mesosphere.marathon.MarathonSchedulerService$$EnhancerByGuice$$dc2305a5:ForkJoinPool-2-worker-3)
      marathon | [2017-06-28 12:14:36,047] INFO Abdicating leadership while leading (reoffer=true) (mesosphere.marathon.core.election.impl.CuratorElectionService:ForkJoinPool-2-worker-3)
      marathon | [2017-06-28 12:14:36,051] INFO Leader defeated. New leader: - (mesosphere.marathon.core.election.impl.CuratorElectionService:pool-4-thread-1)
      marathon | [2017-06-28 12:14:36,052] INFO Call postDriverRuns callbacks on (mesosphere.marathon.MarathonSchedulerService$$EnhancerByGuice$$dc2305a5:ForkJoinPool-2-worker-3)
      marathon | [2017-06-28 12:14:36,052] INFO Finished postDriverRuns callbacks (mesosphere.marathon.MarathonSchedulerService$$EnhancerByGuice$$dc2305a5:ForkJoinPool-2-worker-3)
      marathon | [2017-06-28 12:14:36,056] INFO Deleting existing tombstone for old twitter commons leader election (mesosphere.marathon.core.election.impl.CuratorElectionService:pool-4-thread-1)
      marathon | [2017-06-28 12:14:36,062] INFO Lost leadership (mesosphere.marathon.MarathonSchedulerService$$EnhancerByGuice$$dc2305a5:pool-4-thread-1)
      marathon | [2017-06-28 12:14:36,067] ERROR Terminating after loss of leadership

        Attachments

          Activity

            People

            • Assignee:
              kjeschkies Karsten Jeschkies
              Reporter:
              paddysmalls PaddySmalls
              Team:
              Orchestration Team
              Watchers:
              daltonmatos, PaddySmalls, Tim Harper
            • Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: