• Type: Bug
    • Status: Resolved
    • Priority: Medium
    • Resolution: Cannot Reproduce
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Leader Election
    • Labels:


      At Yelp we still have one large Marathon cluster running 1.1.1. Last week we saw some very weird behaviour after a leadership election caused by a restart of our zookeeper cluster.

      I've been reading the logs and I'll try to post some snippets below that backup what happened.

      • Both mesos and marathon lose connectivity to zookeeper:
        W0804 07:15:49.536195 48531 group.cpp:494] Timed out waiting to connect to ZooKeeper. Forcing ZooKeeper session (sessionId=65b0199ccf2be6a) expiration
        2017-08-04 07:15:39,606] WARN ZooKeeper connection has been dropped. Abdicate Leadership: WatchedEvent state:Disconnected type:None path:null (
      • Both pick a new leader:
        I0804 07:15:56.043835 42424 master.cpp:2017] Elected as the leading master!
        [2017-08-04 07:16:31,139] INFO Elect leadership (mesosphere.marathon.MarathonSchedulerService$$EnhancerByGuice$$dcfbfa0a:main-EventThread)
      • The new Marathon leader loads all the tasks and starts up relatively normally
        [2017-08-04 07:16:31,126] INFO TaskTrackerActor is starting. Task loading initiated. (
        [2017-08-04 07:16:31,192] INFO About to load 5298 tasks (mesosphere.marathon.core.task.tracker.impl.TaskLoaderImpl:ForkJoinPool-2-worker-2011)
      • The new leader has some trouble authenticating with mesos and commits suicide:
        [2017-08-04 07:17:05,697] WARN Error: Framework at scheduler-e60e5959-0ba5-442d-9e51-bd4e850f0de9@ is not authenticated
        [2017-08-04 07:17:05,697] ERROR Committing suicide! (mesosphere.marathon.MarathonScheduler$$EnhancerByGuice$$a504fd7e:Thread-17457577)
      • A second new leader starts up but for some reason it never seems to start the TaskTracker
        [2017-08-04 07:17:05,766] INFO Call onElected leadership callbacks on EntityStoreCache(MarathonStore(app:)), EntityStoreCache(MarathonStore(group:)), EntityStoreCache(MarathonStore(deployment:)), EntityStoreCache(MarathonStore(framework:)), EntityStoreCache(MarathonStore(taskFailure:)), EntityStoreCache(MarathonStore(events:)) (mesosphere.marathon.MarathonSchedulerService$$EnhancerByGuice$$dcfbfa0a:main-EventThread)
        [2017-08-04 07:17:10,908] INFO Elect leadership (mesosphere.marathon.MarathonSchedulerService$$EnhancerByGuice$$dcfbfa0a:main-EventThread)
        [2017-08-04 07:17:10,908] INFO Running driver (mesosphere.marathon.MarathonSchedulerService$$EnhancerByGuice$$dcfbfa0a:main-EventThread)
        [2017-08-04 07:17:10,912] INFO Starting scheduler actor (
        [2017-08-04 07:17:10,895] INFO Finished onElected leadership callbacks (mesosphere.marathon.MarathonSchedulerService$$EnhancerByGuice$$dcfbfa0a:main-EventThread)
        I0804 07:17:10.913952 124413 sched.cpp:396] Authenticating with master master@hostname:5050
        [2017-08-04 07:17:10,917] INFO Recovering deployment:
        [2017-08-04 07:17:10,921] INFO Scheduler actor ready (
        I0804 07:17:10.922133 124413 authenticatee.cpp:299] Authentication success
        [2017-08-04 07:17:10,924] INFO Deployment of / successful (
        [2017-08-04 07:17:10,934] INFO Reset offerLeadership backoff (mesosphere.marathon.MarathonSchedulerService$$EnhancerByGuice$$dcfbfa0a:main-EventThread)
      • This turns out to be very bad because with no tasks loaded Marathon now continues with some of its startup actions. Firstly it decides to launch everything again:
         [2017-08-04 07:17:12,078] INFO addAllFor [/my-very-important-app] version [2017-08-04T14:14:25.153Z] ($$EnhancerByGuice$$ccbd67b8:ForkJoinPool-2-worker-879)
        INFO reconcile [/my-very-important-app] with latest version [2017-08-04T14:14:25.153Z] ($$EnhancerByGuice$$ccbd67b8:ForkJoinPool-2-worker-879)
         [2017-08-04 07:17:26,026] INFO Need to scale /my-very-important-app from 0 up to 300 instances (mesosphere.marathon.SchedulerActions:ForkJoinPool-2-worker-491)
        [2017-08-04 07:17:26,026] INFO Queueing 300 new tasks for /my-very-important-app (0 queued or running) (mesosphere.marathon.SchedulerActions:ForkJoinPool-2-worker-491)
      • Even worse, status updates now flood in for the existing tasks and Marathon doesn't know them so decides to kill them
        [2017-08-04 07:17:32,188] WARN Kill unknown task [my-very-important-app.72bc5453-790d-11e7-9fb9-02421e812e79] (mesosphere.marathon.core.task.update.impl.TaskStatusUpdateProcessorImpl$$EnhancerByGuice$$50a6e7a6:ForkJoinPool-2-worker-3077)

      This is obviously pretty bad

      Our plan is to upgrade to 1.4.6 when it gets released. In the meantime it would be great to confirm what happened here. I've been poking around the leadership election code but I can't see why the TaskTracker wouldn't start up but I'm really no expert. I have a hunch you may have already fixed this in a later release? But given how bad the symptoms are here it would be good to confirm that this bug doesn't still exist.




            • Assignee:
              ivanchernetsky Ivan Chernetsky
              mattmb mattmb
              Orchestration Team
              Ivan Chernetsky, Marcus Alvarez (Inactive), Matthias Eichstedt, mattmb, Rob Johnson
            • Watchers:
              5 Start watching this issue


              • Created: