Uploaded image for project: 'Marathon'
  1. Marathon
  2. MARATHON-7847

Marathon leader reelect killing running task during reconcilation

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: High
    • Resolution: Cannot Reproduce
    • Affects Version/s: Marathon 1.1.7
    • Fix Version/s: None
    • Component/s: Leader Election
    • Labels:

      Description

      We have a 3 node marathon cluster in which the leader got hung. Below are the kind of exceptions that we got in the logs.  

       

      [2017-10-18 04:09:11,078] INFO 10.14.23.221 - - [17/Oct/2017:22:39:05 +0000] "GET //mesos-master2.prod-foo-dcos.foo.net:8080/v2/apps?embed=apps.tasks HTTP/1.1" 200 4184888 "-" "python-requests/2.18.4" 564 (mesosphere.chaos.http.ChaosRequestLog$$EnhancerByGuice$$e0425440:qtp314309716-2721706)
      [2017-10-18 04:09:11,106] WARN mapping exception to status code (mesosphere.marathon.api.MarathonExceptionMapper$$EnhancerByGuice$$66682cd2:qtp314309716-2750766)
      java.lang.RuntimeException: while assembling rich tasks for app [/orc/prod-foo/training-api]
      at mesosphere.marathon.core.appinfo.impl.AppInfoBaseData$AppData$$anonfun$enrichedTasksFuture$1.applyOrElse(AppInfoBaseData.scala:153) ~[marathon:1.1.7]
      at mesosphere.marathon.core.appinfo.impl.AppInfoBaseData$AppData$$anonfun$enrichedTasksFuture$1.applyOrElse(AppInfoBaseData.scala:152) ~[marathon:1.1.7]
      at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) ~[marathon:1.1.7]
      at scala.util.Failure$$anonfun$recover$1.apply(Try.scala:216) ~[marathon:1.1.7]
      at scala.util.Try$.apply(Try.scala:192) ~[marathon:1.1.7]
      at scala.util.Failure.recover(Try.scala:216) ~[marathon:1.1.7]
      at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324) ~[marathon:1.1.7]
      at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324) ~[marathon:1.1.7]
      at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) ~[marathon:1.1.7]
      at scala.concurrent.impl.ExecutionContextImpl$AdaptedForkJoinTask.exec(ExecutionContextImpl.scala:121) ~[marathon:1.1.7]
      at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) ~[marathon:1.1.7]
      at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253) ~[marathon:1.1.7]
      at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346) ~[marathon:1.1.7]
      at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) ~[marathon:1.1.7]
      at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) ~[marathon:1.1.7]
      Caused by: java.lang.IllegalStateException: not currently active (Actor[akka://marathon/user/taskTracker#-2000223842])
      at mesosphere.marathon.core.leadership.impl.WhenLeaderActor$$anonfun$1.applyOrElse(WhenLeaderActor.scala:38) ~[marathon:1.1.7]
      at akka.actor.Actor$class.aroundReceive(Actor.scala:467) ~[marathon:1.1.7]
      at mesosphere.marathon.core.leadership.impl.WhenLeaderActor.aroundReceive(WhenLeaderActor.scala:20) ~[marathon:1.1.7]
      at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) ~[marathon:1.1.7]
      at akka.actor.ActorCell.invoke(ActorCell.scala:487) ~[marathon:1.1.7]
      at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) ~[marathon:1.1.7]
      at akka.dispatch.Mailbox.run(Mailbox.scala:220) ~[marathon:1.1.7]
      at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397) ~[marathon:1.1.7]
      at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) ~[marathon:1.1.7]
      at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) ~[marathon:1.1.7]
      ... 2 common frames omitted
      [2017-10-18 04:09:11,106] WARN mapping exception to status code (mesosphere.marathon.api.MarathonExceptionMapper$$EnhancerByGuice$$66682cd2:qtp314309716-2801473)
      java.lang.RuntimeException: while assembling rich tasks for app [/orc/prod-foo/strchev3]
      at mesosphere.marathon.core.appinfo.impl.AppInfoBaseData$AppData$$anonfun$enrichedTasksFuture$1.applyOrElse(AppInfoBaseData.scala:153) ~[marathon:1.1.7]
      at mesosphere.marathon.core.appinfo.impl.AppInfoBaseData$AppData$$anonfun$enrichedTasksFuture$1.applyOrElse(AppInfoBaseData.scala:152) ~[marathon:1.1.7]
      at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) ~[marathon:1.1.7]
      at scala.util.Failure$$anonfun$recover$1.apply(Try.scala:216) ~[marathon:1.1.7]
      at scala.util.Try$.apply(Try.scala:192) ~[marathon:1.1.7]
      at scala.util.Failure.recover(Try.scala:216) ~[marathon:1.1.7]
      at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324) ~[marathon:1.1.7]
      at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324) ~[marathon:1.1.7]
      at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) ~[marathon:1.1.7]
      at scala.concurrent.impl.ExecutionContextImpl$AdaptedForkJoinTask.exec(ExecutionContextImpl.scala:121) ~[marathon:1.1.7]
      at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) ~[marathon:1.1.7]
      at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253) ~[marathon:1.1.7]
      at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346) ~[marathon:1.1.7]
      at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) ~[marathon:1.1.7]
      at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) ~[marathon:1.1.7]
      Caused by: java.lang.IllegalStateException: not currently active (Actor[akka://marathon/user/taskTracker#-2000223842])
      at mesosphere.marathon.core.leadership.impl.WhenLeaderActor$$anonfun$1.applyOrElse(WhenLeaderActor.scala:38) ~[marathon:1.1.7]
      at akka.actor.Actor$class.aroundReceive(Actor.scala:467) ~[marathon:1.1.7]
      at mesosphere.marathon.core.leadership.impl.WhenLeaderActor.aroundReceive(WhenLeaderActor.scala:20) ~[marathon:1.1.7]
      at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) ~[marathon:1.1.7]
      at akka.actor.ActorCell.invoke(ActorCell.scala:487) ~[marathon:1.1.7]
      at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) ~[marathon:1.1.7]
      at akka.dispatch.Mailbox.run(Mailbox.scala:220) ~[marathon:1.1.7]
      at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397) ~[marathon:1.1.7]
      at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) ~[marathon:1.1.7]
      at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) ~[marathon:1.1.7]
      ... 2 common frames omitted
      [2017-10-18 04:09:11,106] INFO 10.14.23.221 - - [17/Oct/2017:22:39:10 +0000] "GET //mesos-master2.prod-foo-dcos.foo.net:8080/v2/apps?embed=apps.tasks HTTP/1.1" 500 74 "-" "python-requests/2.18.4" 44 (mesosphere.chaos.http.ChaosRequestLog$$EnhancerByGuice$$e0425440:qtp314309716-2801473)```

       

      During this time, no new leader was elected and the above logs were followed by the "Are we the leader message"

       

      [2017-10-18 04:09:14,449] INFO Waiting for consistent leadership state. Are we leader?: false, leader: Some(mesos-master2.prod-foo-bar.foo.net:8080) (mesosphere.marathon.api.LeaderProxyFilter$:qtp314309716-2801472)
      [2017-10-18 04:09:14,449] INFO Waiting for consistent leadership state. Are we leader?: false, leader: Some(mesos-master2.prod-foo-bar.foo.net:8080) (mesosphere.marathon.api.LeaderProxyFilter$:qtp314309716-2776820)
      [2017-10-18 04:09:14,460] INFO Waiting for consistent leadership state. Are we leader?: false, leader: Some(mesos-master2.prod-foo-bar.foo.net:8080) (mesosphere.marathon.api.LeaderProxyFilter$:qtp314309716-2776594)
      [2017-10-18 04:09:14,460] INFO Do not proxy to myself. Waiting for consistent leadership state. Are we leader?: false, leader: Some(mesos-master2.prod-foo-bar.foo.net:8080) (mesosphere.marathon.api.LeaderProxyFilter$:qtp314309716-2776814)
      [2017-10-18 04:09:14,463] INFO Waiting for consistent leadership state. Are we leader?: false, leader: Some(mesos-master2.prod-foo-bar.foo.net:8080) (mesosphere.marathon.api.LeaderProxyFilter$:qtp314309716-2776814)
      [2017-10-18 04:09:14,463] INFO Do not proxy to myself. Waiting for consistent leadership state. Are we leader?: false, leader: Some(mesos-master2.prod-foo-bar.foo.net:8080) (mesosphere.marathon.api.LeaderProxyFilter$:qtp314309716-2801466) 
      

      The other 2 nodes were still proxing the requests to the unresponsive master.

       

       At this point we restarted the problematic node which triggered a  leader election.

      What we noticed is the new leader started killing a few of the running tasks. The logs show :

       

      [2017-10-18 04:27:34,863] WARN Kill unknown task [orc_prod-foo-java.4dbf3cd1-b247-11e7-89e0-0242a7c90d38] (mesosphere.marathon.core.task.update.impl.TaskStatusUpdateProcessorImpl$$EnhancerByGuice$$f6faf9ae:ForkJoinPool-2-worker-1319)
      [2017-10-18 04:27:34,867] WARN Kill unknown task [orc_prod-foo-java.3a71388d-b247-11e7-89e0-0242a7c90d38] (mesosphere.marathon.core.task.update.impl.TaskStatusUpdateProcessorImpl$$EnhancerByGuice$$f6faf9ae:ForkJoinPool-2-worker-1723)
      [2017-10-18 04:27:34,873] WARN Kill unknown task [orc_prod-foo-java.2343620c-b247-11e7-89e0-0242a7c90d38] (mesosphere.marathon.core.task.update.impl.TaskStatusUpdateProcessorImpl$$EnhancerByGuice$$f6faf9ae:ForkJoinPool-2-worker-91)
      [2017-10-18 04:27:34,887] WARN Kill unknown task [orc_prod-foo-java.e6338436-b246-11e7-89e0-0242a7c90d38] (mesosphere.marathon.core.task.update.impl.TaskStatusUpdateProcessorImpl$$EnhancerByGuice$$f6faf9ae:ForkJoinPool-2-worker-277)
      [2017-10-18 04:27:34,905] WARN Kill unknown task [orc_prod-foo-java.6ad4dfed-b246-11e7-89e0-0242a7c90d38] (mesosphere.marathon.core.task.update.impl.TaskStatusUpdateProcessorImpl$$EnhancerByGuice$$f6faf9ae:ForkJoinPool-2-worker-1543)
      [2017-10-18 04:27:34,950] WARN Kill unknown task [orc_prod-foo-java.5251165c-b246-11e7-89e0-0242a7c90d38] (mesosphere.marathon.core.task.update.impl.TaskStatusUpdateProcessorImpl$$EnhancerByGuice$$f6faf9ae:ForkJoinPool-2-worker-1589)
      [2017-10-18 04:27:34,950] WARN Kill unknown task [orc_prod-foo-java.e0d4a86e-b245-11e7-89e0-0242a7c90d38] (mesosphere.marathon.core.task.update.impl.TaskStatusUpdateProcessorImpl$$EnhancerByGuice$$f6faf9ae:ForkJoinPool-2-worker-1425)
      [2017-10-18 04:27:34,956] WARN Kill unknown task [orc_prod-foo-java-share.841ca77d-b244-11e7-89e0-0242a7c90d38] (mesosphere.marathon.core.task.update.impl.TaskStatusUpdateProcessorImpl$$EnhancerByGuice$$f6faf9ae:ForkJoinPool-2-worker-91)
      [2017-10-18 04:27:35,048] WARN Kill unknown task [orc_prod-foo-java-share.a345cbf0-b244-11e7-89e0-0242a7c90d38] (mesosphere.marathon.core.task.update.impl.TaskStatusUpdateProcessorImpl$$EnhancerByGuice$$f6faf9ae:ForkJoinPool-2-worker-1765)

      Looking at the previous issues we found similar issue reported https://jira.mesosphere.com/browse/MARATHON-2721

      The issues is "closed" but we are not sure in which version it was fixed. We are using 

      marathon : 1.1.7-1.0.611.ubuntu1404

      mesos : 0.28.2-2.0.27.ubuntu1404

       

       

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              satheesathee Sathee Sathee
              Team:
              Orchestration Team
              Watchers:
              krishnaghatti, Matthias Eichstedt, Sathee Sathee
            • Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: