Uploaded image for project: 'Marathon'
  1. Marathon
  2. MARATHON-7656

Marathon is sensitve to ZK restarts

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Medium
    • Resolution: Cannot Reproduce
    • Affects Version/s: Marathon 1.3.12, Marathon 1.4.5
    • Fix Version/s: None
    • Component/s: Leader Election
    • Labels:
      None

      Description

      When Marathon loses connection with ZK (e.g., ZK is restarted) it could lead Marathon to
      unrecoverable state which prevent it from starting again. Marathon should not leave ZK in invalid state that prevents it from starting. When Marathon discover invalid state then it, fail to take leadership and end in boot loop.

      This could be extremely hard because Marathon creates a relations in no relational database (ZK) without transactions.

      ERROR Failed to take over leadership (mesosphere.marathon.core.election.impl.CuratorElectionService:pool-1-thread-1)
      mesosphere.marathon.StoreCommandFailedException: Could not fetch DeploymentPlan with key: eedb8d92-eff7-4d35-9202-4b5ac62683d3
          at mesosphere.marathon.state.MarathonStore$$anonfun$fetch$1.applyOrElse(MarathonStore.scala:46)
          at mesosphere.marathon.state.MarathonStore$$anonfun$fetch$1.applyOrElse(MarathonStore.scala:41)
          at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
          at scala.util.Failure$$anonfun$recover$1.apply(Try.scala:216)
          at scala.util.Try$.apply(Try.scala:192)
          at scala.util.Failure.recover(Try.scala:216)
          at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:326)
          at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:326)
          at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
          at scala.concurrent.impl.ExecutionContextImpl$AdaptedForkJoinTask.exec(ExecutionContextImpl.scala:121)
          at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
          at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
          at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
          at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
      Caused by: java.lang.NullPointerException: null
          at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:180)
          at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:185)
          at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:49)
          at mesosphere.marathon.Protos$ZKStoreEntry.parseFrom(Protos.java:34973)
          at mesosphere.util.state.zk.ZKData$.apply(ZKStore.scala:151)
          at mesosphere.util.state.zk.ZKStore$$anonfun$load$2.apply(ZKStore.scala:39)
          at mesosphere.util.state.zk.ZKStore$$anonfun$load$2.apply(ZKStore.scala:39)
          at scala.util.Success$$anonfun$map$1.apply(Try.scala:237)
          at scala.util.Try$.apply(Try.scala:192)
          at scala.util.Success.map(Try.scala:237)
          at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237)
          at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237)
          ... 6 common frames omitted
      ERROR Terminating after loss of leadership (mesosphere.marathon.MarathonSchedulerService$$EnhancerByGuice$$28359304:pool-1-thread-1)
      

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                adukhovniy Aleksey Dukhovniy
                Reporter:
                janisz janisz
                Team:
                Orchestration Team
                Watchers:
                Aleksey Dukhovniy, daltonmatos, Ivan Chernetsky, janisz, Matthias Eichstedt, Rob Johnson
              • Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: