Uploaded image for project: 'Marathon'
  1. Marathon
  2. MARATHON-7757

Marathon 1.4.6 killing tasks after leader election

    Details

    • Sprint:
      Marathon Sprint 1.11-4, Marathon Sprint 1.11-5, Marathon Sprint 1.11-6, Marathon Sprint 1.11-7
    • Story Points:
      21

      Description

      We've just experienced more issues with Marathon killing lots of tasks after a leader election, this time on 1.4.6.

      Here's what I think is happening and some log snippets to back it up:

      New leader starts up but fails to recover deployments and load the root group

      [2017-09-11 06:30:28,343] INFO  All actors active
      [2017-09-11 06:30:28,394] INFO  Recovering all deployments on leader election (mesosphere.marathon.upgrade.DeploymentManager:marathon-akka.actor.default-dispatcher-19)
      [2017-09-11 06:30:28,485] WARN  Failed to load root group with version=2017-09-07T23:08:38.243Z (mesosphere.marathon.storage.repository.StoredGroupRepositoryImpl:ForkJoinPool-2-worker-75)
      [2017-09-11 06:30:28,491] ERROR Failed to recover deployments from repository: java.lang.IllegalStateException: Missing target or original (mesosphere.marathon.upgrade.DeploymentManager:marathon-akka.actor.default-dispatcher-19)
      [2017-09-11 06:30:29,604] ERROR While retrieving 61d2cd11-5c1a-46dc-a3f5-2bfa7437587f, either original (None) or target (None) were no longer available (mesosphere.marathon.storage.repository.StoredPlan:ForkJoinPool-2-worker-87)
      [2017-09-11 06:30:29,618] WARN  Failed to load root group with version=2017-09-07T23:08:59.307Z (mesosphere.marathon.storage.repository.StoredGroupRepositoryImpl:ForkJoinPool-2-worker-127)
      [2017-09-11 06:30:30,615] INFO  Task loading complete. (mesosphere.marathon.core.task.tracker.impl.InstanceTrackerActor:marathon-akka.actor.default-dispatcher-22)
      

      Something tries to load apps and fails for some large number like so:

      [2017-09-11 06:30:59,636] ERROR Failed to load /my-important-app:2017-09-06T20:33:05.604Z for group / (2017-09-10T14:26:48.276Z) (mesosphere.marathon.storage.repository.StoredGroup:ForkJoinPool-2-worker-127)
         at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
         at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
         at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
         at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
         at scala.concurrent.impl.ExecutionContextImpl$AdaptedForkJoinTask.exec(ExecutionContextImpl.scala:121)
         at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
         at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
         at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
         at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
      mesosphere.marathon.util.TimeoutException: Timed out after 30 seconds
         at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
         at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
         at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
      

      I think the timeout here is the zk_timeout? But I'm not sure from the traceback exactly what timed out. I don't yet think zk is overwhelmed although I will be doing some checking on that front.

      Some short time later:

      [ERROR] [09/11/2017 06:36:52.229] [marathon-akka.actor.default-dispatcher-72] [akka://marathon/user/taskKillServiceActor/1/$f] InstanceKillProgressActor1718499249 was stopped before all instances are killed. Outstanding: instance [deploydash.main.git4e0fdba2.config0f1c3724.marathon-665c533c-956d-11e7-bf27-02422956c871]
      

      It fails to kill the task? No indication of why it has decided to...
      Then a better clue:

      [2017-09-11 06:37:35,267] INFO  Killing instance [my-important-app.marathon-665c533c-956d-11e7-bf27-02422956c871] (mesosphere.marathon.SchedulerActions:ForkJoinPool-2-worker-91)
      [2017-09-11 06:37:35,267] INFO  Killing instance [my-important-app.marathon-80b9b090-9529-11e7-92ba-0242a7f30894] (mesosphere.marathon.SchedulerActions:ForkJoinPool-2-worker-91)
      [2017-09-11 06:37:35,268] INFO  Killing 1 tasks for reason: Orphaned (ids: instance [my-important-app.marathon-133a0177-956a-11e7-bf27-02422956c871] ...) (mesosphere.marathon.core.task.termination.impl.KillServiceDelegate$:ForkJoinPool-2-worker-91)
      [2017-09-11 06:37:35,267] INFO  Killing 1 tasks for reason: Orphaned (ids: instance [my-important-app.marathon-80b9b090-9529-11e7-92ba-0242a7f30894] ...) (mesosphere.marathon.core.task.termination.impl.KillServiceDelegate$:ForkJoinPool-2-worker-91)
      

      It seems Marathon thinks the tasks are orphans and so is killing them?

      So, my hunch is that we've hit some sort of scaling point in our largest cluster where the performance of loading apps suffers during startup. Then, because it hasn't succeeded in loading the applications Marathon starts treating the tasks as orphans and killing them? This has killed literally thousands of tasks in our production clusters. I could be completely wrong though, just my first reading of the log. We haven't seen this issue elsewhere so I'm going to now try to attempt to reproduce it. However, if anyone has any suggestions on what Marathon is doing in the meantime I'd be very grateful! I'm happy to share full logs privately if that is helpful.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                junterstein Johannes Unterstein
                Reporter:
                mattmb mattmb
                Team:
                Orchestration Team
                Watchers:
                Harpreet Gulati, Johannes Unterstein, Mao Geng, Marco Monaco, Matthias Eichstedt, mattmb, ndigati, noproblem, paulhamby, PhilKershaw, Rob Johnson, samart, Stanislav Mushkat, Tim Harper
              • Watchers:
                14 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: