Uploaded image for project: 'Marathon'
  1. Marathon
  2. MARATHON-3462

Marathon lost FrameworkId and tasks after crashing

    Details

    • Type: Task
    • Status: Resolved
    • Resolution: Done
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:

      Description

      Environment:
      Marathon 0.10.1 (--ha mode with 3 instances)
      Mesos 0.23 (3 instances running on the same hosts that Marathon is)
      A lot of event subscribers (we are using Qubit Bamboo for service discovery)
      About 120 apps and 400 tasks

      The first error was in instance 13 (current leader):

       log
      [ERROR] [09/07/2015 01:31:35.640] [marathon-akka.actor.default-dispatcher-52] [ActorSystem(marathon)] Uncaught error from thread [marathon-akka.actor.default-dispatcher-52] shutting down JVM since 'akka.jvm-exit-on-fatal-error' is enabled
      java.lang.StackOverflowError
              at scala.collection.MapLFike$FilteredKeys.iterator(MapLike.scala:232)
              at scala.collection.MapLike$FilteredKeys.iterator(MapLike.scala:232)
              at scala.collection.MapLike$FilteredKeys.iterator(MapLike.scala:232)
              at scala.collection.MapLike$FilteredKeys.iterator(MapLike.scala:232)
              at scala.collection.MapLike$FilteredKeys.iterator(MapLike.scala:232)
              at scala.collection.MapLike$FilteredKeys.iterator(MapLike.scala:232)
              at scala.collection.MapLike$FilteredKeys.iterator(MapLike.scala:232)
              ...
              at scala.collection.MapLike$FilteredKeys.iterator(MapLike.scala:232)
      

      This happened together with mesos reconciliation.
      Some logs more in the same instance:

       log
      [2015-09-07 01:31:35,661] INFO Shutting down services (mesosphere.marathon.Main$:42)
      
      [2015-09-07 01:31:35,708] INFO Shutdown triggered (mesosphere.marathon.MarathonSchedulerService$$EnhancerByGuice$$2c1b9a:167)
      [2015-09-07 01:31:35,708] INFO Stopping driver (mesosphere.marathon.MarathonSchedulerService$$EnhancerByGuice$$2c1b9a:222)
      
      [2015-09-07 01:31:35,804] INFO Driver future completed. Executing optional abdication command. (mesosphere.marathon.MarathonSchedulerService$$EnhancerByGuice$$2c1b9a:193)
      [2015-09-07 01:31:35,811] INFO Cancelling timer (mesosphere.marathon.MarathonSchedulerService$$EnhancerByGuice$$2c1b9a:173)
      [2015-09-07 01:31:35,813] INFO Defeated (Leader Interface) (mesosphere.marathon.MarathonSchedulerService$$EnhancerByGuice$$2c1b9a:233)
      [2015-09-07 01:31:35,813] INFO Defeat leadership (mesosphere.marathon.MarathonSchedulerService$$EnhancerByGuice$$2c1b9a:276)
      [2015-09-07 01:31:35,813] INFO Now standing by. Closing existing handles and rejecting new. (mesosphere.marathon.event.http.HttpEventStreamActor:96)
      [INFO] [09/07/2015 01:31:35.813] [marathon-akka.actor.default-dispatcher-62] [akka://marathon/user/MarathonScheduler/$a] Suspending scheduler actor
      
      [2015-09-07 01:31:35,816] INFO Do not proxy to myself. Waiting for consistent leadership state. Are we leader?: false, leader: Some(10.13.12.13:8080) (mesosphere.marathon.api.LeaderProxyFilter$:130)
      
      [2015-09-07 01:31:35,819] INFO Leadership info is consistent again! (mesosphere.marathon.api.LeaderProxyFilter$:99)
      

      Now the log of the new elected leader:

       log
      [2015-09-07 01:31:35,813] INFO Candidate /marathon3/leader/member_0000000174 is now leader of group: [member_0000000176, member_0000000174] (com.twitter.common.zookeeper.CandidateImpl:152)
      [2015-09-07 01:31:35,846] INFO Elected (Leader Interface) (mesosphere.marathon.MarathonSchedulerService$$EnhancerByGuice$$2c1b9a:242)
      [2015-09-07 01:31:35,926] INFO Elect leadership (mesosphere.marathon.MarathonSchedulerService$$EnhancerByGuice$$2c1b9a:290)
      [2015-09-07 01:31:35,927] INFO Running driver (mesosphere.marathon.MarathonSchedulerService$$EnhancerByGuice$$2c1b9a:185)
      [2015-09-07 01:31:35,926] INFO Migration successfully applied for version Version(0, 10, 0) (mesosphere.marathon.state.Migration:70)
      
      [INFO] [09/07/2015 01:31:35.949] [marathon-akka.actor.default-dispatcher-64] [akka://marathon/user/MarathonScheduler/$a] Starting scheduler actor
      
      [2015-09-07 01:31:35,949] INFO Became active. Accepting event streaming requests. (mesosphere.marathon.event.http.HttpEventStreamActor:101)
      
      [INFO] [09/07/2015 01:31:35.977] [marathon-akka.actor.default-dispatcher-64] [akka://marathon/user/MarathonScheduler/$a] Scheduler actor ready
      
      [2015-09-07 01:31:36,078] INFO Reset offerLeadership backoff (mesosphere.marathon.MarathonSchedulerService$$EnhancerByGuice$$2c1b9a:320)
      [2015-09-07 01:31:36,097] INFO Registered as 20150902-231517-235670794-5050-13589-0000 to master '20150902-231517-235670794-5050-13589' (mesosphere.marathon.MarathonScheduler$$EnhancerByGuice$$a7f5cf0a:55)
      
      [INFO] [09/07/2015 01:31:36.106] [marathon-akka.actor.default-dispatcher-64] [akka://marathon/user/offerReviver] Received scheduler registration event
      [INFO] [09/07/2015 01:31:36.106] [marathon-akka.actor.default-dispatcher-71] [akka://marathon/user/$b] POSTing to all endpoints.
      [INFO] [09/07/2015 01:31:36.106] [marathon-akka.actor.default-dispatcher-64] [akka://marathon/user/offerReviver] => revive offers NOW, canceling any scheduled revives
      

      Then I saw the two frameworks running together and obviously I did some work to shutdown all old tasks because Marathon don't recognize all of then.

      I'm suspecting of the event bus as the cause of the crash, but I think the bigger problem is that Marathon didn't recognize the oldest FrameworkId.

      Let me know if you guys need some more information.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              GitHub_gomes Diogo Gomes (Inactive)
              Team:
              Orchestration Team
              Watchers:
            • Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: