Uploaded image for project: 'Marathon'
  1. Marathon
  2. MARATHON-7698

Marathon kill regularly after a few hours

    Details

    • Type: Task
    • Status: Resolved
    • Priority: High
    • Resolution: Cannot Reproduce
    • Affects Version/s: Marathon 1.4.3, Marathon 1.4.5
    • Fix Version/s: None
    • Component/s: Scheduling
    • Labels:

      Description

      Hi, 

      We have marathon running inside container with mesoso role services. Every few hours we get some errors: 

      \"description\":\"Unexpected task op 'LaunchTask' rejected for instance [APP_NAME.marathon-71c67fa8-8278-11e7-a4e9-02427315615e] with reason
      storage error: java.lang.RuntimeException: while asking for ForwardTaskOp(2017-08-16T11:45:57.463Z,instance [APP_NAME.marathon-71c67fa8-8278-11e7-a4e9-02427315615e],LaunchEphemeral(Instance(instance [APP_NAME.marathon-71c67fa8-8278-11e7-a4e9-02427315615e],AgentInfo(AGEN_IP,Some(d3248f66-9c43-4c8d-8b92-0f652ff375ea-S19),Vector(name: \\\"env\\\"\\ntype: TEXT\\ntext {\\n value: \\\"ENV_NAME\\\"\\n}\\n, name: \\\"gr
      oup\\\"\\ntype: SCALAR\\nscalar {\\n value: 3.0\\n}\\n)),InstanceState(Created,2017-08-16T11:45:47.462Z,None,None),Map(task [APP_NAME.71c67fa8-8278-11e7-a4e9-02427315615e] -> LaunchedEphemeral(task [APP_NAME.71c67fa8-8278-11e7-a4e9-02427315615e],2017-08-02T16:12:24.902Z,Status(2017-08-16T11:45:47.462Z,None,None,Created,NetworkInfo(HOST_IP,Vector(62135),List())))),2017-08-02T16:12:24.902Z,UnreachableEnabled(300 
      seconds,600 seconds)))) on runSpec [SPACE_NAME] and instance [APP_NAME.marathon-71c67fa8-8278-11e7-a4e9-02427315615e]\",\"class\":\"mesosphere.marathon.core.launchqueue.impl.TaskLaunche
      rActor\",\"thread\":\"marathon-akka.actor.default-dispatcher-18\",\"severity\":\"WARN\",\"sourceThread\":\"marathon-akka.actor.default-dispatcher-24\",\"akkaTimestamp\":\"11:45:57.478UTC\",\"akkaSource\":\"akka://marathon/user/launchQueue
      /1/APP_NAME\",\"sourceActorSystem\":\"marathon\",\"log_type\":\"application_log\"}"
      
      

      then we get few other errors like:

      {\"@timestamp\":\"2017-08-16T11:45:57.478+00:00\",\"description\":\"Unexpected task op 'LaunchTask' rejected for instance [APP_NAME.marathon-71c6a6b9-8278-11e7-a4e9-02427315615e]
      with reason storage error: java.lang.RuntimeException: while asking for ForwardTaskOp(2017-08-16T11:45:57.463Z,instance [APP_NAME.marathon-71c6a6b9-8278-11e7-a4e9-02427315615e],LaunchEphemeral(Insta
      nce(instance [APP_NAME.marathon-71c6a6b9-8278-11e7-a4e9-02427315615e],AgentInfo(OTHER_AGEN_IP,Some(d3248f66-9c43-4c8d-8b92-0f652ff375ea-S20),Vector(name: \\\"env\\\"\\ntype: TEXT\\ntext {\\n value:
      \\\"dev112\\\"\\n}\\n, name: \\\"group\\\"\\ntype: SCALAR\\nscalar {\\n value: 3.0\\n}\\n)),InstanceState(Created,2017-08-16T11:45:47.462Z,None,None),Map(task [APP_NAME.71c6a6b9-8278-11e7-a4e9-0242
      7315615e] -> LaunchedEphemeral(task [sAPP_NAME.71c6a6b9-8278-11e7-a4e9-02427315615e],2017-06-29T17:16:24.071Z,Status(2017-08-16T11:45:47.462Z,None,None,Created,NetworkInfo(OTHER_AGENT_IP,Vector(64219)
      ,List())))),2017-06-29T17:16:24.071Z,UnreachableEnabled(300 seconds,600 seconds)))) on runSpec [CATALOG_NAME] and instance [APP_NAME.marathon-71c6a6b9-8278-11e7-a4
      e9-02427315615e]\",\"class\":\"mesosphere.marathon.core.launchqueue.impl.TaskLauncherActor\",\"thread\":\"marathon-akka.actor.default-dispatcher-18\",\"severity\":\"WARN\",\"sourceThread\":\"marathon-akka.actor.default-dispatcher-18\",\"a
      kkaTimestamp\":\"11:45:57.478UTC\",\"akkaSource\":\"akka://marathon/user/launchQueue/1/APP_NAME\",\"sourceActorSystem\":\"marathon\",\"log_type\":\"application_log\"}"
      
      

      Then  maraton try to do this task on few other nodes and next We get:

      "InstanceKillProgressActor764042887 was stopped before all instances are killed. Outstanding: instance [OtherAPP.mar
      athon-346d2f14-818a-11e7-8c16-02427315615e]\",\"class\":\"mesosphere.marathon.core.task.termination.impl.InstanceKillProgressActor\",\"thread\":\"marathon-akka.actor.default-dispatcher-19\",\"severity\":\"ERROR\",\"sourceThread\":\"marath
      on-akka.actor.default-dispatcher-11\",\"akkaTimestamp\":\"11:46:00.112UTC\",\"akkaSource\":\"akka://marathon/user/taskKillServiceActor/1/$1r\",\"sourceActorSystem\":\"marathon\",\"log_type\":\"application_log\"}"
      
      

      on few apps and the container was restart and the leader was switch to the second node. After a few hours we have the same situation on the second node. 

      During this errors/warning we have quite high load and just before crash. 

       

        Attachments

          Activity

            People

            • Assignee:
              tharper Tim Harper
              Reporter:
              kaarolch kaarolch
              Team:
              Orchestration Team
              Watchers:
              kaarolch, Matthias Eichstedt
            • Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: