Uploaded image for project: 'Marathon'
  1. Marathon
  2. MARATHON-8183

Marathon is not launching tasks on GPU agents

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Medium
    • Resolution: Won't Do
    • Affects Version/s: None
    • Fix Version/s: DC/OS 1.12.0
    • Component/s: None
    • Labels:
    • Build artifact:
      Marathon-v1.7.174

      Description

      Expected behavior
      • I am able to run a task on a 1 master / 1 GPU agent DC/OS Cluster.
      • That task does not need to have a GPU resource added in its configuration.
      • Changing one's task configuration won't affect the deployment of an other's task configuration.
      Actual behavior

      See video: https://cl.ly/0h2x0Z2d0M2n
      Marathon is not launching the task when running two tasks that both do not have GPUs assigned to it. In the log files you see that although there are enough resources available Marathon declines them (the Mesos offer). When changing just one task's configuration by adding 1 GPU, Marathon suddenly accepts not just the Mesos offer of the changed task but also the task that wasn't touched at all.

      Deep dive

      Longer video: https://cl.ly/1A1F282Y3g3w
      Mesosphere DC/OS Version: 1.11.1
      Mesos Version: 1.5.0
      Marathon Version: 1.6.352

      Steps reproduced with logfiles In the right order

      Task gets added, no GPU field present

      May 02 15:43:55 ip-10-0-1-35.us-west-2.compute.internal mesos-master[4507]: I0502 15:43:55.458536  4713 master.cpp:11553] Adding task sleep1.9efa5a64-4e1f-11e8-8681-429610c47168 with resources cpus(allocated: slave_public):0.1; mem(allocated: slave_public):32; gpus(allocated: slave_public):1 on agent 54d32950-8b68-4175-906d-42b967e1ec0e-S0 at slave(1)@10.0.5.250:5051 (10.0.5.250)
      

      Marathon gives DECLINE call

      May 02 15:43:55 ip-10-0-1-35.us-west-2.compute.internal mesos-master[4507]: I0502 15:43:55.856812  4707 master.cpp:8870] Sending 1 offers to framework 54d32950-8b68-4175-906d-42b967e1ec0e-0001 (marathon) at scheduler-0101f6bc-a4b9-415b-9475-f278ccfcdbd8@10.0.1.35:15101
      May 02 15:43:55 ip-10-0-1-35.us-west-2.compute.internal marathon[4937]: [2018-05-02 15:43:55,857] INFO  No match for:54d32950-8b68-4175-906d-42b967e1ec0e-O9 from:10.0.5.250 reason:No offers wanted (mesosphere.marathon.core.matcher.manager.impl.OfferMatcherManagerActor:marathon-akka.actor.default-dispatcher-5)
      May 02 15:43:55 ip-10-0-1-35.us-west-2.compute.internal mesos-master[4507]: I0502 15:43:55.858525  4721 master.cpp:5518] Processing DECLINE call for offers: [ 54d32950-8b68-4175-906d-42b967e1ec0e-O9 ] for framework 54d32950-8b68-4175-906d-42b967e1ec0e-0001 (marathon) at scheduler-0101f6bc-a4b9-415b-9475-f278ccfcdbd8@10.0.1.35:15101
      

       

      Marathon gets upgrade for sleep1 task

      May 02 15:44:37 ip-10-0-1-35.us-west-2.compute.internal marathon[4937]: [2018-05-02 15:44:37,547] INFO  /sleep1: upgrade detected for app (oldVersion FullVersionInfo(2018-05-02T15:43:49.780Z,2018-05-02T15:43:49.780Z,2018-05-02T15:43:49.780Z)) (mesosphere.marathon.upgrade.GroupVersioningUtil$:group-manager-module-thread-5)
      

      and suddenly Marathon gives ACCEPT call for the offer

      May 02 15:44:40 ip-10-0-1-35.us-west-2.compute.internal marathon[4937]: [2018-05-02 15:44:40,545] INFO  Received new deployment plan 181d2edb-3c47-4e75-891f-c35aa0d02fbb for /sleep1, no conflicts detected (mesosphere.marathon.core.deployment.impl.DeploymentManagerActor:marathon-akka.actor.default-dispatcher-26)
      May 02 15:44:40 ip-10-0-1-35.us-west-2.compute.internal java[2974]: [myid:] INFO  [ProcessThread(sid:0 cport:2181)::PrepRequestProcessor@648] - Got user-level KeeperException when processing sessionid:0x163217b5459001a type:setData cxid:0xa4 zxid:0xe2 txntype:-1 reqpath:n/a Error Path:/marathon/state/deployment/4/181d2edb-3c47-4e75-891f-c35aa0d02fbb Error:KeeperErrorCode = NoNode for /m
      May 02 15:44:40 ip-10-0-1-35.us-west-2.compute.internal java[2974]: [myid:] INFO  [ProcessThread(sid:0 cport:2181)::PrepRequestProcessor@648] - Got user-level KeeperException when processing sessionid:0x163217b5459001a type:create cxid:0xa5 zxid:0xe3 txntype:-1 reqpath:n/a Error Path:/marathon/state/deployment/4 Error:KeeperErrorCode = NoNode for /marathon/state/deployment/4
      May 02 15:44:40 ip-10-0-1-35.us-west-2.compute.internal marathon[4937]: [2018-05-02 15:44:40,609] INFO  Stored new deployment plan 181d2edb-3c47-4e75-891f-c35aa0d02fbb for /sleep1 (mesosphere.marathon.core.deployment.impl.DeploymentManagerActor:marathon-akka.actor.default-dispatcher-23)
      May 02 15:44:40 ip-10-0-1-35.us-west-2.compute.internal marathon[4937]: [2018-05-02 15:44:40,609] INFO  Launching DeploymentActor for 181d2edb-3c47-4e75-891f-c35aa0d02fbb for /sleep1 (mesosphere.marathon.core.deployment.impl.DeploymentManagerActor:marathon-akka.actor.default-dispatcher-4)
      May 02 15:44:40 ip-10-0-1-35.us-west-2.compute.internal marathon[4937]: [2018-05-02 15:44:40,612] INFO  For minimumHealthCapacity 1.0 of /sleep1 leave 1 instances running, maximum capacity 2, killing 0 of 1 running instances immediately. (RunSpec version 2018-05-02T15:44:37.542Z) (mesosphere.marathon.core.deployment.impl.TaskReplaceActor$:marathon-akka.actor.default-dispatcher-2)
      May 02 15:44:40 ip-10-0-1-35.us-west-2.compute.internal marathon[4937]: [2018-05-02 15:44:40,613] INFO  reconcile: found 0 already started instances and 1 old instances (mesosphere.marathon.core.deployment.impl.TaskReplaceActor:marathon-akka.actor.default-dispatcher-2)
      May 02 15:44:40 ip-10-0-1-35.us-west-2.compute.internal marathon[4937]: [2018-05-02 15:44:40,613] INFO  Reconciling instances during app /sleep1 restart: queuing 1 new instances (mesosphere.marathon.core.deployment.impl.TaskReplaceActor:marathon-akka.actor.default-dispatcher-2)
      May 02 15:44:40 ip-10-0-1-35.us-west-2.compute.internal marathon[4937]: [2018-05-02 15:44:40,613] INFO  Resetting the backoff delay before restarting the runSpec (mesosphere.marathon.core.deployment.impl.TaskReplaceActor:marathon-akka.actor.default-dispatcher-2)
      May 02 15:44:40 ip-10-0-1-35.us-west-2.compute.internal marathon[4937]: [2018-05-02 15:44:40,613] INFO  getting new runSpec for '/sleep1', version 2018-05-02T15:44:37.542Z with 1 initial instances (mesosphere.marathon.core.launchqueue.impl.TaskLauncherActor:marathon-akka.actor.default-dispatcher-25)
      May 02 15:44:40 ip-10-0-1-35.us-west-2.compute.internal marathon[4937]: [2018-05-02 15:44:40,614] INFO  activating matcher ActorOfferMatcher(Actor[akka://marathon/user/launchQueue/1/2-sleep1#-885281249]). (mesosphere.marathon.core.matcher.manager.impl.OfferMatcherManagerActor:marathon-akka.actor.default-dispatcher-25)
      May 02 15:44:40 ip-10-0-1-35.us-west-2.compute.internal marathon[4937]: [2018-05-02 15:44:40,614] INFO  Received offers WANTED notification (mesosphere.marathon.core.flow.impl.ReviveOffersActor:marathon-akka.actor.default-dispatcher-25)
      May 02 15:44:40 ip-10-0-1-35.us-west-2.compute.internal marathon[4937]: [2018-05-02 15:44:40,614] INFO  => revive offers NOW, canceling any scheduled revives (mesosphere.marathon.core.flow.impl.ReviveOffersActor:marathon-akka.actor.default-dispatcher-25)
      May 02 15:44:40 ip-10-0-1-35.us-west-2.compute.internal marathon[4937]: [2018-05-02 15:44:40,614] INFO  2 further revives still needed. Repeating reviveOffers according to --revive_offers_repetitions 3 (mesosphere.marathon.core.flow.impl.ReviveOffersActor:marathon-akka.actor.default-dispatcher-25)
      May 02 15:44:40 ip-10-0-1-35.us-west-2.compute.internal marathon[4937]: [2018-05-02 15:44:40,614] INFO  => Schedule next revive at 2018-05-02T15:44:45.614Z in 5000 milliseconds, adhering to --min_revive_offers_interval 5000 (ms) (mesosphere.marathon.core.flow.impl.ReviveOffersActor:marathon-akka.actor.default-dispatcher-25)
      May 02 15:44:40 ip-10-0-1-35.us-west-2.compute.internal mesos-master[4507]: I0502 15:44:40.614996  4713 master.cpp:5623] Processing REVIVE call for framework 54d32950-8b68-4175-906d-42b967e1ec0e-0001 (marathon) at scheduler-0101f6bc-a4b9-415b-9475-f278ccfcdbd8@10.0.1.35:15101
      May 02 15:44:40 ip-10-0-1-35.us-west-2.compute.internal mesos-master[4507]: I0502 15:44:40.615237  4719 hierarchical.cpp:1339] Revived offers for roles { slave_public } of framework 54d32950-8b68-4175-906d-42b967e1ec0e-0001
      May 02 15:44:40 ip-10-0-1-35.us-west-2.compute.internal mesos-master[4507]: I0502 15:44:40.615790  4713 master.cpp:8870] Sending 1 offers to framework 54d32950-8b68-4175-906d-42b967e1ec0e-0001 (marathon) at scheduler-0101f6bc-a4b9-415b-9475-f278ccfcdbd8@10.0.1.35:15101
      May 02 15:44:40 ip-10-0-1-35.us-west-2.compute.internal update_engine[863]: I0502 15:44:40.624565   863 payload_processor.cc:282] Completed 968/1066 operations (90%)
      May 02 15:44:40 ip-10-0-1-35.us-west-2.compute.internal java[2974]: [myid:] INFO  [ProcessThread(sid:0 cport:2181)::PrepRequestProcessor@648] - Got user-level KeeperException when processing sessionid:0x163217b5459001a type:setData cxid:0xb0 zxid:0xea txntype:-1 reqpath:n/a Error Path:/marathon/state/instance/a/sleep1.marathon-b9fdf106-4e1f-11e8-8681-429610c47168 Error:KeeperErrorCode =
      May 02 15:44:40 ip-10-0-1-35.us-west-2.compute.internal nginx[5512]: ip-10-0-1-35.us-west-2.compute.internal nginx: 10.0.1.128 - - [02/May/2018:15:44:40 +0000] "PUT /service/marathon/v2/apps/sleep1?force=true HTTP/1.1" 200 103 "-" "python-requests/2.18.4"
      May 02 15:44:40 ip-10-0-1-35.us-west-2.compute.internal marathon[4937]: [2018-05-02 15:44:40,616] INFO  Start processing offer 54d32950-8b68-4175-906d-42b967e1ec0e-O13. Current offer matcher count: 0 (mesosphere.marathon.core.matcher.manager.impl.OfferMatcherManagerActor:marathon-akka.actor.default-dispatcher-23)
      May 02 15:44:40 ip-10-0-1-35.us-west-2.compute.internal marathon[4937]: [2018-05-02 15:44:40,617] INFO  Updated groups/apps/pods according to plan 181d2edb-3c47-4e75-891f-c35aa0d02fbb for /sleep1 (mesosphere.marathon.core.group.impl.GroupManagerImpl:group-manager-module-thread-3)
      May 02 15:44:40 ip-10-0-1-35.us-west-2.compute.internal marathon[4937]: [2018-05-02 15:44:40,617] INFO  Deployment 181d2edb-3c47-4e75-891f-c35aa0d02fbb:2018-05-02T15:44:37.542Z for /sleep1 acknowledged. Waiting to get processed (mesosphere.marathon.core.group.impl.GroupManagerImpl:group-manager-module-thread-3)
      May 02 15:44:40 ip-10-0-1-35.us-west-2.compute.internal marathon[4937]: [2018-05-02 15:44:40,617] INFO  Runspec [/sleep1] doesn't require any GPU resources but will be launched on an agent with GPU resources due to required persistent volume. (mesosphere.mesos.ResourceMatcher$:marathon-akka.actor.default-dispatcher-23)
      May 02 15:44:40 ip-10-0-1-35.us-west-2.compute.internal marathon[4937]: [2018-05-02 15:44:40,618] INFO  10.0.1.35 - - [02/May/2018:15:44:37 +0000] "PUT //fabianbaie-tf24c0-pub-mas-elb-1569797188.us-west-2.elb.amazonaws.com/v2/apps/sleep1?force=true HTTP/1.1" 200 92 "-" "python-requests/2.18.4" 307 (mesosphere.chaos.http.ChaosRequestLog:qtp944455655-62)
      May 02 15:44:40 ip-10-0-1-35.us-west-2.compute.internal marathon[4937]: [2018-05-02 15:44:40,619] INFO  No tasks left to launch. Stop receiving offers for /sleep1, 2018-05-02T15:44:37.542Z (mesosphere.marathon.core.launchqueue.impl.TaskLauncherActor:marathon-akka.actor.default-dispatcher-23)
      May 02 15:44:40 ip-10-0-1-35.us-west-2.compute.internal marathon[4937]: [2018-05-02 15:44:40,619] INFO  removing matcher ActorOfferMatcher(Actor[akka://marathon/user/launchQueue/1/2-sleep1#-885281249]) (mesosphere.marathon.core.matcher.manager.impl.OfferMatcherManagerActor:marathon-akka.actor.default-dispatcher-2)
      May 02 15:44:40 ip-10-0-1-35.us-west-2.compute.internal java[2974]: [myid:] INFO  [ProcessThread(sid:0 cport:2181)::PrepRequestProcessor@648] - Got user-level KeeperException when processing sessionid:0x163217b5459001a type:create cxid:0xb1 zxid:0xeb txntype:-1 reqpath:n/a Error Path:/marathon/state/instance/a Error:KeeperErrorCode = NoNode for /marathon/state/instance/a
      May 02 15:44:40 ip-10-0-1-35.us-west-2.compute.internal mesos-master[4507]: I0502 15:44:40.630812  4706 master.cpp:10795] Removing offer 54d32950-8b68-4175-906d-42b967e1ec0e-O13
      May 02 15:44:40 ip-10-0-1-35.us-west-2.compute.internal mesos-master[4507]: I0502 15:44:40.630939  4706 master.cpp:4293] Processing ACCEPT call for offers: [ 54d32950-8b68-4175-906d-42b967e1ec0e-O13 ] on agent 54d32950-8b68-4175-906d-42b967e1ec0e-S0 at slave(1)@10.0.5.250:5051 (10.0.5.250) for framework 54d32950-8b68-4175-906d-42b967e1ec0e-0001 (marathon) at scheduler-0101f6bc-a4b9-415b
      

       Entire master/agent logs are attached to this ticket.

      Steps to reproduce the behavior

      1) Create an one master/one GPU agent cluster on AWS via Terraform via

      terraform apply -var-file desired_cluster_profile.tfvars.example

      desired_cluster_profile.tfvars.example

      dcos_version = "1.11.1"
      num_of_masters = "1"
      num_of_private_agents = "0"
      num_of_public_agents = "0"
      num_of_gpu_agents = "1"
      aws_profile = "273854932432_Mesosphere-PowerUser"
      

      2) Deploy your tasks

      dcos marathon app add sleep1.json
      dcos marathon app add sleep2.json
      

      sleep1.json

      {
        "id": "/sleep1",
        "backoffFactor": 1.15,
        "backoffSeconds": 1,
        "cmd": "curl ipinfo.io/ip; sleep 19999999",
        "container": {
          "type": "MESOS",
          "volumes": [],
          "docker": {
            "image": "alpine",
            "forcePullImage": false,
            "parameters": []
          }
        },
        "cpus": 0.1,
        "disk": 0,
        "instances": 1,
        "maxLaunchDelaySeconds": 3600,
        "mem": 32,
        "gpus": 0,
        "networks": [
          {
            "mode": "host"
          }
        ],
        "portDefinitions": [],
        "requirePorts": false,
        "upgradeStrategy": {
          "maximumOverCapacity": 1,
          "minimumHealthCapacity": 1
        },
        "killSelection": "YOUNGEST_FIRST",
        "unreachableStrategy": {
          "inactiveAfterSeconds": 0,
          "expungeAfterSeconds": 0
        },
        "healthChecks": [],
        "fetch": [],
        "constraints": []
      }
      

      sleep2.json

      {
        "id": "/sleep2",
        "backoffFactor": 1.15,
        "backoffSeconds": 1,
        "cmd": "curl ipinfo.io/ip; sleep 19999999",
        "container": {
          "type": "MESOS",
          "volumes": [],
          "docker": {
            "image": "alpine",
            "forcePullImage": false,
            "parameters": []
          }
        },
        "cpus": 0.1,
        "disk": 0,
        "instances": 1,
        "maxLaunchDelaySeconds": 3600,
        "mem": 32,
        "gpus": 0,
        "networks": [
          {
            "mode": "host"
          }
        ],
        "portDefinitions": [],
        "requirePorts": false,
        "upgradeStrategy": {
          "maximumOverCapacity": 1,
          "minimumHealthCapacity": 1
        },
        "killSelection": "YOUNGEST_FIRST",
        "unreachableStrategy": {
          "inactiveAfterSeconds": 0,
          "expungeAfterSeconds": 0
        },
        "healthChecks": [],
        "fetch": [],
        "constraints": []
      }
      

       
      3) Change the GPU configuration of one task

      # nothing happens, but command fixes all apps
      dcos marathon app update sleep1 gpus=1 --force
      

      As you can see, suddenly not just sleep1 is running but also sleep2

        Attachments

          Activity

            People

            • Assignee:
              alenavarkockova Alena Varkockova
              Reporter:
              fabian Fabian Baier
              Team:
              Orchestration Team
              Watchers:
              Alena Varkockova, Fabian Baier, Harpreet Gulati
            • Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: