Uploaded image for project: 'Marathon'
  1. Marathon
  2. MARATHON-2360

Persistent volume apps stuck at suspended

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Medium
    • Resolution: Cannot Reproduce
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Storage Volumes
    • Labels:

      Description

      I'm having trouble running an app in Marathon using persistent local volumes. Having followed the instructions, starting Marathon with a role and principal and creating a simple app with a persistent volume, it just hangs at suspended. It seems that the slave has responded with a valid offer, but can't actually start up the app. The slave doesn't log anything regarding the task, even when I compile with the debug option and turn logging right up with GLOG_v=2.

      Also it seems that Marathon is constantly rolling the task ID as it failing to start, but I can't see why anywhere.

      Oddly when I run without persistent volume, but with disk reservation the app starts running.

      The debug logging on Marathon doesn't appear to be showing anything useful, however I could be missing something. Could anyone give me any pointers as to what the problem may be or where to look for additional debug? Many thanks in advance 😄 .

      Here's some info about my environment and debug info:

      Slave: Ubuntu 14.04 running 0.28 prebuilt and tested in 0.29 built from source
      Master: Mesos 0.28 running inside a Docker Ubuntu 14.04 image on CoreOS
      Marathon: 1.1.1 running inside a Docker Ubuntu 14.04 image on CoreOS

      App with persistent storage

      App info from v2/apps/test/tasks on Marathon

       json
      {
        "app": {
          "id": "/test",
          "cmd": "while true; do sleep 10; done",
          "args": null,
          "user": null,
          "env": {},
          "instances": 1,
          "cpus": 1,
          "mem": 128,
          "disk": 0,
          "executor": "",
          "constraints": [
            [
              "role",
              "CLUSTER",
              "persistent"
            ]
          ],
          "uris": [],
          "fetch": [],
          "storeUrls": [],
          "ports": [
            10002
          ],
          "portDefinitions": [
            {
              "port": 10002,
              "protocol": "tcp",
              "labels": {}
            }
          ],
          "requirePorts": false,
          "backoffSeconds": 1,
          "backoffFactor": 1.15,
          "maxLaunchDelaySeconds": 3600,
          "container": {
            "type": "MESOS",
            "volumes": [
              {
                "containerPath": "test",
                "mode": "RW",
                "persistent": {
                  "size": 100
                }
              }
            ]
          },
          "healthChecks": [],
          "readinessChecks": [],
          "dependencies": [],
          "upgradeStrategy": {
            "minimumHealthCapacity": 0.5,
            "maximumOverCapacity": 0
          },
          "labels": {},
          "acceptedResourceRoles": null,
          "ipAddress": null,
          "version": "2016-05-19T11:31:54.861Z",
          "residency": {
            "relaunchEscalationTimeoutSeconds": 3600,
            "taskLostBehavior": "WAIT_FOREVER"
          },
          "versionInfo": {
            "lastScalingAt": "2016-05-19T11:31:54.861Z",
            "lastConfigChangeAt": "2016-05-18T16:46:59.684Z"
          },
          "tasksStaged": 0,
          "tasksRunning": 0,
          "tasksHealthy": 0,
          "tasksUnhealthy": 0,
          "deployments": [
            {
              "id": "4f3779e5-a805-4b95-9065-f3cf9c90c8fe"
            }
          ],
          "tasks": [
            {
              "id": "test.4b7d4303-1dc2-11e6-a179-a2bd870b1e9c",
              "slaveId": "9f7c6ed5-4bf5-475d-9311-05d21628604e-S17",
              "host": "ip-10-0-90-61.eu-west-1.compute.internal",
              "localVolumes": [
                {
                  "containerPath": "test",
                  "persistenceId": "test#testMGI-4b7d4302-1dc2-11e6-a179-a2bd870b1e9c"
                }
              ],
              "appId": "/test"
            }
          ]
        }
      }
      

      App info in Marathon: (it seems the deployment is spinning


      App without persistent storage

      App info from v2/apps/test2/tasks on Marathon

       json
      {
        "app": {
          "id": "/test2",
          "cmd": "while true; do sleep 10; done",
          "args": null,
          "user": null,
          "env": {},
          "instances": 1,
          "cpus": 1,
          "mem": 128,
          "disk": 100,
          "executor": "",
          "constraints": [
            [
              "role",
              "CLUSTER",
              "persistent"
            ]
          ],
          "uris": [],
          "fetch": [],
          "storeUrls": [],
          "ports": [
            10002
          ],
          "portDefinitions": [
            {
              "port": 10002,
              "protocol": "tcp",
              "labels": {}
            }
          ],
          "requirePorts": false,
          "backoffSeconds": 1,
          "backoffFactor": 1.15,
          "maxLaunchDelaySeconds": 3600,
          "container": null,
          "healthChecks": [],
          "readinessChecks": [],
          "dependencies": [],
          "upgradeStrategy": {
            "minimumHealthCapacity": 0.5,
            "maximumOverCapacity": 0
          },
          "labels": {},
          "acceptedResourceRoles": null,
          "ipAddress": null,
          "version": "2016-05-19T13:44:01.831Z",
          "residency": null,
          "versionInfo": {
            "lastScalingAt": "2016-05-19T13:44:01.831Z",
            "lastConfigChangeAt": "2016-05-19T13:09:20.106Z"
          },
          "tasksStaged": 0,
          "tasksRunning": 1,
          "tasksHealthy": 0,
          "tasksUnhealthy": 0,
          "deployments": [],
          "tasks": [
            {
              "id": "test2.bee624f1-1dc7-11e6-b98e-568f3f9dead8",
              "slaveId": "9f7c6ed5-4bf5-475d-9311-05d21628604e-S18",
              "host": "ip-10-0-90-61.eu-west-1.compute.internal",
              "startedAt": "2016-05-19T13:44:02.190Z",
              "stagedAt": "2016-05-19T13:44:02.023Z",
              "ports": [
                31926
              ],
              "version": "2016-05-19T13:44:01.831Z",
              "ipAddresses": [
                {
                  "ipAddress": "10.0.90.61",
                  "protocol": "IPv4"
                }
              ],
              "appId": "/test2"
            }
          ],
          "lastTaskFailure": {
            "appId": "/test2",
            "host": "ip-10-0-90-61.eu-west-1.compute.internal",
            "message": "Slave ip-10-0-90-61.eu-west-1.compute.internal removed: health check timed out",
            "state": "TASK_LOST",
            "taskId": "test2.e74fb439-1dc2-11e6-a179-a2bd870b1e9c",
            "timestamp": "2016-05-19T13:15:24.155Z",
            "version": "2016-05-19T13:09:20.106Z",
            "slaveId": "9f7c6ed5-4bf5-475d-9311-05d21628604e-S17"
          }
        }
      }
      

      Slave log when running the app without:

      Unable to find source-code formatter for language: ```. Available languages are: actionscript, html, java, javascript, none, sql, xhtml, xml
      I0519 13:09:22.471876 12459 status_update_manager.cpp:320] Received status update TASK_RUNNING (UUID: 36c1f0cb-2fcd-44b9-ab79-cef81c2094be) for task test2.e74fb439-1dc2-11e6-a179-a2bd870b1e9c of framework 1a6352a6-d690-41a2-967e-07342bba56d2-0000
      I0519 13:09:22.471906 12459 status_update_manager.cpp:497] Creating StatusUpdate stream for task test2.e74fb439-1dc2-11e6-a179-a2bd870b1e9c of framework 1a6352a6-d690-41a2-967e-07342bba56d2-0000
      I0519 13:09:22.472262 12459 status_update_manager.cpp:824] Checkpointing UPDATE for status update TASK_RUNNING (UUID: 36c1f0cb-2fcd-44b9-ab79-cef81c2094be) for task test2.e74fb439-1dc2-11e6-a179-a2bd870b1e9c of framework 1a6352a6-d690-41a2-967e-07342bba56d2-0000
      I0519 13:09:22.477686 12459 status_update_manager.cpp:374] Forwarding update TASK_RUNNING (UUID: 36c1f0cb-2fcd-44b9-ab79-cef81c2094be) for task test2.e74fb439-1dc2-11e6-a179-a2bd870b1e9c of framework 1a6352a6-d690-41a2-967e-07342bba56d2-0000 to the agent
      I0519 13:09:22.477830 12453 process.cpp:2605] Resuming slave(1)@10.0.90.61:5051 at 2016-05-19 13:09:22.477814016+00:00
      I0519 13:09:22.477967 12453 slave.cpp:3638] Forwarding the update TASK_RUNNING (UUID: 36c1f0cb-2fcd-44b9-ab79-cef81c2094be) for task test2.e74fb439-1dc2-11e6-a179-a2bd870b1e9c of framework 1a6352a6-d690-41a2-967e-07342bba56d2-0000 to master@10.0.82.230:5050
      I0519 13:09:22.478185 12453 slave.cpp:3532] Status update manager successfully handled status update TASK_RUNNING (UUID: 36c1f0cb-2fcd-44b9-ab79-cef81c2094be) for task test2.e74fb439-1dc2-11e6-a179-a2bd870b1e9c of framework 1a6352a6-d690-41a2-967e-07342bba56d2-0000
      I0519 13:09:22.478229 12453 slave.cpp:3548] Sending acknowledgement for status update TASK_RUNNING (UUID: 36c1f0cb-2fcd-44b9-ab79-cef81c2094be) for task test2.e74fb439-1dc2-11e6-a179-a2bd870b1e9c of framework 1a6352a6-d690-41a2-967e-07342bba56d2-0000 to executor(1)@10.0.90.61:34262
      I0519 13:09:22.488315 12460 pid.cpp:95] Attempting to parse 'master@10.0.82.230:5050' into a PID
      I0519 13:09:22.488370 12460 process.cpp:646] Parsed message name 'mesos.internal.StatusUpdateAcknowledgementMessage' for slave(1)@10.0.90.61:5051 from master@10.0.82.230:5050
      I0519 13:09:22.488452 12452 process.cpp:2605] Resuming slave(1)@10.0.90.61:5051 at 2016-05-19 13:09:22.488441856+00:00
      I0519 13:09:22.488600 12458 process.cpp:2605] Resuming (14)@10.0.90.61:5051 at 2016-05-19 13:09:22.488590080+00:00
      I0519 13:09:22.488632 12458 status_update_manager.cpp:392] Received status update acknowledgement (UUID: 36c1f0cb-2fcd-44b9-ab79-cef81c2094be) for task test2.e74fb439-1dc2-11e6-a179-a2bd870b1e9c of framework 1a6352a6-d690-41a2-967e-07342bba56d2-0000
      I0519 13:09:22.488726 12458 status_update_manager.cpp:824] Checkpointing ACK for status update TASK_RUNNING (UUID: 36c1f0cb-2fcd-44b9-ab79-cef81c2094be) for task test2.e74fb439-1dc2-11e6-a179-a2bd870b1e9c of framework 1a6352a6-d690-41a2-967e-07342bba56d2-0000
      I0519 13:09:22.492985 12452 process.cpp:2605] Resuming slave(1)@10.0.90.61:5051 at 2016-05-19 13:09:22.492974080+00:00
      I0519 13:09:22.493021 12452 slave.cpp:2629] Status update manager successfully handled status update acknowledgement (UUID: 36c1f0cb-2fcd-44b9-ab79-cef81c2094be) for task test2.e74fb439-1dc2-11e6-a179-a2bd870b1e9c of framework 1a6352a6-d690-41a2-967e-07342bba56d2-0000
      
      

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              GitHub_janstenpickle Chris Jansen (Inactive)
              Team:
              Orchestration Team
              Watchers:
            • Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: