Uploaded image for project: 'Marathon'
  1. Marathon
  2. MARATHON-7751

Two instances of app with persistent volume after its restart

    Details

    • Component Version:
    • Sprint:
      Marathon Sprint 1.11-8, Marathon Sprint 1.11-9, Marathon Sprint 1.11-10, Marathon Sprint 1.11-11, Marathon Sprint 1.11-12, Marathon Sprint 1.11-13
    • Story Points:
      3

      Description

      The issue surfaced when executing the following test on Marathon on Marathon 1.4.7:

      def test_restart_container_with_persistent_volume():
          """A task with a persistent volume, which writes to a file in the persistent volume, is launched.
             The app is killed and restarted and we can still read from the persistent volume what was written to it.
          """
      
          app_def = apps.persistent_volume_app()
          app_id = app_def['id']
      
          client = marathon.create_client()
          client.add_app(app_def)
      
          shakedown.deployment_wait()
      
          tasks = client.get_tasks(app_id)
          assert len(tasks) == 1, "The number of tasks is {} after deployment, but 1 was expected".format(len(tasks))
      
          port = tasks[0]['ports'][0]
          host = tasks[0]['host']
          cmd = "curl {}:{}/data/foo".format(host, port)
          run, data = shakedown.run_command_on_master(cmd)
      
          assert run, "{} did not succeed".format(cmd)
          assert data == 'hello\n', "'{}' was not equal to hello\\n".format(data)
      
          client.restart_app(app_id)
          shakedown.deployment_wait()
      
          @retrying.retry(wait_fixed=1000, stop_max_attempt_number=30, retry_on_exception=common.ignore_exception)
          def check_task_recovery():
              tasks = client.get_tasks(app_id)
              assert len(tasks) == 1, "The number of tasks is {} after recovery, but 1 was expected".format(len(tasks))
      
          check_task_recovery()
      
          port = tasks[0]['ports'][0]
          host = tasks[0]['host']
          cmd = "curl {}:{}/data/foo".format(host, port)
          run, data = shakedown.run_command_on_master(cmd)
      
          assert run, "{} did not succeed".format(cmd)
          assert data == 'hello\nhello\n', "'{}' was not equal to hello\\nhello\\n".format(data)
      

      It failed with the following message:

      E   AssertionError: The number of tasks is 2 after deployment, but 1 was expected
      

      Here is what happened:

      1. The app with a persistent volume started.
      2. The app got restarted.
      3. But the old instance didn't get stopped.
      4. For some time there two instances, until a Marathon scale check kicked in and scaled down the app.

      The app ID in the attached logs is persistent-volume-app-e5c906d9a70b4fb5b0622019696bda21.

      Please note that this issue is probably not fixed on master too. It is flaky, and it fails one time out of a thousand runs or so.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                ivanchernetsky Ivan Chernetsky
                Reporter:
                ivanchernetsky Ivan Chernetsky
                Team:
                Orchestration Team
                Watchers:
                Ivan Chernetsky, José Armando García Sancio (Inactive), Karsten Jeschkies, Ken Sipe, Marco Monaco, Matthias Eichstedt, Tim Harper
              • Watchers:
                7 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: