Uploaded image for project: 'Marathon'
  1. Marathon
  2. MARATHON-7458

Marathon appears to be either restoring state unprovoked, or standby is loading state too early

    Details

    • Sprint:
      Marathon Sprint 1.10-5, Marathon Sprint 1.10-6
    • Story Points:
      3

      Description

      We are experiencing a bug with the backup/restore capabilities in marathon. Specifically when using the new query parameter for v2/leader?backup=file:///... to trigger a backup, as introduced in 1.5.0. This is breaking the new dcos-backup feature being added for DC/OS 1.10.

      The steps I can take to produce the error are:
      1) Start a single marathon app (a simple sleeper app will do)

      $ dcos marathon app add sleeper.json
      Created deployment dcf7b71b-d5ed-45b0-a9f8-02f148bb9125
      $ dcos marathon app list
      ID         MEM  CPUS  TASKS  HEALTH  DEPLOYMENT  WAITING  CONTAINER  CMD              
      /sleeper  128   1     1/1    N/A       ---      False       N/A     sleep 100000000
      

      2) Initiate a backup and wait for it to complete:

      $ curl 'https://104.197.9.37/marathon/v2/leader?backup=file:///var/lib/dcos/marathon/my-backup' -X DELETE
      {"message":"Leadership abdicated"}
      

      3) Remove the marathon app

      $ dcos marathon app remove /sleeper
      $ dcos marathon app list
      

      3) Abdicate leadership again (without creating a backup or initiating a restore)

      curl 'https://104.197.9.37/marathon/v2/leader' -X DELETE
      

      4) List the set of marathon apps

      $ dcos marathon app list
      ID         MEM  CPUS  TASKS  HEALTH  DEPLOYMENT  WAITING  CONTAINER  CMD              
      /sleeper  128   1     1/1    N/A       ---      False       N/A     sleep 100000000
      

      The task is back!!!

      For some reason marathon seems to be initiating a restore from the backup I created without me requesting it. I am also able to reproduce this in the opposite sense, where I first create a backup with no apps running, start some apps, abdicate leadership, and my apps get killed.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                junterstein Johannes Unterstein
                Reporter:
                klueska Kevin Klues (Inactive)
                Team:
                Orchestration Team
                Watchers:
                Adam Bordelon (Inactive), Aleksey Dukhovniy, Collin Van Dyck (Inactive), Johannes Unterstein, Kevin Klues (Inactive), Matthias Eichstedt, Richard Boyer, Tim Harper
              • Watchers:
                8 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: