Uploaded image for project: 'DC/OS'
  1. DC/OS
  2. DCOS_OSS-2002

Master Disk full -> ZK failure -> Loss of External persistent volume data.

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: High
    • Resolution: Cannot Reproduce
    • Affects Version/s: DC/OS 1.8.4
    • Fix Version/s: None
    • Component/s: marathon
    • Labels:
      None
    • Environment:

      DC/OS v1.8.4 as provided by Azure Container Service.

      Description

      A disk full condition on one of our master nodes caused a cascade failure and data loss

      1. The Mesos Master log files were never pruned and filled up the disk (14+GB)
      2. ZK partially failed in such a way that quorum was never lost, but it wasn't working either.
      3. Marathon rescheduled some tasks with local persistent volumes and for some reason, the task lost behavior wasn't honored and Mesos wiped the data from disk.

      The marathon app definition contained

      "residency": {
          "relaunchEscalationTimeoutSeconds": 3600,
          "taskLostBehavior": "WAIT_FOREVER"
        },
      

      Vishnu suggested this ticket to ensure that this failure has been fixed post DC/OS v1.8.4

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                jzampieron@zproject.net Jeffrey Zampieron
                Team:
                Orchestration Team
                Watchers:
                Jeffrey Zampieron, Marco Monaco
              • Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: