Uploaded image for project: 'Marathon'
  1. Marathon
  2. MARATHON-7763

DC/OS upgrade from 1.8.7 to 1.9.0 loses ZK state

    Details

      Description

      I did perform an upgrade of my DC/OS cluster from 1.8.7 to 1.9.0 following the instructions found here: https://dcos.io/docs/1.9/administration/upgrading/

      My cluster is CentOS 7 based running on Softlayer VMs.

      On my bootstrap node I did:

      • I downloaded dcos_generate_config.sh
      • checked my genconf/config.yml to be inline with DC/OS 1.9
      • I generated the installation files
      • I started nginx

      On the master nodes (starting with the followers) I did one by one:

      • download the node upgrade script
      • running the node update script
      • monitoring exhibitor until the Zookeeper ensemble was complete and shown as green again
      • checked the return code of the node_upgrade script was 0

      On the agent nodes I did one by one

      • download the node upgrade script
      • running the node update script
      • checked the return code of the node_upgrade script was 0
      • checked they rejoined the cluster through the DC/OS and Mesos UI

      I checked the DC/OS UI and it does show the right amount of allocated resources but does not show any services or tasks.

      So I checked the Mesos UI and I can see there are now two Marathon frameworks with the name "marathon" running with the "dcos_marathon" principal with different framework IDs. One is listed under "Active Frameworks" (ending wiht -0008) one is listed under "Inactive Frameworks" (ending with -0000).

      The inactive framework still has all of the tasks running. No tasks were restarted during the upgrade. It only seems that Marathon didn't re-register properly but registered as a new framework.

      I raised this issue on the DC/OS Slack and @matthias asked me to open this ticket, attaching the Mesos and Marathon logs from all master nodes.

      I do have 3 masters in total. I created the log files by running:

      journalctl -u dcos-marathon > master-N-marathon.log
      journalctl -u dcos-mesos-master > master-N-mesos-master.log

      I manually removed all lines before March 31, 2017 00:00:00 CDT since I only did the upgrade this morning to reduce the file size. I'm in CEST timezone but the servers run on CDT timezone.

        Attachments

        1. master-1-marathon.log
          506 kB
        2. master-1-mesos-master.log
          3.13 MB
        3. master-2-marathon.log
          6.04 MB
        4. master-2-mesos-master.log
          5.13 MB
        5. master-3-marathon.log
          8.82 MB
        6. master-3-mesos-master.log
          86 kB

          Issue Links

            Activity

              People

              • Assignee:
                ivanchernetsky Ivan Chernetsky
                Reporter:
                nielspardon nielspardon
                Team:
                Orchestration Team
                Watchers:
                Ivan Chernetsky, Karsten Jeschkies, Mao Geng, Matthias Eichstedt, Matthias Veit (Inactive), nielspardon, Tim Harper
              • Watchers:
                7 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: