Uploaded image for project: 'DC/OS'
  1. DC/OS
  2. DCOS_OSS-3642

DCOS 1.11.1 and Later Fails To Install

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: High
    • Resolution: Duplicate
    • Affects Version/s: DC/OS 1.11.1, DC/OS 1.11.2
    • Fix Version/s: DC/OS 1.11.3
    • Component/s: networking
    • Labels:

      Description

      With three sets of bootstrap files generated from the same `config.yml`, and served up by matchbox:

      drwxr-xr-x 5 dcosadmin dcosadmin 9 Jun 15 21:48 dcos-1.11.0
      drwxr-xr-x 5 dcosadmin dcosadmin 9 Jun 15 21:27 dcos-1.11.1
      drwxr-xr-x 5 dcosadmin dcosadmin 9 Jun 15 18:04 dcos-1.11.2

      Switching DC/OS install versions is done by moving a dcos symlink to the appropriate version.  The installation and bootstrap process is the same.

      CoreOS and hardware:

      • CoreOS 1623.3.0 is installed on bare-metal Dell FC430's, and is my production hardware.
      • CoreOS 1745.6.0 is installed on bare-metal Dell R7425's due to needing support for the Broadcom 10gige adapter, and is hardware that is being evaluated for new agent and master nodes.

      The install and bootstrap process is exactly the same between CoreOS and DC/OS thanks to the use of Matchbox.

      DC/OS and CoreOS:

      • DC/OS 1.11.0 installs and works correctly on CoreOS 1632.3.0 (Stable) and CoreOS 1745.6.0 (Stable).
      • DC/OS 1.11.1 installs and fails to work correctly on CoreOS 1745.6.0 (Stable).
      • DC/OS 1.11.2 installs and fails to work correctly on CoreOS 1745.6.0 (Stable).

      Unfortunately, I am presently unable to evaluate DCOS 1.11.1 on CoreOS 1632.3.0 due to lack of spare Dell FC430's.  I do not believe that this affects this, as DC/OS 1.11.0 works on both versions of CoeOS under test.

      The Matchbox template file that might be relevant is here – https://gist.github.com/wbrown-lg/810c7802f59ab06858f00b20d02e387f

       

      The main difference between DC/OS 1.11.1 on installation is the following output:

      Jun 18 19:52:37 dcos201-jolly-a.dev.lgscout.com systemd[1]: Starting DC/OS Net: A distributed systems & network overlay orchestration engine...
      Jun 18 19:52:37 dcos201-jolly-a.dev.lgscout.com dcos-net-setup.py[4119]: iptables: Bad rule (does a matching rule exist in that chain?).
      Jun 18 19:52:37 dcos201-jolly-a.dev.lgscout.com dcos-net-setup.py[4140]: iptables: No chain/target/match by that name.
      Jun 18 19:52:38 dcos201-jolly-a.dev.lgscout.com dcos-net-setup.py[4225]: net.ipv6.conf.spartan.disable_ipv6 = 0
      Jun 18 19:52:38 dcos201-jolly-a.dev.lgscout.com bootstrap[4234]: [INFO] Clearing proxy environment variables
      Jun 18 19:52:38 dcos201-jolly-a.dev.lgscout.com bootstrap[4234]: [DEBUG] bootstrapping dcos-net
      Jun 18 19:52:38 dcos201-jolly-a.dev.lgscout.com systemd[1]: Started DC/OS Net: A distributed systems & network overlay orchestration engine.
      Jun 18 19:52:39 dcos201-jolly-a.dev.lgscout.com dcos-net-env[4246]: Exec: /opt/mesosphere/packages/dcos-net-f8971d40c76b0c54ccf0f1a3fe229bf5ad8c827c/dcos-net/erts-9.2/bin/erlexec -noshell -noinput +Bd -boot /opt/mesosphere/packages/dcos-netf8971d40c76b0c54ccf0f1a3fe229bf5ad8c827c/dcos-net/releases/0.0.1/dcos-net -mode embedded -boot_var ERTS_LIB_DIR /opt/mesosphere/packages/dcos-net-f8971d40c76b0c54ccf0f1a3fe229bf5ad8c827c/dcos-net/lib -config /tmp/sys.config -args_file /tmp/vm.args -pa – foreground
      Jun 18 19:52:39 dcos201-jolly-a.dev.lgscout.com dcos-net-env[4246]: Root: /opt/mesosphere/packages/dcos-net--f8971d40c76b0c54ccf0f1a3fe229bf5ad8c827c/dcos-net
      Jun 18 19:52:39 dcos201-jolly-a.dev.lgscout.com dcos-net-env[4246]: /opt/mesosphere/packages/dcos-net--f8971d40c76b0c54ccf0f1a3fe229bf5ad8c827c/dcos-net

       

      Specifically, it seems to be having issues with setting the firewall rules.  dcos-agent can't seem to start, as it can't resolve the zookeeper node.  And then dcos-net keeps getting killed by the watchdog agent.

      Jun 18 19:55:06 dcos201-jolly-a.dev.lgscout.com bootstrap[5627]: [INFO] Clearing proxy environment variables
      Jun 18 19:55:06 dcos201-jolly-a.dev.lgscout.com bootstrap[5627]: [DEBUG] bootstrapping dcos-metrics-agent
      Jun 18 19:55:06 dcos201-jolly-a.dev.lgscout.com bootstrap[5627]: [INFO] Opening /var/lib/dcos for locking
      Jun 18 19:55:06 dcos201-jolly-a.dev.lgscout.com bootstrap[5627]: [INFO] Opening /var/lib/dcos
      Jun 18 19:55:06 dcos201-jolly-a.dev.lgscout.com bootstrap[5627]: [INFO] Opened /var/lib/dcos with fd 4
      Jun 18 19:55:06 dcos201-jolly-a.dev.lgscout.com bootstrap[5627]: [INFO] Taking exclusive lock on /var/lib/dcos
      Jun 18 19:55:06 dcos201-jolly-a.dev.lgscout.com bootstrap[5627]: [INFO] Locking fd 4
      Jun 18 19:55:06 dcos201-jolly-a.dev.lgscout.com bootstrap[5627]: [INFO] Locked fd 4
      Jun 18 19:55:06 dcos201-jolly-a.dev.lgscout.com bootstrap[5627]: [WARNING] Cannot resolve zk-1.zk: [Errno -2] Name or service not known
      Jun 18 19:55:06 dcos201-jolly-a.dev.lgscout.com bootstrap[5627]: [WARNING] Cannot resolve zk-2.zk: [Errno -2] Name or service not known
      Jun 18 19:55:06 dcos201-jolly-a.dev.lgscout.com bootstrap[5627]: [WARNING] Cannot resolve zk-3.zk: [Errno -2] Name or service not known
      Jun 18 19:55:06 dcos201-jolly-a.dev.lgscout.com bootstrap[5627]: [WARNING] Cannot resolve zk-4.zk: [Errno -2] Name or service not known
      Jun 18 19:55:06 dcos201-jolly-a.dev.lgscout.com bootstrap[5627]: [WARNING] Cannot resolve zk-5.zk: [Errno -2] Name or service not known
      Jun 18 19:55:07 dcos201-jolly-a.dev.lgscout.com dcos-net-watchdog.py[3976]: 2018-06-18 19:55:07,469 [ERROR] DNS Server Timeout
      Jun 18 19:55:07 dcos201-jolly-a.dev.lgscout.com dcos-net-watchdog.py[3976]: 2018-06-18 19:55:07,469 [INFO] Sending DNS query: 198.51.100.3
      Jun 18 19:55:11 dcos201-jolly-a.dev.lgscout.com systemd[1]: dcos-mesos-slave.service: Service hold-off time over, scheduling restart.
      Jun 18 19:55:11 dcos201-jolly-a.dev.lgscout.com systemd[1]: dcos-mesos-slave.service: Scheduled restart job, restart counter is at 20.
      Jun 18 19:55:11 dcos201-jolly-a.dev.lgscout.com systemd[1]: Stopped Mesos Agent: distributed systems kernel agent.
      Jun 18 19:55:11 dcos201-jolly-a.dev.lgscout.com systemd[1]: Starting Mesos Agent: distributed systems kernel agent...
      Jun 18 19:55:11 dcos201-jolly-a.dev.lgscout.com mesos-agent[5764]: ping: unknown host ready.spartan
      Jun 18 19:55:11 dcos201-jolly-a.dev.lgscout.com systemd[1]: dcos-mesos-slave.service: Control process exited, code=exited status=2
      Jun 18 19:55:11 dcos201-jolly-a.dev.lgscout.com systemd[1]: dcos-mesos-slave.service: Failed with result 'exit-code'.
      Jun 18 19:55:11 dcos201-jolly-a.dev.lgscout.com systemd[1]: Failed to start Mesos Agent: distributed systems kernel agent.
      Jun 18 19:55:12 dcos201-jolly-a.dev.lgscout.com dcos-net-watchdog.py[3976]: 2018-06-18 19:55:12,473 [ERROR] DNS Server Timeout
      Jun 18 19:55:12 dcos201-jolly-a.dev.lgscout.com dcos-net-watchdog.py[3976]: 2018-06-18 19:55:12,474 [INFO] Killing dcos-net.service
      Jun 18 19:55:12 dcos201-jolly-a.dev.lgscout.com dcos-net-watchdog.py[3976]: 2018-06-18 19:55:12,484 [INFO] dcos-net.service was successfully killed
      Jun 18 19:55:12 dcos201-jolly-a.dev.lgscout.com dcos-net-watchdog.py[3976]: 2018-06-18 19:55:12,484 [INFO] Sleeping for 60 seconds

      DC/OS 1.11.2 exhibits the exact same behavior as above:

      Switching back to DC/OS 1.11.0, it works.

       

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                sergeyurbanovich Sergey Urbanovich
                Reporter:
                wbrown-lg wbrown-lg
                Team:
                Networking Team
                Watchers:
                Deepak Goel, Sergey Urbanovich, wbrown-lg
              • Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: