Uploaded image for project: 'DC/OS'
  1. DC/OS
  2. DCOS_OSS-4205

Ip routing table not rebuilt correctly after reboot.

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: High
    • Resolution: Duplicate
    • Affects Version/s: DC/OS 1.11.6
    • Fix Version/s: None
    • Component/s: dcos-net
    • Labels:

      Description

      When the masters/agents are physically rebooted and they come back the ip route table doesn't get built correctly on all nodes.

      Example...

      node-0004 ~ ❯❯❯ route
      Kernel IP routing table
      Destination Gateway Genmask Flags Metric Ref Use Iface
      default gateway 0.0.0.0 UG 0 0 0 eth0
      9.0.0.0 44.128.0.1 255.255.255.0 UG 0 0 0 vtep1024
      9.0.1.0 44.128.0.2 255.255.255.0 UG 0 0 0 vtep1024
      9.0.2.0 44.128.0.3 255.255.255.0 UG 0 0 0 vtep1024
      9.0.3.0 44.128.0.4 255.255.255.0 UG 0 0 0 vtep1024
      9.0.4.0 44.128.0.5 255.255.255.0 UG 0 0 0 vtep1024
      9.0.5.0 44.128.0.6 255.255.255.0 UG 0 0 0 vtep1024
      9.0.6.0 44.128.0.7 255.255.255.0 UG 0 0 0 vtep1024
      9.0.7.0 44.128.0.8 255.255.255.0 UG 0 0 0 vtep1024
      9.0.9.128 0.0.0.0 255.255.255.128 U 0 0 0 d-dcos
      44.128.0.0 0.0.0.0 255.255.240.0 U 0 0 0 vtep1024
      169.254.169.254 172.xxxx.xxx.17 255.255.255.255 UGH 0 0 0 eth0
      172.17.0.0 0.0.0.0 255.255.192.0 U 0 0 0 eth0
      172.18.0.0 0.0.0.0 255.255.0.0 U 0 0 0 docker0
      172.19.0.0 0.0.0.0 255.255.0.0 U 0 0 0 d-dcos6

      node-0003 ~ ❯❯❯ route
      Kernel IP routing table
      Destination Gateway Genmask Flags Metric Ref Use Iface
      default gateway 0.0.0.0 UG 0 0 0 eth0
      9.0.5.0 44.128.0.6 255.255.255.0 UG 0 0 0 vtep1024
      9.0.6.0 44.128.0.7 255.255.255.0 UG 0 0 0 vtep1024
      9.0.7.128 0.0.0.0 255.255.255.128 U 0 0 0 d-dcos
      9.0.9.0 44.128.0.10 255.255.255.0 UG 0 0 0 vtep1024
      44.128.0.0 0.0.0.0 255.255.240.0 U 0 0 0 vtep1024
      169.254.169.254 172.xxx.xxx.17 255.255.255.255 UGH 0 0 0 eth0
      172.17.0.0 0.0.0.0 255.255.192.0 U 0 0 0 eth0
      172.18.0.0 0.0.0.0 255.255.0.0 U 0 0 0 docker0
      172.19.0.0 0.0.0.0 255.255.0.0 U 0 0 0 d-dcos6

      By checking journalctl -flu dcos-net we noticed this error on the "broken" nodes...

      Sep 26 21:27:34 xxxxxx-0003.xxxxxx dcos-net-env[2641]: badmatch,{error,2415919103,[],[{dcos_overlay_configure,configure_overlay_entry,4,[

      {file,"/pkg/src/dcos-net/_build/prod/lib/dcos_overlay/src/dcos_overlay_configure.erl"},{line,116}]},{lists,map,2,[{file,"lists.erl"},{line,1239}]},{lists,map,2,[{file,"lists.erl"},{line,1239}]},{dcos_overlay_configure,maybe_configure,2,[{file,"/pkg/src/dcos-net/_build/prod/lib/dcos_overlay/src/dcos_overlay_configure.erl"}

      ,

      {line,38}

      ]}]}
      Sep 26 21:27:34 xxxxxx-0003.xxxxxx dcos-net-env[2641]: 21:27:34.138 [error] Supervisor dcos_overlay_sup had child dcos_overlay_lashup_kv_listener started with dcos_overlay_lashup_kv_listener:start_link() at <0.21383.0> exit with reason no match of right hand value

      {error,2415919103,[]}

      in dcos_overlay_configure:configure_overlay_entry/4 line 116 in context child_terminated

      The problem is reproducible on every reboot and the only way that seems to "fix" the issue is to run the following commands on all nodes until all routes are built. The commands have to run a few times sometimes to get the routes corrected.

      sudo systemctl stop dcos-net
      sudo rm -rf /var/lib/dcos/navstar/lashup/*
      sudo rm -rf /var/lib/dcos/navstar/mnesia/*
      sudo systemctl start dcos-net

        Attachments

          Activity

            People

            • Assignee:
              dgoel Deepak Goel
              Reporter:
              johnsmith John Smith
              Team:
              Networking Team
              Watchers:
              Deepak Goel, John Smith
            • Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: