Uploaded image for project: 'DC/OS'
  1. DC/OS
  2. DCOS_OSS-4963

Phantom l4lb vips backend after a node was rebooted using maintenance feature.

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Medium
    • Resolution: Duplicate
    • Affects Version/s: DC/OS 1.12.3
    • Fix Version/s: None
    • Component/s: dcos-net, minuteman
    • Labels:
      None

      Description

      We recently upgraded from DC/OS 1.11.9 to 1.12.2.

       

      I recently rebooted a node (had to migrate from azure unmanaged disks to managed). Before doing that I enabled maintenance mode & used the down-agent.sh script from the dcos-toolbox to drain the node. I did not start the conversion of disks until I confirmed all docker containers where down.

       

      One of these containers running there was running NATS with 3 l4lb addresses defined, after the maintenance our users started to complain about instability in our communications and I noticed that ipvs still had the old backend enabled:

      [hrh@staging-we-private-004 ~]$ getent hosts stagingnats.marathon.l4lb.thisdcos.directory
      11.166.222.91   stagingnats.marathon.l4lb.thisdcos.directory

       

      TCP  11.166.222.91:4222 wlc
        -> 172.26.2.4:2062              Masq    1      0          0
        -> 172.26.2.5:2062              Masq    1      0          0
      TCP  11.166.222.91:6222 wlc
        -> 172.26.2.4:31986             Masq    1      0          0
        -> 172.26.2.5:31986             Masq    1      0          0
      TCP  11.166.222.91:8222 wlc
        -> 172.26.2.4:2063              Masq    1      0          0
        -> 172.26.2.5:2063              Masq    1      0          0
      TCP  11.167.188.120:4222 wlc
      --
      UDP  11.166.222.91:4222 wlc
        -> 172.26.2.4:2062              Masq    1      0          0
        -> 172.26.2.5:2062              Masq    1      0          0
      UDP  11.166.222.91:6222 wlc
        -> 172.26.2.4:31986             Masq    1      0          0
        -> 172.26.2.5:31986             Masq    1      0          0
      UDP  11.166.222.91:8222 wlc
        -> 172.26.2.4:2063              Masq    1      0          0
        -> 172.26.2.5:2063              Masq    1      0          0
      UDP  11.167.188.120:4222 wlc

      172.26.2.4 was the node that was restarted the service currently runs on 172.26.2.5.

       

       

      Confirmation:

       

      [hrh@staging-we-private-004 ~]$ curl 172.26.2.4:2062
      curl: (7) Failed connect to 172.26.2.4:2062; Connection refused
      [hrh@staging-we-private-004 ~]$ curl 172.26.2.5:2062
      INFO {"server_id":"kFm4Hhq9umRWUnzUGQ0kRK","version":"1.0.2","go":"go1.7.6","host":"172.26.2.5","port":2062,"auth_required":false,"ssl_required":false,"tls_required":false,"tls_verify":false,"max_payload":100000000,"connect_urls":["172.26.2.24:31988"]}
      -ERR 'Unknown Protocol Operation'
      -ERR 'Parser Error'

       

       

      Querying the VIPS api:

       

      [hrh@staging-we-private-004 ~]$ curl -s http://localhost:62080/v1/vips|jq ".[]|select(.vip==\"stagingnats.marathon.l4lb.thisdcos.directory:4222\")"
      {
        "backend": [
          {
            "ip": "172.26.2.4",
            "port": 2062
          },
          {
            "ip": "172.26.2.5",
            "port": 2062
          }
        ],
        "protocol": "udp",
        "vip": "stagingnats.marathon.l4lb.thisdcos.directory:4222"
      }
      {
        "backend": [
          {
            "ip": "172.26.2.4",
            "port": 2062
          },
          {
            "ip": "172.26.2.5",
            "port": 2062
          }
        ],
        "protocol": "tcp",
        "vip": "stagingnats.marathon.l4lb.thisdcos.directory:4222"
      }

       

      I tried a few things

      • Clearing out the mnesia database & restarting dcos-net on all nodes to no avail the entries remain.
      • Clearing out the ipvs table on all nodes and restarting dcos-net as well
      • Removing the service(entries still remained in ipvs and vips endpoint after the service was removed).
      • Forcing it to run on the 172.26.2.4 node and then moving it elsewhere, the 172.26.2.4 entry still remained.
      • Renaming the l4lb endpoint address.

       

      I think that in 1.12 the L4LB functionality moved to the new latchup mechanism? Is there any way for me to force these entries out? Currently we got the functionality back by switching to the .mesos entry for the service (since we use fixed ports).

      Anything that I can do to clear out this entry?

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                dgoel Deepak Goel
                Reporter:
                halldorh halldorh
                Team:
                Networking Team
                Watchers:
                Chip Killmar, Deepak Goel, halldorh, mimmus
              • Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: