We recently upgraded from DC/OS 1.11.9 to 1.12.2.
I recently rebooted a node (had to migrate from azure unmanaged disks to managed). Before doing that I enabled maintenance mode & used the down-agent.sh script from the dcos-toolbox to drain the node. I did not start the conversion of disks until I confirmed all docker containers where down.
One of these containers running there was running NATS with 3 l4lb addresses defined, after the maintenance our users started to complain about instability in our communications and I noticed that ipvs still had the old backend enabled:
[hrh@staging-we-private-004 ~]$ getent hosts stagingnats.marathon.l4lb.thisdcos.directory
172.26.2.4 was the node that was restarted the service currently runs on 172.26.2.5.
Querying the VIPS api:
I tried a few things
- Clearing out the mnesia database & restarting dcos-net on all nodes to no avail the entries remain.
- Clearing out the ipvs table on all nodes and restarting dcos-net as well
- Removing the service(entries still remained in ipvs and vips endpoint after the service was removed).
- Forcing it to run on the 172.26.2.4 node and then moving it elsewhere, the 172.26.2.4 entry still remained.
- Renaming the l4lb endpoint address.
I think that in 1.12 the L4LB functionality moved to the new latchup mechanism? Is there any way for me to force these entries out? Currently we got the functionality back by switching to the .mesos entry for the service (since we use fixed ports).
Anything that I can do to clear out this entry?