Upgrading to DCOS 1.9 from an existing 1.8.7 cluster led to VIPs routing failure. All existing tasks using VIPs were assigned new virtual IP entries without the old ones being cleaned up. As a result, incoming requests were round-robin'ed to zombie IPV entries, led to cluster-wide network failures.
1. Existing VIPs were mostly configured with docker containerizer in this format:
2. Started upgrading to 1.9 from an existing 1.8.7 cluster of 1 master, 1 public slave and 7 slaves within an on-prem environment. All nodes were running CentOS 7.3, kernel 4.10.5+. All existing tasks are healthy.
3. Followed the upgrade guide: https://dcos.io/docs/1.9/administration/upgrading/ Upgraded nodes one by one, starting from master -> public slave -> remaining slaves. Waited for each node to rejoin the cluster and showed Healthy on DCOS web UI before proceeding to the next.
4. No tasks were killed during the upgrade process. Requests failure started occurring shortly after the last node completed the upgrade. Half of the requests going to existing VIPs failed due to being routed incorrectly to zombie IPV entries, as observed from:
5. Rebooting every node did not fix it
6. A complete uninstallation and reinstallation of DCOS with:
also did not rectify the issue. It seemed that the root cause was due to the internal state of minuteman / navstar already being out-of-sync with the kernel IPV table.
After getting support from @dgoel on DCOS community Slack channel, it's concluded that the only way to fix this is to manually remove all zombie physical server entries from their virtual services using
This needs to be performed manually on every node on the cluster. The old API to obtain VIPs metadata from minuteman no longer exists in 1.9 (http://localhost:61421/vips) which made it impossible to automate this process by having access to the source of truth.