Uploaded image for project: 'DC/OS'
  1. DC/OS
  2. DCOS_OSS-871

[1.9 Upgrade] L4 Load Balancing / VIPs routing failures due to IP Virtual Server zombie entries

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Done
    • Affects Version/s: DC/OS 1.9.0
    • Fix Version/s: DC/OS 1.9.1
    • Component/s: minuteman
    • Labels:

      Description

      Quick summary:

      Upgrading to DCOS 1.9 from an existing 1.8.7 cluster led to VIPs routing failure. All existing tasks using VIPs were assigned new virtual IP entries without the old ones being cleaned up. As a result, incoming requests were round-robin'ed to zombie IPV entries, led to cluster-wide network failures.

       

      Steps:

      1. Existing VIPs were mostly configured with docker containerizer in this format:

      "docker": {
          "image": "xxx",
          "network": "BRIDGE",
          "portMappings": [{
              "containerPort": xxx,
              "hostPort": 0,
              "protocol": "tcp",
              "labels": {
                  "VIP_0": "/service_name_here:80"
              },
              "name": "default"
          }],
          "privileged": false,
          "forcePullImage": false
      }

      2. Started upgrading to 1.9 from an existing 1.8.7 cluster of 1 master, 1 public slave and 7 slaves within an on-prem environment. All nodes were running CentOS 7.3, kernel 4.10.5+. All existing tasks are healthy.

      3. Followed the upgrade guide: https://dcos.io/docs/1.9/administration/upgrading/ Upgraded nodes one by one, starting from master -> public slave -> remaining slaves. Waited for each node to rejoin the cluster and showed Healthy on DCOS web UI before proceeding to the next.

      4. No tasks were killed during the upgrade process. Requests failure started occurring shortly after the last node completed the upgrade. Half of the requests going to existing VIPs failed due to being routed incorrectly to zombie IPV entries, as observed from:

       

      ipvsadm -l

       

      5. Rebooting every node did not fix it

      6. A complete uninstallation and reinstallation of DCOS with:

      /opt/mesosphere/bin/pkgpanda uninstall
      rm -rf /etc/mesosphere /opt/mesosphere /var/lib/mesos /var/lib/dcos

      also did not rectify the issue. It seemed that the root cause was due to the internal state of minuteman / navstar already being out-of-sync with the kernel IPV table.

       

      Workaround: 

      After getting support from @dgoel on DCOS community Slack channel, it's concluded that the only way to fix this is to manually remove all zombie physical server entries from their virtual services using 

      ipvsadm -dt SERVICE_ENTRY_HERE -r ZOMBIE_PHYSICAL_ENTRY_HERE

      This needs to be performed manually on every node on the cluster. The old API to obtain VIPs metadata from minuteman no longer exists in 1.9 (http://localhost:61421/vips) which made it impossible to automate this process by having access to the source of truth.

        Attachments

          Activity

            People

            • Assignee:
              dgoel Deepak Goel
              Reporter:
              nktpro Jacky Nguyen
              Watchers:
              Avinash Sridharan (Inactive), Bekir Dogan, Deepak Goel, Jacky Nguyen, mimmus, setekhid
            • Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: