Uploaded image for project: 'DC/OS'
  1. DC/OS
  2. DCOS_OSS-617

Backends are not removed from VIP table when killed

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Medium
    • Resolution: Done
    • Affects Version/s: DC/OS 1.8.8
    • Fix Version/s: DC/OS 1.9.1
    • Component/s: minuteman
    • Labels:

      Description

      Description

      Environment: DCOS 1.8.7 (Azure, installed using ACS Engine modified ARM templates - Ubuntu 16.04 with 4.4.0-28-generic).

      After adding a docker contaneriser based app set to use the default `dcos` overlay network and assigned a VIP via label in `portMappings` (see attachment with app config in my-group.json, it's the /my-app/backend app), VIP assignment works and traffic gets routed correctly to the overlay network IPs of its tasks (using both the `SERVICE_LABEL.l4lb.thisdcos.directory:$VIP_PORT` as well as the `IP:$VIP_PORT`). The issue appears when removing (properly killing) tasks, either by scaling up and down, scaling down the initial 3 instances, or completely removing the app: backends are never removed from the VIP and so requests directed to the no-longer-existing backends fail, though `total_failures` continued to be tracked for each backend.

      Steps to reproduce

      1. Before launching the app: check vips in any cluster node (results are the same across, I've tested in different nodes and types of nodes):

      curl -s http://localhost:61421/vips | python -m json.tool
      {
          "vips": {}
      }
      

      2. Launching (`dcos marathon group add my-group.json` ) the app will create the VIP and digging the DNS name will give (only relevant info here)

      ;; QUESTION SECTION:
      ;ctrl1.marathon.l4lb.thisdcos.directory.	IN A
      
      ;; ANSWER SECTION:
      ctrl1.marathon.l4lb.thisdcos.directory.	5 IN A	11.157.241.22
      

      Vips will show backends. After a few requests:

      {
          "vips": {
              "11.157.241.22:3000": {
                  "9.0.5.133:3000": {
                      "is_healthy": true,
                      "latency_last_60s": {},
                      "pending_connections": 0,
                      "total_failures": 0,
                      "total_sucesses": 3
                  },
                  "9.0.6.133:3000": {
                      "is_healthy": true,
                      "latency_last_60s": {},
                      "pending_connections": 0,
                      "total_failures": 0,
                      "total_sucesses": 2
                  },
                  "9.0.7.134:3000": {
                      "is_healthy": true,
                      "latency_last_60s": {},
                      "pending_connections": 0,
                      "total_failures": 0,
                      "total_sucesses": 1
                  }
              }
          }
      }
      

      The status of the group (all apps have their healthcheck in place) will be:

      $ dcos marathon app list
      ID                 MEM   CPUS  TASKS  HEALTH  DEPLOYMENT  CONTAINER  CMD                                                                                                    
      /my-app/frontend          100   0.1    3/3    3/3       ---        DOCKER   None                                                                                                   
      /my-app/backend          100   0.1    3/3    3/3       ---        DOCKER   None
      

      3 Scaling up, and all continues well...

      /my-app/backend          100   0.1    4/4    4/4       ---        DOCKER   None 
      
      {
          "vips": {
              "11.157.241.22:3000": {
                  "9.0.5.133:3000": {
                      "is_healthy": true,
                      "latency_last_60s": {},
                      "pending_connections": 0,
                      "total_failures": 0,
                      "total_sucesses": 4
                  },
                  "9.0.5.135:3000": {
                      "is_healthy": true,
                      "latency_last_60s": {},
                      "pending_connections": 0,
                      "total_failures": 0,
                      "total_sucesses": 2
                  },
                  "9.0.6.133:3000": {
                      "is_healthy": true,
                      "latency_last_60s": {},
                      "pending_connections": 0,
                      "total_failures": 0,
                      "total_sucesses": 5
                  },
                  "9.0.7.134:3000": {
                      "is_healthy": true,
                      "latency_last_60s": {},
                      "pending_connections": 0,
                      "total_failures": 0,
                      "total_sucesses": 2
                  }
              }
          }
      }
      

      4 Scale down and the issue appears.

      Though all tasks appear healthy, on querying our backend endpoint we get a few of

      user@dcos-master-17841738-0:~$ curl -I http://ctrl1.marathon.l4lb.thisdcos.directory:3000/api/up
      curl: (7) Failed to connect to ctrl1.marathon.l4lb.thisdcos.directory port 3000: No route to host
      

      (see failed_requests.txt for more)

      Expected

      VIPs table. Just what we had prior to scaling up above in 2.

      Current result

      VIPs tables like this:

      {
          "vips": {
              "11.157.241.22:3000": {
                  "9.0.5.133:3000": {
                      "is_healthy": true,
                      "latency_last_60s": {},
                      "pending_connections": 0,
                      "total_failures": 3,
                      "total_sucesses": 4
                  },
                  "9.0.5.135:3000": {
                      "is_healthy": true,
                      "latency_last_60s": {},
                      "pending_connections": 0,
                      "total_failures": 0,
                      "total_sucesses": 4
                  },
                  "9.0.6.133:3000": {
                      "is_healthy": true,
                      "latency_last_60s": {},
                      "pending_connections": 0,
                      "total_failures": 0,
                      "total_sucesses": 8
                  },
                  "9.0.7.134:3000": {
                      "is_healthy": true,
                      "latency_last_60s": {},
                      "pending_connections": 0,
                      "total_failures": 0,
                      "total_sucesses": 6
                  }
              }
          }
      }
      

      ^^ Notice the `total_failures` on the first one. What's more, failures are more common the more backends that have been ever registered and are now removed (for instance in a zdd), or when IP allocation is done so another app gets an overlay IP registerd in the back end, in which case requests are directed to a different task.

      Unfortunately removing a group/app does not reset the backends (of the VIP) either. The above output for the vips endpoint is the same after complete group removal.

        Attachments

        1. failed_requests.txt
          2 kB
        2. mesos-slave.log
          11 kB
        3. minuteman_master_slave.log
          2 kB
        4. my-group.json
          5 kB
        5. nginx-a-marathon.json
          1 kB
        6. nginx-b.marathon.json
          1 kB

          Activity

            People

            • Assignee:
              dgoel Deepak Goel
              Reporter:
              ocraviotto Oscar Craviotto (Inactive)
              Team:
              Networking Team
              Watchers:
              Albert Strasheim (Inactive), Bekir Dogan, Deepak Goel, fksimon, Marco Reni, Oscar Craviotto (Inactive), sfwn
            • Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: