Details

    • Type: Bug
    • Status: Resolved
    • Priority: Medium
    • Resolution: Done
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: minuteman
    • Labels:

      Description

      In one of our running DC/OS clusters (v1.9.0) we've had a bit of trouble with the l4lb. One of our services has /waldo/web:3000 as it's VIP_0 label - this hasn't caused any issues in the past but recently we've found about 50% of the requests to it fail with "connection refused".

      I've dug into how the l4lb works - I'm pretty limited in my ability to read Erlang but my understanding is that either Navstar or Minuteman manipulate the IPVS Linux Module to add entries in the 11.0.0.0/8 subnet which map out to actual container ip:port combinations in the DC/OS cluster. Whichever service administers that somehow fetches updates from Mesos.

      In our case the IP address of the waldoweb.marathon.l4lb.thisdcos.directory domain works out to 11.42.69.221(in hex 0B2A45DD). I can inspect the state of IPVS by running "cat /proc/net/ip_vs" on CoreOS. Doing so with a bit of grepping shows this:

      TCP 0B2A45DD:0BB8 wlc
      {{ -> 0A0001A0:3396 Masq 1 0 0}}
      {{ -> 0A000099:2DD8 Masq 1 0 0}}

      0A0001A0:3396 (10.0.1.160:13206) is a valid container IP in the cluster

      0A000099:2DD8 (10.0.0.153:11736) however failed about 4 days ago

       

      So it seems that Minuteman/Navstar has gotten out of sync.  I tried restarting and scaling up / scaling down the service but nothing seems to clear out the old container remote address, any ideas how to fix it?  Do you know how the system may have gotten into this state so we can avoid doing it again?

        Attachments

          Activity

            People

            • Assignee:
              dgoel Deepak Goel
              Reporter:
              whoward whoward
              Team:
              Networking Team
              Watchers:
              Deepak Goel, Marian Zange, whoward
            • Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: