Uploaded image for project: 'DC/OS'
  1. DC/OS
  2. DCOS_OSS-3595

dcos-net Fails to Recurse Upstream Resolvers

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: High
    • Resolution: Done
    • Affects Version/s: DC/OS 1.11.1
    • Fix Version/s: None
    • Component/s: dcos-net, spartan
    • Labels:

      Description

      We have two DC/OS clusters, one in each our of AWS accounts. We noticed that after upgrading to 1.11.0 in development and 1.11.1 in production that responses from our services were occasionally taking much longer (up to seconds) to respond. 

      After digging on this for a day one of our engineers found that DNS lookups to the upstream resolvers in the private subnets in the VPCs were not actually resolving every few seconds:

      [root@dcos-agent]# dig +retry=0 @198.51.100.1 in a <redacted>.us-east-1.<redacted>.b5l
      
      ; <<>> DiG 9.9.4-RedHat-9.9.4-29.el7_2.3 <<>> +retry=0 @198.51.100.1 in a <redacted>.us-east-1.<redacted>.b5l
      ; (1 server found)
      ;; global options: +cmd
      ;; connection timed out; no servers could be reached
       

      This happens for both the private zone in the VPC (request example above) as well as publicly resolvable domains:

      [root@dcos-agent centos]# dig +retry=0 @198.51.100.1 in a google.com
      
      ; <<>> DiG 9.9.4-RedHat-9.9.4-29.el7_2.3 <<>> +retry=0 @198.51.100.1 in a google.com
      ; (1 server found)
      ;; global options: +cmd
      ;; connection timed out; no servers could be reached
      

      I've verified our private resolvers are configured properly in dcos-dns.json:

      {
      
      
        "upstream_resolvers": ["<redacted>", "<redacted>", "<redacted>"],
        "udp_port": 53,
        "tcp_port": 53,
        "bind_ip_blacklist": ["198.51.100.4"],
        "forward_zones": {}
      }
      

      On my my leading master and the agent I ran the dig queries on, the systemd unit for dcos-net.service has no logs in the journal. I'm running the default logging configuration for DC/OS and I'm not sure where those logs are sent to by default. Those would be handy in debugging this further.

        Attachments

          Activity

            People

            • Assignee:
              sergeyurbanovich Sergey Urbanovich
              Reporter:
              jeffmalnick Jeff Malnick
              Team:
              Networking Team
              Watchers:
              Deepak Goel, Jack Angers, Jeff Malnick, Judith Malnick (Inactive), Sergey Urbanovich
            • Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: