We have two DC/OS clusters, one in each our of AWS accounts. We noticed that after upgrading to 1.11.0 in development and 1.11.1 in production that responses from our services were occasionally taking much longer (up to seconds) to respond.
After digging on this for a day one of our engineers found that DNS lookups to the upstream resolvers in the private subnets in the VPCs were not actually resolving every few seconds:
This happens for both the private zone in the VPC (request example above) as well as publicly resolvable domains:
I've verified our private resolvers are configured properly in dcos-dns.json:
On my my leading master and the agent I ran the dig queries on, the systemd unit for dcos-net.service has no logs in the journal. I'm running the default logging configuration for DC/OS and I'm not sure where those logs are sent to by default. Those would be handy in debugging this further.