Uploaded image for project: 'DC/OS'
  1. DC/OS
  2. DCOS_OSS-2109

Spartan uses all available CPU talking to itself

    Details

    • Sprint:
      Networking Team 1.11 Sprint 12
    • Story Points:
      1

      Description

      We have found several instances of spartan using all available CPU with very low cluster load. On further investigation, we discovered a crazy amount of localhost UDP traffic with both source and destination ports set to 62053.

      Wireshark showed this to be weird DNS traffic with a whole lot of duplicate transaction ids:

      root@prd-ge-controller03:~# tshark -i lo -c 100 -f 'udp src port 62053 and dst port 62053' -d 'udp.port==62053,dns' -O dns 2> /dev/null | grep Transaction | sort | uniq -c | head -n 5
      5 Transaction ID: 0x07f7
      4 Transaction ID: 0x1b3d
      6 Transaction ID: 0x3172
      5 Transaction ID: 0x32f7
      4 Transaction ID: 0x3ebe
      

      When I dug into the spartan and erl-dns code, I found that erl-dns doesn't check the QR flag in the DNS message header and thus happily treats a response message as if it were a query. This means that when it replies to a query from the UDP port it's listening on it will treat the reply it's just sent to itself as a new query, replying to its own responses as fast as the network and CPU will allow.

      I don't have any insight into how the query loops started in the first place, but the server I was investigating had 22 distinct transaction ids bouncing at a rate of about 8k queries per second.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                sergeyurbanovich Sergey Urbanovich
                Reporter:
                jerith Jeremy Thurgood
                Team:
                Networking Team
                Watchers:
                Bekir Dogan, Deepak Goel, Jeremy Thurgood, Sergey Urbanovich
              • Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: