Uploaded image for project: 'DC/OS'
  1. DC/OS
  2. DCOS_OSS-4467

Sometime K8s pods on public agent cannot connect to pods on private agents

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: High
    • Resolution: Done
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: networking
    • Labels:
      None
    • Sprint:
      Networking: RI-10 Sprint 38
    • Story Points:
      5

      Description

      I’m triaging an issue with Kubernetes pods running in private agents not able to communicate with Kubernetes pods running in a public agent. This is a flaky issue that, until now, only occurs on DC/OS 1.12 Open.

      From a pod running in a private agent we can ping the m-dcos interface in the public agent:

      root@45d8408d-0f3c-4599-861a-0673d2ed2579:/mnt/mesos/sandbox# ping 9.0.0.1
      PING 9.0.0.1 (9.0.0.1) 56(84) bytes of data.
      64 bytes from 9.0.0.1: icmp_seq=1 ttl=63 time=0.336 ms
      64 bytes from 9.0.0.1: icmp_seq=2 ttl=63 time=0.382 ms
      ^C
      --- 9.0.0.1 ping statistics ---
      2 packets transmitted, 2 received, 0% packet loss, time 1003ms
      rtt min/avg/max/mdev = 0.336/0.359/0.382/0.023 ms
      

      ping from a pod in a private agent to a pod in a public agent fails:

      root@45d8408d-0f3c-4599-861a-0673d2ed2579:/mnt/mesos/sandbox# ping 192.168.5.1 
      PING 192.168.5.1 (192.168.5.1) 56(84) bytes of data.
      ^C
      --- 192.168.5.1 ping statistics ---
      7 packets transmitted, 0 received, 100% packet loss, time 6133ms
      

      tcpdump on the public agent shows that it's sending the reply but nothing gets to the other side:

      root@vm-lk6f:/mnt/mesos/sandbox# tcpdump -n -p -i any icmp 
      tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
      listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes
      18:26:22.164787 IP 192.168.0.1 > 192.168.5.1: ICMP echo request, id 20881, seq 1, length 64
      18:26:22.164834 IP 192.168.5.1 > 192.168.0.1: ICMP echo reply, id 20881, seq 1, length 64
      18:26:23.178036 IP 192.168.0.1 > 192.168.5.1: ICMP echo request, id 20881, seq 2, length 64
      18:26:23.178085 IP 192.168.5.1 > 192.168.0.1: ICMP echo reply, id 20881, seq 2, length 64
      18:26:24.201986 IP 192.168.0.1 > 192.168.5.1: ICMP echo request, id 20881, seq 3, length 64
      18:26:24.202027 IP 192.168.5.1 > 192.168.0.1: ICMP echo reply, id 20881, seq 3, length 64
      18:26:25.226082 IP 192.168.0.1 > 192.168.5.1: ICMP echo request, id 20881, seq 4, length 64
      18:26:25.226129 IP 192.168.5.1 > 192.168.0.1: ICMP echo reply, id 20881, seq 4, length 64
      18:26:26.250005 IP 192.168.0.1 > 192.168.5.1: ICMP echo request, id 20881, seq 5, length 64
      18:26:26.250045 IP 192.168.5.1 > 192.168.0.1: ICMP echo reply, id 20881, seq 5, length 64
      18:26:27.274026 IP 192.168.0.1 > 192.168.5.1: ICMP echo request, id 20881, seq 6, length 64
      18:26:27.274070 IP 192.168.5.1 > 192.168.0.1: ICMP echo reply, id 20881, seq 6, length 64
      18:26:28.298088 IP 192.168.0.1 > 192.168.5.1: ICMP echo request, id 20881, seq 7, length 64
      18:26:28.298137 IP 192.168.5.1 > 192.168.0.1: ICMP echo reply, id 20881, seq 7, length 64
      ^C
      14 packets captured
      20 packets received by filter
      0 packets dropped by kernel
      

      Routes in pod running in private agent:

      root@45d8408d-0f3c-4599-861a-0673d2ed2579:/mnt/mesos/sandbox# ip route
      default via 9.0.4.1 dev eth0 
      9.0.4.0/25 dev eth0 proto kernel scope link src 9.0.4.3 
      172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown 
      blackhole 192.168.0.0/24 proto bird 
      192.168.1.0/24 via 9.0.2.3 dev tunl0 proto bird onlink 
      192.168.2.0/24 via 9.0.3.3 dev tunl0 proto bird onlink 
      192.168.3.0/24 via 9.0.2.4 dev tunl0 proto bird onlink 
      192.168.4.0/24 via 9.0.4.5 dev tunl0 proto bird onlink 
      192.168.5.0/24 via 9.0.0.1 dev tunl0 proto bird onlink 
      

      Routes in public agent:

      root@vm-lk6f:/mnt/mesos/sandbox# ip route
      default via 10.138.0.1 dev ens4v1 proto dhcp src 10.138.0.7 metric 1024 
      default via 10.138.0.1 dev ens4v1 proto dhcp metric 1024 
      9.0.0.0/25 dev m-dcos proto kernel scope link src 9.0.0.1 
      9.0.0.128/25 dev d-dcos proto kernel scope link src 9.0.0.129 linkdown 
      9.0.1.0/24 via 44.128.0.2 dev vtep1024 
      9.0.2.0/24 via 44.128.0.3 dev vtep1024 
      9.0.3.0/24 via 44.128.0.4 dev vtep1024 
      9.0.4.0/24 via 44.128.0.5 dev vtep1024 
      10.138.0.1 dev ens4v1 proto dhcp scope link src 10.138.0.7 metric 1024 
      10.138.0.1 dev ens4v1 proto dhcp metric 1024 
      44.128.0.0/20 dev vtep1024 proto kernel scope link src 44.128.0.1 
      172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 
      172.18.0.0/16 dev d-dcos6 proto kernel scope link src 172.18.0.1 linkdown 
      192.168.0.0/24 via 9.0.4.3 dev tunl0 proto bird onlink 
      192.168.1.0/24 via 9.0.2.3 dev tunl0 proto bird onlink 
      192.168.2.0/24 via 9.0.3.3 dev tunl0 proto bird onlink 
      192.168.3.0/24 via 9.0.2.4 dev tunl0 proto bird onlink 
      192.168.4.0/24 via 9.0.4.5 dev tunl0 proto bird onlink 
      blackhole 192.168.5.0/24 proto bird 
      

      Initially I suspected iptables rules but the following is in dcos-net logs on the public agent 10.138.0.7:

      Nov 07 17:21:11 vm-lk6f.c.massive-bliss-781.internal dcos-net-env[2709]: 2018-11-07T17:21:11.116867+00:00 error: Error in process <0.23191.166> on node 'navstar@10.138.0.7' with exit value:, {{badmatch,{error,ehostunreach,[]}},[{dcos_overlay_configure,configure_overlay_entry,4,[{file,"/pkg/src/dcos-net/_build/prod/lib/dcos_overlay/src/dcos_overlay_configure.erl"},{line,104}]},{lists,foreach,2,[{file,"lists.erl"},{line,1338}]},{dcos_overlay_configure,maybe_configure,2,[{file,"/pkg/src/dcos-net/_build/prod/lib/dcos_overlay/src/dcos_overlay_configure.erl"},{line,35}]}]}
      Nov 07 17:21:11 vm-lk6f.c.massive-bliss-781.internal dcos-net-env[2709]: 2018-11-07T17:21:11.117213+00:00 error: supervisor: {local,dcos_overlay_sup}, errorContext: child_terminated, reason: {{badmatch,{error,ehostunreach,[]}},[{dcos_overlay_configure,configure_overlay_entry,4,[{file,"/pkg/src/dcos-net/_build/prod/lib/dcos_overlay/src/dcos_overlay_configure.erl"},{line,104}]},{lists,foreach,2,[{file,"lists.erl"},{line,1338}]},{dcos_overlay_configure,maybe_configure,2,[{file,"/pkg/src/dcos-net/_build/prod/lib/dcos_overlay/src/dcos_overlay_configure.erl"},{line,35}]}]}, offender: [{pid,<0.23210.166>},{id,dcos_overlay_lashup_kv_listener},{mfargs,{dcos_overlay_lashup_kv_listener,start_link,[]}},{restart_type,permanent},{shutdown,5000},{child_type,worker}]
      Nov 07 17:21:11 vm-lk6f.c.massive-bliss-781.internal dcos-net-env[2709]: 17:21:11.117 [error] Error in process <0.23191.166> on node 'navstar@10.138.0.7' with exit value:
      Nov 07 17:21:11 vm-lk6f.c.massive-bliss-781.internal dcos-net-env[2709]: {{badmatch,{error,ehostunreach,[]}},[{dcos_overlay_configure,configure_overlay_entry,4,[{file,"/pkg/src/dcos-net/_build/prod/lib/dcos_overlay/src/dcos_overlay_configure.erl"},{line,104}]},{lists,foreach,2,[{file,"lists.erl"},{line,1338}]},{dcos_overlay_configure,maybe_configure,2,[{file,"/pkg/src/dcos-net/_build/prod/lib/dcos_overlay/src/dcos_overlay_configure.erl"},{line,35}]}]}
      Nov 07 17:21:11 vm-lk6f.c.massive-bliss-781.internal dcos-net-env[2709]: 17:21:11.117 [error] Supervisor dcos_overlay_sup had child dcos_overlay_lashup_kv_listener started with dcos_overlay_lashup_kv_listener:start_link() at <0.23210.166> exit with reason no match of right hand value {error,ehostunreach,[]} in dcos_overlay_configure:configure_overlay_entry/4 line 104 in context child_terminated
      Nov 07 17:21:16 vm-lk6f.c.massive-bliss-781.internal dcos-net-env[2709]: 17:21:16.120 [notice] Overlay configuration was gossiped, fd01:b::/64 => #{"fd01:a::1/64" => #{agent_ip => "10.138.0.7",mac => "70:B3:D5:80:00:01",subnet => "fd01:b::/80"},"fd01:a::2/64" => #{agent_ip => "10.138.0.3",mac => "70:B3:D5:80:00:02",subnet => "fd01:b:0:0:1::/80"},"fd01:a::3/64" => #{agent_ip => "10.138.0.4",mac => "70:B3:D5:80:00:03",subnet => "fd01:b:0:0:2::/80"},"fd01:a::4/64" => #{agent_ip => "10.138.0.6",mac => "70:B3:D5:80:00:04",subnet => "fd01:b:0:0:3::/80"},"fd01:a::5/64" => #{agent_ip => "10.138.0.5",mac => "70:B3:D5:80:00:05",subnet => "fd01:b:0:0:4::/80"}}
      

        Attachments

          Activity

            People

            • Assignee:
              dgoel Deepak Goel
              Reporter:
              simaoreis Simao Reis
              Team:
              Networking Team
              Watchers:
              Deepak Goel, Orlando Hohmeier, Paulo Pires, Razi Malik (Inactive), Sam Briesemeister, Shane Utt, Simao Reis, Steven Siahetiong
            • Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: