Uploaded image for project: 'DC/OS'
  1. DC/OS
  2. DCOS_OSS-1065

Test 'marathon_app_tasks_survive_upgrade' fails due to health check

    Details

    • Type: Task
    • Status: Resolved
    • Priority: High
    • Resolution: Cannot Reproduce
    • Affects Version/s: DC/OS 1.10.0
    • Fix Version/s: None
    • Component/s: testing
    • Sprint:
      Networking Team 1.10 Sprint 5, Networking Team 1.10 Sprint 6, Networking Team 1.10 Sprint 8, Networking Team 1.10 Sprint 10, Networking Team 1.11 Sprint 1, Networking Team 1.11 Sprint 2

      Description

      During the upgrade test test_upgrade_vpc.marathon_app_tasks_survive_upgrade, the viplisten task has been observed to fail after an upgrade occurs. At first the task survives the upgrade, but shortly thereafter the agent is seen killing the task.

      The following shows the stderr from the task sandbox:

      4e67d253.073162c5-31b8-11e7-a7f4-823a749f2de1/runs/latest/stdout
      Received task health update, healthy: false
      Received kill for task upgrade-viplisten-ab829317de29421b8890fb844e67d253.073162c5-31b8-11e7-a7f4-823a749f2de1 with grace period of 3secs
      Sending SIGTERM to process tree at pid 15
      Sent SIGTERM to the following process trees:
      [
      -+- 15 sh -c /usr/bin/nc -l -p $PORT0
       \--- 16 /usr/bin/nc -l -p 9557
      ]
      Scheduling escalation to SIGKILL in 3secs from now
      Command terminated with signal Terminated (pid: 15)
      [centos@ip-10-10-0-244 ~]$ cat /var/lib/mesos/slave/slaves/076a54fb-81b9-49fa-83e3-0df68ad80be3-S0/frameworks/076a54fb-81b9-49fa-83e3-0df68ad80be3-0000/executors/upgrade-viplisten-ab829317de29421b8890fb844e67d253.073162c5-31b8-11e7-a7f4-823a749f2de1/runs/latest/stderr
      I0505 17:26:56.556574     5 exec.cpp:162] Version: 1.2.1
      I0505 17:26:56.559469     8 exec.cpp:237] Executor registered on agent 076a54fb-81b9-49fa-83e3-0df68ad80be3-S0
      nslookup: can't resolve '(null)': Name does not resolve
      
      Name:      viplisten.marathon.l4lb.thisdcos.directory
      Address 1: 11.245.227.232
      16
      nslookup: can't resolve '(null)': Name does not resolve
      
      Name:      viplisten.marathon.l4lb.thisdcos.directory
      Address 1: 11.245.227.232
      16
      nslookup: can't resolve '(null)': Name does not resolve
      
      Name:      viplisten.marathon.l4lb.thisdcos.directory
      Address 1: 11.245.227.232
      16
      nslookup: can't resolve '(null)': Name does not resolve
      
      Name:      viplisten.marathon.l4lb.thisdcos.directory
      Address 1: 11.245.227.232
      16
      nslookup: can't resolve '(null)': Name does not resolve
      
      Name:      viplisten.marathon.l4lb.thisdcos.directory
      Address 1: 11.245.227.232
      16
      nslookup: can't resolve '(null)': Name does not resolve
      
      Name:      viplisten.marathon.l4lb.thisdcos.directory
      Address 1: 11.245.227.232
      16
      nslookup: can't resolve '(null)': Name does not resolve
      
      Name:      viplisten.marathon.l4lb.thisdcos.directory
      Address 1: 11.245.227.232
      16
      nslookup: can't resolve '(null)': Name does not resolve
      
      Name:      viplisten.marathon.l4lb.thisdcos.directory
      Address 1: 11.245.227.232
      16
      nslookup: can't resolve '(null)': Name does not resolve
      
      Name:      viplisten.marathon.l4lb.thisdcos.directory
      Address 1: 11.245.227.232
      16
      nslookup: can't resolve '(null)': Name does not resolve
      
      Name:      viplisten.marathon.l4lb.thisdcos.directory
      Address 1: 11.245.227.232
      16
      nslookup: can't resolve '(null)': Name does not resolve
      
      Name:      viplisten.marathon.l4lb.thisdcos.directory
      Address 1: 11.245.227.232
      16
      nslookup: can't resolve '(null)': Name does not resolve
      
      Name:      viplisten.marathon.l4lb.thisdcos.directory
      Address 1: 11.245.227.232
      16
      nslookup: can't resolve '(null)': Name does not resolve
      
      Name:      viplisten.marathon.l4lb.thisdcos.directory
      Address 1: 11.245.227.232
      16
      I0505 17:40:04.237536    10 exec.cpp:488] Agent exited, but framework has checkpointing enabled. Waiting 15mins to reconnect with agent 076a54fb-81b9-49fa-83e3-0df68ad80be3-S0
      nslookup: can't resolve '(null)': Name does not resolve
      
      nslookup: can't resolve 'viplisten.marathon.l4lb.thisdcos.directory': Name does not resolve
      W0505 17:40:12.879420    13 health_checker.cpp:236] Health check failed 1 times consecutively: COMMAND health check failed: Command returned exited with status 1
      E0505 17:40:12.880730    14 process.cpp:2426] Failed to shutdown socket with fd 7: Transport endpoint is not connected
      I0505 17:40:31.530529     7 exec.cpp:283] Received reconnect request from agent 076a54fb-81b9-49fa-83e3-0df68ad80be3-S0
      I0505 17:40:31.532205     7 exec.cpp:260] Executor re-registered on agent 076a54fb-81b9-49fa-83e3-0df68ad80be3-S0
      nslookup: can't resolve '(null)': Name does not resolve
      
      nslookup: can't resolve 'viplisten.marathon.l4lb.thisdcos.directory': Name does not resolve
      W0505 17:41:12.898277    11 health_checker.cpp:236] Health check failed 2 times consecutively: COMMAND health check failed: Command returned exited with status 1
      nslookup: can't resolve '(null)': Name does not resolve
      
      nslookup: can't resolve 'viplisten.marathon.l4lb.thisdcos.directory': Name does not resolve
      W0505 17:42:12.940117    12 health_checker.cpp:236] Health check failed 3 times consecutively: COMMAND health check failed: Command returned exited with status 1
      I0505 17:42:12.940325    12 health_checker.cpp:165] Health checking stopped
      punt!
      I0505 17:42:13.040613    10 health_checker.cpp:165] Health checking stopped
      

      It seems that the task's health check began failing after upgrade, leading to the task kill. Here is the test code which generates a Marathon app JSON for this task:

      def create_marathon_viplisten_app():
          return {
              "id": '/' + TEST_APP_NAME_FMT.format('viplisten-' + uuid.uuid4().hex),
              "cmd": '/usr/bin/nc -l -p $PORT0',
              "cpus": 0.1,
              "mem": 32,
              "instances": 1,
              "container": {
                  "type": "MESOS",
                  "docker": {
                    "image": "alpine:3.5"
                  }
              },
              'portDefinitions': [{
                  'labels': {
                      'VIP_0': '/viplisten:5000'
                  }
              }],
              "healthChecks": [{
                  "protocol": "COMMAND",
                  "command": {
                      "value": "/usr/bin/nslookup viplisten.marathon.l4lb.thisdcos.directory && pgrep -x /usr/bin/nc"
                  },
                  "gracePeriodSeconds": 300,
                  "intervalSeconds": 60,
                  "timeoutSeconds": 20,
                  "maxConsecutiveFailures": 3
              }]
          }
      

      If I log into the agent that was running this task and attempt the health check command manually, I get successful output from nslookup, but pgrep gives a nonzero exit code:

      [centos@ip-10-10-0-244 ~]$ /usr/bin/nslookup viplisten.marathon.l4lb.thisdcos.directory && pgrep -x /usr/bin/nc
      Server:		198.51.100.1
      Address:	198.51.100.1#53
      
      Name:	viplisten.marathon.l4lb.thisdcos.directory
      Address: 11.245.228.11
      
      [centos@ip-10-10-0-244 ~]$ echo $?
      1
      [centos@ip-10-10-0-244 ~]$ /usr/bin/nslookup viplisten.marathon.l4lb.thisdcos.directory
      Server:		198.51.100.1
      Address:	198.51.100.1#53
      
      Name:	viplisten.marathon.l4lb.thisdcos.directory
      Address: 11.245.228.11
      
      [centos@ip-10-10-0-244 ~]$ echo $?
      0
      

      It looks like Spartan was up and running on the agent when the health checks failed (see timestamps):

      May 05 17:20:01 ip-10-10-0-244.eu-central-1.compute.internal systemd[1]: Starting DNS Forwarder (Spartan): forwards DNS requests to multiple DNS servers...
      May 05 17:20:01 ip-10-10-0-244.eu-central-1.compute.internal bootstrap[15127]: [INFO] Clearing proxy environment variables
      May 05 17:20:01 ip-10-10-0-244.eu-central-1.compute.internal bootstrap[15127]: [INFO] Connecting to 10.10.0.145:2181
      May 05 17:20:02 ip-10-10-0-244.eu-central-1.compute.internal bootstrap[15127]: [INFO] Zookeeper connection established, state: CONNECTED
      May 05 17:20:02 ip-10-10-0-244.eu-central-1.compute.internal bootstrap[15127]: [DEBUG] bootstrapping dcos-spartan
      May 05 17:20:02 ip-10-10-0-244.eu-central-1.compute.internal systemd[1]: Started DNS Forwarder (Spartan): forwards DNS requests to multiple DNS servers.
      May 05 17:20:02 ip-10-10-0-244.eu-central-1.compute.internal spartan-env[15177]: Exec: /opt/mesosphere/packages/spartan--9cc57a3d55452b905d90e3201f56913140914ecc/spartan/erts-8.2.2/bin/erlexec -noshell -n
      May 05 17:20:02 ip-10-10-0-244.eu-central-1.compute.internal spartan-env[15177]: Root: /opt/mesosphere/packages/spartan--9cc57a3d55452b905d90e3201f56913140914ecc/spartan
      May 05 17:20:02 ip-10-10-0-244.eu-central-1.compute.internal spartan-env[15177]: /opt/mesosphere/packages/spartan--9cc57a3d55452b905d90e3201f56913140914ecc/spartan
      May 05 17:20:02 ip-10-10-0-244.eu-central-1.compute.internal spartan-env[15177]: Setup running ...
      May 05 17:20:02 ip-10-10-0-244.eu-central-1.compute.internal spartan-env[15177]: Directories verified. Res = {[ok],[]}
      May 05 17:20:02 ip-10-10-0-244.eu-central-1.compute.internal spartan-env[15177]: Setup finished processing hooks ...
      May 05 17:40:04 ip-10-10-0-244.eu-central-1.compute.internal systemd[1]: Stopping DNS Forwarder (Spartan): forwards DNS requests to multiple DNS servers...
      May 05 17:40:04 ip-10-10-0-244.eu-central-1.compute.internal systemd[1]: Stopped DNS Forwarder (Spartan): forwards DNS requests to multiple DNS servers.
      May 05 17:40:04 ip-10-10-0-244.eu-central-1.compute.internal systemd[1]: Starting DNS Forwarder (Spartan): forwards DNS requests to multiple DNS servers...
      May 05 17:40:04 ip-10-10-0-244.eu-central-1.compute.internal bootstrap[4135]: [INFO] Clearing proxy environment variables
      May 05 17:40:04 ip-10-10-0-244.eu-central-1.compute.internal bootstrap[4135]: [INFO] Connecting to 10.10.0.6:2181
      May 05 17:40:04 ip-10-10-0-244.eu-central-1.compute.internal bootstrap[4135]: [INFO] Zookeeper connection established, state: CONNECTED
      May 05 17:40:04 ip-10-10-0-244.eu-central-1.compute.internal bootstrap[4135]: [DEBUG] bootstrapping dcos-spartan
      May 05 17:40:05 ip-10-10-0-244.eu-central-1.compute.internal env[4267]: RTNETLINK answers: File exists
      May 05 17:40:05 ip-10-10-0-244.eu-central-1.compute.internal env[4273]: RTNETLINK answers: File exists
      May 05 17:40:05 ip-10-10-0-244.eu-central-1.compute.internal env[4276]: RTNETLINK answers: File exists
      May 05 17:40:05 ip-10-10-0-244.eu-central-1.compute.internal systemd[1]: Started DNS Forwarder (Spartan): forwards DNS requests to multiple DNS servers.
      May 05 17:40:05 ip-10-10-0-244.eu-central-1.compute.internal env[4278]: RTNETLINK answers: File exists
      May 05 17:40:05 ip-10-10-0-244.eu-central-1.compute.internal spartan-env[4280]: Exec: /opt/mesosphere/packages/spartan--2f73e776bdf9cca69c338c66cebe0a33e226b3f8/spartan/erts-8.2.2/bin/erlexec -noshell -no
      May 05 17:40:05 ip-10-10-0-244.eu-central-1.compute.internal spartan-env[4280]: Root: /opt/mesosphere/packages/spartan--2f73e776bdf9cca69c338c66cebe0a33e226b3f8/spartan
      May 05 17:40:05 ip-10-10-0-244.eu-central-1.compute.internal spartan-env[4280]: /opt/mesosphere/packages/spartan--2f73e776bdf9cca69c338c66cebe0a33e226b3f8/spartan
      May 05 17:40:06 ip-10-10-0-244.eu-central-1.compute.internal spartan-env[4280]: Setup running ...
      May 05 17:40:06 ip-10-10-0-244.eu-central-1.compute.internal spartan-env[4280]: Directories verified. Res = {[ok],[]}
      May 05 17:40:06 ip-10-10-0-244.eu-central-1.compute.internal spartan-env[4280]: Setup finished processing hooks ...
      

      It looks like navstar was having trouble connecting to ZK after the upgrade:

      May 05 17:40:10 ip-10-10-0-244.eu-central-1.compute.internal bootstrap[4616]: [INFO] Connecting to zk-1.zk:2181
      May 05 17:40:10 ip-10-10-0-244.eu-central-1.compute.internal bootstrap[4616]: [WARNING] Connection dropped: socket connection error: Name or service not known
      

      Find complete navstar logs from the agent attached, as well as Exhibitor logs from one of the masters.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                adukhovniy Aleksey Dukhovniy
                Reporter:
                greg Greg Mann
                Team:
                Orchestration Team
                Watchers:
                Aleksey Dukhovniy, Avinash Sridharan (Inactive), Bekir Dogan, Cathy Daw, Deepak Goel, Greg Mann, Jan-Philip Gehrcke, Marco Monaco, Matthias Eichstedt, Senthil Kumaran
              • Watchers:
                10 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: