Uploaded image for project: 'Marathon'
  1. Marathon
  2. MARATHON-7036

Jenkins test suites appear to be leaking resources, contributing to flaky builds

    Details

      Description

      We have seen a growing body of evidence to suggest that sometimes Marathon does not properly clean up fully after itself when the test suite runs, and this leads to instability of our tests.

      One example can be found in the failure rate for our `ForwardToLeader` tests.

       

      [
        {
          "failure": "mesosphere.marathon.integration.ForwardToLeaderIntegrationTest:ForwardingToLeader should direct HTTPS ping",
          "jobs": [
            1087,
            1092
          ]
        },
        {
          "failure": "mesosphere.marathon.integration.ForwardToLeaderIntegrationTest:ForwardingToLeader should forwarding 404",
          "jobs": [
            1118
          ]
        },
        {
          "failure": "mesosphere.marathon.integration.ForwardToLeaderIntegrationTest:ForwardingToLeader should forwarding HTTPS ping with a ca signed cert",
          "jobs": [
            1089,
            1090,
            1098
          ]
        },
        {
          "failure": "mesosphere.marathon.integration.ForwardToLeaderIntegrationTest:ForwardingToLeader should forwarding HTTPS ping with a self-signed cert",
          "jobs": [
            969,
            982,
            1071,
            1086,
            1087,
            1090,
            1091,
            1092,
            1093,
            1094,
            1095,
            1096,
            1097,
            1099
          ]
        },
        {
          "failure": "mesosphere.marathon.integration.ForwardToLeaderIntegrationTest:ForwardingToLeader should forwarding ping",
          "jobs": [
            965,
            1089,
            1090,
            1098
          ]
        }
      ]
      

      This input for this report was builds 938 to 1175, which tells us that there is small cluster of failures in the middle of the dataset. An informal scanning of the HEADS for each failing build does not help us find any significant code change to which we can attribute the cluster of failures. It seems entirely unrelated to changes. However, there IS a 4 hour idle period between build 1099 and the next successful build:

        {
          "id": 1094,
          "startTime": "2017-03-09T07:53:43"
        },
        {
          "id": 1095,
          "startTime": "2017-03-09T08:23:39"
        },
        {
          "id": 1096,
          "startTime": "2017-03-09T08:53:36"
        },
        {
          "id": 1097,
          "startTime": "2017-03-09T09:23:57"
        },
        {
          "id": 1098,
          "startTime": "2017-03-09T09:53:40"
        },
        {
          "id": 1099,
          "startTime": "2017-03-09T10:23:43"
        },
        {
          "id": 1100,
          "startTime": null
        },
        {
          "id": 1101,
          "startTime": null
        },
        {
          "id": 1102,
          "startTime": "2017-03-09T14:31:33"
        },
        {
          "id": 1103,
          "startTime": "2017-03-09T14:52:19"
        },
        {
          "id": 1104,
          "startTime": "2017-03-09T15:12:02"
        },
        {
          "id": 1105,
          "startTime": "2017-03-09T15:31:46"
        },
      

      We are not entirely sure why there was such a gap (best hypothesis is that Jenkins master had issues). The slave may have either timed out from not being used in this period, or, as part of some recovery action, was destroyed or recreated? It's guesses at this point, but, the best guess (and the data could support it) is that some event happened to clean the Jenkins slave state of the accumulated junk.

      This will take some investigative work to solve. It would be helpful to collect some data, such as after each build, capture the output of `ps aux` and perhaps some other data. If we can collect data proving there is a leak, then we can either fix it, or, put in some pre-flight stage that detects leaked processes from a prior run and kills them.

      Or, we could simply recycle our Jenkins slaves more often.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                tharper Tim Harper
                Reporter:
                tharper Tim Harper
                Team:
                Orchestration Team
                Watchers:
                Jason Gilanfarr (Inactive), Karsten Jeschkies, Marco Monaco, Tim Harper
              • Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: