Details

    • Type: Bug
    • Status: Resolved
    • Priority: Low
    • Resolution: Cannot Reproduce
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Build & CI
    • Labels:
      None
    • Sprint:
      Marathon Sprint 1.11-6, Marathon Sprint 1.11-7
    • Story Points:
      3

      Description

      Problem

      In the recent past, we modified our AWS to a node with less compute resources, without reducing the concurrency factor or increasing our patienceConfig. The evidence indicates that we are timing out on tests that would have otherwise finished if it were given more time (or, if they simply ran faster)

      Proposed solution

      Recommendation to increase our patienceConfig for these tests, or, switch to instances with higher serial execution guarantees.

      Observation of a Manifestation of the Problem

      We see a sudden spike of failures for 1.5.x in the following integration tests:

         1 mesosphere.marathon.integration.MesosAppIntegrationTest:MesosApp should delete pod instances
         1 mesosphere.marathon.integration.MesosAppIntegrationTest:MesosApp should deploy a simple pod with health checks
         1 mesosphere.marathon.integration.ResidentTaskIntegrationTest:ResidentTaskIntegrationTest should Config Change
         1 mesosphere.marathon.integration.ResidentTaskIntegrationTest:ResidentTaskIntegrationTest should persistent volume will be re-attached and keep state
         1 mesosphere.marathon.integration.ResidentTaskIntegrationTest:ResidentTaskIntegrationTest should resident task can be deployed and write to persistent volume
         3 mesosphere.marathon.core.health.impl.MarathonHealthCheckManagerTest:HealthCheckManager should reconcile loads the last known task health state
         5 mesosphere.marathon.integration.AppDeployIntegrationTest:AppDeploy should backoff delays are reset on configuration changes
        12 mesosphere.marathon.integration.KeepAppsRunningDuringAbdicationIntegrationTest:Abdicating a leader should keep all running apps alive
        16 mesosphere.marathon.integration.BackupRestoreIntegrationTest:Abdicating a leader should keep all running apps alive
        18 mesosphere.marathon.integration.DeleteAppAndBackupIntegrationTest:Abdicating a leader should keep all running apps alive
      Sample size: 296
      

      The failures happened consecutively, for a period, stopped, and then returned.

      This visualization shows job 1933-1231, or, roughly, Oct 27 to Nov 2nd:


      marathon-loop-1.5-failures-oct-27-to-nov-2.pdf

      Failures by job:

      [
        {
          "failure": "mesosphere.marathon.core.health.impl.MarathonHealthCheckManagerTest:HealthCheckManager should reconcile loads the last known task health state",
          "jobs": [
            1123,
            1133,
            1153
          ]
        },
        {
          "failure": "mesosphere.marathon.integration.AppDeployIntegrationTest:AppDeploy should backoff delays are reset on configuration changes",
          "jobs": [
            935,
            1022,
            1045,
            1176,
            1225
          ]
        },
        {
          "failure": "mesosphere.marathon.integration.BackupRestoreIntegrationTest:Abdicating a leader should keep all running apps alive",
          "jobs": [
            1048,
            1049,
            1050,
            1051,
            1052,
            1053,
            1054,
            1055,
            1056,
            1058,
            1059,
            1060,
            1061,
            1062,
            1156,
            1158
          ]
        },
        {
          "failure": "mesosphere.marathon.integration.DeleteAppAndBackupIntegrationTest:Abdicating a leader should keep all running apps alive",
          "jobs": [
            1048,
            1049,
            1050,
            1051,
            1052,
            1053,
            1054,
            1055,
            1056,
            1057,
            1058,
            1059,
            1060,
            1061,
            1062,
            1156,
            1157,
            1158
          ]
        },
        {
          "failure": "mesosphere.marathon.integration.KeepAppsRunningDuringAbdicationIntegrationTest:Abdicating a leader should keep all running apps alive",
          "jobs": [
            1049,
            1051,
            1052,
            1054,
            1055,
            1056,
            1058,
            1061,
            1062,
            1156,
            1157,
            1158
          ]
        },
        {
          "failure": "mesosphere.marathon.integration.MesosAppIntegrationTest:MesosApp should delete pod instances",
          "jobs": [
            1006
          ]
        },
        {
          "failure": "mesosphere.marathon.integration.MesosAppIntegrationTest:MesosApp should deploy a simple pod with health checks",
          "jobs": [
            1004
          ]
        },
        {
          "failure": "mesosphere.marathon.integration.ResidentTaskIntegrationTest:ResidentTaskIntegrationTest should Config Change",
          "jobs": [
            1050
          ]
        },
        {
          "failure": "mesosphere.marathon.integration.ResidentTaskIntegrationTest:ResidentTaskIntegrationTest should persistent volume will be re-attached and keep state",
          "jobs": [
            1079
          ]
        },
        {
          "failure": "mesosphere.marathon.integration.ResidentTaskIntegrationTest:ResidentTaskIntegrationTest should resident task can be deployed and write to persistent volume",
          "jobs": [
            941
          ]
        }
      ]
      
      

      A look at node-id.tsv reveals shows that every test run on `i-0fcc80f66bd529e4c` and `i-02f91003a9aba040f` had failures. Strong correlation with these nodes and the stream of failures.

      1040	[Marathon] JenkinsMarathonCI-Debian8-2017-07-18 (i-09c5240b1e0c898c2)
      1041	[Marathon] JenkinsMarathonCI-Debian8-2017-07-18 (i-01bd9d9c0dccb11c9)
      1042	[Marathon] JenkinsMarathonCI-Debian8-2017-07-18 (i-0d009de875da39f54)
      1043	[Marathon] JenkinsMarathonCI-Debian8-2017-07-18 (i-04690b6b5c7855a61)
      1044	[Marathon] JenkinsMarathonCI-Debian8-2017-07-18 (i-075614fdd33a289ec)
      1045	[Marathon] JenkinsMarathonCI-Debian8-2017-07-18 (i-0975c8d472dbc3fb6)
      1046	[Marathon] JenkinsMarathonCI-Debian8-2017-07-18 (i-05f470c9d0bc65f2c)
      1047	[Marathon] JenkinsMarathonCI-Debian8-2017-07-18 (i-0a0067ade00583d99)
      1048	[Marathon] JenkinsMarathonCI-Debian8-2017-07-18 (i-0fcc80f66bd529e4c)
      1049	[Marathon] JenkinsMarathonCI-Debian8-2017-07-18 (i-0fcc80f66bd529e4c)
      1050	[Marathon] JenkinsMarathonCI-Debian8-2017-07-18 (i-0fcc80f66bd529e4c)
      1051	[Marathon] JenkinsMarathonCI-Debian8-2017-07-18 (i-0fcc80f66bd529e4c)
      1052	[Marathon] JenkinsMarathonCI-Debian8-2017-07-18 (i-0fcc80f66bd529e4c)
      1053	[Marathon] JenkinsMarathonCI-Debian8-2017-07-18 (i-0fcc80f66bd529e4c)
      1054	[Marathon] JenkinsMarathonCI-Debian8-2017-07-18 (i-0fcc80f66bd529e4c)
      1055	[Marathon] JenkinsMarathonCI-Debian8-2017-07-18 (i-0fcc80f66bd529e4c)
      1056	[Marathon] JenkinsMarathonCI-Debian8-2017-07-18 (i-0fcc80f66bd529e4c)
      1057	[Marathon] JenkinsMarathonCI-Debian8-2017-07-18 (i-0fcc80f66bd529e4c)
      1058	[Marathon] JenkinsMarathonCI-Debian8-2017-07-18 (i-0fcc80f66bd529e4c)
      1059	[Marathon] JenkinsMarathonCI-Debian8-2017-07-18 (i-0fcc80f66bd529e4c)
      1060	[Marathon] JenkinsMarathonCI-Debian8-2017-07-18 (i-0fcc80f66bd529e4c)
      1061	[Marathon] JenkinsMarathonCI-Debian8-2017-07-18 (i-02f91003a9aba040f)
      1062	[Marathon] JenkinsMarathonCI-Debian8-2017-07-18 (i-02f91003a9aba040f)
      1063	[Marathon] JenkinsMarathonCI-Debian8-2017-07-18 (i-0f370cfed19fec0b5)
      1064	[Marathon] JenkinsMarathonCI-Debian8-2017-07-18 (i-033305290fbc862ab)
      1065	[Marathon] JenkinsMarathonCI-Debian8-2017-07-18 (i-033305290fbc862ab)
      1066	[Marathon] JenkinsMarathonCI-Debian8-2017-07-18 (i-0e7edec0ecce8a643)
      1067	[Marathon] JenkinsMarathonCI-Debian8-2017-07-18 (i-03deacb91ea5f69ee)
      1068	[Marathon] JenkinsMarathonCI-Debian8-2017-07-18 (i-03deacb91ea5f69ee)
      1069	[Marathon] JenkinsMarathonCI-Debian8-2017-07-18 (i-0cd7f58606587ca9f)
      

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                tharper Tim Harper
                Reporter:
                kjeschkies Karsten Jeschkies
                Team:
                Orchestration Team
                Watchers:
                Karsten Jeschkies, Marco Monaco, Matthias Eichstedt, Tim Harper
              • Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: