Uploaded image for project: 'DC/OS'
  1. DC/OS
  2. DCOS_OSS-4294

Flaky Test for Metronome on Looking at Job Run History

    Details

      Description

      Analysis details below... however the test in question was written with the assumption that job run history is immediately available when a job run finishes... which is NOT true.   Metronome asynchronously updates job run history and will eventually be correct.   Tests need to be written with this in mind.

       

      https://teamcity.mesosphere.io/viewLog.html?buildId=1249722&buildTypeId=DcOs_Open_Test_IntegrationTest_AzureArm&tab=artifacts#!bxqmlynw4w0

      This is a different failure; instead of a 502 we get a 404.

      [2018-10-09 23:14:57,473|urllib3.connectionpool|DEBUG]: http://172.16.0.5:80 "GET /service/metronome/v1/jobs/test.metronome/runs/20181009231406X2KVv HTTP/1.1" 200 145
      [2018-10-09 23:14:57,474|dcos_test_utils.jobs|INFO]: Waiting on job run 20181009231406X2KVv to finish.
      [2018-10-09 23:14:58,475|dcos_test_utils.helpers|DEBUG]: Request method GET: http://172.16.0.5/service/metronome/v1/jobs/test.metronome/runs/20181009231406X2KVv. Arguments: \{}
      [2018-10-09 23:14:58,483|urllib3.connectionpool|DEBUG]: http://172.16.0.5:80 "GET /service/metronome/v1/jobs/test.metronome/runs/20181009231406X2KVv HTTP/1.1" 200 145
      [2018-10-09 23:14:58,483|dcos_test_utils.jobs|INFO]: Waiting on job run 20181009231406X2KVv to finish.
      [2018-10-09 23:14:59,485|dcos_test_utils.helpers|DEBUG]: Request method GET: http://172.16.0.5/service/metronome/v1/jobs/test.metronome/runs/20181009231406X2KVv. Arguments: \{}
      [2018-10-09 23:14:59,492|urllib3.connectionpool|DEBUG]: http://172.16.0.5:80 "GET /service/metronome/v1/jobs/test.metronome/runs/20181009231406X2KVv HTTP/1.1" 200 145
      [2018-10-09 23:14:59,493|dcos_test_utils.jobs|INFO]: Waiting on job run 20181009231406X2KVv to finish.
      [2018-10-09 23:15:00,494|dcos_test_utils.helpers|DEBUG]: Request method GET: http://172.16.0.5/service/metronome/v1/jobs/test.metronome/runs/20181009231406X2KVv. Arguments: \{}
      [2018-10-09 23:15:00,503|urllib3.connectionpool|DEBUG]: http://172.16.0.5:80 "GET /service/metronome/v1/jobs/test.metronome/runs/20181009231406X2KVv HTTP/1.1" 200 145
      [2018-10-09 23:15:00,503|dcos_test_utils.jobs|INFO]: Waiting on job run 20181009231406X2KVv to finish.
      [2018-10-09 23:15:01,504|dcos_test_utils.helpers|DEBUG]: Request method GET: http://172.16.0.5/service/metronome/v1/jobs/test.metronome/runs/20181009231406X2KVv. Arguments: \{}
      [2018-10-09 23:15:01,512|urllib3.connectionpool|DEBUG]: http://172.16.0.5:80 "GET /service/metronome/v1/jobs/test.metronome/runs/20181009231406X2KVv HTTP/1.1" 200 145
      [2018-10-09 23:15:01,512|dcos_test_utils.jobs|INFO]: Waiting on job run 20181009231406X2KVv to finish.
      [2018-10-09 23:15:02,514|dcos_test_utils.helpers|DEBUG]: Request method GET: http://172.16.0.5/service/metronome/v1/jobs/test.metronome/runs/20181009231406X2KVv. Arguments: \{}
      [2018-10-09 23:15:02,534|urllib3.connectionpool|DEBUG]: http://172.16.0.5:80 "GET /service/metronome/v1/jobs/test.metronome/runs/20181009231406X2KVv HTTP/1.1" 404 85
      [2018-10-09 23:15:02,535|dcos_test_utils.jobs|INFO]: Job run 20181009231406X2KVv finished.
      [2018-10-09 23:15:02,535|dcos_test_utils.helpers|DEBUG]: Request method GET: http://172.16.0.5/service/metronome/v1/jobs/test.metronome. Arguments: \{'params': \{'embed': 'history'}}
      [2018-10-09 23:15:02,560|urllib3.connectionpool|DEBUG]: http://172.16.0.5:80 "GET /service/metronome/v1/jobs/test.metronome?embed=history HTTP/1.1" 200 516
      [2018-10-09 23:15:02,560|dcos_test_utils.dcos_api|INFO]: Job failed, run info: None
      

      Task in question did finish. Same time at which we got 404.

      bash-4.4$ mcli ps -a | egrep '^id|test_metron'
      id host state started
      test_metronome_20181009231406X2KVv.252b8d59-cc19-11e8-a67b-de3d68afe8ff 10.32.0.5 TASK_FINISHED 2018-10-09T23:15:02Z
      

      We succeeded.

      STARTING: 1539126901.625669
      RUNNING: 1539126902.326389
      FINISHED: 1539126902.32721

      However, the test queried the endpoint `/service/metronome/v1/jobs/test.metronome?embed=history` at `2018-10-09 23:15:02,560` (35 milliseconds after getting a 404 for the current job). Maybe there's a race?

      10:51 $ pbpaste | tr "'" '"' | sed 's/False/false/g' | sed 's/None/null/g' | jq .
      \{
       "id": "test.metronome",
       "description": "Test Metronome API regressions",
       "labels": \{},
       "run": \{
       "cpus": 1,
       "mem": 512,
       "disk": 0,
       "cmd": "ls",
       "user": "nobody",
       "env": \{},
       "placement": \{
       "constraints": []
       },
       "artifacts": [],
       "maxLaunchDelay": 3600,
       "docker": \{
       "image": "busybox:latest",
       "privileged": false,
       "parameters": [],
       "forcePullImage": false
       },
       "volumes": [],
       "restart": \{
       "policy": "ON_FAILURE"
       },
       "secrets": \{}
       },
       "history": \{
       "successCount": 0,
       "failureCount": 0,
       "lastSuccessAt": null,
       "lastFailureAt": null,
       "successfulFinishedRuns": [],
       "failedFinishedRuns": []
       }
      }
      

      Conclusion: Not a Metronome issue. The job history is updated asynchronously and there's no guarantee that if you get a 404 for a job run as an indicator of completion, that this means the job history is immediately available. Test needs to be updated to use eventually.

        Attachments

          Activity

            People

            • Assignee:
              alenavarkockova Alena Varkockova
              Reporter:
              ken Ken Sipe
              Team:
              Orchestration Team
              Watchers:
              Ken Sipe, Mergebot
            • Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: