Details

      Description

      Analysis details below... however the test in question was written with the assumption that job run history is immediately available when a job run finishes... which is NOT true.   Metronome asynchronously updates job run history and will eventually be correct.   Tests need to be written with this in mind.

       

      https://teamcity.mesosphere.io/viewLog.html?buildId=1249722&buildTypeId=DcOs_Open_Test_IntegrationTest_AzureArm&tab=artifacts#!bxqmlynw4w0

      This is a different failure; instead of a 502 we get a 404.

      [2018-10-09 23:14:57,473|urllib3.connectionpool|DEBUG]: http://172.16.0.5:80 "GET /service/metronome/v1/jobs/test.metronome/runs/20181009231406X2KVv HTTP/1.1" 200 145
      [2018-10-09 23:14:57,474|dcos_test_utils.jobs|INFO]: Waiting on job run 20181009231406X2KVv to finish.
      [2018-10-09 23:14:58,475|dcos_test_utils.helpers|DEBUG]: Request method GET: http://172.16.0.5/service/metronome/v1/jobs/test.metronome/runs/20181009231406X2KVv. Arguments: \{}
      [2018-10-09 23:14:58,483|urllib3.connectionpool|DEBUG]: http://172.16.0.5:80 "GET /service/metronome/v1/jobs/test.metronome/runs/20181009231406X2KVv HTTP/1.1" 200 145
      [2018-10-09 23:14:58,483|dcos_test_utils.jobs|INFO]: Waiting on job run 20181009231406X2KVv to finish.
      [2018-10-09 23:14:59,485|dcos_test_utils.helpers|DEBUG]: Request method GET: http://172.16.0.5/service/metronome/v1/jobs/test.metronome/runs/20181009231406X2KVv. Arguments: \{}
      [2018-10-09 23:14:59,492|urllib3.connectionpool|DEBUG]: http://172.16.0.5:80 "GET /service/metronome/v1/jobs/test.metronome/runs/20181009231406X2KVv HTTP/1.1" 200 145
      [2018-10-09 23:14:59,493|dcos_test_utils.jobs|INFO]: Waiting on job run 20181009231406X2KVv to finish.
      [2018-10-09 23:15:00,494|dcos_test_utils.helpers|DEBUG]: Request method GET: http://172.16.0.5/service/metronome/v1/jobs/test.metronome/runs/20181009231406X2KVv. Arguments: \{}
      [2018-10-09 23:15:00,503|urllib3.connectionpool|DEBUG]: http://172.16.0.5:80 "GET /service/metronome/v1/jobs/test.metronome/runs/20181009231406X2KVv HTTP/1.1" 200 145
      [2018-10-09 23:15:00,503|dcos_test_utils.jobs|INFO]: Waiting on job run 20181009231406X2KVv to finish.
      [2018-10-09 23:15:01,504|dcos_test_utils.helpers|DEBUG]: Request method GET: http://172.16.0.5/service/metronome/v1/jobs/test.metronome/runs/20181009231406X2KVv. Arguments: \{}
      [2018-10-09 23:15:01,512|urllib3.connectionpool|DEBUG]: http://172.16.0.5:80 "GET /service/metronome/v1/jobs/test.metronome/runs/20181009231406X2KVv HTTP/1.1" 200 145
      [2018-10-09 23:15:01,512|dcos_test_utils.jobs|INFO]: Waiting on job run 20181009231406X2KVv to finish.
      [2018-10-09 23:15:02,514|dcos_test_utils.helpers|DEBUG]: Request method GET: http://172.16.0.5/service/metronome/v1/jobs/test.metronome/runs/20181009231406X2KVv. Arguments: \{}
      [2018-10-09 23:15:02,534|urllib3.connectionpool|DEBUG]: http://172.16.0.5:80 "GET /service/metronome/v1/jobs/test.metronome/runs/20181009231406X2KVv HTTP/1.1" 404 85
      [2018-10-09 23:15:02,535|dcos_test_utils.jobs|INFO]: Job run 20181009231406X2KVv finished.
      [2018-10-09 23:15:02,535|dcos_test_utils.helpers|DEBUG]: Request method GET: http://172.16.0.5/service/metronome/v1/jobs/test.metronome. Arguments: \{'params': \{'embed': 'history'}}
      [2018-10-09 23:15:02,560|urllib3.connectionpool|DEBUG]: http://172.16.0.5:80 "GET /service/metronome/v1/jobs/test.metronome?embed=history HTTP/1.1" 200 516
      [2018-10-09 23:15:02,560|dcos_test_utils.dcos_api|INFO]: Job failed, run info: None
      

      Task in question did finish. Same time at which we got 404.

      bash-4.4$ mcli ps -a | egrep '^id|test_metron'
      id host state started
      test_metronome_20181009231406X2KVv.252b8d59-cc19-11e8-a67b-de3d68afe8ff 10.32.0.5 TASK_FINISHED 2018-10-09T23:15:02Z
      

      We succeeded.

      STARTING: 1539126901.625669
      RUNNING: 1539126902.326389
      FINISHED: 1539126902.32721

      However, the test queried the endpoint `/service/metronome/v1/jobs/test.metronome?embed=history` at `2018-10-09 23:15:02,560` (35 milliseconds after getting a 404 for the current job). Maybe there's a race?

      10:51 $ pbpaste | tr "'" '"' | sed 's/False/false/g' | sed 's/None/null/g' | jq .
      \{
       "id": "test.metronome",
       "description": "Test Metronome API regressions",
       "labels": \{},
       "run": \{
       "cpus": 1,
       "mem": 512,
       "disk": 0,
       "cmd": "ls",
       "user": "nobody",
       "env": \{},
       "placement": \{
       "constraints": []
       },
       "artifacts": [],
       "maxLaunchDelay": 3600,
       "docker": \{
       "image": "busybox:latest",
       "privileged": false,
       "parameters": [],
       "forcePullImage": false
       },
       "volumes": [],
       "restart": \{
       "policy": "ON_FAILURE"
       },
       "secrets": \{}
       },
       "history": \{
       "successCount": 0,
       "failureCount": 0,
       "lastSuccessAt": null,
       "lastFailureAt": null,
       "successfulFinishedRuns": [],
       "failedFinishedRuns": []
       }
      }
      

      Conclusion: Not a Metronome issue. The job history is updated asynchronously and there's no guarantee that if you get a 404 for a job run as an indicator of completion, that this means the job history is immediately available. Test needs to be updated to use eventually.

        Attachments

          Activity

            People

            • Assignee:
              alenavarkockova Alena Varkockova
              Reporter:
              ken Ken Sipe
              Team:
              Orchestration Team
              Watchers:
              Ken Sipe, Mergebot
            • Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: