Details

      Description

      Spun up a new cluster on 1.12.1 and noticed this issue. When jobs succeed or fail with the restart policy set to `NEVER`, it exits properly, but when set to `ON_FAILURE` the failing job just hangs. 

      Tested this with multiple versions and found that the issue appears to have been introduced in DC/OS 1.11.5, which bumps metronome to 0.4.3. Looking into the changelog for that version, I believe it's related to this since this is the error message I'm getting: https://github.com/dcos/metronome/pull/230

       I've boiled it down to this simple failing job:

      {
        "id": "testing",
        "description": "Testing",
        "labels": {},
        "run": {
          "cpus": 1,
          "mem": 1024,
          "disk": 100,
          "cmd": "ls agleflop",
          "env": {},
          "placement": {
            "constraints": []
          },
          "artifacts": [],
          "maxLaunchDelay": 3600,
          "volumes": [],
          "restart": {
            "policy": "ON_FAILURE"
          },
          "secrets": {}
        },
        "schedules": []
      }
      

       stderr from the job log in the sandbox:

      (AT BEGINNING OF FILE)
      Marked '/' as rslave
      Prepared mount '{"flags":14,"source":"proc","target":"/proc","type":"proc"}'
      I0103 23:58:42.239980     3 exec.cpp:162] Version: 1.7.1
      I0103 23:58:42.242617     6 exec.cpp:236] Executor registered on agent d1327d36-9378-4f71-90de-e99a44de8920-S41
      I0103 23:58:42.243762     4 executor.cpp:184] Received SUBSCRIBED event
      I0103 23:58:42.244035     4 executor.cpp:188] Subscribed executor on 10.1.3.249
      I0103 23:58:42.244158     4 executor.cpp:184] Received LAUNCH event
      I0103 23:58:42.244904     4 executor.cpp:687] Starting task testing_20190103235828Y4MZQ.7f4f322b-0fb3-11e9-b766-b290d55b0072
      I0103 23:58:42.248999     4 executor.cpp:502] Running '/opt/mesosphere/active/mesos/libexec/mesos/mesos-containerizer launch <POSSIBLY-SENSITIVE-DATA>'
      I0103 23:58:42.249780     4 executor.cpp:702] Forked command at 12
      ls: cannot access 'agleflop': No such file or directory
      I0103 23:58:42.339903     8 executor.cpp:1003] Command exited with status 2 (pid: 12)
      I0103 23:58:43.341948    11 process.cpp:926] Stopped the socket accept loop
      

      And these are the journalctl logs from the master metronome process:

      Jan 04 05:11:06 ip-10-2-4-123.ec2.internal metronome[3849]: [info] d.m.j.i.JobSpecPersistenceActor - Create JobSpec testing
      Jan 04 05:11:15 ip-10-2-4-123.ec2.internal metronome[3849]: [info] d.m.j.i.JobRunServiceActor - Trigger new JobRun for JobSpec: JobSpec(testing,Some(Testing),Map(),List(),JobRunSpec(1.0,1024.0,100.0,Some(ls agleflop),None,None,Map(),PlacementSpec(List()),List(),30 seconds,None,List(),RestartSpec(OnFailure,None),None,Map()))
      Jan 04 05:11:15 ip-10-2-4-123.ec2.internal metronome[3849]: [info] d.m.j.i.JobRunServiceActor - Start new JobRun: testing.201901040511158mqnd
      Jan 04 05:11:15 ip-10-2-4-123.ec2.internal metronome[3849]: [info] d.m.j.i.JobRunExecutorActor - Create JobRun testing.201901040511158mqnd - become creating
      Jan 04 05:11:15 ip-10-2-4-123.ec2.internal metronome[3849]: [info] d.m.j.i.JobRunExecutorActor - Execution of JobRun testing.201901040511158mqnd has been started - become starting
      Jan 04 05:11:15 ip-10-2-4-123.ec2.internal metronome[3849]: [info] d.m.j.i.JobRunExecutorActor - addTaskToLaunchQueue
      Jan 04 05:11:15 ip-10-2-4-123.ec2.internal metronome[3849]: [info] m.m.c.l.i.TaskLauncherActor - Started instanceLaunchActor for /testing/201901040511158mqnd version 1970-01-01T00:00:00.000Z with initial count 1
      Jan 04 05:11:15 ip-10-2-4-123.ec2.internal metronome[3849]: [info] m.m.c.m.m.i.OfferMatcherManagerActor - activating matcher ActorOfferMatcher(Actor[akka://metronome/user/launchQueue/1/1-testing_201901040511158mqnd#2011550370]).
      Jan 04 05:11:18 ip-10-2-4-123.ec2.internal metronome[3849]: [info] m.m.c.l.i.OfferProcessorImpl - Processing offer: offerId 560623e9-ef2a-4a3c-83e8-155d56b411a8-O596, agentId 560623e9-ef2a-4a3c-83e8-155d56b411a8-S1
      Jan 04 05:11:18 ip-10-2-4-123.ec2.internal metronome[3849]: [warn] m.m.c.l.i.OfferProcessorImpl - Offer 560623e9-ef2a-4a3c-83e8-155d56b411a8-O596 has zero-valued resources: ports
      Jan 04 05:11:18 ip-10-2-4-123.ec2.internal metronome[3849]: [info] m.m.c.m.m.i.OfferMatcherManagerActor - Start processing offer 560623e9-ef2a-4a3c-83e8-155d56b411a8-O596. Current offer matcher count: 0
      Jan 04 05:11:18 ip-10-2-4-123.ec2.internal metronome[3849]: [info] m.m.c.l.i.TaskLauncherActor - No tasks left to launch. Stop receiving offers for /testing/201901040511158mqnd, 1970-01-01T00:00:00.000Z
      Jan 04 05:11:18 ip-10-2-4-123.ec2.internal metronome[3849]: [info] m.m.c.m.m.i.OfferMatcherManagerActor - removing matcher ActorOfferMatcher(Actor[akka://metronome/user/launchQueue/1/1-testing_201901040511158mqnd#2011550370])
      Jan 04 05:11:18 ip-10-2-4-123.ec2.internal metronome[3849]: [info] m.m.c.m.m.i.OfferMatcherManagerActor - Finished processing 560623e9-ef2a-4a3c-83e8-155d56b411a8-O596 from 10.2.3.246. Matched 1 ops after 2 passes. First 10: ports(*) 1025->2180,2182->3887,3889->5049,5052->8079,8082->8180,8182->32000; disk(*) 76138.0; cpus(*) 3.0; mem(*) 14002.0
      Jan 04 05:11:18 ip-10-2-4-123.ec2.internal metronome[3849]: [info] m.m.c.l.i.OfferProcessorImpl - Processing LaunchEphemeral(Instance(instance [testing_201901040511158mqnd.marathon-2b3e2f45-0fdf-11e9-b5df-c2235483f5d9],AgentInfo(10.2.3.246,Some(560623e9-ef2a-4a3c-83e8-155d56b411a8-S1),None,None,Vector()),InstanceState(Created,2019-01-04T05:11:18.914Z,None,None,Running),Map(task [testing_201901040511158mqnd.2b3e2f45-0fdf-11e9-b5df-c2235483f5d9] -> Task(task [testing_201901040511158mqnd.2b3e2f45-0fdf-11e9-b5df-c2235483f5d9],1970-01-01T00:00:00.000Z,Status(2019-01-04T05:11:18.914Z,None,None,Created,NetworkInfo(10.2.3.246,List(),List())))),1970-01-01T00:00:00.000Z,UnreachableEnabled(0 seconds,0 seconds),None)) for instance [testing_201901040511158mqnd.marathon-2b3e2f45-0fdf-11e9-b5df-c2235483f5d9]
      Jan 04 05:11:19 ip-10-2-4-123.ec2.internal metronome[3849]: [info] m.m.MarathonScheduler - Received status update for task testing_201901040511158mqnd.2b3e2f45-0fdf-11e9-b5df-c2235483f5d9: TASK_STARTING ()
      Jan 04 05:11:19 ip-10-2-4-123.ec2.internal metronome[3849]: [info] m.m.c.t.u.i.TaskStatusUpdateProcessorImpl - Acknowledge status update for task testing_201901040511158mqnd.2b3e2f45-0fdf-11e9-b5df-c2235483f5d9: TASK_STARTING ()
      Jan 04 05:11:19 ip-10-2-4-123.ec2.internal metronome[3849]: [info] m.m.MarathonScheduler - Received status update for task testing_201901040511158mqnd.2b3e2f45-0fdf-11e9-b5df-c2235483f5d9: TASK_RUNNING ()
      Jan 04 05:11:19 ip-10-2-4-123.ec2.internal metronome[3849]: [info] m.m.c.t.u.i.TaskStatusUpdateProcessorImpl - Acknowledge status update for task testing_201901040511158mqnd.2b3e2f45-0fdf-11e9-b5df-c2235483f5d9: TASK_RUNNING ()
      Jan 04 05:11:19 ip-10-2-4-123.ec2.internal metronome[3849]: [info] m.m.MarathonScheduler - Received status update for task testing_201901040511158mqnd.2b3e2f45-0fdf-11e9-b5df-c2235483f5d9: TASK_FAILED (Command exited with status 2)
      Jan 04 05:11:19 ip-10-2-4-123.ec2.internal metronome[3849]: [info] m.m.c.i.u.InstanceUpdater$ - all tasks of instance [testing_201901040511158mqnd.marathon-2b3e2f45-0fdf-11e9-b5df-c2235483f5d9] are terminal, requesting to expunge
      Jan 04 05:11:19 ip-10-2-4-123.ec2.internal metronome[3849]: [info] m.m.c.t.t.InstanceTracker$InstancesBySpec - Removed app [/testing/201901040511158mqnd] from tracker
      Jan 04 05:11:19 ip-10-2-4-123.ec2.internal metronome[3849]: [info] d.m.j.i.JobRunExecutorActor - still in time, launching another task
      Jan 04 05:11:19 ip-10-2-4-123.ec2.internal metronome[3849]: [info] d.m.j.i.JobRunExecutorActor - Job run testing.201901040511158mqnd already exists in LaunchQueue - not adding
      Jan 04 05:11:19 ip-10-2-4-123.ec2.internal metronome[3849]: [info] m.m.c.t.u.i.TaskStatusUpdateProcessorImpl - Acknowledge status update for task testing_201901040511158mqnd.2b3e2f45-0fdf-11e9-b5df-c2235483f5d9: TASK_FAILED (Command exited with status 2)
      Jan 04 05:11:19 ip-10-2-4-123.ec2.internal metronome[3849]: [info] m.m.c.l.i.OfferProcessorImpl - Processing offer: offerId 560623e9-ef2a-4a3c-83e8-155d56b411a8-O597, agentId 560623e9-ef2a-4a3c-83e8-155d56b411a8-S0
      Jan 04 05:11:19 ip-10-2-4-123.ec2.internal metronome[3849]: [warn] m.m.c.l.i.OfferProcessorImpl - Offer 560623e9-ef2a-4a3c-83e8-155d56b411a8-O597 has zero-valued resources: ports
      Jan 04 05:11:19 ip-10-2-4-123.ec2.internal metronome[3849]: [info] m.m.c.l.i.OfferProcessorImpl - Processing offer: offerId 560623e9-ef2a-4a3c-83e8-155d56b411a8-O598, agentId 560623e9-ef2a-4a3c-83e8-155d56b411a8-S1
      Jan 04 05:11:19 ip-10-2-4-123.ec2.internal metronome[3849]: [warn] m.m.c.l.i.OfferProcessorImpl - Offer 560623e9-ef2a-4a3c-83e8-155d56b411a8-O598 has zero-valued resources: ports
      Jan 04 05:11:19 ip-10-2-4-123.ec2.internal metronome[3849]: [info] m.m.c.m.m.i.OfferMatcherManagerActor - No match for:560623e9-ef2a-4a3c-83e8-155d56b411a8-O597 from:10.2.0.151 reason:No offers wanted
      Jan 04 05:11:19 ip-10-2-4-123.ec2.internal metronome[3849]: [info] m.m.c.m.m.i.OfferMatcherManagerActor - No match for:560623e9-ef2a-4a3c-83e8-155d56b411a8-O598 from:10.2.3.246 reason:No offers wanted
      Jan 04 05:11:30 ip-10-2-4-123.ec2.internal metronome[3849]: [info] m.m.c.l.i.OfferProcessorImpl - Processing offer: offerId 560623e9-ef2a-4a3c-83e8-155d56b411a8-O599, agentId 560623e9-ef2a-4a3c-83e8-155d56b411a8-S3
      Jan 04 05:11:30 ip-10-2-4-123.ec2.internal metronome[3849]: [warn] m.m.c.l.i.OfferProcessorImpl - Offer 560623e9-ef2a-4a3c-83e8-155d56b411a8-O599 has zero-valued resources: ports
      Jan 04 05:11:30 ip-10-2-4-123.ec2.internal metronome[3849]: [info] m.m.c.m.m.i.OfferMatcherManagerActor - No match for:560623e9-ef2a-4a3c-83e8-155d56b411a8-O599 from:10.2.2.80 reason:No offers wanted
      Jan 04 05:13:20 ip-10-2-4-123.ec2.internal metronome[3849]: [info] m.m.c.l.i.OfferProcessorImpl - Processing offer: offerId 560623e9-ef2a-4a3c-83e8-155d56b411a8-O601, agentId 560623e9-ef2a-4a3c-83e8-155d56b411a8-S0
      Jan 04 05:13:20 ip-10-2-4-123.ec2.internal metronome[3849]: [warn] m.m.c.l.i.OfferProcessorImpl - Offer 560623e9-ef2a-4a3c-83e8-155d56b411a8-O601 has zero-valued resources: ports
      Jan 04 05:13:20 ip-10-2-4-123.ec2.internal metronome[3849]: [info] m.m.c.m.m.i.OfferMatcherManagerActor - No match for:560623e9-ef2a-4a3c-83e8-155d56b411a8-O601 from:10.2.0.151 reason:No offers wanted
      Jan 04 05:13:21 ip-10-2-4-123.ec2.internal metronome[3849]: [info] m.m.c.l.i.OfferProcessorImpl - Processing offer: offerId 560623e9-ef2a-4a3c-83e8-155d56b411a8-O603, agentId 560623e9-ef2a-4a3c-83e8-155d56b411a8-S1
      Jan 04 05:13:21 ip-10-2-4-123.ec2.internal metronome[3849]: [warn] m.m.c.l.i.OfferProcessorImpl - Offer 560623e9-ef2a-4a3c-83e8-155d56b411a8-O603 has zero-valued resources: ports
      Jan 04 05:13:21 ip-10-2-4-123.ec2.internal metronome[3849]: [info] m.m.c.m.m.i.OfferMatcherManagerActor - No match for:560623e9-ef2a-4a3c-83e8-155d56b411a8-O603 from:10.2.3.246 reason:No offers wanted
      Jan 04 05:13:31 ip-10-2-4-123.ec2.internal metronome[3849]: [info] m.m.c.l.i.OfferProcessorImpl - Processing offer: offerId 560623e9-ef2a-4a3c-83e8-155d56b411a8-O606, agentId 560623e9-ef2a-4a3c-83e8-155d56b411a8-S3
      Jan 04 05:13:31 ip-10-2-4-123.ec2.internal metronome[3849]: [warn] m.m.c.l.i.OfferProcessorImpl - Offer 560623e9-ef2a-4a3c-83e8-155d56b411a8-O606 has zero-valued resources: ports
      Jan 04 05:13:31 ip-10-2-4-123.ec2.internal metronome[3849]: [info] m.m.c.m.m.i.OfferMatcherManagerActor - No match for:560623e9-ef2a-4a3c-83e8-155d56b411a8-O606 from:10.2.2.80 reason:No offers wanted
      

      And you can see the message from the above referenced github PR. It seems that launch validation might be too aggressive in this case and prevents metronome from relaunching the task.

       Job run testing.201901040511158mqnd already exists in LaunchQueue - not adding

        Attachments

          Activity

            People

            • Assignee:
              ken Ken Sipe
              Reporter:
              kyrozetera kyrozetera
              Team:
              Orchestration Team
              Watchers:
              Anitha Muthu (Inactive), Ken Sipe, kyrozetera, Lisa Gunn (Inactive), Matthias Eichstedt, Mergebot
            • Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: