Uploaded image for project: 'DC/OS'
  1. DC/OS
  2. DCOS_OSS-4239

Metronome does not crash when exception during startup happens

    Details

      Description

      This came up from a log analysis DCOS-42236

      After master failover in dcos tests it sometimes happens that the current leader throws this exception

      Oct 04 13:17:40 dcos-e2e-c6039d3e-a3bb-4a0b-bc6f-519bf3c96e0b-master-0 metronome[15546]: Exception in thread "main" java.lang.InterruptedException: Timed out while waiting for zookeeper connection
      Oct 04 13:17:40 dcos-e2e-c6039d3e-a3bb-4a0b-bc6f-519bf3c96e0b-master-0 metronome[15546]: at mesosphere.marathon.core.storage.store.impl.zk.RichCuratorFramework.blockUntilConnected(RichCuratorFramework.scala:153)
      Oct 04 13:17:40 dcos-e2e-c6039d3e-a3bb-4a0b-bc6f-519bf3c96e0b-master-0 metronome[15546]: at mesosphere.marathon.storage.CuratorZk.client$lzycompute(StorageConfig.scala:135)
      Oct 04 13:17:40 dcos-e2e-c6039d3e-a3bb-4a0b-bc6f-519bf3c96e0b-master-0 metronome[15546]: at mesosphere.marathon.storage.CuratorZk.client(StorageConfig.scala:117)
      Oct 04 13:17:40 dcos-e2e-c6039d3e-a3bb-4a0b-bc6f-519bf3c96e0b-master-0 metronome[15546]: at mesosphere.marathon.storage.CuratorZk.leafStore(StorageConfig.scala:151)
      Oct 04 13:17:40 dcos-e2e-c6039d3e-a3bb-4a0b-bc6f-519bf3c96e0b-master-0 metronome[15546]: at mesosphere.marathon.storage.PersistenceStorageConfig.lazyStore(StorageConfig.scala:58)
      Oct 04 13:17:40 dcos-e2e-c6039d3e-a3bb-4a0b-bc6f-519bf3c96e0b-master-0 metronome[15546]: at mesosphere.marathon.storage.PersistenceStorageConfig.lazyStore$(StorageConfig.scala:54)
      Oct 04 13:17:40 dcos-e2e-c6039d3e-a3bb-4a0b-bc6f-519bf3c96e0b-master-0 metronome[15546]: at mesosphere.marathon.storage.CuratorZk.lazyStore(StorageConfig.scala:95)
      Oct 04 13:17:40 dcos-e2e-c6039d3e-a3bb-4a0b-bc6f-519bf3c96e0b-master-0 metronome[15546]: at mesosphere.marathon.storage.PersistenceStorageConfig.store(StorageConfig.scala:67)
      Oct 04 13:17:40 dcos-e2e-c6039d3e-a3bb-4a0b-bc6f-519bf3c96e0b-master-0 metronome[15546]: at mesosphere.marathon.storage.PersistenceStorageConfig.store$(StorageConfig.scala:62)
      Oct 04 13:17:40 dcos-e2e-c6039d3e-a3bb-4a0b-bc6f-519bf3c96e0b-master-0 metronome[15546]: at mesosphere.marathon.storage.CuratorZk.store(StorageConfig.scala:95)
      Oct 04 13:17:40 dcos-e2e-c6039d3e-a3bb-4a0b-bc6f-519bf3c96e0b-master-0 metronome[15546]: at mesosphere.marathon.storage.StorageModule$.apply(StorageModule.scala:51)
      Oct 04 13:17:40 dcos-e2e-c6039d3e-a3bb-4a0b-bc6f-519bf3c96e0b-master-0 metronome[15546]: at mesosphere.marathon.storage.StorageModule$.apply(StorageModule.scala:40)
      Oct 04 13:17:40 dcos-e2e-c6039d3e-a3bb-4a0b-bc6f-519bf3c96e0b-master-0 metronome[15546]: at dcos.metronome.repository.SchedulerRepositoriesModule.storageModule$lzycompute(SchedulerRepositoriesModule.scala:55)
      Oct 04 13:17:40 dcos-e2e-c6039d3e-a3bb-4a0b-bc6f-519bf3c96e0b-master-0 metronome[15546]: at dcos.metronome.repository.SchedulerRepositoriesModule.storageModule(SchedulerRepositoriesModule.scala:55)
      Oct 04 13:17:40 dcos-e2e-c6039d3e-a3bb-4a0b-bc6f-519bf3c96e0b-master-0 metronome[15546]: at dcos.metronome.repository.SchedulerRepositoriesModule.frameworkIdRepository$lzycompute(SchedulerRepositoriesModule.scala:60)
      Oct 04 13:17:40 dcos-e2e-c6039d3e-a3bb-4a0b-bc6f-519bf3c96e0b-master-0 metronome[15546]: at dcos.metronome.repository.SchedulerRepositoriesModule.frameworkIdRepository(SchedulerRepositoriesModule.scala:60)
      Oct 04 13:17:40 dcos-e2e-c6039d3e-a3bb-4a0b-bc6f-519bf3c96e0b-master-0 metronome[15546]: at dcos.metronome.scheduler.SchedulerModule.<init>(SchedulerModule.scala:143)
      Oct 04 13:17:40 dcos-e2e-c6039d3e-a3bb-4a0b-bc6f-519bf3c96e0b-master-0 metronome[15546]: at dcos.metronome.JobsModule.schedulerModule$lzycompute(JobsModule.scala:43)
      Oct 04 13:17:40 dcos-e2e-c6039d3e-a3bb-4a0b-bc6f-519bf3c96e0b-master-0 metronome[15546]: at dcos.metronome.JobsModule.schedulerModule(JobsModule.scala:34)
      Oct 04 13:17:40 dcos-e2e-c6039d3e-a3bb-4a0b-bc6f-519bf3c96e0b-master-0 metronome[15546]: at dcos.metronome.JobsModule.jobRunModule$lzycompute(JobsModule.scala:46)
      Oct 04 13:17:40 dcos-e2e-c6039d3e-a3bb-4a0b-bc6f-519bf3c96e0b-master-0 metronome[15546]: at dcos.metronome.JobsModule.jobRunModule(JobsModule.scala:45)
      Oct 04 13:17:40 dcos-e2e-c6039d3e-a3bb-4a0b-bc6f-519bf3c96e0b-master-0 metronome[15546]: at dcos.metronome.JobsModule.jobSpecModule$lzycompute(JobsModule.scala:57)
      Oct 04 13:17:40 dcos-e2e-c6039d3e-a3bb-4a0b-bc6f-519bf3c96e0b-master-0 metronome[15546]: at dcos.metronome.JobsModule.jobSpecModule(JobsModule.scala:52)
      Oct 04 13:17:40 dcos-e2e-c6039d3e-a3bb-4a0b-bc6f-519bf3c96e0b-master-0 metronome[15546]: at dcos.metronome.JobComponents.apiModule$lzycompute(JobApplicationLoader.scala:53)
      Oct 04 13:17:40 dcos-e2e-c6039d3e-a3bb-4a0b-bc6f-519bf3c96e0b-master-0 metronome[15546]: at dcos.metronome.JobComponents.apiModule(JobApplicationLoader.scala:51)
      Oct 04 13:17:40 dcos-e2e-c6039d3e-a3bb-4a0b-bc6f-519bf3c96e0b-master-0 metronome[15546]: at dcos.metronome.JobComponents.router(JobApplicationLoader.scala:85)
      Oct 04 13:17:40 dcos-e2e-c6039d3e-a3bb-4a0b-bc6f-519bf3c96e0b-master-0 metronome[15546]: at play.api.BuiltInComponents.httpRequestHandler(Application.scala:321)
      Oct 04 13:17:40 dcos-e2e-c6039d3e-a3bb-4a0b-bc6f-519bf3c96e0b-master-0 metronome[15546]: at play.api.BuiltInComponents.httpRequestHandler$(Application.scala:321)
      Oct 04 13:17:40 dcos-e2e-c6039d3e-a3bb-4a0b-bc6f-519bf3c96e0b-master-0 metronome[15546]: at play.api.BuiltInComponentsFromContext.httpRequestHandler$lzycompute(ApplicationLoader.scala:122)
      Oct 04 13:17:40 dcos-e2e-c6039d3e-a3bb-4a0b-bc6f-519bf3c96e0b-master-0 metronome[15546]: at play.api.BuiltInComponentsFromContext.httpRequestHandler(ApplicationLoader.scala:122)
      Oct 04 13:17:40 dcos-e2e-c6039d3e-a3bb-4a0b-bc6f-519bf3c96e0b-master-0 metronome[15546]: at play.api.BuiltInComponents.application(Application.scala:324)
      Oct 04 13:17:40 dcos-e2e-c6039d3e-a3bb-4a0b-bc6f-519bf3c96e0b-master-0 metronome[15546]: at play.api.BuiltInComponents.application$(Application.scala:323)
      Oct 04 13:17:40 dcos-e2e-c6039d3e-a3bb-4a0b-bc6f-519bf3c96e0b-master-0 metronome[15546]: at play.api.BuiltInComponentsFromContext.application$lzycompute(ApplicationLoader.scala:122)
      Oct 04 13:17:40 dcos-e2e-c6039d3e-a3bb-4a0b-bc6f-519bf3c96e0b-master-0 metronome[15546]: at play.api.BuiltInComponentsFromContext.application(ApplicationLoader.scala:122)
      Oct 04 13:17:40 dcos-e2e-c6039d3e-a3bb-4a0b-bc6f-519bf3c96e0b-master-0 metronome[15546]: at dcos.metronome.JobApplicationLoader.load(JobApplicationLoader.scala:34)
      Oct 04 13:17:40 dcos-e2e-c6039d3e-a3bb-4a0b-bc6f-519bf3c96e0b-master-0 metronome[15546]: at play.core.server.ProdServerStart$.start(ProdServerStart.scala:51)
      Oct 04 13:17:40 dcos-e2e-c6039d3e-a3bb-4a0b-bc6f-519bf3c96e0b-master-0 metronome[15546]: at play.core.server.ProdServerStart$.main(ProdServerStart.scala:25)
      Oct 04 13:17:40 dcos-e2e-c6039d3e-a3bb-4a0b-bc6f-519bf3c96e0b-master-0 metronome[15546]: at play.core.server.ProdServerStart.main(ProdServerStart.scala)
      Oct 04 13:17:41 dcos-e2e-c6039d3e-a3bb-4a0b-bc6f-519bf3c96e0b-master-0 metronome[15546]: [warn] o.a.z.ClientCnxn - Session 0x0 for server zk-4.zk/172.28.0.2:2181, unexpected error, closing socket connection and attempting reconnect
      Oct 04 13:17:41 dcos-e2e-c6039d3e-a3bb-4a0b-bc6f-519bf3c96e0b-master-0 metronome[15546]: java.net.ConnectException: Connection refused
      Oct 04 13:17:41 dcos-e2e-c6039d3e-a3bb-4a0b-bc6f-519bf3c96e0b-master-0 metronome[15546]: at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
      Oct 04 13:17:41 dcos-e2e-c6039d3e-a3bb-4a0b-bc6f-519bf3c96e0b-master-0 metronome[15546]: at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
      Oct 04 13:17:41 dcos-e2e-c6039d3e-a3bb-4a0b-bc6f-519bf3c96e0b-master-0 metronome[15546]: at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:356)
      Oct 04 13:17:41 dcos-e2e-c6039d3e-a3bb-4a0b-bc6f-519bf3c96e0b-master-0 metronome[15546]: at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1192)
      Oct 04 13:17:43 dcos-e2e-c6039d3e-a3bb-4a0b-bc6f-519bf3c96e0b-master-0 metronome[15546]: [warn] o.a.z.ClientCnxn - Session 0x0 for server zk-4.zk/172.28.0.2:2181, unexpected error, closing socket connection and attempting reconnect
      Oct 04 13:17:43 dcos-e2e-c6039d3e-a3bb-4a0b-bc6f-519bf3c96e0b-master-0 metronome[15546]: java.net.ConnectException: Connection refused
      Oct 04 13:17:43 dcos-e2e-c6039d3e-a3bb-4a0b-bc6f-519bf3c96e0b-master-0 metronome[15546]: at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
      Oct 04 13:17:43 dcos-e2e-c6039d3e-a3bb-4a0b-bc6f-519bf3c96e0b-master-0 metronome[15546]: at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
      Oct 04 13:17:43 dcos-e2e-c6039d3e-a3bb-4a0b-bc6f-519bf3c96e0b-master-0 metronome[15546]: at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:356)
      Oct 04 13:17:43 dcos-e2e-c6039d3e-a3bb-4a0b-bc6f-519bf3c96e0b-master-0 metronome[15546]: at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1192)
      

      When play is booting up, play.core.server.ProdServerStart.main calls the JobApplicationLoader.load but then we execute jobComponents.schedulerService.run() in a separate thread so when that throws exception, metronome should shut down but that does not happen.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                alenavarkockova Alena Varkockova
                Reporter:
                alenavarkockova Alena Varkockova
                Team:
                Orchestration Team
                Watchers:
                Alena Varkockova, Jan-Philip Gehrcke, Matthias Eichstedt, Mergebot, Tim Harper
              • Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: