Uploaded image for project: 'Marathon'
  1. Marathon
  2. MARATHON-7479

Performance regression in serial deployments

    Details

    • Sprint:
      Marathon Sprint 1.10-6
    • Story Points:
      2
    • Zendesk Ticket IDs:
      8114

      Description

      In Marathon 1.1.x, we had an optimization in place in order to make state diff calculations efficient (see compare and hash methods for AppDefinition). However, this was removed in 1.3.x due to the shortcuts taken in the hash and equals methods leading to surprising behavior.

      Short term, we should find a way to re-introduce these optimizations in a safer way.

      Long term, we should consider ways to leverage existing knowledge about what was changed in order to prune the amount of work needed to compute the state diffs.

      Acceptance criteria:

      [ ] - JMH Benchmark for deployments of apps
      [ ] - Measure to look for low-hanging fruit optimizations
      [ ] - Optimize Deployment plan computation, ideally, by an order of magnitude for apps
      [ ] - Do the above for pods, also

      A branch is in progress, here: https://github.com/mesosphere/marathon/tree/tharper/1.4-optimize-dependency-plan-computation

      Original description

      Frequent 503s from the leader on v2/tasks/ and v2/apps non GETs, causing app pileups at Yelp

      Here is the stacktrace for SEO:

      2017-06-07 17:28:35,199] ERROR Exception while processing request (mesosphere.marathon.api.MarathonExceptionMapper$$EnhancerByGuice$$9833dc0d:qtp1329043305-177960)
      java.util.concurrent.TimeoutException: Futures timed out after [30000 milliseconds]
      at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
      at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
      at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
      at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
      at scala.concurrent.Await$.result(package.scala:190)
      at mesosphere.marathon.api.RestResource$class.result(RestResource.scala:59)
      at mesosphere.marathon.api.v2.TasksResource.result(TasksResource.scala:31)
      at mesosphere.marathon.DebugModule$MetricsBehavior$$anonfun$invoke$1.apply(DebugConf.scala:82)
      at mesosphere.marathon.metrics.Metrics.timed(Metrics.scala:28)
      at mesosphere.marathon.DebugModule$MetricsBehavior.invoke(DebugConf.scala:82)
      at mesosphere.marathon.api.v2.TasksResource$$anonfun$killTasks$1.apply(TasksResource.scala:174)
      at mesosphere.marathon.api.v2.TasksResource$$anonfun$killTasks$1.apply(TasksResource.scala:130)
      at mesosphere.marathon.api.AuthResource$$anonfun$authenticated$1.apply(AuthResource.scala:27)
      at mesosphere.marathon.api.AuthResource$$anonfun$authenticated$1.apply(AuthResource.scala:25)
      at scala.Option.map(Option.scala:146)
      at mesosphere.marathon.api.AuthResource$class.authenticated(AuthResource.scala:25)
      at mesosphere.marathon.api.v2.TasksResource.authenticated(TasksResource.scala:31)
      at mesosphere.marathon.DebugModule$MetricsBehavior$$anonfun$invoke$1.apply(DebugConf.scala:82)
      at mesosphere.marathon.metrics.Metrics.timed(Metrics.scala:28)
      at mesosphere.marathon.DebugModule$MetricsBehavior.invoke(DebugConf.scala:82)
      at mesosphere.marathon.api.v2.TasksResource.killTasks(TasksResource.scala:130)
      at mesosphere.marathon.DebugModule$MetricsBehavior$$anonfun$invoke$1.apply(DebugConf.scala:82)
      at mesosphere.marathon.metrics.Metrics.timed(Metrics.scala:28)
      at mesosphere.marathon.DebugModule$MetricsBehavior.invoke(DebugConf.scala:82)
      at sun.reflect.GeneratedMethodAccessor190.invoke(Unknown Source)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      at java.lang.reflect.Method.invoke(Method.java:498)
      at com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
      at com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$ResponseOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:205)
      at com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75)
      at com.codahale.metrics.jersey.InstrumentedResourceMethodDispatchProvider$TimedRequestDispatcher.dispatch(InstrumentedResourceMethodDispatchProvider.java:30)
      at com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:302)
      at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
      at com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)
      at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
      at com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84)
      at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1542)
      at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1473)
      at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1419)
      at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1409)
      at com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:409)
      at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:558)
      at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:733)
      at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
      at com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:263)
      at com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:178)
      at com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:91)
      at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:62)
      at mesosphere.marathon.api.CacheDisablingFilter.doFilter(CacheDisablingFilter.scala:18)
      at mesosphere.marathon.DebugModule$MetricsBehavior$$anonfun$invoke$1.apply(DebugConf.scala:82)
      at mesosphere.marathon.metrics.Metrics.timed(Metrics.scala:28)
      at mesosphere.marathon.DebugModule$MetricsBehavior.invoke(DebugConf.scala:82)
      at com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163)
      at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
      at mesosphere.marathon.api.CORSFilter.doFilter(CORSFilter.scala:45)
      at mesosphere.marathon.DebugModule$MetricsBehavior$$anonfun$invoke$1.apply(DebugConf.scala:82)
      at mesosphere.marathon.metrics.Metrics.timed(Metrics.scala:28)
      at mesosphere.marathon.DebugModule$MetricsBehavior.invoke(DebugConf.scala:82)
      at com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163)
      at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
      at mesosphere.marathon.api.LimitConcurrentRequestsFilter.pass(LimitConcurrentRequestsFilter.scala:30)
      at mesosphere.marathon.api.LimitConcurrentRequestsFilter$$anonfun$3$$anonfun$apply$2.apply(LimitConcurrentRequestsFilter.scala:15)
      at mesosphere.marathon.api.LimitConcurrentRequestsFilter$$anonfun$3$$anonfun$apply$2.apply(LimitConcurrentRequestsFilter.scala:15)
      at mesosphere.marathon.api.LimitConcurrentRequestsFilter.doFilter(LimitConcurrentRequestsFilter.scala:34)
      at com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163)
      at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
      at mesosphere.marathon.api.LeaderProxyFilter.doFilter(LeaderProxyFilter.scala:98)
      at com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163)
      at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
      at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118)
      at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113)
      at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1668)
      at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:581)
      at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
      at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:513)
      at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1160)
      at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:511)
      at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1090)
      at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
      at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:119)
      at com.codahale.metrics.jetty9.InstrumentedHandler.handle(InstrumentedHandler.java:240)
      at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:109)
      at org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:375)
      at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:119)
      at org.eclipse.jetty.server.Server.handle(Server.java:517)
      at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:308)
      at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:242)
      at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:261)
      at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
      at org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:75)
      at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceAndRun(ExecuteProduceConsume.java:213)
      at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:147)
      at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:654)
      at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:572)
      at java.lang.Thread.run(Thread.java:745)
      [2017-06-07 17:28:35,199] INFO 10.64.84.139 - admin [08/Jun/2017:00:28:05 +0000] "POST //10.64.84.139:5052/v2/tasks/delete?force=True&scale=True HTTP/1.1" 503 58 "-" "PaaSTA Tools 0.64.0 setup_marathon_job" 3000 (mesosphere.chaos.http.ChaosRequestLog$$EnhancerByGuice$$bd50692:qtp1329043305-177960)

      It happens pretty frequently after we've upgraded from Marathon 1.4.3 on our largest cluster (~800 apps).

      To summarize the endpoints we see this on:

      • /v2/apps - PUT,POST,DELETE
      • /v2/tasks - POST, but probably because that is the only write method we use on that

      Luckily we can get these endpoints to go through "eventually", but it does impair our ability to make changes quickly.

      We know we cannot (easily) roll back to 1.1.

      We have a hunch that we can kinda try to tune this with the group_manager_request_timeout option? But we are guessing at this point and seeking help for tuning marathon to support our workload in this post 1.1 world.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                tharper Tim Harper
                Reporter:
                solarkennedy Kyle Anderson
                Team:
                Orchestration Team
                Watchers:
                Bekir Dogan, Chmielewski, EvanKrall, Ivan Chernetsky, janisz, Ken Sipe, Kyle Anderson, Nathan Handler, sameaton, Tim Harper, Tobi Knaup
              • Watchers:
                11 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: