Uploaded image for project: 'Marathon'
  1. Marathon
  2. MARATHON-7167

Marathon unable to recover from ZK outage

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Medium
    • Resolution: Done
    • Affects Version/s: Marathon 1.4.1, Marathon 1.4.7
    • Fix Version/s: None
    • Component/s: Leader Election
    • Labels:
      None

      Description

      Marathon version 1.4.1

      we had a ZK outage over the weekend. Upon resolving that, we were hoping that Marathon would recover, but it failed with exceptions in the following:

      2017-03-27T06:34:53+00:00 paasta1-uswest1cdevc marathon[65816]: [2017-03-26 23:34:53,680] INFO InstanceTrackerActor is starting. Task loading initiated. (mesosphere.marathon.core.task.tracker.impl.InstanceTrackerActor:marathon-akka.actor.default-dispatcher-45)
      2017-03-27T06:34:53+00:00 paasta1-uswest1cdevc marathon[65816]: [2017-03-26 23:34:53,770] ERROR while loading tasks (akka.actor.OneForOneStrategy:marathon-akka.actor.default-dispatcher-4)
      2017-03-27T06:34:53+00:00 paasta1-uswest1cdevc marathon[65816]: at mesosphere.marathon.core.task.tracker.impl.InstanceTrackerActor$$anonfun$initializing$1.applyOrElse(InstanceTrackerActor.scala:111)
      2017-03-27T06:34:53+00:00 paasta1-uswest1cdevc marathon[65816]: at mesosphere.marathon.core.task.tracker.impl.InstanceTrackerActor.aroundReceive(InstanceTrackerActor.scala:70)
      2017-03-27T06:34:53+00:00 paasta1-uswest1cdevc marathon[65816]: at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
      2017-03-27T06:34:53+00:00 paasta1-uswest1cdevc marathon[65816]: at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
      2017-03-27T06:34:53+00:00 paasta1-uswest1cdevc marathon[65816]: at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
      2017-03-27T06:34:53+00:00 paasta1-uswest1cdevc marathon[65816]: at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
      2017-03-27T06:34:53+00:00 paasta1-uswest1cdevc marathon[65816]: Caused by: play.api.libs.json.JsResultException: JsResultException(errors:List((/tasksMap/schematizer.canary.git6b5ddcc9.configd3c7cdbb.185c14d7-0979-11e7-aa4f-5efa5b7db366/reservation,List(ValidationError(List(error.path.missing),WrappedArray()))), (/tasksMap/schematizer.canary.git6b5ddcc9.configd3c7cdbb.185c14d7-0979-11e7-aa4f-5efa5b7db366/status/networkInfo,List(ValidationError(List(error.path.missing),WrappedArray())))))
      2017-03-27T06:34:53+00:00 paasta1-uswest1cdevc marathon[65816]: at play.api.libs.json.JsReadable$$anonfun$2.apply(JsReadable.scala:23)
      2017-03-27T06:34:53+00:00 paasta1-uswest1cdevc marathon[65816]: at play.api.libs.json.JsError.fold(JsResult.scala:13)
      2017-03-27T06:34:53+00:00 paasta1-uswest1cdevc marathon[65816]: at play.api.libs.json.JsObject.as(JsValue.scala:76)
      2017-03-27T06:34:53+00:00 paasta1-uswest1cdevc marathon[65816]: at mesosphere.marathon.storage.store.ZkStoreSerialization$$anonfun$8.apply(ZkStoreSerialization.scala:80)
      2017-03-27T06:34:53+00:00 paasta1-uswest1cdevc marathon[65816]: at akka.http.scaladsl.unmarshalling.Unmarshaller$$anon$1.apply(Unmarshaller.scala:55)
      2017-03-27T06:34:53+00:00 paasta1-uswest1cdevc marathon[65816]: at mesosphere.marathon.core.storage.store.impl.BasePersistenceStore$stateMachine$macro$271$1.apply(BasePersistenceStore.scala:89)
      2017-03-27T06:34:53+00:00 paasta1-uswest1cdevc marathon[65816]: at mesosphere.marathon.core.storage.store.impl.BasePersistenceStore$stateMachine$macro$271$1.apply(BasePersistenceStore.scala:85)
      2017-03-27T06:34:53+00:00 paasta1-uswest1cdevc marathon[65816]: at scala.concurrent.impl.ExecutionContextImpl$AdaptedForkJoinTask.exec(ExecutionContextImpl.scala:121)
      2017-03-27T06:34:53+00:00 paasta1-uswest1cdevc marathon[65816]: at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
      2017-03-27T06:34:53+00:00 paasta1-uswest1cdevc marathon[65816]: [2017-03-26 23:34:53,781] INFO InstanceTrackerActor is starting. Task loading initiated. (mesosphere.marathon.core.task.tracker.impl.InstanceTrackerActor:marathon-akka.actor.default-dispatcher-40)
      2017-03-27T06:34:53+00:00 paasta1-uswest1cdevc marathon[65816]: ... 2 common frames omitted
      2017-03-27T06:34:53+00:00 paasta1-uswest1cdevc marathon[65816]: [2017-03-26 23:34:53,883] ERROR while loading tasks (akka.actor.OneForOneStrategy:marathon-akka.actor.default-dispatcher-22)
      2017-03-27T06:34:53+00:00 paasta1-uswest1cdevc marathon[65816]: at akka.actor.Actor$class.aroundReceive(Actor.scala:484)
      2017-03-27T06:34:53+00:00 paasta1-uswest1cdevc marathon[65816]: at akka.actor.ActorCell.invoke(ActorCell.scala:495)
      2017-03-27T06:34:53+00:00 paasta1-uswest1cdevc marathon[65816]: at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
      2017-03-27T06:34:53+00:00 paasta1-uswest1cdevc marathon[65816]: at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
      2017-03-27T06:34:53+00:00 paasta1-uswest1cdevc marathon[65816]: at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
      2017-03-27T06:34:53+00:00 paasta1-uswest1cdevc marathon[65816]: at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
      2017-03-27T06:34:53+00:00 paasta1-uswest1cdevc marathon[65816]: at play.api.libs.json.JsReadable$$anonfun$2.apply(JsReadable.scala:23)
      2017-03-27T06:34:53+00:00 paasta1-uswest1cdevc marathon[65816]: at play.api.libs.json.JsResult$class.fold(JsResult.scala:73)
      2017-03-27T06:34:53+00:00 paasta1-uswest1cdevc marathon[65816]: at play.api.libs.json.JsReadable$class.as(JsReadable.scala:21)
      2017-03-27T06:34:53+00:00 paasta1-uswest1cdevc marathon[65816]: at play.api.libs.json.JsObject.as(JsValue.scala:76)
      2017-03-27T06:34:53+00:00 paasta1-uswest1cdevc marathon[65816]: at akka.http.scaladsl.unmarshalling.Unmarshaller$$anonfun$strict$1$$anonfun$apply$13.apply(Unmarshaller.scala:62)
      2017-03-27T06:34:53+00:00 paasta1-uswest1cdevc marathon[65816]: at akka.http.scaladsl.unmarshalling.Unmarshal.to(Unmarshal.scala:19)
      2017-03-27T06:34:53+00:00 paasta1-uswest1cdevc marathon[65816]: at mesosphere.marathon.core.storage.store.impl.BasePersistenceStore$stateMachine$macro$271$1.apply(BasePersistenceStore.scala:89)
      2017-03-27T06:34:53+00:00 paasta1-uswest1cdevc marathon[65816]: at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
      2017-03-27T06:34:53+00:00 paasta1-uswest1cdevc marathon[65816]: at scala.concurrent.impl.ExecutionContextImpl$AdaptedForkJoinTask.exec(ExecutionContextImpl.scala:121)

      We tried rm'ing some of the nodes corresponding to the failed tasks in ZK:

      Download | New paste
      robj@paasta1-uswest1adevc:~ % zkcmd --cluster-type=infrastructure -l /marathon-norcal-devc/state/apps/b/schematizer.canary.git6b5ddcc9.configd3c7cdbb
      Using the infrastructure local cluster config
      Using cluster [('10.40.5.5', 22181), ('10.40.5.6', 22181), ('10.40.1.17', 22181)]
      Getting children stored at node /marathon-norcal-devc/state/apps/b/schematizer.canary.git6b5ddcc9.configd3c7cdbb
      ==> ['2017-03-15T12:15:34.705Z', '2017-03-16T22:30:30.142Z', '2017-03-15T16:13:21.747Z']
      robj@paasta1-uswest1adevc:~ % zkcmd --cluster-type=infrastructure -g /marathon-norcal-devc/state/apps/b/schematizer.canary.git6b5ddcc9.configd3c7cdbb/2017-03-15T16:13:21.747Z
      Using the infrastructure local cluster config
      Using cluster [('10.40.5.5', 22181), ('10.40.5.6', 22181), ('10.40.1.17', 22181)]
      Getting node at /marathon-norcal-devc/state/apps/b/schematizer.canary.git6b5ddcc9.configd3c7cdbb/2017-03-15T16:13:21.747Z
      ==> ('\n./schematizer.canary.git6b5ddcc9.configd3c7cdbb\x12\xdc\x01\n\x1f\n\x17file:///root/.dockercfg\x10\x00\x18\x00 \x00\x12\xb6\x01\n\x1d\n\x0ePAASTA_CLUSTER\x12\x0bnorcal-devc\n[\n\x13PAASTA_DOCKER_IMAGE\x12Dservices-schematizer:paasta-6b5ddcc9e957496e2922b5a3e3796479c5bb4902\n\x19\n\x0fPAASTA_INSTANCE\x12\x06canary\n\x1d\n\x0ePAASTA_SERVICE\x12\x0bschematizer0\x00\x18\x01"\x16\n\x04cpus\x10\x00\x1a\t\t\x00\x00\x00\x00\x00\x00\xf0?2\x01*"\x15\n\x03mem\x10\x00\x1a\t\t\x00\x00\x00\x00\x00\x00\x90@2\x01*"\x16\n\x04disk\x10\x00\x1a\t\t\x00\x00\x00\x00\x00\x00\x90@2\x01*"\x16\n\x04gpus\x10\x00\x1a\t\t\x00\x00\x00\x00\x00\x00\x00\x002\x01*:\x12\n\x0bsuperregion\x10\x03\x1a\x011:\x18\n\x06region\x10\x04\x1a\x0cuseast1-prod:\x11\n\x04pool\x10\x01\x1a\x07defaultB\x00Z\x182017-03-15T16:13:21.747Zb\x17\x08\x00\x10\x00\x18< \n(\n2\x07/status8\x1eH\x00h\x90Nq\x00\x00\x00\x00\x00\x00\x00@z\x12\t\x00\x00\x00\x00\x00\x00\xf0?\x11\x00\x00\x00\x00\x00\x00\xf0?\x90\x01\x00\x9a\x01\x8a\x07\x08\x01\x12 \n\r/etc/boto_cfg\x12\r/etc/boto_cfg\x18\x02\x12"\n\x0e/nail/bulkdata\x12\x0e/nail/bulkdata\x18\x02\x12.\n\x14/nail/etc/datacenter\x12\x14/nail/etc/datacenter\x18\x02\x12,\n\x13/nail/etc/ecosystem\x12\x13/nail/etc/ecosystem\x18\x02\x12(\n\x11/nail/etc/habitat\x12\x11/nail/etc/habitat\x18\x02\x12:\n\x1a/nail/etc/kafka_discovery/\x12\x1a/nail/etc/kafka_discovery/\x18\x02\x12&\n\x10/nail/etc/region\x12\x10/nail/etc/region\x18\x02\x12.\n\x14/nail/etc/runtimeenv\x12\x14/nail/etc/runtimeenv\x18\x02\x12*\n\x12/nail/etc/services\x12\x12/nail/etc/services\x18\x02\x120\n\x15/nail/etc/superregion\x12\x15/nail/etc/superregion\x18\x02\x122\n\x16/nail/etc/topology_env\x12\x16/nail/etc/topology_env\x18\x02\x12B\n\x1e/nail/etc/zookeeper_discovery/\x12\x1e/nail/etc/zookeeper_discovery/\x18\x02\x12\x18\n\t/nail/srv\x12\t/nail/srv\x18\x02\x12:\n\x1a/var/run/synapse/services/\x12\x1a/var/run/synapse/services/\x18\x02\x128\n\x19/var/run/synapse/sockets/\x12\x19/var/run/synapse/sockets/\x18\x02\x1a\xbd\x01\ncdocker-paasta.yelpcorp.com:443/services-schematizer:paasta-6b5ddcc9e957496e2922b5a3e3796479c5bb4902\x10\x02\x1a\x0e\x08\x00\x10\xb8E\x1a\x03tcp\xa0\x06\x92P \x00*\x14\n\x0bmemory-swap\x12\x051024m*\x14\n\ncpu-period\x12\x06100000*\x14\n\tcpu-quota\x12\x0710000000\x00\xa8\x01\xe0\xa7\x12\xb8\x01\xd3\x87\xfb\x95\xad+\xc0\x01\xd3\x87\xfb\x95\xad+\xda\x01\x08\x08\x92P\x1a\x03tcp', Stat(aversion=0, ctime=1489594401803L, cversion=0, czxid=115975009675L, dataLength=1463, ephemeralOwner=0L, mtime=1489594401835L, mzxid=115975009681L, numChildren=0, pzxid=115975009675L, version=1))
      
      robj@paasta1-uswest1adevc:~ % zkcmd --cluster-type=infrastructure -d /marathon-norcal-devc/state/apps/b/schematizer.canary.git6b5ddcc9.configd3c7cdbb

      hoping that the problem was isolated to specific tasks, but although Marathon would 'progress' to another task, it would see the same problem again. There didn't seem to be any pattern between the tasks, so we're assuming this would have happened for every task. To resolve this, we had to remove all of Marathon's state from ZK, causing an outage for the cluster.

      Luckily, this happened in a dev cluster. We'd really like to avoid the same big hammer in production - can you help us debug the error message, and advise us if there is a 'safe' way of resolving?

       

      Thanks.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                ivanchernetsky Ivan Chernetsky
                Reporter:
                Rob-Johnson Rob Johnson
                Team:
                Orchestration Team
                Watchers:
                Ivan Chernetsky, Jason Gilanfarr (Inactive), Ken Sipe, Matthias Eichstedt, Nathan Handler, Rob Johnson
              • Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: