We are seeing an issue in one of our environments where some tasks seem to be disappearing from our marathon instance, 5 tasks today.
We have seen several different errors occurring in our logs recently.
First, we have seen instances of our marathon leaders committing suicide and abdicating leadership to another master, sometimes more than once in a day:
Oct 4 17:59:14 marathon-n001 marathon: [2017-10-04 17:59:11,350] ERROR Committing suicide! (mesosphere.marathon.MarathonScheduler$$EnhancerByGuice$$cbb97c3f:Thread-2752)
Oct 4 20:33:14 marathon-n003 marathon: [2017-10-04 20:33:11,364] ERROR Committing suicide! (mesosphere.marathon.MarathonScheduler$$EnhancerByGuice$$cbb97c3f:Thread-2963)
Oct 4 23:35:43 marathon-n001 marathon: [2017-10-04 23:35:43,962] ERROR ZooKeeper access failed - Committing suicide to avoid invalidating ZooKeeper state (mesosphere.marathon.core.election.impl.CuratorElectionService:pool-4-thread-1-EventThread)
Second, we're seeing some recurring errors:
Oct 4 20:33:22 marathon-n001 marathon: [2017-10-04 20:33:22,438] ERROR Failed to load /staging/application/assets-elastic-doc-micro:2017-09-26T13:57:01.698Z for group /staging/application (2017-09-26T15:49:27.606Z) (mesosphere.marathon.storage.repository.StoredGroup:ForkJoinPool-2-worker-3)
When we have a leader commit suicide, we have seen some logs with ERROR retrieving task and ERROR failed to recover deployment:
Oct 4 23:33:33 marathon-n001 marathon: [2017-10-04 23:33:33,896] ERROR While retrieving 15cbf208-8e60-4f2a-b69d-fb2929d44997, either original (None) or target (None) were no longer available (mesosphere.marathon.storage.repository.StoredPlan:ForkJoinPool-2-worker-27)
Oct 4 23:33:33 marathon-n001 marathon: [2017-10-04 23:33:33,899] ERROR Failed to recover deployments from repository: java.lang.IllegalStateException: Missing target or original (mesosphere.marathon.upgrade.DeploymentManager:marathon-akka.actor.default-dispatcher-15)
Some timeout errors:
Oct 4 23:35:31 marathon-n001 marathon: [2017-10-04 23:35:29,900] ERROR Connection timed out for connection string (IP1:2181,IP2:2181,IP3:2181) and timeout (15000) / elapsed (10782) (org.apache.curator.ConnectionState:main-EventThread)
We have just increased the value of our /etc/marathon/conf/zk_timeout to 30000 milliseconds to see if this has any impact and are monitoring the marathon instances now.
We are not sure, at this point why our Marathon masters are restarting. This instance of marathon runs ~700 applications at this point in time.
About 45 days ago, we upgraded our marathon version from 1.3.6 to 1.4.7.
I've been looking and seen similar issues in
, however it looks as though that issue was already resolved and a fix merged into releases/1.4. MARATHON-7401
Are there any particular things I could do to diagnose this issue??