Resolution: Cannot Reproduce
Affects Version/s: None
Fix Version/s: None
Component/s: Leader Election
At Yelp we still have one large Marathon cluster running 1.1.1. Last week we saw some very weird behaviour after a leadership election caused by a restart of our zookeeper cluster.
I've been reading the logs and I'll try to post some snippets below that backup what happened.
- Both mesos and marathon lose connectivity to zookeeper:
- Both pick a new leader:
- The new Marathon leader loads all the tasks and starts up relatively normally
- The new leader has some trouble authenticating with mesos and commits suicide:
- A second new leader starts up but for some reason it never seems to start the TaskTracker
- This turns out to be very bad because with no tasks loaded Marathon now continues with some of its startup actions. Firstly it decides to launch everything again:
- Even worse, status updates now flood in for the existing tasks and Marathon doesn't know them so decides to kill them
This is obviously pretty bad
Our plan is to upgrade to 1.4.6 when it gets released. In the meantime it would be great to confirm what happened here. I've been poking around the leadership election code but I can't see why the TaskTracker wouldn't start up but I'm really no expert. I have a hunch you may have already fixed this in a later release? But given how bad the symptoms are here it would be good to confirm that this bug doesn't still exist.