A Mesos DNS instance may miss a Mesos leader change, because it seems to only rely on the ZK watch mechanism for detecting a state transition.
We believe to understand that the watch mechanism does not have the delivery guarantees that it pretends to have (quotes/reference below).
To make Mesos DNS more reliably pick up a leader change, we could and maybe should make it periodically poll the corresponding relevant ZK node(s) (this does not need to happen with super high frequency – once every 5 or 10 seconds for instance could suffice). Relevant discussion from #dcos-networking: https://mesosphere.slack.com/archives/dcos-networking/p1486509188005541
Excerpt from there:
Some relevant quotes and references on the topic of delivery guarantees associated with the ZK watch mechanism:
Quoting Tyler N from a relevant discussion:
It is at-most-once delivery semantics. You should expect TCP connections to drop at any time. If you are only using watches, and your client reconnects after an issue, you have no idea of knowing what happened before you reconnected. This is one of those things that people writing distributed systems mess up frequently when using ZK because it works great in test environments, and will usually only cause really annoying bugs at higher scales when this rare event pops up, but it does pop up. At my last job we noticed a sharded redis proxy system occasionally would fail to get a watch that notified it about replica master changes, which caused really annoying data inconsistency issues, so we ended up completely removing watches in favor of polling. Twitter also moved to a poll-heavy service discovery system after a number of problems with zk watches and general scalability problems they ran into.
This is the only 'worrying' input that I am aware of. And it is rather convincing. Polling yields easy to understand guarantees.
However, if one looks at a little more official resources, it seems that with really careful client design, it might actually be possible to rely on watches + proper session state transition + automatic re-registering + an additional safety read after recconnect:
From the ZK docs:
When you disconnect from a server (for example, when the server fails), you will not get any watches until the connection is reestablished. For this reason session events are sent to all outstanding watch handlers. Use session events to go into a safe mode: you will not be receiving events while disconnected, so your process should act conservatively in that mode.
From a ZK book:
Say that a ZooKeeper client disconnects from a ZooKeeper server and connects to a different server in the ensemble. The client will send a list of outstanding watches. When reregistering the watch, the server will check to see if the watched znode has changed since the watch was registered. If the znode has changed, a watch event will be sent to the client; otherwise, the watch will be reregistered at the new server.