Uploaded image for project: 'Marathon'
  1. Marathon
  2. MARATHON-7564

Zombie Marathon if ZK DNS is unresolvable

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Medium
    • Resolution: Done
    • Affects Version/s: Marathon 1.4.7
    • Fix Version/s: None
    • Component/s: Persistence
    • Labels:

      Description

      There have been a number of issues here with some work done... but there may still be an edge case where zk resolution leads to a zombie marathon (the process doesn't die but it isn't leading any longer)

      We should confirm that Curator does not consider the retry policy properly when DNS resolution issues are happening (this is a problem because we depend on our retry policy to tell Marathon to suicide). We believe leader abdication is now working because it crashes, but standby curator elections could become zombies, and if the leader restarts, it could become a new standby zombie just as well.
       
      Following the confirmation of said behavior, let's collaborate with the curator maintainers to get a fix in. If it's what I think it is, then the fix should be simple (I think it's possible they forgot to have proper exception handling around the DNS resolution portion of the connection routine is all).

      Resolutions:

      https://phabricator.mesosphere.com/rMARATHONe4fa0a6bcdee20dac42856899156e5894b171588

      https://github.com/dcos/dcos/pull/1690

       

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                adukhovniy Aleksey Dukhovniy
                Reporter:
                ken Ken Sipe
                Team:
                Orchestration Team
                Watchers:
                Adam Bordelon (Inactive), Aleksey Dukhovniy, Ken Sipe, Tim Harper
              • Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: