Uploaded image for project: 'Marathon'
  1. Marathon
  2. MARATHON-4093

HA is broken if marathon is not leading from the same host as the mesos-master

    Details

    • Type: Task
    • Status: Resolved
    • Priority: Low
    • Resolution: Won't Do
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Leader Election
    • Labels:

      Description

      versions:

      Mesos 1.0.0
      Marathon 1.3.0
      

      Setup is three servers, each running the zookeeper, mesos-master and marathon services (plus consul and dnsmasq for internal hostname resolution) Additionally, there are 6 slaves running the mesos-agent service (as well as consul and dnsmasq).

      Command line flags for mesos-master

        --cluster='dev' \
        --credentials='/var/lib/mesos/credentials' \
        --hostname='01.mesos-master.service.internal.asperafiles.com' \
        --ip='10.114.81.17' \
        --log_dir='/var/log/mesos' \
        --logging_level='INFO' \
        --port=5050 \
        --quorum=2 \
        --work_dir='/var/lib/mesos' \
        --zk='zk://01.zookeeper.service.internal.asperafiles.com:2181,02.zookeeper.service.internal.asperafiles.com:2181,03.zookeeper.service.internal.asperafiles.com:2181/mesos'
      

      Command line flags for marathon

      --hostname 01.marathon.service.internal.asperafiles.com 
      --master zk://01.zookeeper.service.internal.asperafiles.com:2181,02.zookeeper.service.internal.asperafiles.com:2181,03.zookeeper.service.internal.asperafiles.com:2181/mesos
      --zk zk://01.zookeeper.service.internal.asperafiles.com:2181,02.zookeeper.service.internal.asperafiles.com:2181,03.zookeeper.service.internal.asperafiles.com:2181/marathon
      

      all hostnames are resolvable

      I'm starting with mesos-master and marathon leading on server 01.mesos-master.service.internal. with no tasks running.



      I start a simple task:

      See that marathon is running it fine.

      I force a leader election. marathon is now leading at 03.mesos-master.service.internal/03.marathon.service.internal

      I then try to suspend the task in the marathon ui. I use the ui for the leading marathon service.


      The task is still running

      Eventually marathon marks the task as suspended, but I still see that the task is running in the mesos ui

      I've been experimenting with all sorts of different configurations. I've tried upgrading marathon and mesos. I've tried ips instead of hostnames. I can't seem to resolve this issue. We've actually been running mesos and marathon for a year and a half and it seems that this issue exists in all of our clusters. If this is a configuration issue, please help me figure out what I'm doing wrong. As is it, any errant marathon leader election keeps us from deploying/controlling our applications.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              tomwganem Tom Ganem
              Team:
              Orchestration Team
              Watchers:
              Jason Gilanfarr (Inactive), kamaradclimber, zanes2016
            • Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: