Uploaded image for project: 'Marathon'
  1. Marathon
  2. MARATHON-8066

Mesos cannot offer resource to marathon

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Medium
    • Resolution: Cannot Reproduce
    • Affects Version/s: Marathon 1.5.4
    • Fix Version/s: None
    • Component/s: Leader Election
    • Labels:

      Description

      Issue Summary
      We have three EC2 nodes (m4xlarge) running Zookeeper, Mesos-master and Marathon containers in each node. The order to launch the clusters is Zookeeper, Mesos and Marathon. After all three Marathon nodes form a cluster, Mesos will treat Marathon as "Inactive" framework. You can see all the Marathon containers running and well-clustered, but shown as inactive in Mesos framework. At this time, the Marathon has Mesos_master_ui_url=null and cannot get resource from Mesos for deploying applications.
       
      Details
      This issue happens at the first time when provisioning the environment only. We can fix this issue manually by doing rolling restart for all the Marathon containers. After doing restart, all clusters (Zookeeper, Mesos and Marathon) performs normally and can fall back from single node failure. The reason we report this issue is because we want to avoid the manual steps and figure out the root cause for this issue.
       
      For logs in Marathon, it will request test reconciliation with pesos master

        
       *[2018-02-05 22:05:37,966] INFO  initiate task reconciliation (mesosphere.marathon.MarathonSchedulerActor:marathon-akka.actor.default-dispatcher-5)*
       *[2018-02-05 22:05:37,966] INFO  Requesting task reconciliation with the Mesos master (mesosphere.marathon.SchedulerActions:ForkJoinPool-3-worker-15)*
       *[2018-02-05 22:05:37,967] INFO  task reconciliation has finished (mesosphere.marathon.MarathonSchedulerActor:marathon-akka.actor.default-dispatcher-24)*
       *[2018-02-05 22:05:38,184] INFO  Add EventStream Handle as event listener: HttpEventSSEHandle(5ff4a57b-5ea7-4f66-a3e2-95a867431e89 on "ip.address.ip.1" on event types from Set()). Current nr of streams: 3 (mesosphere.marathon.core.event.impl.stream.HttpEventStreamActor:marathon-akka.actor.default-dispatcher-7)*
       *[2018-02-05 22:05:38,186] INFO  Received EOF from stream handle HttpEventSSEHandle(e8be1407-e19d-4b7b-a4e4-9f126fe0ad41 on "ip.address.ip.1" on event types from Set()). Ignore subsequent events. (mesosphere.marathon.core.event.impl.stream.HttpEventStreamHandleActor:marathon-akka.actor.default-dispatcher-17)*
       *[2018-02-05 22:05:38,370] INFO  "ip.address.ip.1" - - [05/Feb/2018:22:05:38 +0000] "GET //"ip.address.ip.2":8080/ui/#/apps HTTP/1.1" 401 243 "-" "ELB-HealthChecker/2.0"  (mesosphere.chaos.http.ChaosRequestLog:qtp2058879732-215)*
       *[2018-02-05 22:05:38,496] INFO  "ip.address.ip.3" - - [05/Feb/2018:22:05:38 +0000] "GET //"ip.address.ip.2":8080/ui/#/apps HTTP/1.1" 401 243 "-" "ELB-HealthChecker/2.0"  (mesosphere.chaos.http.ChaosRequestLog:qtp2058879732-59)*
       *[2018-02-05 22:05:38,848] INFO  No match for:7c908dd3-0062-486c-bf87-30903cbd130a-O3088 from:"Mesos-agent-DNS" reason:No offers wanted (mesosphere.marathon.core.matcher.manager.impl.OfferMatcherManagerActor:marathon-akka.actor.default-dispatcher-17)*
       *[2018-02-05 22:05:42,853] INFO  No match for:7c908dd3-0062-486c-bf87-30903cbd130a-O3089 from:+"Mesos-agent-DNS"+ reason:No offers wanted (mesosphere.marathon.core.matcher.manager.impl.OfferMatcherManagerActor:marathon-akka.actor.default-dispatcher-17)*
      

       
      For logs in Mesos master, it will keep send offer to marathon then decline

       *I0205 21:33:50.016361    15 master.cpp:7673] Sending 1 offers to framework 71583709-937e-4de2-9a15-85cfd728f6b7-0001 (marathon) at scheduler-ef8c4572-0fc2-453a-9ab1-c72c54ea5530@"ip.address":44295*
       *I0205 21:33:50.017779    20 master.cpp:5120] Processing DECLINE call for offers: [ 7c908dd3-0062-486c-bf87-30903cbd130a-O2883 ] for framework 71583709-937e-4de2-9a15-85cfd728f6b7-0001 (marathon) at scheduler-ef8c4572-0fc2-453a-9ab1-c72c54ea5530@"ip.address":44295*
       *I0205 21:33:50.017910    20 master.cpp:9170] Removing offer 7c908dd3-0062-486c-bf87-30903cbd130a-O2883*
        
       *I0205 21:33:57.022912    18 master.cpp:7673] Sending 1 offers to framework 71583709-937e-4de2-9a15-85cfd728f6b7-0001 (marathon) at scheduler-ef8c4572-0fc2-453a-9ab1-c72c54ea5530@"ip.address":44295*
       *I0205 21:33:57.024308    20 master.cpp:5120] Processing DECLINE call for offers: [ 7c908dd3-0062-486c-bf87-30903cbd130a-O2884 ] for framework 71583709-937e-4de2-9a15-85cfd728f6b7-0001 (marathon) at scheduler-ef8c4572-0fc2-453a-9ab1-c72c54ea5530@"ip.address":44295*
      

      Manual steps to fix the issue
      Just restart the current Marathon leader container, then Marathon will do a new leader election. If the new leader shows as inactive in Mesos, then restart the new Marathon leader container. Repeating this operation until active Marathon shows in Mesos.
       
      Other Issues
      During our experiment, we also see "Waiting for consistent leadership state" issue happened several times, which will lead to split-brain like issue for three-nodes cluster (One of the Marathon node keeps waiting for consistent leadership state while the other two already forms a cluster). This issue should be fixed after Marathon 1.4.8 according to the release note, but we still find this issue for Marathon 1.4.11 and 1.5.4
       
      Other Experiments

      1. We tried several combos of Mesos of Marathon:
        Mesos: 1.4.0    Marathon:1.4.8
        Mesos: 1.4.0    Marathon:1.4.11
        Mesos: 1.4.1    Marathon:1.4.11
        Mesos: 1.4.1    Marathon:1.5.4
        For all these combos, we see all these issues happens.
      2. We tested both multi-AZ and single-AZ, found those issues in both modes
      3. We Added sleep logic before deploying Marathon for 1 and 5 minutes
        In 1 minute case, the issue still exists
        In 5 minute case, Marathon framework even did not show in Mesos

        Attachments

          Activity

            People

            • Assignee:
              ivanchernetsky Ivan Chernetsky
              Reporter:
              杨帆 yfsuperman
              Team:
              Orchestration Team
              Watchers:
              Matthias Eichstedt, maxsom, yfsuperman
            • Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: