Uploaded image for project: 'Marathon'
  1. Marathon
  2. MARATHON-7509

Marathon doesn't re-discover resources for resident tasks if the mesos agent ID changes across boots.

    Details

      Description

      On Marathon 1.4.2 we encountered a problem where when the Mesos Agent ID changes on a node, marathon is not able to re-discover the reserved resources on that node.

      Here's the /v2/info for marathon

      {"name":"marathon","version":"1.4.2","buildref":"unknown","elected":true,"leader":"REDACTED_MARATHON_IP:8080","frameworkId":"REDACTED_ID","marathon_config":{"master":"zk://REDACTED_ZK:2181/REDACTED_PATH_MESOS","failover_timeout":604800,"framework_name":"REDACTED_FRAMEWORK","ha":true,"checkpoint":true,"local_port_min":10000,"local_port_max":20000,"executor":"//cmd","hostname":"REDACTED_MARATHON_IP","webui_url":null,"mesos_role":"REDACTED_ROLE","task_launch_timeout":300000,"task_reservation_timeout":20000,"reconciliation_initial_delay":15000,"reconciliation_interval":600000,"mesos_user":"REDACTED_USER","leader_proxy_connection_timeout_ms":5000,"leader_proxy_read_timeout_ms":10000,"features":["task_killing"],"mesos_leader_ui_url":"http://REDACTED_MASTER:5050/"},"zookeeper_config":{"zk":"zk://REDACTED_ZK:2181/REDACTED_ZK_PATH_MARATHON","zk_timeout":10000,"zk_session_timeout":10000,"zk_max_versions":50},"event_subscriber":null,"http_config":{"http_port":8080,"https_port":8443}}
      

      A few select items from the mesos agent logs:

      Jun 10 04:29:29 ip-172-19-120-80 mesos-agent[2589]: I0610 04:29:29.708191  2961 slave.cpp:1134] Registered with master master@MASTER_0_REDACTED:5050; given agent ID SLAVE_ID_0_REDACTED-S892
      ....
      Jun 12 11:09:54 ip-172-19-120-80 mesos-agent[2589]: I0612 11:09:54.027519  2906 zookeeper.cpp:259] A new leading master (UPID=master@MASTER_1_REDACTED:5050) is detected
      ....
      Jun 12 13:31:22 ip-172-19-120-80 mesos-agent[2455]: I0612 13:31:22.593070  2738 state.cpp:95] Agent host rebooted
      Jun 12 13:31:22 ip-172-19-120-80 mesos-agent[2455]: I0612 13:31:22.602294  2765 zookeeper.cpp:259] A new leading master (UPID=master@MASTER_1_REDACTED:5050) is detected
      Jun 12 13:31:28 ip-172-19-120-80 mesos-agent[2455]: I0612 13:31:28.742300  2584 slave.cpp:1134] Registered with master master@MASTER_1_REDACTED:5050; given agent ID SLAVE_ID_1_REDACTED-S10
      

      The tasks actually EXIST in mesos, but marathon doesn't seem to know about them. When I look at the link for the marathon app instances, it has the correct app ids, but the slave ID in the link is the old one, not the new one.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                marco.monaco Marco Monaco
                Reporter:
                drcrallen drcrallen
                Team:
                Orchestration Team
                Watchers:
                drcrallen, Matthias Eichstedt
              • Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: