Uploaded image for project: 'Marathon'
  1. Marathon
  2. MARATHON-2227

Unique host constraint is not enforced during rolling upgrade of app

    Details

      Description

      I have web servers that listen on a fixed port. I used marathon to launch these servers, and set a "hostname:UNIQUE" constraint to ensure that theses servers don't run on the same host and end up with port collision.

      I had been using Marathon's rolling upgrade feature to upgrade these servers. It was working very smooth. Servers of the newer version are started in nodes that have no existing servers.

      However, in recent Marathon versions, I found that such unique-host constraint is no longer enforced during rolling upgrade. New servers are being started side-by-side to old servers and failed with port collision.

      This issue can be reproduced with following App JSON:

      {
        "id": "/test-server",
        "cmd": "python -m SimpleHTTPServer 10101",
        "cpus": 0.1,
        "mem": 32,
        "disk": 0,
        "instances": 6,
        "constraints": [
          [
            "hostname",
            "UNIQUE"
          ]
        ],
        "portDefinitions": [],
        "upgradeStrategy": {
          "minimumHealthCapacity": 0.8,
          "maximumOverCapacity": 0
        }
      }
      

      Try triggering an upgrade, e.g. by increasing the CPU share, then likely some Marathon tasks would fail.

      Marathon shows that multiple tasks are are running on the same host

      The Marathon app's DEBUG page shows "TASK_FAILED", and the Mesos task's stderr shows that the task failure is due to multiple servers running on same host:

      I0119 10:56:06.241583 68112 exec.cpp:162] Version: 1.1.0
      I0119 10:56:06.243521 68122 exec.cpp:237] Executor registered on agent 367af7e4-ea29-470a-9b13-e7c5e9e9a768-S3
      Traceback (most recent call last):
        File "/usr/lib64/python2.7/runpy.py", line 162, in _run_module_as_main
          "__main__", fname, loader, pkg_name)
        File "/usr/lib64/python2.7/runpy.py", line 72, in _run_code
          exec code in run_globals
        File "/usr/lib64/python2.7/SimpleHTTPServer.py", line 220, in <module>
          test()
        File "/usr/lib64/python2.7/SimpleHTTPServer.py", line 216, in test
          BaseHTTPServer.test(HandlerClass, ServerClass)
        File "/usr/lib64/python2.7/BaseHTTPServer.py", line 595, in test
          httpd = ServerClass(server_address, HandlerClass)
        File "/usr/lib64/python2.7/SocketServer.py", line 419, in __init__
          self.server_bind()
        File "/usr/lib64/python2.7/BaseHTTPServer.py", line 108, in server_bind
          SocketServer.TCPServer.server_bind(self)
        File "/usr/lib64/python2.7/SocketServer.py", line 430, in server_bind
          self.socket.bind(self.server_address)
        File "/usr/lib64/python2.7/socket.py", line 224, in meth
          return getattr(self._sock,name)(*args)
      socket.error: [Errno 98] Address already in use
      

      I am running Marathon 1.3.6. Did a brief searching, and found that the issue may be related to the fix in
      https://github.com/mesosphere/marathon/issues/3624
      which removes that constraint checking against tasks of an earlier version.

      I think the constraint should be maintained throughout the rolling upgrade if it is not changed.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                GitHub_albertcsm albertcsm (Inactive)
                Team:
                Orchestration Team
                Watchers:
                marc-lebourdais, Matthias Eichstedt, Tim Harper
              • Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: