I have web servers that listen on a fixed port. I used marathon to launch these servers, and set a "hostname:UNIQUE" constraint to ensure that theses servers don't run on the same host and end up with port collision.
I had been using Marathon's rolling upgrade feature to upgrade these servers. It was working very smooth. Servers of the newer version are started in nodes that have no existing servers.
However, in recent Marathon versions, I found that such unique-host constraint is no longer enforced during rolling upgrade. New servers are being started side-by-side to old servers and failed with port collision.
This issue can be reproduced with following App JSON:
"cmd": "python -m SimpleHTTPServer 10101",
Try triggering an upgrade, e.g. by increasing the CPU share, then likely some Marathon tasks would fail.
Marathon shows that multiple tasks are are running on the same host
The Marathon app's DEBUG page shows "TASK_FAILED", and the Mesos task's stderr shows that the task failure is due to multiple servers running on same host:
I0119 10:56:06.241583 68112 exec.cpp:162] Version: 1.1.0
I0119 10:56:06.243521 68122 exec.cpp:237] Executor registered on agent 367af7e4-ea29-470a-9b13-e7c5e9e9a768-S3
Traceback (most recent call last):
File "/usr/lib64/python2.7/runpy.py", line 162, in _run_module_as_main
"__main__", fname, loader, pkg_name)
File "/usr/lib64/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/usr/lib64/python2.7/SimpleHTTPServer.py", line 220, in <module>
File "/usr/lib64/python2.7/SimpleHTTPServer.py", line 216, in test
File "/usr/lib64/python2.7/BaseHTTPServer.py", line 595, in test
httpd = ServerClass(server_address, HandlerClass)
File "/usr/lib64/python2.7/SocketServer.py", line 419, in __init__
File "/usr/lib64/python2.7/BaseHTTPServer.py", line 108, in server_bind
File "/usr/lib64/python2.7/SocketServer.py", line 430, in server_bind
File "/usr/lib64/python2.7/socket.py", line 224, in meth
socket.error: [Errno 98] Address already in use
I am running Marathon 1.3.6. Did a brief searching, and found that the issue may be related to the fix in
which removes that constraint checking against tasks of an earlier version.
I think the constraint should be maintained throughout the rolling upgrade if it is not changed.