Affects Version/s: Marathon 1.3.7
Fix Version/s: None
Mesos Master 1.0.1, Agent 1.0.0
We had our three marathon processes go 100% CPU, with machines reporting load average of >3 in uptime. Looking at logs and the nessus report from our opsec team showed that the CPU usage began when the nessus scan started. The scan lasted for two minutes, however the 100% usage persisted for several hours after the scan finished, indeed until each marathon process was restarted. All three masters were scanned.
About two hours later, we killed the first marathon process and restarted it. It returned to an average 5% CPU usage, but the other two processes remained at 100%. Each process was killed, and each remained at 5% or so after restart. uptime reported ~0.05 load average.
We plan on creating a debug cluster and running nessus again, sometime in the next few weeks. Please let me know if there's any flags you want setting for log capture.
Top reported an average of 5 or 6 threads in marathon each competing for the cpu (about 15% each thread).