Uploaded image for project: 'Marathon'
  1. Marathon
  2. MARATHON-8561

Marathon slave memory usage constantly grows

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Medium
    • Resolution: Cannot Reproduce
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:

      Description

      Marathon slave memory consumption constantly grows until it stops with 101 error code (ZookeeperConnectionLost). I see no such problems on the master node and process is up and running for more the one week at the moment.

      Marathon version: 1.6.549
      Mesos version: 1.5.0
      Marathon slave server: 2core 2.9GHz, 3.8GB RAM.
      Marathon start arguments:

      --draining_seconds 600 --enable_features maintenance_mode --http_port 8888 --master zk://zookeeper1.vpc.prod.internal:2181,zookeeper2.vpc.prod.internal:2181,zookeeper3.vpc.prod.internal:2181/mesos --plugin_conf /opt/marathon/plugins/conf.json --plugin_dir /opt/marathon/plugins --zk zk://zookeeper1.vpc.prod.internal:2181,zookeeper2.vpc.prod.internal:2181,zookeeper3.vpc.prod.internal:2181/marathon --zk_timeout 30000 --zk_session_timeout 30000 --zk_connection_timeout 30000
      

      Java options:

      -Dlogback.configurationFile=/opt/marathon/logback.xml
      -Xms2g
      -Xmx2g
      -XX:+PrintGCApplicationStoppedTime
      -XX:+UseConcMarkSweepGC
      -XX:+CMSParallelRemarkEnabled
      -XX:-UseParNewGC
      -XX:+UseCMSInitiatingOccupancyOnly
      -XX:CMSInitiatingOccupancyFraction=80
      -XX:MaxGCPauseMillis=200
      -XX:MaxTenuringThreshold=1
      -XX:SurvivorRatio=90
      -XX:TargetSurvivorRatio=9
      -XX:ParallelGCThreads=2
      -javaagent:/home/ubuntu/aspectjweaver-1.8.13.jar
      -Dcom.sun.management.jmxremote.port=6789
      -Dcom.sun.management.jmxremote.ssl=false
      -Dcom.sun.management.jmxremote.authenticate=false
      -Djava.rmi.server.hostname=10.5.22.184
      

      Logs:

      [2019-02-07 03:12:12,986] WARN Client session timed out, have not heard from server in 22674ms for sessionid 0xc6681a5e4f277a4 (org.apache.zookeeper.ClientCnxn:JMX exporting thread-SendThread(ip-10-5-22-68.ec2.internal:2181))
      [2019-02-07 03:12:12,986] WARN Client session timed out, have not heard from server in 22675ms for sessionid 0xa6681a5e48f792a (org.apache.zookeeper.ClientCnxn:JMX exporting thread-SendThread(ip-10-5-29-79.ec2.internal:2181))
      [2019-02-07 03:12:12,986] INFO Client session timed out, have not heard from server in 22675ms for sessionid 0xa6681a5e48f792a, closing socket connection and attempting reconnect(org.apache.zookeeper.ClientCnxn:JMX exporting thread-SendThread(ip-10-5-29-79.ec2.internal:2181))
      [2019-02-07 03:12:13,087] INFO State change: SUSPENDED (org.apache.curator.framework.state.ConnectionStateManager:JMX exporting thread-EventThread)
      [2019-02-07 03:12:13,087] INFO State change: SUSPENDED (org.apache.curator.framework.state.ConnectionStateManager:JMX exporting thread-EventThread)
      [2019-02-07 03:12:13,088] INFO backgroundOperationsLoop exiting (org.apache.curator.framework.imps.CuratorFrameworkImpl:Curator-Framework-0)
      [2019-02-07 03:12:13,090] INFO Session: 0xc6681a5e4f277a4 closed (org.apache.zookeeper.ZooKeeper:Curator-ConnectionStateManager-0)
      [2019-02-07 03:12:13,090] INFO EventThread shut down for session: 0xc6681a5e4f277a4 (org.apache.zookeeper.ClientCnxn:JMX exporting thread-EventThread)
      [2019-02-07 03:12:13,112] INFO Shutting down services (mesosphere.marathon.MarathonApp:shutdownHook3)
      
      Feb  7 03:12:14 i-09520e4e2e881480c systemd[1]: marathon.service: Main process exited, code=exited, status=101/n/a
      Feb  7 03:12:14 i-09520e4e2e881480c systemd[1]: marathon.service: Unit entered failed state.
      Feb  7 03:12:14 i-09520e4e2e881480c systemd[1]: marathon.service: Failed with result 'exit-code'.
      Feb  7 03:12:14 i-09520e4e2e881480c systemd[1]: marathon.service: Service hold-off time over, scheduling restart.
      Feb  7 03:12:14 i-09520e4e2e881480c systemd[1]: Stopped marathon.
      Feb  7 03:12:14 i-09520e4e2e881480c systemd[1]: Started marathon.
      

       
      More logs attached.

      Our cluster setup:
      2 Marathon nodes
      2 Mesos Master nodes
      21 Mesos slaves

      We are running ~100 apps and ~200 tasks.
      We are not using Marathon health checks.

      We are using Linkerd for service mesh which runs on each Mesos slave and calls Marathon API for discovery every 10sec.

        Attachments

        1. marathon_1.7.212-0.1.20190401133925.ubuntu1604_all.deb
          73.58 MB
        2. marathon_1.7.212-0.1.20190401133936.ubuntu1804_all.deb
          73.58 MB
        3. marathon.logs.zip
          9.89 MB
        4. monitor.png
          monitor.png
          148 kB
        5. monitor2.png
          monitor2.png
          114 kB
        6. sampler-heap.png
          sampler-heap.png
          104 kB
        7. smapler-th.png
          smapler-th.png
          164 kB
        8. syslog.zip
          658 kB

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                astryia astryia
                Team:
                Orchestration Team
                Watchers:
                astryia, daltonmatos, Karsten Jeschkies, Matthias Eichstedt
              • Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: