Uploaded image for project: 'Marathon'
  1. Marathon
  2. MARATHON-7673

Metrics are broken in Marathon 1.5+

    Details

      Description

      We noticed today that the "latest-soak-cluster: root-marathon appears to be flapping" and "latest-soak-cluster: root-marathon leader appears to be flapping" monitors were muted, and were not receiving any data. Looking at the monitor histories, it appears that the last data point arrived on June 15.

      Looking into the logs of the soak-monitors systemd unit, we find the following:

      python3[1038]: [INFO 2017-07-25 23:23:17,934 api_client.py:134] 202 POST https://app.datadoghq.com/api/v1/series (803.8192ms)
      python3[1038]: [INFO 2017-07-25 23:23:47,990 main.py:194] Running 20 monitors: ['Dse1Monitor', 'MetronomeMonitor', 'ChronosMonitor', 'PosixAgentMonitor', 'MesosAgentMonitor', 'Hell
      python3[1038]: [INFO 2017-07-25 23:23:48,004 __init__.py:108] Monitor "EdgeLBMonitor" failed with JSONDecodeError: Expecting value: line 1 column 1 (char 0)
      python3[1038]: [INFO 2017-07-25 23:23:48,170 __init__.py:108] Monitor "MarathonMonitor" failed with KeyError: 'value'
      python3[1038]: [INFO 2017-07-25 23:23:48,242 main.py:231] Sending 352 metric(s) to Datadog
      

      Looking at the soak-cluster-monitor code, we find the following likely source of the MarathonMonitor error: https://github.com/mesosphere/soak-cluster-monitor/blob/9182d570aa73b8d28d577b770712881899844567/monitors/marathon.py#L32-L33

      Indeed, comparing the output of /marathon/metrics from a couple different clusters, we see that the 1.9 soak cluster provides a "gauges" field with entries like this:

          "service.mesosphere.marathon.app.count": {
            "value": 34
          }
      

      while the latest soak cluster returns a "gauges" field with entries like this:

          "service.mesosphere.marathon.app.count": {
            "count": 925268,
            "min": 40,
            "max": 41,
            "p50": 41,
            "p75": 41,
            "p98": 41,
            "p99": 41,
            "p999": 41,
            "mean": 40.99508285522461,
            "tags": {
              
            },
            "unit": {
              "name": "unknown",
              "label": "unknown"
            }
          }
      

      It seems that recently Marathon changed to a new metrics library, which likely caused this change in output; see this Slack discussion.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                icharalampidis Ioannis Charalampidis
                Reporter:
                greg Greg Mann
                Team:
                Orchestration Team
                Watchers:
                Chris Lambert (Inactive), Chun-Hung Hsiao, Greg Mann, Ioannis Charalampidis, Ivan Chernetsky, Jie Yu, kamaradclimber, Ken Sipe, Matthias Eichstedt
              • Watchers:
                9 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: