Uploaded image for project: 'Marathon'
  1. Marathon
  2. MARATHON-3571

As a Developer, I would like more information over task restarts

    Details

    • Type: Task
    • Status: Resolved
    • Priority: Medium
    • Resolution: Won't Do
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: API
    • Labels:

      Description

      Pain point:

      Apps that fail repeatedly will restart silently, preventing the development team from understanding what the instability is, or that there even is one.

      Steps to reproduce:

      Run an app in Marathon that isn't persistent. Watch Marathon proceed to flap that app.

      Work-around:

      At this time, the only workaround is to watch the DCOS/Marathon UI or logs for restarts. This is quite laborious.

      Description

      An app definition is basically the JSON configuration for running your apps. It is what you send to the apps end-point in the REST API (https://mesosphere.github.io/marathon/docs/rest-api.html#apps) or what gets created if you use "New App" in the user interface. Tasks are the instances of the app definition. If your app definition has "instances" set to three, Marathon will try to start three instances that conform to this app definition.
      So let's say you create an app "/internal-reporting-tool" for which "instances" is set to three at 11:00am (I'll use a time resolution of minutes for demonstration only.). Let's say that at 11:10am we have the following tasks running:
      task1) started at 11:00am, therefore it has a life time of 10min at 11:10am
      task2) started at 11:01am, therefore it has a life time of 9min at 11:10 am
      task3) started at 11:05am, therefore it has a life time of 5min at 11:10 am
      Bi) Last time of update would be 11:00am
      Bii) The average life time of the tasks would be (10min+9min+5min)/3 = 8 min.
      Biii) The median life time would be 9min (take the middle value: https://en.wikipedia.org/wiki/Median).
      If you create an app definition with 100 instances and after 30min your tasks have a average life time of 3min, they either took a long time to get started (that is a problem) or the were a lot of task failures so the tasks had to be restarted (that would be a problem).
      Thus, you can use this metrics as general indicators for problems.
      As another alternative, you can feed the events from Marathon (https://mesosphere.github.io/marathon/docs/rest-api.html#event-stream) about your applications into a metrics collector such as graphite to have some history about failures.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              GitHub_jgarcia-mesosphere John Garcia (Inactive)
              Team:
              Orchestration Team
              Watchers:
              Bekir Dogan, Chmielewski, Jason Gilanfarr (Inactive), Matthias Eichstedt
            • Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: