Apps that fail repeatedly will restart silently, preventing the development team from understanding what the instability is, or that there even is one.
Run an app in Marathon that isn't persistent. Watch Marathon proceed to flap that app.
At this time, the only workaround is to watch the DCOS/Marathon UI or logs for restarts. This is quite laborious.
An app definition is basically the JSON configuration for running your apps. It is what you send to the apps end-point in the REST API (https://mesosphere.github.io/marathon/docs/rest-api.html#apps) or what gets created if you use "New App" in the user interface. Tasks are the instances of the app definition. If your app definition has "instances" set to three, Marathon will try to start three instances that conform to this app definition.
So let's say you create an app "/internal-reporting-tool" for which "instances" is set to three at 11:00am (I'll use a time resolution of minutes for demonstration only.). Let's say that at 11:10am we have the following tasks running:
task1) started at 11:00am, therefore it has a life time of 10min at 11:10am
task2) started at 11:01am, therefore it has a life time of 9min at 11:10 am
task3) started at 11:05am, therefore it has a life time of 5min at 11:10 am
Bi) Last time of update would be 11:00am
Bii) The average life time of the tasks would be (10min+9min+5min)/3 = 8 min.
Biii) The median life time would be 9min (take the middle value: https://en.wikipedia.org/wiki/Median).
If you create an app definition with 100 instances and after 30min your tasks have a average life time of 3min, they either took a long time to get started (that is a problem) or the were a lot of task failures so the tasks had to be restarted (that would be a problem).
Thus, you can use this metrics as general indicators for problems.
As another alternative, you can feed the events from Marathon (https://mesosphere.github.io/marathon/docs/rest-api.html#event-stream) about your applications into a metrics collector such as graphite to have some history about failures.