Currently, it is very expensive to run an actual scale test. As such, we do not run them very often.
We have a mesos-simulation scale test which stubs out Mesos entirely. As such we can simulate interacting with a very large cluster without the cost of actually having resources to launch all of those processes.
I'm ignorant to what this mesos-simulation scale test does, currently, so, as I learn more I will update this ticket.
Things I would like to see:
- Simulations involving lots of apps vs lots of instances
- GC stats as we scale up
- Response rate
- Time to deploy rate
- Failure rate
- Zookeeper rate
It would be great to record the kamon metrics, too. If we could capture a large dataset for later analysis and visualization, then I think this could be supremely helpful.
Having this automated will improve the probability we run it more often. Getting this part right greatly reduces the probability that Marathon will be a bottleneck in the total system.