This is a ticket based of a slack conversation in #orchestration.
Judith wanted to capture the thought for future discussion:
I noticed my Marathon tasks dying with GC failures before the kernel OOM killer got to them.
My original thoughts were based on experience in embedded systems handling flow-control and memory pressure:
Mission critical software should pre-allocate everything upfront and control all the relevant pre-allocated queue depths. Flow control should be done with back pressure when the queues fill up.
I ran into this with spring boot and a horrible “production-ready” thread pool queue default in their Jetty setup. It just keeps queuing connections until it OOMs, rather than denying new clients.
It’s almost like anything java/scala/(fav dynamic language here) is broken by default.
Marathon should have 100% predictable memory characteristics b/c you could set an app limit and then prohibit additional deployments if out of resources. Error out on an incoming user request is much better than OOM in a random code block and losing your mind.
The real enhancement here would be to have knobs on marathon to determine what is kept in ram and limit allocations only to initialization.