We've recently upgraded from Marathon 1.1 to Marathon 1.4.3, and for the second time we've had an app get "suck" in a deployment phase and never finish, even though it has 1/1 healthy tasks (it looks like it is good to go).
(I'm pretty sure) The deployment is a scale operation, always in state 1/2.
My only course of action to take is to "roll back" the deployment, but I hate doing that because the rollback means to scale to 0, and then back to 1. Which means downtime.
I have lots and lots of logs, unfortunately I didn't snag the deployment id. I have logs with the appid in the name spanning 2 days. Nothing looks bad to me, but there might be something that is not there that I don't see. If you want I can scrub them and post them somewhere. But even better I would like to be able to fish more myself, but I don't really know how to "debug" a deployment like this (especially when it looks like there is nothing wrong).
I couldn't find a previous issue reporting this. I think it is likely something new for > 1.1 (we've been running 1.1 for quite a while). I think it is pretty likely to happen again. If you give me exact debug things you need, I can be sure to grab them at the time. (I can't leave it in this broken state for more than a day though, the show must move on)