On the original ticket Tom asked rhetorically:
If a container ran for 4 days but then controller has since been unreachable for 3 straight days, is it really sensible to wait one more day?
Most of the time in this situation, your decision boils down to:
- Continue to wait, with all the costs that entails.
- Give up, and the user will probably need/want to rerun the four days of compute after the cluster is back to health, with all the costs that entails.
Which one is "best" depends on whether the controller will come back faster than it took to run the compute, which requires predicting the future so is unknowable.
This situation also sucks because I don't think we'll get a single answer from our users. People running the workflows would probably prefer we always choose to wait. Their financial or IT folks might feel differently—especially if their cloud resources are limited and other services could use them better.
This would be more expensive for us to build, but thinking about this sideways, what if we had the option to store the results somewhere cheaper? This situation sucks the most when the compute needed a x64.gargantuan
instance type, and now that node is just twiddling its thumbs waiting for the controller to come back. What if crunch-run could give the results to crunch-dispatch, or a-c-r, or some shared storage?
More concretely, what if crunch-run wrote the logs, output collection, and a JSON file representing the container updates it wants to make to Keep, sent the manifest(s) to crunch-dispatch or a-c-r, and stopped there? Then the more permanent/cheaper service could hold onto those in a queue as long as necessary until controller came back. The results would be good for as long as the Keep signatures last, which the admin can already configure.