crunch-run retry timeout should increase for long-running containers
For example, if a container finishes successfully after running for 48 hours, and crunch-run encounters transient errors while updating the container state to Complete via controller, it should surely retry for longer than the default 8 minutes before giving up.
Updated by Brett Smith 6 months ago
On the original ticket Tom asked rhetorically:
If a container ran for 4 days but then controller has since been unreachable for 3 straight days, is it really sensible to wait one more day?
Most of the time in this situation, your decision boils down to:
- Continue to wait, with all the costs that entails.
- Give up, and the user will probably need/want to rerun the four days of compute after the cluster is back to health, with all the costs that entails.
Which one is "best" depends on whether the controller will come back faster than it took to run the compute, which requires predicting the future so is unknowable.
This situation also sucks because I don't think we'll get a single answer from our users. People running the workflows would probably prefer we always choose to wait. Their financial or IT folks might feel differently—especially if their cloud resources are limited and other services could use them better.
This would be more expensive for us to build, but thinking about this sideways, what if we had the option to store the results somewhere cheaper? This situation sucks the most when the compute needed a
x64.gargantuan instance type, and now that node is just twiddling its thumbs waiting for the controller to come back. What if crunch-run could give the results to crunch-dispatch, or a-c-r, or some shared storage?
More concretely, what if crunch-run wrote the logs, output collection, and a JSON file representing the container updates it wants to make to Keep, sent the manifest(s) to crunch-dispatch or a-c-r, and stopped there? Then the more permanent/cheaper service could hold onto those in a queue as long as necessary until controller came back. The results would be good for as long as the Keep signatures last, which the admin can already configure.
To cover the situation where communication between a-c-r and crunch-run is still working but controller isn't reachable, and the exit code and output manifest are decided, we could have crunch-run report these next time a-c-r sends a probe, and have a-c-r save it to disk (just as we are hoping to do with unflushed logs). Then, crunch-run could timeout faster, because even after a long delay / restart a-c-r would eventually set the final container state to Complete rather than Cancelled during "crunch-run process exited without finalizing container state" cleanup.
The main thing that bothers me about this is that it only covers the interval between container exit and crunch-run exit. We'd still have the problem of getting through downtime before the container exits, which I suspect would be a bigger slice of the problem, cost/time wise.