Bug #20378
opencrunch-run maximum downtime tolerance
Description
crunch-run is designed to tolerate API server downtime without losing work.
However, the various things that can time out need to be aligned.
A specific problem we ran into is saveLogCollection() setting a trash_at time of now + 12 hours. However, when the API server went down for longer than 12 hours, the containers continued running but the log collection became trashed. When the API server came back, it couldn't update the log collection.
There are probably other situations the expectations of various components are not aligned in how long they are prepared to weather downtime. Components should:
1) Use a consistent value for downtime tolerance
2) Attempt to shut down in an orderly fashion when downtime is exceeded.
Updated by Peter Amstutz over 1 year ago
- Category set to Crunch
- Subject changed from crunch- to crunch-run maximum downtime tolerance
Updated by Brett Smith over 1 year ago
I think downtime tolerance should err on the long side, and perhaps be a function of how long the job has run, maybe with a cap on minimum and maximum. Having a compute node sitting around for the API server to come back is expensive and annoying, but it's not nearly as annoying as losing a week's worth of compute because the API server was unreachable for a few hours at the end of a job.