Idea #7568
closedCheckpoint/restart for crunch jobs
Description
The low-level technology to do checkopint/restart with containers already exists, in the form of CRIU (e.g. http://criu.org/Docker). There are a few advantages to supporting checkpoint/restart for crunch jobs:
- high priority jobs could preempt lower priority jobs (which would be suspended and checkpointed), which then resume later
- jobs could be migrated from one host to another (making it easier to scale cloud resource utilisation up/down when scheduling less than full hosts)
- long-running jobs can be periodically checkpointed to mitigate losses in the event of an unexpected failure (such as a server failing)
To be useful crunch would have to arrange for the necessary volume resources to be present at restart, and the Crunch API would need to be support the concept that not all started jobs need to be running all of the time (in particular when tracking resource utilisation).