Project

General

Profile

Actions

Idea #7568

closed

Checkpoint/restart for crunch jobs

Added by Joshua Randall over 8 years ago. Updated over 4 years ago.

Status:
Closed
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
-
Start date:
10/15/2015
Due date:
Story points:
-

Description

The low-level technology to do checkopint/restart with containers already exists, in the form of CRIU (e.g. http://criu.org/Docker). There are a few advantages to supporting checkpoint/restart for crunch jobs:
- high priority jobs could preempt lower priority jobs (which would be suspended and checkpointed), which then resume later
- jobs could be migrated from one host to another (making it easier to scale cloud resource utilisation up/down when scheduling less than full hosts)
- long-running jobs can be periodically checkpointed to mitigate losses in the event of an unexpected failure (such as a server failing)

To be useful crunch would have to arrange for the necessary volume resources to be present at restart, and the Crunch API would need to be support the concept that not all started jobs need to be running all of the time (in particular when tracking resource utilisation).

Actions #1

Updated by Peter Amstutz over 4 years ago

  • Status changed from New to Closed
Actions

Also available in: Atom PDF