Project

General

Profile

Feature #11100

Updated by Peter Amstutz about 7 years ago

h1. Background 

 Workflows produce a lot of intermediate collections.    For production workflows that are rarely re-run, the job reuse benefits are minimal, instead this is just clutter and takes up storage space that the user would rather not pay for.    This is also necessary to support a roll-in/roll-out use case where a cluster only has sufficient storage to store a few complete runs and input and output data are transferred from/to somewhere else. 

 h1. Requirements 

 Should be able to specify default behavior (retain or trash) but override behavior for output of specific steps. 

 The final output is always retained.    Input should be unaffected. 

 Intermediate collections need to live as long as they are in use by downstream steps.    When intermediate collections are no longer needed by downstream steps, they should be trashed. 

 h1. Design 

 arvados-cwl-runner submits container requests; when the container completes a collection is created and reported in output_uuid.    Arvados-cwl-runner can then set the trash_at field on the collection. 

 * A simple approach is for arvados-cwl-runner to immediately set the trash_at time to now + 2 weeks (or some configurable time that is longer than the runtime of the workflow).    This ensures that the collection remains accessible to downstream steps (because it is not yet trashed) but still gets deleted eventually.    This is the easiest solution to implement, but has the drawback that intermediate outputs hang around for much longer than necessary.    There is a small race condition between finalizing the container request and marking the output collection as future trash; if cwl-runner is terminated abruptly it won't have a chance to mark it as future trash and it will linger. 

 * A second approach is to record all the output collections and trash them in a batch at the end.    This reduces the time that collections hang around.    However, if cwl-runner is terminated abruptly it won't have a chance to clean up.    This could be combined with the previous approach. 

 * A third approach is to track collection lifetimes inside the workflow engine.    Collections are trashed once there are no more running containers or pending downstream steps which reference the collection.    This solution minimizes the size of the working set but is more complex to implement, and has the same problem if cwl-runner is terminated abruptly.    Also, if there is significant time between trashing the collection and actually deleting blocks (e.g. 2 weeks) this effectively degenerates to the previous case. 

 * A forth approach is to move responsibility for cleaning up to the API server.    Container requests have a "requested by container" field.    When a parent container terminates, all container requests initiated by that container are terminated.    This could be extended to include trashing the output collection of these container requests.    This requires a new flag to be added to container requests to indicate if its output is temporary or not (or alternately, Tom suggested overloading the semantics of "output_name" so that an empty output_name indicates temporary output, and a provided "output_name" indicates retained.)    This has the benefit that cleanup happens regardless of whether cwl-runner was able to terminate gracefully or not. 

 This may interact badly with container request retries.    A cwl-runner run might terminate because of node failure; a new container is automatically submitted which relies on container reuse to be able to pick up where it left off.    If the previous container's outputs were automatically cleaned up, it may be unable to resume the workflow.

Back