Idea #8645
Updated by Tom Clegg almost 9 years ago
h1. "A pipeline is a group" Proposal: Goal: In crunch v2, users will can easily treat a invocation of a pipeline instance and its related resources (or a subset of related resources) as a "bundle" that can be shared, copied, moved, downloaded, etc. as a unit. single unit for permissions and replication purposes. The bundle can include: This consists of: * Container requests * Copies of input Inputs collections * Copies of docker Docker images * (Perhaps incomplete) clones of git Git repositories * Copies of container Container requests * Container logs * Copies of container output Output collections h2. Implementation overview When running a pipeline, rather than create a "pipeline instance" object as in crunch1, Arvados creates a new group with @group_class="pipeline"@. of type "pipeline" is created. Inputs are copied into to the pipeline when (or even before?) the pipeline starts, group and container outputs are copied into owned by the pipeline group as container requests are completed. group. Pipelines get special treatment in Workbench. ("special" tbd?) h2. Benefits Workbench can show (and control) "what you will share when you press Share". * It is easy Allow git repository objects to distinguish objects that are "included" in the bundle -- and therefore will be shared when the bundle is shared -- from objects that are referenced owned by projects and pipelines. Support efficient cloning of the pipeline (and might be readable by repository on the current user) but aren't in the bundle. * If you don't want backend (similar to share some bits (e.g., non-free code, private data), simply delete them from the bundle. Optionally, make a full copy for yourself first. github "fork this repo" button). By default -- if you don't delete any inputs from your bundle -- you protect yourself from accidentally deleting or modifying one of your pipeline dependencies and making your pipeline impossible Could have API support to reproduce. Examples: enforce pipelines as "read only" once they are completed running so they can't be changed. Benefits: * Even after deleting a commit from your git repo with a non-FF push, you should still be able to view that version of the source code if you used it Everything is in a pipeline. (But you should also have the option of deleting/unsharing code one place, making provenance and data without deleting the metadata about the pipeline, if that's really what you want.) sharing much easier * The user doesn't have the burden of remembering which input collections should be "frozen" in order to make pipelines reproducible. Currently, it's too easy to modify a dataset (e.g., rename a file) and then much later realize that you can no longer run a pipeline that used the old version as an input. With the proposed approach, the version of each input needed to re-run the pipeline is preserved until the user deletes it _from that pipeline._ Limited sharing easy to reason about: copy a copying pipeline and then delete the parts you don't want to share. A pipeline can include information about failed containers / container requests that were later re-attempted. h2. Other side effects * More groups in the system. This is one incentive (of many) to improve the permission system implementation to use a Postgres join instead of keeping a cache of all group UUIDs readable/writable by each user. * More identical copies of collections. Search results will be more noisy, unless we de-dup/filter/sort results effectively. h2. Implementation details h3. git A pipeline bundle should include a snapshot of the parts of the original git repository that were used to run the job. The snapshot should be made efficiently -- for example, using "git clone /path/to/original/repo" to make hardlinks rather than copying the git data objects.