Feature #21074: "workflow" records link to a collection with the actual workflow - Arvados

Feature #21074

Updated by Peter Amstutz about 1 month ago

Idea: the "workflow" table is an odd duck.    It stores a single data string in the "definition" field, but doesn't support properties, versioning, trashing, etc.    We want these things for workflows but we don't want to duplicate all the logic.    It would be better if we could just store workflows in collections. 

 However, eliminating the "workflows" API endpoint would be disruptive, as Workbench and arvados-cwl-runner both rely on it.    (We can synchronize workbench updates but people frequently use older versions of arvados-cwl-runner with newer API servers). 

 Starting from Arvados 2.6.0, @--create-workflow@ works by creating a collection (of @type: workflow@) with all the workflow files, and then only puts a minimal wrapper workflow into the @definition@ field of the workflow record.    The wrapper consists of a single step workflow which runs the real workflow from keep (using a @keep:@ reference). 

 Workbench needs the following: 

 * The entry point (currently, it writes @definition@ to a generic @workflow.json@ file and runs that) 
 * The schema for inputs / outputs (currently extracted from @definition@) 
 * Metadata such as git commit information (currently extracted from @definition@) 
 * The actual workflow definition (currently extracted from @definition@ by looking for a single step which with a @keep:@ reference) 

 It seems pretty straightforward that we should create container requests to run CWL directly from the @type: workflow@ collection, because that's now 90% how it works already. 

 In other words, we probably have enough in the Collection record already to identify and launch workflows without any additional support from a-c-r (but requiring a bit of extra elbow grease in Workbench). 

 I think there's two main questions to answer: 

 # Should Workbench be expected to interact deeply with the underlying CWL, or should we copy all the information we expect workbench to need into properties?    (at least one Typescript library for interacting with CWL does exist) 
 # What do we do with the legacy workflows endpoint?    We have at least one user which launches workflows by workflow UUID which would be interrupted if the workflow endpoint just went away. 

 Also: what to do with @template_uuid@ 

 h3. Old proposal 

 To migrate workflow records to collections, I propose the following: 

 # Workflow records are migrated over to collections.    The "name" and "description" fields are straightforward.    The contents of the "definition" field would be put in Keep as "workflow.yml".    The collection record would have metadata "type: cwl-workflow"  
 # The Workflow endpoint is migrated to controller 
 # On controller, GET/PUT/POST operations are translated to apply to only collections with "type: cwl-workflow".    The contents of "definition" would be read from / written to Keep 
 ## another option would be to generate definition the on the fly from metadata 
 # when going through the workflows endpoint, collection UUIDs would be mapped to workflow UUIDs with the same cluster and random part just with -7fd4e- substituting for -4zz18- 
 # Going forward, we can choose to either expose additional fields and capabilities through the workflows endpoint (properties, versioning), or phase out the workflows endpoint by updating client code that uses workflows to instead use collections of "type: cwl-workflow" 

 This is probably also an opportunity to extract other metadata from the CWL document and put them in collection properties so that Workbench has it on hand without having to parse the CWL document as it currently does.

Back

Project

General

Profile

Arvados

Feature #21074