Feature #21074
Updated by Peter Amstutz about 1 month ago
Idea: the "workflow" table is an odd duck. It stores a single data string in the "definition" field, but doesn't support properties, versioning, trashing, etc. We want these things for workflows but we don't want to duplicate all the logic. It would be better if we could just store workflows in collections.
However, eliminating the "workflows" API endpoint would be disruptive, as Workbench and arvados-cwl-runner both rely on it. (We can synchronize workbench updates but people frequently use older versions of arvados-cwl-runner with newer API servers).
Starting from Arvados 2.6.0, @--create-workflow@ works by creating a collection (of @type: workflow@) with all the workflow files, and then only puts a minimal wrapper workflow into the @definition@ field of the workflow record. The wrapper consists of a single step workflow which runs the real workflow from keep (using a @keep:@ reference).
Workbench needs the following:
* The entry point (currently, it writes @definition@ to a generic @workflow.json@ file and runs that)
* The schema for inputs / outputs (currently extracted from @definition@)
* Metadata such as git commit information (currently extracted from @definition@)
* The actual workflow definition (currently extracted from @definition@ by looking for a single step which with a @keep:@ reference)
It seems pretty straightforward that we should create container requests to run CWL directly from the @type: workflow@ collection, because that's now 90% how it works already.
In other words, we probably have enough in the Collection record already to identify and launch workflows without any additional support from a-c-r (but requiring a bit of extra elbow grease in Workbench).
I think there's two main questions to answer:
# Should Workbench be expected to interact deeply with the underlying CWL, or should we copy all the information we expect workbench to need into properties? (at least one Typescript library for interacting with CWL does exist)
# What do we do with the legacy workflows endpoint? We have at least one user which launches workflows by workflow UUID which would be interrupted if the workflow
endpoint just went away.
Also: what to do with @template_uuid@
h3. Points that came up in engineering discussion 2025-03-12
* How much can we disrupt user processes around workflows? Does @create-workflow@/ @update-workflow@ with old a-c-r version need to work indefinitely? What about launching workflows using @arvwf:@?
* Does a workflow created by a new a-c-r need to be runnable with @arvwf:@ by an old one?
* Should @template_uuid@ be updated automatically by migration?
* Is it better to make the workflows API virtual, or phase it out?
** Possible virtual API: workflow record has its columns scrubbed and it is just a pointer to a collection. Create/update modifies the underlying collection, "get" fetches collection record and returns fields as-is, while synthesizing a "definition" field based on the collection properties (CWL inputs/outputs).
** Alternately: maybe the "workflows" table can be maintained as-is while using collections for workflows is built out, and then users are encouraged to migrate away from using "workflows" table identifiers (by printing warnings and stuff) so it can be phased out over several versions?
h3. Another migration idea (follow up after eng discussion)
Add a @collection_uuid@ field to the workflow, which is the collection with the workflow definition and all the metadata.
If @collection_uuid@ is set, then @name@, @description@ and @definition@ on the workflow must be empty. Once set, @collection_uuid@ cannot be changed (and the other fields must remain empty).
For old versions of arvados-cwl-runner, they will see no change to how workflow records work.
New version of arvados-cwl-runner will create the collection with the workflow files (which is what it does already) and then create the workflow with @collection_uuid@ set.
h4. If @collection_uuid@ is empty
The behavior of the workflow API is exactly the same.
h4. If @collection_uuid@ is not empty
The record is returned with @name@ and @description@ copied from the collection record. I'm not sure what to do about @definition@ -- it should be possible to synthesize it from metadata stored on the collection (inputs/outputs/requirements are all things that Workbench need to have on hand to launch workflows anyway); it would be simpler to just stick the wrapper into the collection as the entry point as a file it reads out, but that means API server or controller has to read a file out of Keep.
The group contents API adds support for @include=collection_uuid@.
Workflow query filter add support joining on the collection in order to query on properties, e.g. @[["collection.properties.category", "=", "WGS"]]@
Any queries on @name@ or @description@ should be turned into joins on collection, e.g. @[["name", "=", "foo"]]@ becomes @[["collection.name", "=", "foo"]]@
@owner_uuid@ is required to match between the workflow and the collection? Changing @owner_uuid@ on either one changes it for both?
If the linked collection record is not accessible (e.g. it is trashed, deleted, or forbidden) the workflow is not visible either. Deleting the collection record should delete the linked workflow record.
If the workflow collection included @arv:depends@ (#22565) then copying/moving could helpfully copy/move dependencies along with it, but that's not directly in scope.
Finally, for @template_uuid@ container requests, we continue to link to the workflow record by uuid, but add a new property @arv:workflow_pdh@ so we know exactly which version of the code was run.
h4. Things I like about this solution:
Old clients get the same behavior.
New clients can change their behavior incrementally.
We can phase in support in Workbench, all code will continue to query for workflow records, but can be incrementally migrated over to using group contents (for @include=collection@), extract properties from the linked workflow record, list past versions, and so forth can be implemented for workflow records that use @collection_uuid@.