Project

General

Profile

Actions

Feature #21074

open

"workflow" records link to a collection with the actual workflow

Added by Peter Amstutz over 1 year ago. Updated 3 days ago.

Status:
In Progress
Priority:
Normal
Assigned To:
Category:
API
Target version:
Story points:
-

Description

1. Add "collection_uuid" to workflows table
2. Update API revision
3. When collection_uuid is set, workflows controller rejects updates
4. When a collection with type: workflow is updated, search for workflows with the corresponding collection_uuid and synchronizes name/description/definition/owner_uuid
5. The group contents API adds support for include=collection_uuid
6. Workflow query filter add support joining on the collection in order to query on properties, e.g. should be possible to do [["collection.properties.category", "=", "WGS"]] so clients can query/filter workflow records by collection properties.
7. deleting collection should delete linked workflow record

Need to define exactly how to assemble the definition. The idea is to put the input and output sections in properties on the collection, then assembling the wrapper is pretty straightforward (looks like we need hints/requirements as well). This is the most relevant code in arvados-cwl-runner:

    wrapper = {
        "class": "Workflow",
        "id": "#main",
        "inputs": newinputs,
        "outputs": [],
        "steps": [step]
    }

    for i in main["inputs"]:
        step["in"].append({
            "id": "#main/step/%s" % shortname(i["id"]),
            "source": "#main/%s" % shortname(i["id"])
        })

    for i in main["outputs"]:
        step["out"].append({"id": "#main/step/%s" % shortname(i["id"])})
        wrapper["outputs"].append({"outputSource": "#main/step/%s" % shortname(i["id"]),
                                   "type": i["type"],
                                   "id": "#main/%s" % shortname(i["id"])})

    wrapper["requirements"] = [{"class": "SubworkflowFeatureRequirement"}]

    if main.get("requirements"):
        wrapper["requirements"].extend(main["requirements"])
    if hints:
        wrapper["hints"] = hints

    # Schema definitions (this lets you define things like record
    # types) require a special handling.

    for i, r in enumerate(wrapper["requirements"]):
        if r["class"] == "SchemaDefRequirement":
            wrapper["requirements"][i] = fix_schemadef(r, main["id"], tool.doc_loader.expand_url, merged_map, jobmapper, col.portable_data_hash())

    doc = {"cwlVersion": "v1.2", "$graph": [wrapper]}

    if git_info:
        for g in git_info:
            doc[g] = git_info[g]

I'm thinking:

type: workflow
arv:cwl_inputs
arv:cwl_outputs
arv:cwl_requirements
arv:cwl_hints

Possibly slightly confusing because we also use "cwl_input" and "cwl_output" to store the input and output objects on container requests for workflow steps, but it isn't a direct collision ("inputs" is plural) and I lean towards preferring consistency with the CWL fields they correspond to.

Old discussion

Idea: the "workflow" table is an odd duck. It stores a single data string in the "definition" field, but doesn't support properties, versioning, trashing, etc. We want these things for workflows but we don't want to duplicate all the logic. It would be better if we could just store workflows in collections.

However, eliminating the "workflows" API endpoint would be disruptive, as Workbench and arvados-cwl-runner both rely on it. (We can synchronize workbench updates but people frequently use older versions of arvados-cwl-runner with newer API servers).

Starting from Arvados 2.6.0, --create-workflow works by creating a collection (of type: workflow) with all the workflow files, and then only puts a minimal wrapper workflow into the definition field of the workflow record. The wrapper consists of a single step workflow which runs the real workflow from keep (using a keep: reference).

Workbench needs the following:

  • The entry point (currently, it writes definition to a generic workflow.json file and runs that)
  • The schema for inputs / outputs (currently extracted from definition)
  • Metadata such as git commit information (currently extracted from definition)
  • The actual workflow definition (currently extracted from definition by looking for a single step which with a keep: reference)

It seems pretty straightforward that we should create container requests to run CWL directly from the type: workflow collection, because that's now 90% how it works already.

In other words, we probably have enough in the Collection record already to identify and launch workflows without any additional support from a-c-r (but requiring a bit of extra elbow grease in Workbench).

I think there's two main questions to answer:

  1. Should Workbench be expected to interact deeply with the underlying CWL, or should we copy all the information we expect workbench to need into properties? (at least one Typescript library for interacting with CWL does exist)
  2. What do we do with the legacy workflows endpoint? We have at least one user which launches workflows by workflow UUID which would be interrupted if the workflow
    endpoint just went away.

Also: what to do with template_uuid

Points that came up in engineering discussion 2025-03-12

  • How much can we disrupt user processes around workflows? Does create-workflow/ update-workflow with old a-c-r version need to work indefinitely? What about launching workflows using arvwf:?
  • Does a workflow created by a new a-c-r need to be runnable with arvwf: by an old one?
  • Should template_uuid be updated automatically by migration?
  • Is it better to make the workflows API virtual, or phase it out?
    • Possible virtual API: workflow record has its columns scrubbed and it is just a pointer to a collection. Create/update modifies the underlying collection, "get" fetches collection record and returns fields as-is, while synthesizing a "definition" field based on the collection properties (CWL inputs/outputs).
    • Alternately: maybe the "workflows" table can be maintained as-is while using collections for workflows is built out, and then users are encouraged to migrate away from using "workflows" table identifiers (by printing warnings and stuff) so it can be phased out over several versions?

Another migration idea (follow up to eng discussion)

Add a collection_uuid field to the workflow, which is the collection with the workflow definition and all the metadata.

If collection_uuid is set, then the workflow is linked to a collection. Once set, collection_uuid cannot be changed. Subsequently, the workflow name, description, and definition are synchronized with the collection.

For old versions of arvados-cwl-runner that do not set collection_uuid, they will see no change to how workflow records work.

New version of arvados-cwl-runner will create the collection with the workflow files (which is what it does already) and then create the workflow with collection_uuid set. To update the workflow, it only need to update the collection associated with the workflow.

If collection_uuid is empty

The behavior of the workflow API is exactly the same.

If collection_uuid is not empty

The name and description fields are synchronized with the collection record; updating the collection record updates the workflow record. This means updating a collection needs to check for a linked workflow and update the workflow record in the same transaction.

The definition should be synthesized from metadata stored on the collection (inputs/outputs/requirements which are all things that Workbench need to have on hand to launch workflows already). When the collection is updated, the synchronization method updates constructs a new value for definition which is set on the workflow record.

It's probably simpler if this only goes in one direction, e.g. if collection_uuid is non-empty then the workflow record can no longer be updated through the API directly, but only by updating the backing collection which indirectly updates the workflow record.

The group contents API adds support for include=collection_uuid so that clients can fetch both the workflow record and associated collection record in the same API request.

Workflow query filter add support joining on the collection in order to query on properties, e.g. should be possible to do [["collection.properties.category", "=", "WGS"]] so clients can query/filter workflow records by collection properties.

owner_uuid is required to match between the workflow and the collection. Changing owner_uuid on the linked collection changes it for the workflow as well.

If the linked collection record is not accessible (e.g. it is trashed, deleted, or forbidden) the workflow should not be visible either. Deleting the collection record should delete the linked workflow record.

If the workflow collection included something like arv:depends (#22565) then copying/moving could helpfully copy/move dependencies along with it, but that's not directly in scope.

Finally, for template_uuid container requests, we continue to link to the workflow record by uuid, but add a new property arv:workflow_pdh so we know precisely which version of the code was run.

Things I like about this solution:

Old clients get the same behavior.

New clients can change their behavior incrementally.

We can phase in support in Workbench, all code will continue to query for workflow records, but can be incrementally migrated over to using group contents (for include=collection), extract properties from the linked workflow record, list past versions, and so forth can be implemented for workflow records that use collection_uuid.


Subtasks 1 (1 open0 closed)

Task #22717: Review 21074-workflow-collection-linkIn ProgressTom Clegg04/07/2025Actions

Related issues 4 (3 open1 closed)

Related to Arvados Epics - Idea #19132: Registered workflow improvementsIn Progress09/01/202306/30/2025Actions
Related to Arvados - Idea #22565: General purpose arv:depends property to indicate the data/code in a collection or object depends on other collections or objectsIn ProgressActions
Related to Arvados - Idea #21292: New workflow picker panelNewActions
Related to Arvados Workbench 2 - Feature #19387: Support picking workflows uploaded as collections with type: workflow.RejectedActions
Actions #1

Updated by Peter Amstutz over 1 year ago

  • Description updated (diff)
Actions #2

Updated by Peter Amstutz over 1 year ago

  • Related to Idea #19132: Registered workflow improvements added
Actions #3

Updated by Peter Amstutz over 1 year ago

  • Target version changed from Future to Development 2024-01-17 sprint
Actions #4

Updated by Peter Amstutz over 1 year ago

  • Target version changed from Development 2024-01-17 sprint to Development 2024-01-31 sprint
Actions #5

Updated by Peter Amstutz over 1 year ago

  • Category set to API
Actions #6

Updated by Peter Amstutz over 1 year ago

  • Description updated (diff)
Actions #7

Updated by Peter Amstutz over 1 year ago

  • Description updated (diff)
Actions #8

Updated by Peter Amstutz over 1 year ago

  • Description updated (diff)
Actions #9

Updated by Peter Amstutz about 1 year ago

  • Target version changed from Development 2024-01-31 sprint to Development 2024-02-14 sprint
Actions #10

Updated by Peter Amstutz about 1 year ago

  • Target version changed from Development 2024-02-14 sprint to Development 2024-02-28 sprint
Actions #11

Updated by Peter Amstutz about 1 year ago

  • Target version changed from Development 2024-02-28 sprint to Development 2024-03-13 sprint
Actions #12

Updated by Peter Amstutz about 1 year ago

  • Target version changed from Development 2024-03-13 sprint to Development 2024-03-27 sprint
Actions #13

Updated by Peter Amstutz about 1 year ago

  • Target version changed from Development 2024-03-27 sprint to Development 2024-04-10 sprint
Actions #14

Updated by Peter Amstutz about 1 year ago

  • Tracker changed from Idea to Feature
Actions #15

Updated by Peter Amstutz about 1 year ago

  • Target version changed from Development 2024-04-10 sprint to Development 2024-04-24 sprint
Actions #16

Updated by Peter Amstutz about 1 year ago

  • Target version changed from Development 2024-04-24 sprint to Development 2024-05-08 sprint
Actions #17

Updated by Peter Amstutz about 1 year ago

  • Target version changed from Development 2024-05-08 sprint to Development 2024-06-05 sprint
Actions #18

Updated by Peter Amstutz 11 months ago

  • Release set to 70
Actions #19

Updated by Peter Amstutz 11 months ago

  • Target version changed from Development 2024-06-05 sprint to 439
Actions #20

Updated by Peter Amstutz 11 months ago

  • Target version changed from 439 to Development 2024-07-03 sprint
Actions #21

Updated by Peter Amstutz 10 months ago

  • Target version changed from Development 2024-07-03 sprint to Development 2024-07-24 sprint
Actions #22

Updated by Peter Amstutz 9 months ago

  • Target version changed from Development 2024-07-24 sprint to Development 2024-08-28 sprint
Actions #23

Updated by Peter Amstutz 9 months ago

  • Release deleted (70)
Actions #24

Updated by Peter Amstutz 9 months ago

  • Target version changed from Development 2024-08-28 sprint to Development 2024-09-11 sprint
Actions #25

Updated by Peter Amstutz 8 months ago

  • Target version changed from Development 2024-09-11 sprint to Development 2024-09-25 sprint
Actions #26

Updated by Peter Amstutz 8 months ago

  • Target version changed from Development 2024-09-25 sprint to Development 2024-10-09 sprint
Actions #27

Updated by Peter Amstutz 7 months ago

  • Target version changed from Development 2024-10-09 sprint to Development 2024-10-23 sprint
Actions #28

Updated by Peter Amstutz 7 months ago

  • Target version changed from Development 2024-10-23 sprint to Development 2024-11-06 sprint
Actions #29

Updated by Peter Amstutz 6 months ago

  • Target version changed from Development 2024-11-06 sprint to Development 2024-11-20
Actions #30

Updated by Peter Amstutz 5 months ago

  • Target version changed from Development 2024-11-20 to Development 2024-12-04
Actions #31

Updated by Peter Amstutz 5 months ago

  • Target version changed from Development 2024-12-04 to Development 2025-01-08
Actions #32

Updated by Peter Amstutz 4 months ago

  • Target version changed from Development 2025-01-08 to Future
Actions #33

Updated by Peter Amstutz 29 days ago

  • Description updated (diff)
Actions #34

Updated by Peter Amstutz 29 days ago

  • Description updated (diff)
Actions #35

Updated by Peter Amstutz 29 days ago

  • Related to Idea #22565: General purpose arv:depends property to indicate the data/code in a collection or object depends on other collections or objects added
Actions #36

Updated by Peter Amstutz 29 days ago

  • Description updated (diff)
Actions #37

Updated by Peter Amstutz 29 days ago

  • Description updated (diff)
Actions #38

Updated by Peter Amstutz 29 days ago

  • Subject changed from Migrate "workflow" table to be backed by collections but maintain API to "workflow" records link to a collection with the actual workflow
Actions #39

Updated by Peter Amstutz 29 days ago

  • Description updated (diff)
Actions #40

Updated by Peter Amstutz 29 days ago

  • Description updated (diff)
Actions #41

Updated by Peter Amstutz 29 days ago

  • Description updated (diff)
Actions #42

Updated by Peter Amstutz 29 days ago

  • Description updated (diff)
Actions #43

Updated by Peter Amstutz 29 days ago

  • Description updated (diff)
Actions #44

Updated by Peter Amstutz 29 days ago

  • Description updated (diff)
Actions #45

Updated by Peter Amstutz 29 days ago

  • Description updated (diff)
Actions #46

Updated by Peter Amstutz 27 days ago

  • Description updated (diff)
Actions #47

Updated by Peter Amstutz 23 days ago

  • Related to Idea #21292: New workflow picker panel added
Actions #48

Updated by Peter Amstutz 23 days ago

  • Target version changed from Future to Development 2025-04-16
  • Tracker changed from Feature to Idea
Actions #49

Updated by Peter Amstutz 15 days ago

  • Related to Feature #19387: Support picking workflows uploaded as collections with type: workflow. added
Actions #50

Updated by Peter Amstutz 15 days ago

  • Description updated (diff)
Actions #51

Updated by Peter Amstutz 15 days ago

  • Description updated (diff)
Actions #52

Updated by Peter Amstutz 9 days ago

  • Tracker changed from Idea to Feature
Actions #53

Updated by Peter Amstutz 9 days ago

  • Assigned To set to Peter Amstutz
Actions #54

Updated by Peter Amstutz 8 days ago

  • Subtask #22717 added
Actions #55

Updated by Peter Amstutz 6 days ago

  • Status changed from New to In Progress
Actions #59

Updated by Peter Amstutz 3 days ago

  • All agreed upon points are implemented / addressed. Describe changes from pre-implementation design.
    • Adds collection_uuid to workflow record
    • Adds synchronization between collection records with type: workflow and workflow records linked by collection_uuid
    • Adds ability to query linked collection properties through workflows, e.g. [collection.properties.category,=,WGS]
    • Adds ability to fetch collection records linked to workflow records using include=collection_uuid when using group.contents API
  • Anything not implemented (discovered or discussed during work) has a follow-up story.
    • Arvados-cwl-runner client support will be added in ticket #22761
  • Code is tested and passing, both automated and manual, what manual testing was done is described.
    • Added a bunch of tests for all the new behaviors
  • New or changed UX/UX and has gotten feedback from stakeholders.
    • n/a
  • Documentation has been updated.
    • updated workflows API page about linked workflows
    • updated group contents API page to mention that collection_uuid can be used with include
  • Behaves appropriately at the intended scale (describe intended scale).
    • Workflows are updated infrequently. Scale concerns around accessing workflow and collection records don't change. Collection records representing workflows get slightly heavier with the introduction of cwl_inputs and cwl_outputs properties but these are still significantly smaller than e.g. most mounts sections.
  • Considered backwards and forwards compatibility issues between client and server.
    • Yes, this entire design is a compromise intended to maintain backwards compatibility. Older versions of arvados-cwl-runner will be able to create and consume workflows as if nothing changed. Workbench can support both legacy workflows and workflows that are linked to collections without forcing a migration.
  • Follows our coding standards and GUI style guidelines.
    • yes.
Actions

Also available in: Atom PDF