Story #11942

[CWL] arvados-cwl-runner should support tagging output collection using properties

Added by Tom Morris over 1 year ago. Updated 4 months ago.

Status:
In Progress
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
07/05/2017
Due date:
% Done:

0%

Estimated time:
(Total: 0.00 h)
Story points:
1.0

Subtasks

Task #13846: Review 11942-tagging-supportIn ProgressFuad Muhic


Related issues

Related to Arvados - Feature #8641: add key/value support for tagsNew2016-03-04

Related to Arvados - Feature #14016: [API] Container request can provide existing collection UUID that will accept CR outputNew

History

#1 Updated by Tom Morris over 1 year ago

The current output collection tagging using Links, but when tag support is available in the Collection.properties field, arvados-cwl-runner should switch to using that instead.

#2 Updated by Tom Clegg over 1 year ago

Use a new command line option for "tag via properties" so old code continues to work as before until we figure out the tag migration strategy.

#3 Updated by Tom Clegg over 1 year ago

Example

arvados-cwl-runner --output-properties "foo:bar,baz:qux" ...

results in

"properties":{
  "foo":"bar",
  "baz":"qux" 
}

#4 Updated by Lucas Di Pentima over 1 year ago

  • Story points set to 1.0

#5 Updated by Tom Morris 9 months ago

  • Target version changed from Arvados Future Sprints to To Be Groomed

#6 Updated by Tom Morris 8 months ago

  • Target version changed from To Be Groomed to Arvados Future Sprints

#8 Updated by Tom Morris 5 months ago

  • Assigned To set to Fuad Muhic
  • Target version changed from Arvados Future Sprints to 2018-08-01 Sprint

#9 Updated by Peter Amstutz 5 months ago

Suggestion, as an alternative or in addition to "--output-properties" there should be a command line flag that takes an input YAML file and merges that into the output collection properties.

#10 Updated by Fuad Muhic 5 months ago

  • Status changed from New to In Progress

#11 Updated by Fuad Muhic 5 months ago

11942-tagging-support is ready for a review @ 7eadb7fa179c18e41066709e8645a1e6eaad655c
test-run: https://ci.curoverse.com/job/developer-run-tests/829/

- Added --output-properties and --output-properties-yaml to arvados-cwl-runner
- propertied defined by both flags are combined into into one dictionary (if conflict happens, properties defied by --output-properties are taken)

#12 Updated by Lucas Di Pentima 5 months ago

I've tried to test this by running a-c-r against both arvbox and 4xphq and although the argument is accepted by the command, the output collection properties don't include the requested tags.
Am I making some mistake trying it, the CWL file for testing is like this:

class: CommandLineTool
cwlVersion: v1.0
inputs: []
outputs: []
stdout: output.txt
baseCommand: [echo, "Hello world"]

The command against 4xphq:

$ arvados-cwl-runner --wait --disable-reuse --api=containers --output-properties "foo:bar,baz:qux" echo.cwl

The properties field on the output collection is like this:

[...]
  "properties": {
    "type": "output",
    "container_request": "4xphq-xvhdp-hhs2zenp91tc6vi" 
  },
[...]

#13 Updated by Peter Amstutz 5 months ago

  • Target version changed from 2018-08-01 Sprint to 2018-08-15 Sprint

#14 Updated by Peter Amstutz 4 months ago

OK, we made a mistake here.

There's several ways a-c-r can submit a container request and get a collection as output.

  1. Running a in --local mode. When the workflow ends, it calls make_output_collection and can set properties.
  2. Running in --submit --wait mode. When the workflow ends, it calls make_output_collection and can set properties, and sets the PDH of the output on the container record. The container request output creates a new collection with the same content, but does not copy the properties (because it is referenced by PDH, it could find multiple collection records with conflicting properties). If the client a-c-r is still waiting for a result, it could apply the properties to the container requests's output.
  3. Running in --submit --no-wait mode. Same as above, except there's no client process available to apply the properties.
  4. As a variation, when running a single CommandLineTool in --submit mode it only submits a single container request for the tool, there is no separate workflow runner container. If the submitter process blocks, it could set properties when after the container request is completed. If it does not block (--no-wait) again there is no process available to set properties on the final output collection.

As a result, to property implement this feature seems to require API support to make it possible to specify the properties ahead of time, such as #14016

#16 Updated by Tom Morris 4 months ago

  • Related to Feature #14016: [API] Container request can provide existing collection UUID that will accept CR output added

#17 Updated by Tom Morris 4 months ago

What, specifically, is the mistake that we made? We only implemented support for some of the code paths? Does that also imply that the current support for name and expiration dates is only implemented for some of the cases?

Is the fact that there are so many different code paths a good and necessary thing, or something that we need to fix?

Is part of the problem that we record PDH instead of UUID for the output collection (and it it's ambiguous)? Should we fix that?

If we can enumerate what the problems/gaps are, we can formulate a plan to address them.

#18 Updated by Peter Amstutz 4 months ago

Tom Morris wrote:

What, specifically, is the mistake that we made?

The mistake was putting the story on the sprint without realizing the assumptions had changed going from links/jobs API to properties/containers API.

We only implemented support for some of the code paths? Does that also imply that the current support for name and expiration dates is only implemented for some of the cases?

Name and expiration dates are already handled by the API server as a special case (they are fields of the container request which are applied to the output collection). So one possible solution is to make properties a similar special case.

Is the fact that there are so many different code paths a good and necessary thing, or something that we need to fix?

Good and necessary. The main takeaway should be that sometimes the client is still running after the container, and sometimes not, so we shouldn't rely on it to do a fixup right at the end.

Is part of the problem that we record PDH instead of UUID for the output collection (and it it's ambiguous)? Should we fix that?

It complicates things because it interferes with the solution of defining that properties on the output collection are copied the container request output. We still want to record the PDH for provenance, but we could record the uuid in addition to the PDH (then there would be something to copy from.)

#19 Updated by Tom Morris 4 months ago

Peter Amstutz wrote:

Name and expiration dates are already handled by the API server as a special case (they are fields of the container request which are applied to the output collection). So one possible solution is to make properties a similar special case.

Would it make sense to generalize this to "output_collection_fields." or some such, which is a hash/dict containing things which should be set on the output container and move name, expiration date, and properties there? This is similar to "store the entire collection record" idea which was mentioned in chat, but would be restricted to just the handful of supported fields. One advantage would be future extensibility and consistent implementation.

Whether it's a new generalized field or a new specific field just for properties, it seems to make sense to be consistent with the other two collection fields which are already implemented.

Designing an entirely new mechanism seems like overkill/over engineering.

#20 Updated by Peter Amstutz 4 months ago

A generalized field for setting output collection fields would be fine.

However, for about the same amount of work, I think #14016 (container request can specify desired output_uuid) make it possible to solve the immediate problem but could apply to a wider variety of circumstances.

I agree that a generalized "run some other user code after an event" feature isn't needed to solve this specific problem, although if one existed we wouldn't need to have this conversation.

#21 Updated by Tom Morris 4 months ago

  • Target version changed from 2018-08-15 Sprint to Arvados Future Sprints

Also available in: Atom PDF