Project

General

Profile

Actions

Feature #11100

closed

[CWL] Intermediary collection handling can be specified

Added by Tom Morris about 7 years ago. Updated almost 7 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Story points:
3.0

Description

Background

Workflows produce a lot of intermediate collections. For production workflows that are rarely re-run, the job reuse benefits are minimal, instead this is just clutter and takes up storage space that the user would rather not pay for. This is also necessary to support a roll-in/roll-out use case where a cluster only has sufficient storage to store a few complete runs and input and output data are transferred from/to somewhere else.

Requirements

Should be able to specify default behavior (retain or trash) but override behavior for output of specific steps.

The final output is always retained. Input should be unaffected.

Intermediate collections need to live as long as they are in use by downstream steps. When intermediate collections are no longer needed by downstream steps, they should be trashed.

Design

arvados-cwl-runner submits container requests; when the container completes a collection is created and reported in output_uuid. Arvados-cwl-runner can then set the trash_at field on the collection.

  • API server
    • Add a "output_ttl" field to container request. This value is in seconds. When the output collection is created for the container request, it should have trash_at and delete_at set now + output_ttl (assume that tokens are issued with expiry times less than trash_at). A value of <= 0 means don't set trash_at.
    • Add tests.
    • Update documentation
  • CWL runner
    • When "intermediate output TTL" is provided, container requests are submitted with output_ttl set
    • Default behavior is output_ttl is None or 0 (to be consistent with current behavior.)
    • When workflow completes successfully, everything marked as intermediate should be trashed immediately. Do not do this on workflow failure.
    • Provide command line option to indicate that things shouldn't be delete immediately
    • Custom Arvados CWL hint to specify treatment of individual step outputs
    • Update documentation

Subtasks 3 (0 open3 closed)

Task #11370: Update cwl runner after API feature is mergedResolvedPeter Amstutz05/22/2017Actions
Task #11389: Review 11100-cwl-set-output-ttlResolvedLucas Di Pentima05/22/2017Actions
Task #11388: Review 11100-cr-output-ttlResolvedPeter Amstutz04/04/2017Actions

Related issues

Related to Arvados - Idea #9277: [Crunch2] System-owned container outputs should be garbage-collectedResolvedPeter Amstutz02/16/2017Actions
Related to Arvados - Idea #9589: [Workbench] Update collection interface for collections with non-nil trash_atClosedRadhika Chippada07/13/2016Actions
Actions #1

Updated by Peter Amstutz about 7 years ago

  • Description updated (diff)
Actions #2

Updated by Tom Morris about 7 years ago

  • Assigned To set to Peter Amstutz
Actions #3

Updated by Peter Amstutz about 7 years ago

  • Description updated (diff)
Actions #4

Updated by Peter Amstutz about 7 years ago

  • Description updated (diff)
Actions #5

Updated by Peter Amstutz about 7 years ago

  • Description updated (diff)
  • Add a "output_ttl" on container request which means output will have trash_at and delete_at set now + output_ttl (assume that tokens are issued with expiry times less than trash_at)
  • Intermediate outputs can be used/reused before they are trashed
  • CWL runner intermediate steps are submitted with output_ttl
  • CWL options to control default treatment of step outputs
Actions #6

Updated by Peter Amstutz about 7 years ago

  • Description updated (diff)
Actions #7

Updated by Tom Morris about 7 years ago

  • Assigned To deleted (Peter Amstutz)
  • Target version changed from Arvados Future Sprints to 2017-04-12 sprint
  • Story points set to 3.0
Actions #8

Updated by Peter Amstutz about 7 years ago

  • Description updated (diff)
Actions #9

Updated by Peter Amstutz about 7 years ago

  • Project changed from 35 to Arvados
Actions #10

Updated by Tom Clegg about 7 years ago

  • Assigned To set to Tom Clegg
Actions #11

Updated by Tom Clegg about 7 years ago

Currently it's an error to set delete_at to a time earlier than blob_signature_ttl. Some time passes between when the container request chooses a delete_at and when the collection validation checks whether delete_at is valid. This means the container request code has to add a few seconds of grace period to avoid failing during slow times. Perhaps it would be better to change the rule so, if a client sets delete_at to a value too soon to achieve safely, it automatically gets extended to the earliest possible time?

This behavior doesn't seem to be documented in either of the likely places
Actions #12

Updated by Tom Clegg about 7 years ago

Actions #13

Updated by Peter Amstutz about 7 years ago

Tom Clegg wrote:

Currently it's an error to set delete_at to a time earlier than blob_signature_ttl. Some time passes between when the container request chooses a delete_at and when the collection validation checks whether delete_at is valid. This means the container request code has to add a few seconds of grace period to avoid failing during slow times. Perhaps it would be better to change the rule so, if a client sets delete_at to a value too soon to achieve safely, it automatically gets extended to the earliest possible time?

So instead of a validation error, it would just adjust it to a valid value. Yes.

Actions #14

Updated by Tom Clegg about 7 years ago

  • Status changed from New to In Progress
Actions #15

Updated by Tom Clegg about 7 years ago

With "earliest possible delete_at":

11100-cr-output-ttl @ 65121f8db54a1ed15207d050e1f48c5fc26d646b

Actions #16

Updated by Peter Amstutz about 7 years ago

Documentation of the output_ttl should specify units.

Should the code to adjust delete_at to the earliest valid time be performed before_validation so that the actual validation doesn't modify the record? (Not sure if rails is opinionated about validations being pure). (default_trash_interval does something similar and is in before_validation).

Rest LGTM

Actions #17

Updated by Tom Clegg about 7 years ago

Peter Amstutz wrote:

Documentation of the output_ttl should specify units.

Indeed. Got it in Containers API but missed it in docs. Fixed.

Should the code to adjust delete_at to the earliest valid time be performed before_validation so that the actual validation doesn't modify the record? (Not sure if rails is opinionated about validations being pure). (default_trash_interval does something similar and is in before_validation).

Not sure about Rails either, but yes, that seems neater. Fixed.

Also added a bit to acknowledge that no tokens will expire before the previous (hence previously validated) value of delete_at.

11100-cr-output-ttl @ ff3bb22d4b2bff5666907a6eeb6cd68cd3cbe22b

Actions #18

Updated by Tom Clegg about 7 years ago

  • Assigned To changed from Tom Clegg to Peter Amstutz
Actions #19

Updated by Peter Amstutz about 7 years ago

  • Target version changed from 2017-04-12 sprint to 2017-04-26 sprint
Actions #20

Updated by Peter Amstutz almost 7 years ago

  • Target version changed from 2017-04-26 sprint to 2017-05-10 sprint
Actions #21

Updated by Peter Amstutz almost 7 years ago

  • Target version changed from 2017-05-10 sprint to 2017-05-24 sprint
Actions #22

Updated by Lucas Di Pentima almost 7 years ago

Reviewed 8645888f1 - branch 11100-cwl-set-output-ttl

Some minor comments:

  • File doc/user/cwl/cwl-extensions.html.textile.liquid:L86 - duplicated ‘before’. Also s/recommend/recommended/ ?
  • File sdk/cwl/arvados_cwl/arv-cwl-schema.yml:L145 - Same as previous: s/recommend/recommended/
  • If there’s no way to ensure that an intermediate output collection doesn’t get trashed while a step depending on it is running, maybe we should add a log warning when every intermediate output is created with a short TTL, for future debugging purposes. Didn’t check if this is done on the API serve side.

Apart from that, LGTM. Local tests ran ok.

Actions #23

Updated by Peter Amstutz almost 7 years ago

I decided to add --trash-intermediate/--no-trash-intermediate as distinct options from --intermediate-output-ttl.

Updated documentation.

Not really sure about logging "TTL too short". We could pick some arbitrary number (60 seconds? 3600 seconds?) but it is highly workflow dependent.

Now at bee4fa82c15b2188c2105fff3b52c305b38f04a6

Actions #24

Updated by Lucas Di Pentima almost 7 years ago

  • File sdk/cwl/arvados_cwl/arvcontainer.py:L178 - Should that check be also on the arvados-cwl-runner side?
  • Do you think it would be useful to add a test checking the new argument --trash-intermediate and its default behavior?
  • Instead of logging when output TTL is too short, how about logging a message noticing when creating an output collection with the ttl? (I believe this would be on the api server side, right?).
Actions #25

Updated by Lucas Di Pentima almost 7 years ago

Side note: I found a typo (Aravdos) on doc/user/cwl/cwl-runner.html.textile.liquid:129, maybe you can add that fix to the other doc updates.

Actions #26

Updated by Peter Amstutz almost 7 years ago

  • Target version changed from 2017-05-24 sprint to 2017-06-07 sprint
Actions #27

Updated by Peter Amstutz almost 7 years ago

Lucas Di Pentima wrote:

  • File sdk/cwl/arvados_cwl/arvcontainer.py:L178 - Should that check be also on the arvados-cwl-runner side?

Done.

  • Do you think it would be useful to add a test checking the new argument --trash-intermediate and its default behavior?

Yes, as it turned out I actually did forget to propagate it to the runner command line. Test added, and code fixed.

  • Instead of logging when output TTL is too short, how about logging a message noticing when creating an output collection with the ttl? (I believe this would be on the api server side, right?).

Added logging CWL about intermediate outputs.

Side note: I found a typo (Aravdos) on doc/user/cwl/cwl-runner.html.textile.liquid:129, maybe you can add that fix to the other doc updates.

Fixed.

Now @ 5223e68fa0d1f1fc52c7285fa6844d7932ba9acb

Actions #28

Updated by Lucas Di Pentima almost 7 years ago

Please check the tests, there's one failing on me, at least when running locally:

======================================================================
FAIL: test_done (tests.test_container.TestContainer)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/lucas/arvados_local/tmp/VENVDIR/local/lib/python2.7/site-packages/mock/mock.py", line 1305, in patched
    return func(*args, **keywargs)
  File "/home/lucas/arvados_local/sdk/cwl/tests/test_container.py", line 415, in test_done
    arvjob.collect_outputs.assert_called_with("keep:abc+123")
  File "/home/lucas/arvados_local/tmp/VENVDIR/local/lib/python2.7/site-packages/mock/mock.py", line 925, in assert_called_with
    raise AssertionError('Expected call: %s\nNot called' % (expected,))
AssertionError: Expected call: mock('keep:abc+123')
Not called

----------------------------------------------------------------------
Actions #29

Updated by Lucas Di Pentima almost 7 years ago

LGTM, thanks!

Actions #30

Updated by Peter Amstutz almost 7 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 75 to 100

Applied in changeset arvados|commit:0da65aef99133cb76661f56f2adac83a77ad8ac9.

Actions

Also available in: Atom PDF