Feature #11100

[CWL] Intermediary collection handling can be specified

Added by Tom Morris 10 months ago. Updated 7 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
03/30/2017
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
3.0

Description

Background

Workflows produce a lot of intermediate collections. For production workflows that are rarely re-run, the job reuse benefits are minimal, instead this is just clutter and takes up storage space that the user would rather not pay for. This is also necessary to support a roll-in/roll-out use case where a cluster only has sufficient storage to store a few complete runs and input and output data are transferred from/to somewhere else.

Requirements

Should be able to specify default behavior (retain or trash) but override behavior for output of specific steps.

The final output is always retained. Input should be unaffected.

Intermediate collections need to live as long as they are in use by downstream steps. When intermediate collections are no longer needed by downstream steps, they should be trashed.

Design

arvados-cwl-runner submits container requests; when the container completes a collection is created and reported in output_uuid. Arvados-cwl-runner can then set the trash_at field on the collection.

  • API server
    • Add a "output_ttl" field to container request. This value is in seconds. When the output collection is created for the container request, it should have trash_at and delete_at set now + output_ttl (assume that tokens are issued with expiry times less than trash_at). A value of <= 0 means don't set trash_at.
    • Add tests.
    • Update documentation
  • CWL runner
    • When "intermediate output TTL" is provided, container requests are submitted with output_ttl set
    • Default behavior is output_ttl is None or 0 (to be consistent with current behavior.)
    • When workflow completes successfully, everything marked as intermediate should be trashed immediately. Do not do this on workflow failure.
    • Provide command line option to indicate that things shouldn't be delete immediately
    • Custom Arvados CWL hint to specify treatment of individual step outputs
    • Update documentation

Subtasks

Task #11370: Update cwl runner after API feature is mergedResolvedPeter Amstutz

Task #11389: Review 11100-cwl-set-output-ttlResolvedLucas Di Pentima

Task #11388: Review 11100-cr-output-ttlResolvedPeter Amstutz


Related issues

Related to Arvados - Story #9277: [Crunch2] System-owned container outputs should be garbage-collectedResolved2017-02-16

Related to Arvados - Story #9589: [Workbench] Update collection interface for collections with non-nil trash_atNew2016-07-13

Associated revisions

Revision ff8d14ac
Added by Tom Clegg 9 months ago

Merge branch '7709-api-rails4' (partial)

refs #7709
refs #11100

Revision 77f5a84c
Added by Tom Clegg 8 months ago

Merge branch '11100-cr-output-ttl'

refs #11100

Revision 0da65aef
Added by Peter Amstutz 7 months ago

Merge branch '11100-cwl-set-output-ttl' closes #11100

Revision 7c03ff7b (diff)
Added by Peter Amstutz 7 months ago

Fix crunch script to set trash_intermediate and intermediate_output_ttl refs #11100

History

#1 Updated by Peter Amstutz 10 months ago

  • Description updated (diff)

#2 Updated by Tom Morris 10 months ago

  • Assigned To set to Peter Amstutz

#3 Updated by Peter Amstutz 9 months ago

  • Description updated (diff)

#4 Updated by Peter Amstutz 9 months ago

  • Description updated (diff)

#5 Updated by Peter Amstutz 9 months ago

  • Description updated (diff)
  • Add a "output_ttl" on container request which means output will have trash_at and delete_at set now + output_ttl (assume that tokens are issued with expiry times less than trash_at)
  • Intermediate outputs can be used/reused before they are trashed
  • CWL runner intermediate steps are submitted with output_ttl
  • CWL options to control default treatment of step outputs

#6 Updated by Peter Amstutz 9 months ago

  • Description updated (diff)

#7 Updated by Tom Morris 9 months ago

  • Assigned To deleted (Peter Amstutz)
  • Target version changed from Arvados Future Sprints to 2017-04-12 sprint
  • Story points set to 3.0

#8 Updated by Peter Amstutz 9 months ago

  • Description updated (diff)

#9 Updated by Peter Amstutz 9 months ago

  • Project changed from Arvados Private to Arvados

#10 Updated by Tom Clegg 9 months ago

  • Assigned To set to Tom Clegg

#11 Updated by Tom Clegg 9 months ago

Currently it's an error to set delete_at to a time earlier than blob_signature_ttl. Some time passes between when the container request chooses a delete_at and when the collection validation checks whether delete_at is valid. This means the container request code has to add a few seconds of grace period to avoid failing during slow times. Perhaps it would be better to change the rule so, if a client sets delete_at to a value too soon to achieve safely, it automatically gets extended to the earliest possible time?

This behavior doesn't seem to be documented in either of the likely places

#12 Updated by Tom Clegg 9 months ago

#13 Updated by Peter Amstutz 9 months ago

Tom Clegg wrote:

Currently it's an error to set delete_at to a time earlier than blob_signature_ttl. Some time passes between when the container request chooses a delete_at and when the collection validation checks whether delete_at is valid. This means the container request code has to add a few seconds of grace period to avoid failing during slow times. Perhaps it would be better to change the rule so, if a client sets delete_at to a value too soon to achieve safely, it automatically gets extended to the earliest possible time?

So instead of a validation error, it would just adjust it to a valid value. Yes.

#14 Updated by Tom Clegg 9 months ago

  • Status changed from New to In Progress

#15 Updated by Tom Clegg 9 months ago

With "earliest possible delete_at":

11100-cr-output-ttl @ 65121f8db54a1ed15207d050e1f48c5fc26d646b

#16 Updated by Peter Amstutz 9 months ago

Documentation of the output_ttl should specify units.

Should the code to adjust delete_at to the earliest valid time be performed before_validation so that the actual validation doesn't modify the record? (Not sure if rails is opinionated about validations being pure). (default_trash_interval does something similar and is in before_validation).

Rest LGTM

#17 Updated by Tom Clegg 9 months ago

Peter Amstutz wrote:

Documentation of the output_ttl should specify units.

Indeed. Got it in Containers API but missed it in docs. Fixed.

Should the code to adjust delete_at to the earliest valid time be performed before_validation so that the actual validation doesn't modify the record? (Not sure if rails is opinionated about validations being pure). (default_trash_interval does something similar and is in before_validation).

Not sure about Rails either, but yes, that seems neater. Fixed.

Also added a bit to acknowledge that no tokens will expire before the previous (hence previously validated) value of delete_at.

11100-cr-output-ttl @ ff3bb22d4b2bff5666907a6eeb6cd68cd3cbe22b

#18 Updated by Tom Clegg 8 months ago

  • Assigned To changed from Tom Clegg to Peter Amstutz

#19 Updated by Peter Amstutz 8 months ago

  • Target version changed from 2017-04-12 sprint to 2017-04-26 sprint

#20 Updated by Peter Amstutz 8 months ago

  • Target version changed from 2017-04-26 sprint to 2017-05-10 sprint

#21 Updated by Peter Amstutz 7 months ago

  • Target version changed from 2017-05-10 sprint to 2017-05-24 sprint

#22 Updated by Lucas Di Pentima 7 months ago

Reviewed 8645888f1 - branch 11100-cwl-set-output-ttl

Some minor comments:

  • File doc/user/cwl/cwl-extensions.html.textile.liquid:L86 - duplicated ‘before’. Also s/recommend/recommended/ ?
  • File sdk/cwl/arvados_cwl/arv-cwl-schema.yml:L145 - Same as previous: s/recommend/recommended/
  • If there’s no way to ensure that an intermediate output collection doesn’t get trashed while a step depending on it is running, maybe we should add a log warning when every intermediate output is created with a short TTL, for future debugging purposes. Didn’t check if this is done on the API serve side.

Apart from that, LGTM. Local tests ran ok.

#23 Updated by Peter Amstutz 7 months ago

I decided to add --trash-intermediate/--no-trash-intermediate as distinct options from --intermediate-output-ttl.

Updated documentation.

Not really sure about logging "TTL too short". We could pick some arbitrary number (60 seconds? 3600 seconds?) but it is highly workflow dependent.

Now at bee4fa82c15b2188c2105fff3b52c305b38f04a6

#24 Updated by Lucas Di Pentima 7 months ago

  • File sdk/cwl/arvados_cwl/arvcontainer.py:L178 - Should that check be also on the arvados-cwl-runner side?
  • Do you think it would be useful to add a test checking the new argument --trash-intermediate and its default behavior?
  • Instead of logging when output TTL is too short, how about logging a message noticing when creating an output collection with the ttl? (I believe this would be on the api server side, right?).

#25 Updated by Lucas Di Pentima 7 months ago

Side note: I found a typo (Aravdos) on doc/user/cwl/cwl-runner.html.textile.liquid:129, maybe you can add that fix to the other doc updates.

#26 Updated by Peter Amstutz 7 months ago

  • Target version changed from 2017-05-24 sprint to 2017-06-07 sprint

#27 Updated by Peter Amstutz 7 months ago

Lucas Di Pentima wrote:

  • File sdk/cwl/arvados_cwl/arvcontainer.py:L178 - Should that check be also on the arvados-cwl-runner side?

Done.

  • Do you think it would be useful to add a test checking the new argument --trash-intermediate and its default behavior?

Yes, as it turned out I actually did forget to propagate it to the runner command line. Test added, and code fixed.

  • Instead of logging when output TTL is too short, how about logging a message noticing when creating an output collection with the ttl? (I believe this would be on the api server side, right?).

Added logging CWL about intermediate outputs.

Side note: I found a typo (Aravdos) on doc/user/cwl/cwl-runner.html.textile.liquid:129, maybe you can add that fix to the other doc updates.

Fixed.

Now @ 5223e68fa0d1f1fc52c7285fa6844d7932ba9acb

#28 Updated by Lucas Di Pentima 7 months ago

Please check the tests, there's one failing on me, at least when running locally:

======================================================================
FAIL: test_done (tests.test_container.TestContainer)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/lucas/arvados_local/tmp/VENVDIR/local/lib/python2.7/site-packages/mock/mock.py", line 1305, in patched
    return func(*args, **keywargs)
  File "/home/lucas/arvados_local/sdk/cwl/tests/test_container.py", line 415, in test_done
    arvjob.collect_outputs.assert_called_with("keep:abc+123")
  File "/home/lucas/arvados_local/tmp/VENVDIR/local/lib/python2.7/site-packages/mock/mock.py", line 925, in assert_called_with
    raise AssertionError('Expected call: %s\nNot called' % (expected,))
AssertionError: Expected call: mock('keep:abc+123')
Not called

----------------------------------------------------------------------

#29 Updated by Lucas Di Pentima 7 months ago

LGTM, thanks!

#30 Updated by Peter Amstutz 7 months ago

  • Status changed from In Progress to Resolved
  • % Done changed from 75 to 100

Applied in changeset arvados|commit:0da65aef99133cb76661f56f2adac83a77ad8ac9.

Also available in: Atom PDF