Feature #11100
closed
[CWL] Intermediary collection handling can be specified
Added by Tom Morris almost 8 years ago.
Updated over 7 years ago.
Description
Background¶
Workflows produce a lot of intermediate collections. For production workflows that are rarely re-run, the job reuse benefits are minimal, instead this is just clutter and takes up storage space that the user would rather not pay for. This is also necessary to support a roll-in/roll-out use case where a cluster only has sufficient storage to store a few complete runs and input and output data are transferred from/to somewhere else.
Requirements¶
Should be able to specify default behavior (retain or trash) but override behavior for output of specific steps.
The final output is always retained. Input should be unaffected.
Intermediate collections need to live as long as they are in use by downstream steps. When intermediate collections are no longer needed by downstream steps, they should be trashed.
Design¶
arvados-cwl-runner submits container requests; when the container completes a collection is created and reported in output_uuid. Arvados-cwl-runner can then set the trash_at field on the collection.
- API server
- Add a "output_ttl" field to container request. This value is in seconds. When the output collection is created for the container request, it should have trash_at and delete_at set now + output_ttl (assume that tokens are issued with expiry times less than trash_at). A value of <= 0 means don't set trash_at.
- Add tests.
- Update documentation
- CWL runner
- When "intermediate output TTL" is provided, container requests are submitted with output_ttl set
- Default behavior is output_ttl is None or 0 (to be consistent with current behavior.)
- When workflow completes successfully, everything marked as intermediate should be trashed immediately. Do not do this on workflow failure.
- Provide command line option to indicate that things shouldn't be delete immediately
- Custom Arvados CWL hint to specify treatment of individual step outputs
- Update documentation
- Description updated (diff)
- Assigned To set to Peter Amstutz
- Description updated (diff)
- Description updated (diff)
- Description updated (diff)
- Add a "output_ttl" on container request which means output will have trash_at and delete_at set now + output_ttl (assume that tokens are issued with expiry times less than trash_at)
- Intermediate outputs can be used/reused before they are trashed
- CWL runner intermediate steps are submitted with output_ttl
- CWL options to control default treatment of step outputs
- Description updated (diff)
- Assigned To deleted (
Peter Amstutz)
- Target version changed from Arvados Future Sprints to 2017-04-12 sprint
- Story points set to 3.0
- Description updated (diff)
- Project changed from 35 to Arvados
- Assigned To set to Tom Clegg
Currently it's an error to set delete_at to a time earlier than blob_signature_ttl. Some time passes between when the container request chooses a delete_at and when the collection validation checks whether delete_at is valid. This means the container request code has to add a few seconds of grace period to avoid failing during slow times. Perhaps it would be better to change the rule so, if a client sets delete_at to a value too soon to achieve safely, it automatically gets extended to the earliest possible time?
This behavior doesn't seem to be documented in either of the likely places
Tom Clegg wrote:
Currently it's an error to set delete_at to a time earlier than blob_signature_ttl. Some time passes between when the container request chooses a delete_at and when the collection validation checks whether delete_at is valid. This means the container request code has to add a few seconds of grace period to avoid failing during slow times. Perhaps it would be better to change the rule so, if a client sets delete_at to a value too soon to achieve safely, it automatically gets extended to the earliest possible time?
So instead of a validation error, it would just adjust it to a valid value. Yes.
- Status changed from New to In Progress
Documentation of the output_ttl
should specify units.
Should the code to adjust delete_at to the earliest valid time be performed before_validation
so that the actual validation
doesn't modify the record? (Not sure if rails is opinionated about validations being pure). (default_trash_interval does something similar and is in before_validation).
Rest LGTM
Peter Amstutz wrote:
Documentation of the output_ttl
should specify units.
Indeed. Got it in Containers API but missed it in docs. Fixed.
Should the code to adjust delete_at to the earliest valid time be performed before_validation
so that the actual validation
doesn't modify the record? (Not sure if rails is opinionated about validations being pure). (default_trash_interval does something similar and is in before_validation).
Not sure about Rails either, but yes, that seems neater. Fixed.
Also added a bit to acknowledge that no tokens will expire before the previous (hence previously validated) value of delete_at.
11100-cr-output-ttl @ ff3bb22d4b2bff5666907a6eeb6cd68cd3cbe22b
- Assigned To changed from Tom Clegg to Peter Amstutz
- Target version changed from 2017-04-12 sprint to 2017-04-26 sprint
- Target version changed from 2017-04-26 sprint to 2017-05-10 sprint
- Target version changed from 2017-05-10 sprint to 2017-05-24 sprint
Reviewed 8645888f1 - branch 11100-cwl-set-output-ttl
Some minor comments:
- File
doc/user/cwl/cwl-extensions.html.textile.liquid:L86
- duplicated ‘before’. Also s/recommend/recommended/ ?
- File
sdk/cwl/arvados_cwl/arv-cwl-schema.yml:L145
- Same as previous: s/recommend/recommended/
- If there’s no way to ensure that an intermediate output collection doesn’t get trashed while a step depending on it is running, maybe we should add a log warning when every intermediate output is created with a short TTL, for future debugging purposes. Didn’t check if this is done on the API serve side.
Apart from that, LGTM. Local tests ran ok.
I decided to add --trash-intermediate/--no-trash-intermediate as distinct options from --intermediate-output-ttl.
Updated documentation.
Not really sure about logging "TTL too short". We could pick some arbitrary number (60 seconds? 3600 seconds?) but it is highly workflow dependent.
Now at bee4fa82c15b2188c2105fff3b52c305b38f04a6
- File
sdk/cwl/arvados_cwl/arvcontainer.py:L178
- Should that check be also on the arvados-cwl-runner
side?
- Do you think it would be useful to add a test checking the new argument
--trash-intermediate
and its default behavior?
- Instead of logging when output TTL is too short, how about logging a message noticing when creating an output collection with the ttl? (I believe this would be on the api server side, right?).
Side note: I found a typo (Aravdos) on doc/user/cwl/cwl-runner.html.textile.liquid:129
, maybe you can add that fix to the other doc updates.
- Target version changed from 2017-05-24 sprint to 2017-06-07 sprint
Lucas Di Pentima wrote:
- File
sdk/cwl/arvados_cwl/arvcontainer.py:L178
- Should that check be also on the arvados-cwl-runner
side?
Done.
- Do you think it would be useful to add a test checking the new argument
--trash-intermediate
and its default behavior?
Yes, as it turned out I actually did forget to propagate it to the runner command line. Test added, and code fixed.
- Instead of logging when output TTL is too short, how about logging a message noticing when creating an output collection with the ttl? (I believe this would be on the api server side, right?).
Added logging CWL about intermediate outputs.
Side note: I found a typo (Aravdos) on doc/user/cwl/cwl-runner.html.textile.liquid:129, maybe you can add that fix to the other doc updates.
Fixed.
Now @ 5223e68fa0d1f1fc52c7285fa6844d7932ba9acb
Please check the tests, there's one failing on me, at least when running locally:
======================================================================
FAIL: test_done (tests.test_container.TestContainer)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/lucas/arvados_local/tmp/VENVDIR/local/lib/python2.7/site-packages/mock/mock.py", line 1305, in patched
return func(*args, **keywargs)
File "/home/lucas/arvados_local/sdk/cwl/tests/test_container.py", line 415, in test_done
arvjob.collect_outputs.assert_called_with("keep:abc+123")
File "/home/lucas/arvados_local/tmp/VENVDIR/local/lib/python2.7/site-packages/mock/mock.py", line 925, in assert_called_with
raise AssertionError('Expected call: %s\nNot called' % (expected,))
AssertionError: Expected call: mock('keep:abc+123')
Not called
----------------------------------------------------------------------
- Status changed from In Progress to Resolved
- % Done changed from 75 to 100
Applied in changeset arvados|commit:0da65aef99133cb76661f56f2adac83a77ad8ac9.
Also available in: Atom
PDF