Project

General

Profile

Actions

Bug #19699

closed

HTTP download creates collections with too-long names, needs flag to run in runner process after submission

Added by Peter Amstutz over 1 year ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
CWL
Target version:
Story points:
-
Release relationship:
Auto

Description

Customer issues that came up:

  • Customer needs to run on a lot of samples which means launching a whole bunch of a-c-r processes. The feature that automatically transfers from HTTP to Keep happens before submitting the workflow. This will scale much better if the data transfer happens on the compute node when the workflow actually launches.
  • The collection is named "Downloaded from http://..." and if the URL is too long, it will exceed the 255 character limit on collection names. a-c-r needs to account for the limit (probably also including the timestamp that gets added by ensure_unique_name) and trim the name to a valid length so it won't get rejected.

Files


Subtasks 2 (0 open2 closed)

Task #19703: Review 19699-cwl-http-dl ResolvedPeter Amstutz11/14/2022Actions
Task #19812: Review 19699-cwl-dl-docsResolvedLucas Di Pentima11/14/2022Actions

Related issues

Related to Arvados - Bug #19688: Launch registered workflows fasterResolvedPeter Amstutz11/14/2022Actions
Actions #2

Updated by Peter Amstutz over 1 year ago

  • Description updated (diff)
Actions #3

Updated by Peter Amstutz over 1 year ago

  • Assigned To set to Peter Amstutz
Actions #4

Updated by Peter Amstutz over 1 year ago

  • Status changed from New to In Progress
Actions #6

Updated by Peter Amstutz over 1 year ago

Updated test package, adds --defer-dowload and --varying-url-params

Example command line:

arvados-cwl-runner --defer-download --varying-url-params=AWSAccessKeyId,Signature,Expires workflow.cwl params.yml

  • --defer-download will perform the download after the workflow is submitted (when the runner process on the compute node actually starts)
  • --varying-url-params tells it to ignore these URL query parameters from any HTTP URLs when checking to see if a URL has already been downloaded to Keep.
Actions #8

Updated by Peter Amstutz over 1 year ago

Updated test package, add another option --prefer-cached-downloads

Example command line:

arvados-cwl-runner --defer-download --varying-url-params=AWSAccessKeyId,Signature,Expires --prefer-cached-downloads workflow.cwl params.yml

  • --defer-download will perform the download after the workflow is submitted (in the runner process on the compute node)
  • --varying-url-params tells it to ignore the listed URL query parameters from any HTTP URLs when checking to see if a URL has already been downloaded to Keep.
  • --prefer-cached-downloads says that if the URL is found in Keep, use it without any further checking. This means changes in the upstream resource won't be detected, but it also means it will not error out if the upstream resource becomes inaccessible.
Actions #9

Updated by Peter Amstutz over 1 year ago

  • Target version changed from 2022-11-09 sprint to 2022-11-23 sprint
Actions #10

Updated by Peter Amstutz over 1 year ago

  • Related to Bug #19688: Launch registered workflows faster added
Actions #11

Updated by Peter Amstutz over 1 year ago

Updated package, no changes to downloading behavior, includes bug fix on #19688

Actions #12

Updated by Peter Amstutz over 1 year ago

19699-cwl-http-dl @ 420b3b25875fd56814d1ff9027b9283ff4446571

  • See comment 8 for list of new options
  • This branch is on #19688 so you should review that one first

developer-run-tests: #3364

Actions #13

Updated by Peter Amstutz over 1 year ago

19699-cwl-http-dl @ c31c6528cac695bc86d4244516e07ea316cac979

Rebased to get test fixes

developer-run-tests: #3367

Actions #14

Updated by Lucas Di Pentima over 1 year ago

Just a couple comments:

  • The a-c-r options page (user/cwl/cwl-run-options.html) needs to be updated with these new flags. I think this feature may also deserve a proper doc section, but maybe should not block this story.
  • IIRC, compute nodes don't have internet access by default. If this is the case, do you think it would be convenient to remind this potential issue when documenting --defer-downloads?
Actions #15

Updated by Peter Amstutz over 1 year ago

Lucas Di Pentima wrote in #note-14:

Just a couple comments:

  • The a-c-r options page (user/cwl/cwl-run-options.html) needs to be updated with these new flags. I think this feature may also deserve a proper doc section, but maybe should not block this story.

You're right, I forgot about docs. Let's keep the issue open and I'll follow up.

  • IIRC, compute nodes don't have internet access by default. If this is the case, do you think it would be convenient to remind this potential issue when documenting --defer-downloads?

arvados-cwl-runner always has network access to the API enabled. Compute nodes can be firewalled off from the general Internet but that's something you need to configure at the gateway level which isn't part of our standard configuration.

Actions #16

Updated by Lucas Di Pentima over 1 year ago

19699-cwl-http-dl LGTM

Actions #17

Updated by Peter Amstutz over 1 year ago

  • Target version changed from 2022-11-23 sprint to 2022-12-07 Sprint
Actions #18

Updated by Peter Amstutz over 1 year ago

19699-cwl-dl-docs @ 3cdc1e47bf435c364644ce8ef792cb42e95ac183

  • Update table of options
  • Add section about downloading from HTTP
Actions #19

Updated by Lucas Di Pentima over 1 year ago

  • At cwl-style.html.textile file:
    • Line 175: The first sentence is the same as the section's title.
    • Line 211: $(runtime.outdir) formatting is missing.
    • Lines 255, 256, 257: Ignoring formatting of flags with ==, I think it would look nicer if they're formatted in monospaced font like other variables, commands, etc.
    • Line 263: I think the example command would be better formatted inside a codeblock.
  • Even though the feature is fully described on the guide, I think we could clarify a bit more about its utility, for example: time savings, reduced traffic costs, enhanced automation, wdyt?

The rest LGTM.

Actions #20

Updated by Peter Amstutz over 1 year ago

Lucas Di Pentima wrote in #note-19:

  • At cwl-style.html.textile file:
    • Line 175: The first sentence is the same as the section's title.
    • Line 211: $(runtime.outdir) formatting is missing.
    • Lines 255, 256, 257: Ignoring formatting of flags with ==, I think it would look nicer if they're formatted in monospaced font like other variables, commands, etc.
    • Line 263: I think the example command would be better formatted inside a codeblock.
  • Even though the feature is fully described on the guide, I think we could clarify a bit more about its utility, for example: time savings, reduced traffic costs, enhanced automation, wdyt?

The rest LGTM.

Addressed above comments

19699-cwl-dl-docs @ da952d583d65e9c6c7ff24ae40c4e0d0a21efd22

Actions #21

Updated by Lucas Di Pentima over 1 year ago

This LGTM, thanks!

Actions #22

Updated by Peter Amstutz over 1 year ago

  • Status changed from In Progress to Resolved
Actions #23

Updated by Peter Amstutz over 1 year ago

  • Release set to 47
Actions #24

Updated by Brett Smith over 1 year ago

  • Release changed from 47 to 54
Actions

Also available in: Atom PDF