Project

General

Profile

Actions

Feature #21378

closed

CWL workflow to transfer data from Keep to S3

Added by Brett Smith 10 months ago. Updated 8 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
CWL
Story points:
-

Description

We have a workflow to transfer data from S3 to Keep. Now we want a workflow that goes in the other direction.

The existing workflow just uses aws s3 cp, so the core of this version can share the implementation, just with arguments flipped. Other changes around that:

  • The existing workflow has a parameter parallel_downloads. Either the new one should have parallel_uploads, or they should use a new common name like parallel_transfers.
  • The new workflow needs to be documented.

Subtasks 1 (0 open1 closed)

Task #21456: Review 21378-cleanupResolvedBrett Smith03/26/2024Actions
Actions #2

Updated by Peter Amstutz 10 months ago

  • Target version set to Development 2024-02-14 sprint
Actions #3

Updated by Peter Amstutz 10 months ago

  • Assigned To set to Peter Amstutz
Actions #4

Updated by Peter Amstutz 10 months ago

  • Status changed from New to In Progress
Actions #5

Updated by Peter Amstutz 10 months ago

This has been updated on github cfa85c1d1a31a7cb990fc2fda260ff970cbc0198

https://github.com/arvados/aws-cli-cwl

I've manually tested this a little bit with cwltool but unrelated cloud weather is making it hard to test on Arvados, please take a look at it and decide if it is ready to share with users.

Actions #6

Updated by Brett Smith 10 months ago

Reviewing from ddb738ce1a0095e7b234d55ab9dcf5ea5494e8ca through cfa85c1d1a31a7cb990fc2fda260ff970cbc0198

I do appreciate that both versions look much richer and user-friendly with full documentation throughout. Thanks much for that.

README.md:

  • The example inputs for the downloader have not been updated with the new name of parallel_downloads.
  • The second paragraph of the uploader section has been copy-pasted verbatim from the downloader and needs a few s/download/upload/g;.
    • Same goes for the doc field in aws-s3-bulk-upload.cwl.
  • The example s3target value would be nicer if were more obviously an example value, e.g., s3://zzzzz-4zz18-12345abcde67890/
  • It would be nice to mark up the input.yml filename as monospace just before both example listings.

aws-s3-bulk-upload.cwl:

  • Refers to a tools/aws-s3-scatter-upload.cwl that is not in Git.

tools/aws-s3-scatter-download.cwl:

  • The comment at the top refers to the workflow by its old name.
  • In the dockerFile:
    • Apparently this is "care about FHS" week. You should install the virtualenv in a subdirectory of /opt: “A package to be installed in /opt must locate its static files in a separate /opt/<package> or /opt/<provider> directory tree…”
    • You can remove the apt install of python3-pip and python3-setuptools, they're both noops since all that matters is what's in the virtualenv, and venv will ensure that pip is in there.
    • You could consider installing awscli in the virtualenv if you want more flexibility to get a different version. That's really a question of preference, just flagging it.
    • Similarly you could consider pinning cwltool by saying pip install 'cwltool~=3.0' or whatever. Up to you how you want to balance flexibility against reliability.
    • I'm guessing these comments all apply to tools/aws-s3-scatter-upload.cwl too.

Thanks.

Actions #7

Updated by Peter Amstutz 9 months ago

  • Target version changed from Development 2024-02-14 sprint to Development 2024-02-28 sprint
Actions #8

Updated by Brett Smith 9 months ago

Not directly related to the story, but: looking at some output from the original workflow, I noticed that aws s3 cp seems to report progress many times per second. This makes it difficult to get a sense of progress when reviewing logs in Arvados. If there's an option to control logging frequency, adding it the workflow seems like it would be a nice ergonomic improvement. If it's time-based, maybe sync it to crunchstat, every 10sec? If it's progress-based, maybe sync it to Keep blocks, 64MiB?

If this gets spun out into a separate story, sure.

Actions #9

Updated by Peter Amstutz 9 months ago

Brett Smith wrote in #note-8:

Not directly related to the story, but: looking at some output from the original workflow, I noticed that aws s3 cp seems to report progress many times per second. This makes it difficult to get a sense of progress when reviewing logs in Arvados. If there's an option to control logging frequency, adding it the workflow seems like it would be a nice ergonomic improvement. If it's time-based, maybe sync it to crunchstat, every 10sec? If it's progress-based, maybe sync it to Keep blocks, 64MiB?

If this gets spun out into a separate story, sure.

https://awscli.amazonaws.com/v2/documentation/api/latest/reference/s3/cp.html

If it is gumming up the logs, then it would probably make sense to add --no-progress

Actions #10

Updated by Brett Smith 9 months ago

Peter Amstutz wrote in #note-9:

If it is gumming up the logs, then it would probably make sense to add --no-progress

I agree that seems like the best available option. In principle the network stats from crunchstat should be able to give you a better sense of how much work it's accomplished.

Actions #11

Updated by Peter Amstutz 9 months ago

https://github.com/arvados/aws-cli-cwl/commit/d7c146281d8aeb1687b7df0e9ed34b39db0babb9

I stumbled on process a little bit, I should have used a branch for this (next time).

  • It scatters batches but no longer uses a sub-scatter on individual files. Instead it writes a shell script on the fly and executes that. This is overall simpler and has the important benefit that downloading to a keep-backend directory as was originally intended.
  • Requests 3 GB of RAM so that we get a c5.large instead of a t3
  • No longer installs cwltool because it is no longer needed, so the Dockerfile is simpler
  • Addressed the notes about the README
Actions #12

Updated by Brett Smith 9 months ago

Peter Amstutz wrote in #note-11:

  • It scatters batches but no longer uses a sub-scatter on individual files. Instead it writes a shell script on the fly and executes that. This is overall simpler and has the important benefit that downloading to a keep-backend directory as was originally intended.

Overall I like this approach and I appreciate how much it simplifies. However:

  • The --endpoint argument should be quoted for safety.
  • Every time we say url.split('/').pop(), that will not do the right thing if the URL has query parameters or a fragment.

Unfortunately, safely metaprogramming shell is basically impossible. I wonder if it would be nicer to write a Python script that takes an action and a list of sources as arguments. Then it could do rich string parsing and call aws s3 cp without going through the shell at all. If that's not practical in the time available, we should at least try to address these issues above.

  • Requests 3 GB of RAM so that we get a c5.large instead of a t3

This assumes a specific instance configuration on the cluster. I'm also frustrated that we're doing that thing we do of "This is what one user wants right now, so let's assume this is what every user wants for all time, and hardcode it."

Whenever our users writing CWL need dynamic instance selection, we tell them to make the resource an input parameter and evaluate it in the resource requirements. We should follow our own advice.

Actions #13

Updated by Peter Amstutz 9 months ago

Brett Smith wrote in #note-12:

  • The --endpoint argument should be quoted for safety.

Sure. I added a quote function that removes any single quotes or backslashes, and used it on all the input parameters.

  • Every time we say url.split('/').pop(), that will not do the right thing if the URL has query parameters or a fragment.

I am fairly confident that s3:// URLs that this would be used with are unlikely to have query parameters or fragments.

Unfortunately, safely metaprogramming shell is basically impossible. I wonder if it would be nicer to write a Python script that takes an action and a list of sources as arguments. Then it could do rich string parsing and call aws s3 cp without going through the shell at all. If that's not practical in the time available, we should at least try to address these issues above.

The next step would probably be to stop using the aws cli and just write a Python script that uses the AWS Python SDK, and/or implement a "s3_to_keep" function to our SDK. But I'd like to close out this ticket, I believe it should be enough to unblock the customer.

Whenever our users writing CWL need dynamic instance selection, we tell them to make the resource an input parameter and evaluate it in the resource requirements. We should follow our own advice.

I added a ramMin parameter.

https://github.com/arvados/aws-cli-cwl/commit/05caa76f428e05025c7798c3146571de702d4eb3

Actions #14

Updated by Brett Smith 9 months ago

Peter Amstutz wrote in #note-13:

https://github.com/arvados/aws-cli-cwl/commit/05caa76f428e05025c7798c3146571de702d4eb3

This all works for me, but testing the download workflow, I get this error:

  2024-02-16T22:47:18.031158796Z stderr stderr was: 'evalmachine.<anonymous>:22
  2024-02-16T22:47:18.031158796Z stderr var quote = function(s) { return "'"+s.replaceAll("'", "").replaceAll("\\", "")+"'"; }
  2024-02-16T22:47:18.031158796Z stderr                                        ^
  2024-02-16T22:47:18.031158796Z stderr
  2024-02-16T22:47:18.031158796Z stderr TypeError: s.replaceAll is not a function
  2024-02-16T22:47:18.031158796Z stderr     at quote (evalmachine.<anonymous>:22:40)
  2024-02-16T22:47:18.031158796Z stderr     at evalmachine.<anonymous>:28:50
  2024-02-16T22:47:18.031158796Z stderr     at Array.map (<anonymous>)
  2024-02-16T22:47:18.031158796Z stderr     at evalmachine.<anonymous>:27:30
  2024-02-16T22:47:18.031158796Z stderr     at evalmachine.<anonymous>:33:3
  2024-02-16T22:47:18.031158796Z stderr     at Script.runInContext (vm.js:130:18)
  2024-02-16T22:47:18.031158796Z stderr     at Script.runInNewContext (vm.js:135:17)
  2024-02-16T22:47:18.031158796Z stderr     at Object.runInNewContext (vm.js:302:38)
  2024-02-16T22:47:18.031158796Z stderr     at Socket.<anonymous> ([eval]:11:57)
  2024-02-16T22:47:18.031158796Z stderr     at Socket.emit (events.js:314:20)'
  2024-02-16T22:47:18.031158796Z stderr
  2024-02-16T22:47:18.052668966Z stderr WARNING [step download-batch] completed permanentFail
  2024-02-16T22:47:18.115670924Z stderr INFO [workflow aws-s3-bulk-download.cwl (05caa76)] completed permanentFail
  2024-02-16T22:47:18.115687692Z stderr WARNING [step aws-s3-bulk-download.cwl (05caa76)] completed permanentFail
  2024-02-16T22:47:18.184831722Z stderr INFO [workflow workflow.json#main (05caa76)] completed permanentFail
  2024-02-16T22:47:18.184862556Z stderr ERROR Overall process status is permanentFail
  2024-02-16T22:47:18.315531485Z stderr INFO Final output collection 5e8f759f6a06c774c11c86f8c83c9910+59 "Output from workflow aws-s3-bulk-download.cwl (05caa76) (0h8mwjthta23k9v)" (pirca-4zz18-0h8mwjthta23k9v)
  2024-02-16T22:47:18.562804568Z stderr WARNING Final process status is permanentFail
 (X-Request-Id: req-0fzxnddzpa9i5xia6nk4)
2024-02-16 17:47:21 arvados.cwl-runner[1708320] ERROR: Overall process status is permanentFail

I get this whether I use --local or --submit. I see that String.replaceAll is a function in my browser debugger, so 🤷

Actions #15

Updated by Peter Amstutz 9 months ago

Brett Smith wrote in #note-14:

Peter Amstutz wrote in #note-13:

https://github.com/arvados/aws-cli-cwl/commit/05caa76f428e05025c7798c3146571de702d4eb3

This all works for me, but testing the download workflow, I get this error:

[...]

I get this whether I use --local or --submit. I see that String.replaceAll is a function in my browser debugger, so 🤷

I'm glad you caught that. Turns out replaceAll is a newer thing but there's another relatively succinct way to do it:

https://stackoverflow.com/questions/13340131/string-prototype-replaceall-not-working

Updated:

https://github.com/arvados/aws-cli-cwl/commit/2ba73528c63e111c12063181fa530daf205243c5

Actions #16

Updated by Brett Smith 9 months ago

Peter Amstutz wrote in #note-15:

https://github.com/arvados/aws-cli-cwl/commit/2ba73528c63e111c12063181fa530daf205243c5

I successfully tested this for both upload and download. I feel good delivering this version to the user, thank you.

I'm fine with all the code functionality-wise. For future development and users I think some things could be touched up:

  • Both workflows give this warning:
    WARNING Workflow checker warning:
    aws-s3-bulk-upload.cwl:54:5: Source 'ramMin' of type ["null", "int"] may be incompatible
    aws-s3-bulk-upload.cwl:72:7:   with sink 'ramMin' of type "int"
    If we could resolve this, that would be nice.
  • The name of the quote function is misleading because it doesn't actually do that, it completely removes specific characters. Something like stripMeta would be better.
  • In the README it would be nice if one of example upload input files showed the syntax for uploading a file that's already in Keep. I realize that's sort of a general "how to use CWL" thing but it's a nice affordance and confirms it supports both.

Thanks.

Actions #17

Updated by Peter Amstutz 9 months ago

  • Target version changed from Development 2024-02-28 sprint to Development 2024-03-13 sprint
Actions #18

Updated by Peter Amstutz 9 months ago

  • Target version set to Development 2024-03-13 sprint
  • Tracker changed from Idea to Feature
Actions #19

Updated by Peter Amstutz 8 months ago

  • Target version changed from Development 2024-03-13 sprint to Development 2024-03-27 sprint
Actions #20

Updated by Peter Amstutz 8 months ago

Brett Smith wrote in #note-16:

Peter Amstutz wrote in #note-15:

https://github.com/arvados/aws-cli-cwl/commit/2ba73528c63e111c12063181fa530daf205243c5

I successfully tested this for both upload and download. I feel good delivering this version to the user, thank you.

I'm fine with all the code functionality-wise. For future development and users I think some things could be touched up:

  • Both workflows give this warning:[...]If we could resolve this, that would be nice.
  • The name of the quote function is misleading because it doesn't actually do that, it completely removes specific characters. Something like stripMeta would be better.
  • In the README it would be nice if one of example upload input files showed the syntax for uploading a file that's already in Keep. I realize that's sort of a general "how to use CWL" thing but it's a nice affordance and confirms it supports both.

Thanks.

I actually made a branch this time:

21378-cleanup @ commit:0807c5b8932f6ef84f756ef590cf7fed74b4e980

https://github.com/arvados/aws-cli-cwl/commit/0807c5b8932f6ef84f756ef590cf7fed74b4e980

  • Add example of keep: URL
  • make ramMin optional to suppress a spurious checker warning
  • rename quote() function to sanitize() (although it still also adds quotes)
  • don't generate empty batches when len(urls) < count
Actions #21

Updated by Peter Amstutz 8 months ago

Actions #22

Updated by Peter Amstutz 8 months ago

  • Status changed from In Progress to Resolved
Actions

Also available in: Atom PDF