Feature #21378
closedCWL workflow to transfer data from Keep to S3
Added by Brett Smith 11 months ago. Updated 8 months ago.
Description
We have a workflow to transfer data from S3 to Keep. Now we want a workflow that goes in the other direction.
The existing workflow just uses aws s3 cp
, so the core of this version can share the implementation, just with arguments flipped. Other changes around that:
- The existing workflow has a parameter
parallel_downloads
. Either the new one should haveparallel_uploads
, or they should use a new common name likeparallel_transfers
. - The new workflow needs to be documented.
Updated by Peter Amstutz 10 months ago
- Target version set to Development 2024-02-14 sprint
Updated by Peter Amstutz 10 months ago
This has been updated on github cfa85c1d1a31a7cb990fc2fda260ff970cbc0198
https://github.com/arvados/aws-cli-cwl
I've manually tested this a little bit with cwltool
but unrelated cloud weather is making it hard to test on Arvados, please take a look at it and decide if it is ready to share with users.
Updated by Brett Smith 10 months ago
Reviewing from ddb738ce1a0095e7b234d55ab9dcf5ea5494e8ca through cfa85c1d1a31a7cb990fc2fda260ff970cbc0198
I do appreciate that both versions look much richer and user-friendly with full documentation throughout. Thanks much for that.
README.md
:
- The example inputs for the downloader have not been updated with the new name of
parallel_downloads
. - The second paragraph of the uploader section has been copy-pasted verbatim from the downloader and needs a few
s/download/upload/g;
.- Same goes for the
doc
field inaws-s3-bulk-upload.cwl
.
- Same goes for the
- The example
s3target
value would be nicer if were more obviously an example value, e.g.,s3://zzzzz-4zz18-12345abcde67890/
- It would be nice to mark up the
input.yml
filename as monospace just before both example listings.
aws-s3-bulk-upload.cwl
:
- Refers to a
tools/aws-s3-scatter-upload.cwl
that is not in Git.
tools/aws-s3-scatter-download.cwl
:
- The comment at the top refers to the workflow by its old name.
- In the
dockerFile
:- Apparently this is "care about FHS" week. You should install the virtualenv in a subdirectory of
/opt
: “A package to be installed in/opt
must locate its static files in a separate/opt/<package>
or/opt/<provider>
directory tree…” - You can remove the apt install of
python3-pip
andpython3-setuptools
, they're both noops since all that matters is what's in the virtualenv, andvenv
will ensure thatpip
is in there. - You could consider installing awscli in the virtualenv if you want more flexibility to get a different version. That's really a question of preference, just flagging it.
- Similarly you could consider pinning cwltool by saying
pip install 'cwltool~=3.0'
or whatever. Up to you how you want to balance flexibility against reliability. - I'm guessing these comments all apply to
tools/aws-s3-scatter-upload.cwl
too.
- Apparently this is "care about FHS" week. You should install the virtualenv in a subdirectory of
Thanks.
Updated by Peter Amstutz 9 months ago
- Target version changed from Development 2024-02-14 sprint to Development 2024-02-28 sprint
Updated by Brett Smith 9 months ago
Not directly related to the story, but: looking at some output from the original workflow, I noticed that aws s3 cp
seems to report progress many times per second. This makes it difficult to get a sense of progress when reviewing logs in Arvados. If there's an option to control logging frequency, adding it the workflow seems like it would be a nice ergonomic improvement. If it's time-based, maybe sync it to crunchstat, every 10sec? If it's progress-based, maybe sync it to Keep blocks, 64MiB?
If this gets spun out into a separate story, sure.
Updated by Peter Amstutz 9 months ago
Brett Smith wrote in #note-8:
Not directly related to the story, but: looking at some output from the original workflow, I noticed that
aws s3 cp
seems to report progress many times per second. This makes it difficult to get a sense of progress when reviewing logs in Arvados. If there's an option to control logging frequency, adding it the workflow seems like it would be a nice ergonomic improvement. If it's time-based, maybe sync it to crunchstat, every 10sec? If it's progress-based, maybe sync it to Keep blocks, 64MiB?If this gets spun out into a separate story, sure.
https://awscli.amazonaws.com/v2/documentation/api/latest/reference/s3/cp.html
If it is gumming up the logs, then it would probably make sense to add --no-progress
Updated by Brett Smith 9 months ago
Peter Amstutz wrote in #note-9:
If it is gumming up the logs, then it would probably make sense to add
--no-progress
I agree that seems like the best available option. In principle the network stats from crunchstat should be able to give you a better sense of how much work it's accomplished.
Updated by Peter Amstutz 9 months ago
https://github.com/arvados/aws-cli-cwl/commit/d7c146281d8aeb1687b7df0e9ed34b39db0babb9
I stumbled on process a little bit, I should have used a branch for this (next time).
- It scatters batches but no longer uses a sub-scatter on individual files. Instead it writes a shell script on the fly and executes that. This is overall simpler and has the important benefit that downloading to a keep-backend directory as was originally intended.
- Requests 3 GB of RAM so that we get a c5.large instead of a t3
- No longer installs cwltool because it is no longer needed, so the Dockerfile is simpler
- Addressed the notes about the README
Updated by Brett Smith 9 months ago
Peter Amstutz wrote in #note-11:
- It scatters batches but no longer uses a sub-scatter on individual files. Instead it writes a shell script on the fly and executes that. This is overall simpler and has the important benefit that downloading to a keep-backend directory as was originally intended.
Overall I like this approach and I appreciate how much it simplifies. However:
- The
--endpoint
argument should be quoted for safety. - Every time we say
url.split('/').pop()
, that will not do the right thing if the URL has query parameters or a fragment.
Unfortunately, safely metaprogramming shell is basically impossible. I wonder if it would be nicer to write a Python script that takes an action and a list of sources as arguments. Then it could do rich string parsing and call aws s3 cp
without going through the shell at all. If that's not practical in the time available, we should at least try to address these issues above.
- Requests 3 GB of RAM so that we get a c5.large instead of a t3
This assumes a specific instance configuration on the cluster. I'm also frustrated that we're doing that thing we do of "This is what one user wants right now, so let's assume this is what every user wants for all time, and hardcode it."
Whenever our users writing CWL need dynamic instance selection, we tell them to make the resource an input parameter and evaluate it in the resource requirements. We should follow our own advice.
Updated by Peter Amstutz 9 months ago
Brett Smith wrote in #note-12:
- The
--endpoint
argument should be quoted for safety.
Sure. I added a quote function that removes any single quotes or backslashes, and used it on all the input parameters.
- Every time we say
url.split('/').pop()
, that will not do the right thing if the URL has query parameters or a fragment.
I am fairly confident that s3:// URLs that this would be used with are unlikely to have query parameters or fragments.
Unfortunately, safely metaprogramming shell is basically impossible. I wonder if it would be nicer to write a Python script that takes an action and a list of sources as arguments. Then it could do rich string parsing and call
aws s3 cp
without going through the shell at all. If that's not practical in the time available, we should at least try to address these issues above.
The next step would probably be to stop using the aws cli and just write a Python script that uses the AWS Python SDK, and/or implement a "s3_to_keep" function to our SDK. But I'd like to close out this ticket, I believe it should be enough to unblock the customer.
Whenever our users writing CWL need dynamic instance selection, we tell them to make the resource an input parameter and evaluate it in the resource requirements. We should follow our own advice.
I added a ramMin parameter.
https://github.com/arvados/aws-cli-cwl/commit/05caa76f428e05025c7798c3146571de702d4eb3
Updated by Brett Smith 9 months ago
Peter Amstutz wrote in #note-13:
https://github.com/arvados/aws-cli-cwl/commit/05caa76f428e05025c7798c3146571de702d4eb3
This all works for me, but testing the download workflow, I get this error:
2024-02-16T22:47:18.031158796Z stderr stderr was: 'evalmachine.<anonymous>:22 2024-02-16T22:47:18.031158796Z stderr var quote = function(s) { return "'"+s.replaceAll("'", "").replaceAll("\\", "")+"'"; } 2024-02-16T22:47:18.031158796Z stderr ^ 2024-02-16T22:47:18.031158796Z stderr 2024-02-16T22:47:18.031158796Z stderr TypeError: s.replaceAll is not a function 2024-02-16T22:47:18.031158796Z stderr at quote (evalmachine.<anonymous>:22:40) 2024-02-16T22:47:18.031158796Z stderr at evalmachine.<anonymous>:28:50 2024-02-16T22:47:18.031158796Z stderr at Array.map (<anonymous>) 2024-02-16T22:47:18.031158796Z stderr at evalmachine.<anonymous>:27:30 2024-02-16T22:47:18.031158796Z stderr at evalmachine.<anonymous>:33:3 2024-02-16T22:47:18.031158796Z stderr at Script.runInContext (vm.js:130:18) 2024-02-16T22:47:18.031158796Z stderr at Script.runInNewContext (vm.js:135:17) 2024-02-16T22:47:18.031158796Z stderr at Object.runInNewContext (vm.js:302:38) 2024-02-16T22:47:18.031158796Z stderr at Socket.<anonymous> ([eval]:11:57) 2024-02-16T22:47:18.031158796Z stderr at Socket.emit (events.js:314:20)' 2024-02-16T22:47:18.031158796Z stderr 2024-02-16T22:47:18.052668966Z stderr WARNING [step download-batch] completed permanentFail 2024-02-16T22:47:18.115670924Z stderr INFO [workflow aws-s3-bulk-download.cwl (05caa76)] completed permanentFail 2024-02-16T22:47:18.115687692Z stderr WARNING [step aws-s3-bulk-download.cwl (05caa76)] completed permanentFail 2024-02-16T22:47:18.184831722Z stderr INFO [workflow workflow.json#main (05caa76)] completed permanentFail 2024-02-16T22:47:18.184862556Z stderr ERROR Overall process status is permanentFail 2024-02-16T22:47:18.315531485Z stderr INFO Final output collection 5e8f759f6a06c774c11c86f8c83c9910+59 "Output from workflow aws-s3-bulk-download.cwl (05caa76) (0h8mwjthta23k9v)" (pirca-4zz18-0h8mwjthta23k9v) 2024-02-16T22:47:18.562804568Z stderr WARNING Final process status is permanentFail (X-Request-Id: req-0fzxnddzpa9i5xia6nk4) 2024-02-16 17:47:21 arvados.cwl-runner[1708320] ERROR: Overall process status is permanentFail
I get this whether I use --local
or --submit
. I see that String.replaceAll
is a function in my browser debugger, so 🤷
Updated by Peter Amstutz 9 months ago
Brett Smith wrote in #note-14:
Peter Amstutz wrote in #note-13:
https://github.com/arvados/aws-cli-cwl/commit/05caa76f428e05025c7798c3146571de702d4eb3
This all works for me, but testing the download workflow, I get this error:
[...]
I get this whether I use
--local
or--submit
. I see thatString.replaceAll
is a function in my browser debugger, so 🤷
I'm glad you caught that. Turns out replaceAll is a newer thing but there's another relatively succinct way to do it:
https://stackoverflow.com/questions/13340131/string-prototype-replaceall-not-working
Updated:
https://github.com/arvados/aws-cli-cwl/commit/2ba73528c63e111c12063181fa530daf205243c5
Updated by Brett Smith 9 months ago
Peter Amstutz wrote in #note-15:
https://github.com/arvados/aws-cli-cwl/commit/2ba73528c63e111c12063181fa530daf205243c5
I successfully tested this for both upload and download. I feel good delivering this version to the user, thank you.
I'm fine with all the code functionality-wise. For future development and users I think some things could be touched up:
- Both workflows give this warning:
WARNING Workflow checker warning: aws-s3-bulk-upload.cwl:54:5: Source 'ramMin' of type ["null", "int"] may be incompatible aws-s3-bulk-upload.cwl:72:7: with sink 'ramMin' of type "int"
If we could resolve this, that would be nice. - The name of the
quote
function is misleading because it doesn't actually do that, it completely removes specific characters. Something likestripMeta
would be better. - In the
README
it would be nice if one of example upload input files showed the syntax for uploading a file that's already in Keep. I realize that's sort of a general "how to use CWL" thing but it's a nice affordance and confirms it supports both.
Thanks.
Updated by Peter Amstutz 9 months ago
- Target version changed from Development 2024-02-28 sprint to Development 2024-03-13 sprint
Updated by Peter Amstutz 9 months ago
- Target version set to Development 2024-03-13 sprint
- Tracker changed from Idea to Feature
Updated by Peter Amstutz 8 months ago
- Target version changed from Development 2024-03-13 sprint to Development 2024-03-27 sprint
Updated by Peter Amstutz 8 months ago
Brett Smith wrote in #note-16:
Peter Amstutz wrote in #note-15:
https://github.com/arvados/aws-cli-cwl/commit/2ba73528c63e111c12063181fa530daf205243c5
I successfully tested this for both upload and download. I feel good delivering this version to the user, thank you.
I'm fine with all the code functionality-wise. For future development and users I think some things could be touched up:
- Both workflows give this warning:[...]If we could resolve this, that would be nice.
- The name of the
quote
function is misleading because it doesn't actually do that, it completely removes specific characters. Something likestripMeta
would be better.- In the
README
it would be nice if one of example upload input files showed the syntax for uploading a file that's already in Keep. I realize that's sort of a general "how to use CWL" thing but it's a nice affordance and confirms it supports both.Thanks.
I actually made a branch this time:
21378-cleanup @ commit:0807c5b8932f6ef84f756ef590cf7fed74b4e980
https://github.com/arvados/aws-cli-cwl/commit/0807c5b8932f6ef84f756ef590cf7fed74b4e980
- Add example of keep: URL
- make ramMin optional to suppress a spurious checker warning
- rename quote() function to sanitize() (although it still also adds quotes)
- don't generate empty batches when len(urls) < count
Updated by Peter Amstutz 8 months ago
PR was approved and merged https://github.com/arvados/aws-cli-cwl/pull/1
Updated by Peter Amstutz 8 months ago
- Status changed from In Progress to Resolved