Feature #20650
opena-c-r natively supports S3 inputs just like HTTP/S
Added by Brett Smith almost 2 years ago. Updated about 1 hour ago.
Description
In short, provide a native S3 version of http_to_keep
. Users should be able to write S3 sources in their inputs, which a-c-r then writes to keep before executing the workflow.
This should support the existing --defer-downloads
option, so the user can decide whether to copy the data when they create the container request, or let it be handled inside the container.
Updated by Peter Amstutz about 1 year ago
- Target version changed from Future to Development 2024-03-27 sprint
Updated by Peter Amstutz about 1 year ago
- Target version changed from Development 2024-03-27 sprint to Future
Updated by Peter Amstutz 5 months ago
- Target version changed from Future to Development 2024-12-04
Updated by Peter Amstutz 5 months ago
- Target version changed from Development 2024-12-04 to Development 2024-11-20
Updated by Peter Amstutz 5 months ago
- Target version changed from Development 2024-11-20 to Development 2024-12-04
Updated by Peter Amstutz 4 months ago
- Target version changed from Development 2024-12-04 to Development 2025-01-08
Updated by Peter Amstutz 3 months ago
- Target version changed from Development 2025-01-08 to Development 2025-01-29
Updated by Peter Amstutz 2 months ago
- Target version changed from Development 2025-01-29 to Development 2025-02-12
Updated by Peter Amstutz 2 months ago
- Target version changed from Development 2025-02-12 to Development 2025-02-26
Updated by Peter Amstutz about 2 months ago
- Target version changed from Development 2025-02-26 to Development 2025-03-19
Updated by Peter Amstutz about 2 months ago
20650-s3-downloader @ 8e08552a406e1d1ff133bd95a92db130b47381aa
Work in progress, just wanted to make sure it was recorded here.
Updated by Peter Amstutz about 1 month ago
- Target version changed from Development 2025-03-19 to Development 2025-04-02
Updated by Peter Amstutz 10 days ago
20650-s3-downloader @ d86b76612c6ae20699b7e31f324e2558ed6a4399
Updated by Peter Amstutz 10 days ago
Updated by Peter Amstutz 10 days ago
- Related to Idea #22680: Arvados manages persistent credentials to external resources and can provide them to a container added
Updated by Peter Amstutz 10 days ago
20650-s3-downloader @ ba346458f5ed3d318d44063c559dcbeed82a0ffb
- All agreed upon points are implemented / addressed. Describe changes from pre-implementation design.
- implemented
s3_to_keep
with similar behavior tohttp_to_keep
(refactored to share code) arvados-cwl-runner
usess3_to_keep
to import from S3arvados-cwl-runner
captures local credentials and passes them to the container as a secret file, unless--disable-aws-credential-capture
is used
- implemented
- Anything not implemented (discovered or discussed during work) has a follow-up story.
- the main follow-up is #22680 so that users can use
--disable-aws-credential-capture
and instead the container gets an AWS role from crunch-run
- the main follow-up is #22680 so that users can use
- Code is tested and passing, both automated and manual, what manual testing was done is described.
- Added tests for s3_to_keep
- Added test for capturing local credentials and passing them as a secret
- New or changed UX/UX and has gotten feedback from stakeholders.
- n/a
- Documentation has been updated.
- updated "CWL run options" and "Guidelines for writing workflows" pages
- Behaves appropriately at the intended scale (describe intended scale).
- no new changes in scale, but similarly to the http download feature and
--defer-downloads
this spreads the data transfer out to potentially many compute nodes instead of a single client node needed to do all the data transfer, which can greatly improve performance for large batch jobs
- no new changes in scale, but similarly to the http download feature and
- Considered backwards and forwards compatibility issues between client and server.
- no changes, the metadata stored for downloads has not changed
- Follows our coding standards and GUI style guidelines.
- yes
Users can now provide s3:// URLs in the CWL input document and the are automatically downloaded to Keep, the way http_to_keep already does it.
Because S3 is HTTP underneath and uses HTTP headers to communicate various metadata, I was able to refactor http_to_keep and share code between them.
It gets AWS credentials from the local environment using whatever behavior is implemented in boto3.
It will do the file transfer locally before submitting the container unless --defer-downloads
is provided. In that case, it will pass the s3:// to the workflow runner container and capture the credentials from local environment (whatever boto3 would have used).
I went back and forth on whether arvados-python-client should include boto3
as a standard dependency or an optional dependency. On the one hand, it's extra thing to carry around which is inelegant for users who are not using it. On the other hand, it is extra friction to have to request it explicitly for users who do want to use it. I decided to split the difference and make it so that it is only imported on first use when a s3 URL is found, so in the case where there are no s3 URLs, it is not imported at all.
See https://workbench.tordo.arvadosapi.com/processes/tordo-xvhdp-j1qpgyuhap0t83b for example of this in action.
Updated by Peter Amstutz 3 days ago
- Target version changed from Development 2025-04-02 to Development 2025-04-16
Updated by Lucas Di Pentima 2 days ago
Sorry for the late review. Here're my comments/questions:
- Trying to understand all the combinations, does this allows to use
--defer-downloads
and--disable-aws-credentials-capture
? Looking atpathmapper.py
line 123, it seems that this is not possible, as the credentials would still be passed on to the container.- I think this would be a useful combination for cases where the compute nodes have access to the target S3 buckets via an instance profile and IAM role, for example.
- I think a new test would be needed to try the combination "deferred download without credentials" making sure we're not leaking anything.
- In case of multiple AWS profiles, it might cause confusion to the user on why they're not able to download from the bucket if the correct credentials are not stored in the default profile.
- Maybe we need an additional a-c-r argument for this?
- Would it be useful (and easy enough) to add a validation step to check that the s3 URLs are indeed accessible with the given credentials, when using
--defer-downloads
? - File
doc/user/cwl/cwl-style.html.textile.liquid
: Typo at line 288: 'Arvado' and missing 'in' at the end of the last sentence.
Updated by Peter Amstutz about 1 hour ago
Lucas Di Pentima wrote in #note-23:
Sorry for the late review. Here're my comments/questions:
- Trying to understand all the combinations, does this allows to use
--defer-downloads
and--disable-aws-credentials-capture
? Looking atpathmapper.py
line 123, it seems that this is not possible, as the credentials would still be passed on to the container.
- I think this would be a useful combination for cases where the compute nodes have access to the target S3 buckets via an instance profile and IAM role, for example.
That is exactly how it is intended to work. The logic may be a bit hard to read, but what it is saying is to create a boto session when:
defer_downloads is False (i.e. we are going to download files right now)
or
aws_credential_capture is True (i.e. we are going to pass to the container)
So in the case that defer_downloads is True and aws_credential_capture is False, then it will not create a boto session.
- I think a new test would be needed to try the combination "deferred download without credentials" making sure we're not leaking anything.
I added a test for this, and it confirmed the logic was working as intended.
- In case of multiple AWS profiles, it might cause confusion to the user on why they're not able to download from the bucket if the correct credentials are not stored in the default profile.
- Maybe we need an additional a-c-r argument for this?
I'm assuming users should be able to use the AWS_PROFILE
environment variable to select a different profile, but you're right that it might be more accessible if there was add a command line parameter to select a profile.
- Would it be useful (and easy enough) to add a validation step to check that the s3 URLs are indeed accessible with the given credentials, when using
--defer-downloads
?
I don't think that is a good idea, doing a bunch of network requests to verify that resources exist still takes time (sometimes a lot of time), which goes against the goal of options like --defer-downloads
which is to defer the work and launch the workflow as fast as possible.
- File
doc/user/cwl/cwl-style.html.textile.liquid
: Typo at line 288: 'Arvado' and missing 'in' at the end of the last sentence.
20650-s3-downloader @ 5d8f237ce20d9100019cca4394b6f79ef98e171f