Feature #20650
opena-c-r natively supports S3 inputs just like HTTP/S
Description
In short, provide a native S3 version of http_to_keep
. Users should be able to write S3 sources in their inputs, which a-c-r then writes to keep before executing the workflow.
This should support the existing --defer-downloads
option, so the user can decide whether to copy the data when they create the container request, or let it be handled inside the container.
Updated by Peter Amstutz about 1 year ago
- Target version changed from Future to Development 2024-03-27 sprint
Updated by Peter Amstutz about 1 year ago
- Target version changed from Development 2024-03-27 sprint to Future
Updated by Peter Amstutz 5 months ago
- Target version changed from Future to Development 2024-12-04
Updated by Peter Amstutz 5 months ago
- Target version changed from Development 2024-12-04 to Development 2024-11-20
Updated by Peter Amstutz 4 months ago
- Target version changed from Development 2024-11-20 to Development 2024-12-04
Updated by Peter Amstutz 4 months ago
- Target version changed from Development 2024-12-04 to Development 2025-01-08
Updated by Peter Amstutz 3 months ago
- Target version changed from Development 2025-01-08 to Development 2025-01-29
Updated by Peter Amstutz 2 months ago
- Target version changed from Development 2025-01-29 to Development 2025-02-12
Updated by Peter Amstutz 2 months ago
- Target version changed from Development 2025-02-12 to Development 2025-02-26
Updated by Peter Amstutz about 2 months ago
- Target version changed from Development 2025-02-26 to Development 2025-03-19
Updated by Peter Amstutz about 2 months ago
20650-s3-downloader @ 8e08552a406e1d1ff133bd95a92db130b47381aa
Work in progress, just wanted to make sure it was recorded here.
Updated by Peter Amstutz about 1 month ago
- Target version changed from Development 2025-03-19 to Development 2025-04-02
Updated by Peter Amstutz 9 days ago
20650-s3-downloader @ d86b76612c6ae20699b7e31f324e2558ed6a4399
Updated by Peter Amstutz 9 days ago
Updated by Peter Amstutz 9 days ago
- Related to Idea #22680: Arvados manages persistent credentials to external resources and can provide them to a container added
Updated by Peter Amstutz 9 days ago
20650-s3-downloader @ ba346458f5ed3d318d44063c559dcbeed82a0ffb
- All agreed upon points are implemented / addressed. Describe changes from pre-implementation design.
- implemented
s3_to_keep
with similar behavior tohttp_to_keep
(refactored to share code) arvados-cwl-runner
usess3_to_keep
to import from S3arvados-cwl-runner
captures local credentials and passes them to the container as a secret file, unless--disable-aws-credential-capture
is used
- implemented
- Anything not implemented (discovered or discussed during work) has a follow-up story.
- the main follow-up is #22680 so that users can use
--disable-aws-credential-capture
and instead the container gets an AWS role from crunch-run
- the main follow-up is #22680 so that users can use
- Code is tested and passing, both automated and manual, what manual testing was done is described.
- Added tests for s3_to_keep
- Added test for capturing local credentials and passing them as a secret
- New or changed UX/UX and has gotten feedback from stakeholders.
- n/a
- Documentation has been updated.
- updated "CWL run options" and "Guidelines for writing workflows" pages
- Behaves appropriately at the intended scale (describe intended scale).
- no new changes in scale, but similarly to the http download feature and
--defer-downloads
this spreads the data transfer out to potentially many compute nodes instead of a single client node needed to do all the data transfer, which can greatly improve performance for large batch jobs
- no new changes in scale, but similarly to the http download feature and
- Considered backwards and forwards compatibility issues between client and server.
- no changes, the metadata stored for downloads has not changed
- Follows our coding standards and GUI style guidelines.
- yes
Users can now provide s3:// URLs in the CWL input document and the are automatically downloaded to Keep, the way http_to_keep already does it.
Because S3 is HTTP underneath and uses HTTP headers to communicate various metadata, I was able to refactor http_to_keep and share code between them.
It gets AWS credentials from the local environment using whatever behavior is implemented in boto3.
It will do the file transfer locally before submitting the container unless --defer-downloads
is provided. In that case, it will pass the s3:// to the workflow runner container and capture the credentials from local environment (whatever boto3 would have used).
I went back and forth on whether arvados-python-client should include boto3
as a standard dependency or an optional dependency. On the one hand, it's extra thing to carry around which is inelegant for users who are not using it. On the other hand, it is extra friction to have to request it explicitly for users who do want to use it. I decided to split the difference and make it so that it is only imported on first use when a s3 URL is found, so in the case where there are no s3 URLs, it is not imported at all.
See https://workbench.tordo.arvadosapi.com/processes/tordo-xvhdp-j1qpgyuhap0t83b for example of this in action.
Updated by Peter Amstutz 1 day ago
- Target version changed from Development 2025-04-02 to Development 2025-04-16
Updated by Lucas Di Pentima about 20 hours ago
Sorry for the late review. Here're my comments/questions:
- Trying to understand all the combinations, does this allows to use
--defer-downloads
and--disable-aws-credentials-capture
? Looking atpathmapper.py
line 123, it seems that this is not possible, as the credentials would still be passed on to the container.- I think this would be a useful combination for cases where the compute nodes have access to the target S3 buckets via an instance profile and IAM role, for example.
- I think a new test would be needed to try the combination "deferred download without credentials" making sure we're not leaking anything.
- In case of multiple AWS profiles, it might cause confusion to the user on why they're not able to download from the bucket if the correct credentials are not stored in the default profile.
- Maybe we need an additional a-c-r argument for this?
- Would it be useful (and easy enough) to add a validation step to check that the s3 URLs are indeed accessible with the given credentials, when using
--defer-downloads
? - File
doc/user/cwl/cwl-style.html.textile.liquid
: Typo at line 288: 'Arvado' and missing 'in' at the end of the last sentence.