Project

General

Profile

Actions

Feature #20650

open

a-c-r natively supports S3 inputs just like HTTP/S

Added by Brett Smith almost 2 years ago. Updated about 1 hour ago.

Status:
In Progress
Priority:
Normal
Assigned To:
Category:
CWL
Target version:
Story points:
-

Description

In short, provide a native S3 version of http_to_keep. Users should be able to write S3 sources in their inputs, which a-c-r then writes to keep before executing the workflow.

This should support the existing --defer-downloads option, so the user can decide whether to copy the data when they create the container request, or let it be handled inside the container.


Subtasks 1 (1 open0 closed)

Task #22301: Review 20650-s3-downloaderIn ProgressLucas Di Pentima04/04/2025Actions

Related issues 1 (1 open0 closed)

Related to Arvados - Idea #22680: Arvados manages persistent credentials to external resources and can provide them to a containerNewActions
Actions #3

Updated by Peter Amstutz about 1 year ago

  • Target version changed from Future to Development 2024-03-27 sprint
Actions #4

Updated by Peter Amstutz about 1 year ago

  • Target version changed from Development 2024-03-27 sprint to Future
Actions #5

Updated by Peter Amstutz 5 months ago

  • Target version changed from Future to Development 2024-12-04
Actions #6

Updated by Peter Amstutz 5 months ago

  • Target version changed from Development 2024-12-04 to Development 2024-11-20
Actions #7

Updated by Peter Amstutz 5 months ago

  • Assigned To set to Peter Amstutz
Actions #8

Updated by Peter Amstutz 5 months ago

  • Target version changed from Development 2024-11-20 to Development 2024-12-04
Actions #9

Updated by Peter Amstutz 5 months ago

  • Tracker changed from Idea to Feature
Actions #10

Updated by Peter Amstutz 4 months ago

  • Target version changed from Development 2024-12-04 to Development 2025-01-08
Actions #11

Updated by Peter Amstutz 4 months ago

  • Status changed from New to In Progress
Actions #12

Updated by Peter Amstutz 3 months ago

  • Target version changed from Development 2025-01-08 to Development 2025-01-29
Actions #13

Updated by Peter Amstutz 2 months ago

  • Target version changed from Development 2025-01-29 to Development 2025-02-12
Actions #14

Updated by Peter Amstutz 2 months ago

  • Target version changed from Development 2025-02-12 to Development 2025-02-26
Actions #15

Updated by Peter Amstutz about 2 months ago

  • Target version changed from Development 2025-02-26 to Development 2025-03-19
Actions #16

Updated by Peter Amstutz about 2 months ago

20650-s3-downloader @ 8e08552a406e1d1ff133bd95a92db130b47381aa

Work in progress, just wanted to make sure it was recorded here.

Actions #17

Updated by Peter Amstutz about 1 month ago

  • Target version changed from Development 2025-03-19 to Development 2025-04-02
Actions #20

Updated by Peter Amstutz 10 days ago

  • Related to Idea #22680: Arvados manages persistent credentials to external resources and can provide them to a container added
Actions #21

Updated by Peter Amstutz 10 days ago

20650-s3-downloader @ ba346458f5ed3d318d44063c559dcbeed82a0ffb

developer-run-tests: #4712

  • All agreed upon points are implemented / addressed. Describe changes from pre-implementation design.
    • implemented s3_to_keep with similar behavior to http_to_keep (refactored to share code)
    • arvados-cwl-runner uses s3_to_keep to import from S3
    • arvados-cwl-runner captures local credentials and passes them to the container as a secret file, unless --disable-aws-credential-capture is used
  • Anything not implemented (discovered or discussed during work) has a follow-up story.
    • the main follow-up is #22680 so that users can use --disable-aws-credential-capture and instead the container gets an AWS role from crunch-run
  • Code is tested and passing, both automated and manual, what manual testing was done is described.
    • Added tests for s3_to_keep
    • Added test for capturing local credentials and passing them as a secret
  • New or changed UX/UX and has gotten feedback from stakeholders.
    • n/a
  • Documentation has been updated.
    • updated "CWL run options" and "Guidelines for writing workflows" pages
  • Behaves appropriately at the intended scale (describe intended scale).
    • no new changes in scale, but similarly to the http download feature and --defer-downloads this spreads the data transfer out to potentially many compute nodes instead of a single client node needed to do all the data transfer, which can greatly improve performance for large batch jobs
  • Considered backwards and forwards compatibility issues between client and server.
    • no changes, the metadata stored for downloads has not changed
  • Follows our coding standards and GUI style guidelines.
    • yes

Users can now provide s3:// URLs in the CWL input document and the are automatically downloaded to Keep, the way http_to_keep already does it.

Because S3 is HTTP underneath and uses HTTP headers to communicate various metadata, I was able to refactor http_to_keep and share code between them.

It gets AWS credentials from the local environment using whatever behavior is implemented in boto3.

It will do the file transfer locally before submitting the container unless --defer-downloads is provided. In that case, it will pass the s3:// to the workflow runner container and capture the credentials from local environment (whatever boto3 would have used).

I went back and forth on whether arvados-python-client should include boto3 as a standard dependency or an optional dependency. On the one hand, it's extra thing to carry around which is inelegant for users who are not using it. On the other hand, it is extra friction to have to request it explicitly for users who do want to use it. I decided to split the difference and make it so that it is only imported on first use when a s3 URL is found, so in the case where there are no s3 URLs, it is not imported at all.

See https://workbench.tordo.arvadosapi.com/processes/tordo-xvhdp-j1qpgyuhap0t83b for example of this in action.

Actions #22

Updated by Peter Amstutz 3 days ago

  • Target version changed from Development 2025-04-02 to Development 2025-04-16
Actions #23

Updated by Lucas Di Pentima 2 days ago

Sorry for the late review. Here're my comments/questions:

  • Trying to understand all the combinations, does this allows to use --defer-downloads and --disable-aws-credentials-capture? Looking at pathmapper.py line 123, it seems that this is not possible, as the credentials would still be passed on to the container.
    • I think this would be a useful combination for cases where the compute nodes have access to the target S3 buckets via an instance profile and IAM role, for example.
    • I think a new test would be needed to try the combination "deferred download without credentials" making sure we're not leaking anything.
  • In case of multiple AWS profiles, it might cause confusion to the user on why they're not able to download from the bucket if the correct credentials are not stored in the default profile.
    • Maybe we need an additional a-c-r argument for this?
  • Would it be useful (and easy enough) to add a validation step to check that the s3 URLs are indeed accessible with the given credentials, when using --defer-downloads?
  • File doc/user/cwl/cwl-style.html.textile.liquid: Typo at line 288: 'Arvado' and missing 'in' at the end of the last sentence.
Actions #24

Updated by Peter Amstutz about 1 hour ago

Lucas Di Pentima wrote in #note-23:

Sorry for the late review. Here're my comments/questions:

  • Trying to understand all the combinations, does this allows to use --defer-downloads and --disable-aws-credentials-capture? Looking at pathmapper.py line 123, it seems that this is not possible, as the credentials would still be passed on to the container.
    • I think this would be a useful combination for cases where the compute nodes have access to the target S3 buckets via an instance profile and IAM role, for example.

That is exactly how it is intended to work. The logic may be a bit hard to read, but what it is saying is to create a boto session when:

defer_downloads is False (i.e. we are going to download files right now)

or

aws_credential_capture is True (i.e. we are going to pass to the container)

So in the case that defer_downloads is True and aws_credential_capture is False, then it will not create a boto session.

  • I think a new test would be needed to try the combination "deferred download without credentials" making sure we're not leaking anything.

I added a test for this, and it confirmed the logic was working as intended.

  • In case of multiple AWS profiles, it might cause confusion to the user on why they're not able to download from the bucket if the correct credentials are not stored in the default profile.
    • Maybe we need an additional a-c-r argument for this?

I'm assuming users should be able to use the AWS_PROFILE environment variable to select a different profile, but you're right that it might be more accessible if there was add a command line parameter to select a profile.

  • Would it be useful (and easy enough) to add a validation step to check that the s3 URLs are indeed accessible with the given credentials, when using --defer-downloads?

I don't think that is a good idea, doing a bunch of network requests to verify that resources exist still takes time (sometimes a lot of time), which goes against the goal of options like --defer-downloads which is to defer the work and launch the workflow as fast as possible.

  • File doc/user/cwl/cwl-style.html.textile.liquid: Typo at line 288: 'Arvado' and missing 'in' at the end of the last sentence.

20650-s3-downloader @ 5d8f237ce20d9100019cca4394b6f79ef98e171f

developer-run-tests: #4724

Actions

Also available in: Atom PDF