Project

General

Profile

Actions

Feature #20650

open

a-c-r natively supports S3 inputs just like HTTP/S

Added by Brett Smith almost 2 years ago. Updated about 20 hours ago.

Status:
In Progress
Priority:
Normal
Assigned To:
Category:
CWL
Target version:
Story points:
-

Description

In short, provide a native S3 version of http_to_keep. Users should be able to write S3 sources in their inputs, which a-c-r then writes to keep before executing the workflow.

This should support the existing --defer-downloads option, so the user can decide whether to copy the data when they create the container request, or let it be handled inside the container.


Subtasks 1 (1 open0 closed)

Task #22301: Review 20650-s3-downloaderIn ProgressPeter Amstutz04/02/2025Actions

Related issues 1 (1 open0 closed)

Related to Arvados - Idea #22680: Arvados manages persistent credentials to external resources and can provide them to a containerNewActions
Actions #3

Updated by Peter Amstutz about 1 year ago

  • Target version changed from Future to Development 2024-03-27 sprint
Actions #4

Updated by Peter Amstutz about 1 year ago

  • Target version changed from Development 2024-03-27 sprint to Future
Actions #5

Updated by Peter Amstutz 5 months ago

  • Target version changed from Future to Development 2024-12-04
Actions #6

Updated by Peter Amstutz 5 months ago

  • Target version changed from Development 2024-12-04 to Development 2024-11-20
Actions #7

Updated by Peter Amstutz 5 months ago

  • Assigned To set to Peter Amstutz
Actions #8

Updated by Peter Amstutz 4 months ago

  • Target version changed from Development 2024-11-20 to Development 2024-12-04
Actions #9

Updated by Peter Amstutz 4 months ago

  • Tracker changed from Idea to Feature
Actions #10

Updated by Peter Amstutz 4 months ago

  • Target version changed from Development 2024-12-04 to Development 2025-01-08
Actions #11

Updated by Peter Amstutz 4 months ago

  • Status changed from New to In Progress
Actions #12

Updated by Peter Amstutz 3 months ago

  • Target version changed from Development 2025-01-08 to Development 2025-01-29
Actions #13

Updated by Peter Amstutz 2 months ago

  • Target version changed from Development 2025-01-29 to Development 2025-02-12
Actions #14

Updated by Peter Amstutz 2 months ago

  • Target version changed from Development 2025-02-12 to Development 2025-02-26
Actions #15

Updated by Peter Amstutz about 2 months ago

  • Target version changed from Development 2025-02-26 to Development 2025-03-19
Actions #16

Updated by Peter Amstutz about 2 months ago

20650-s3-downloader @ 8e08552a406e1d1ff133bd95a92db130b47381aa

Work in progress, just wanted to make sure it was recorded here.

Actions #17

Updated by Peter Amstutz about 1 month ago

  • Target version changed from Development 2025-03-19 to Development 2025-04-02
Actions #20

Updated by Peter Amstutz 9 days ago

  • Related to Idea #22680: Arvados manages persistent credentials to external resources and can provide them to a container added
Actions #21

Updated by Peter Amstutz 9 days ago

20650-s3-downloader @ ba346458f5ed3d318d44063c559dcbeed82a0ffb

developer-run-tests: #4712

  • All agreed upon points are implemented / addressed. Describe changes from pre-implementation design.
    • implemented s3_to_keep with similar behavior to http_to_keep (refactored to share code)
    • arvados-cwl-runner uses s3_to_keep to import from S3
    • arvados-cwl-runner captures local credentials and passes them to the container as a secret file, unless --disable-aws-credential-capture is used
  • Anything not implemented (discovered or discussed during work) has a follow-up story.
    • the main follow-up is #22680 so that users can use --disable-aws-credential-capture and instead the container gets an AWS role from crunch-run
  • Code is tested and passing, both automated and manual, what manual testing was done is described.
    • Added tests for s3_to_keep
    • Added test for capturing local credentials and passing them as a secret
  • New or changed UX/UX and has gotten feedback from stakeholders.
    • n/a
  • Documentation has been updated.
    • updated "CWL run options" and "Guidelines for writing workflows" pages
  • Behaves appropriately at the intended scale (describe intended scale).
    • no new changes in scale, but similarly to the http download feature and --defer-downloads this spreads the data transfer out to potentially many compute nodes instead of a single client node needed to do all the data transfer, which can greatly improve performance for large batch jobs
  • Considered backwards and forwards compatibility issues between client and server.
    • no changes, the metadata stored for downloads has not changed
  • Follows our coding standards and GUI style guidelines.
    • yes

Users can now provide s3:// URLs in the CWL input document and the are automatically downloaded to Keep, the way http_to_keep already does it.

Because S3 is HTTP underneath and uses HTTP headers to communicate various metadata, I was able to refactor http_to_keep and share code between them.

It gets AWS credentials from the local environment using whatever behavior is implemented in boto3.

It will do the file transfer locally before submitting the container unless --defer-downloads is provided. In that case, it will pass the s3:// to the workflow runner container and capture the credentials from local environment (whatever boto3 would have used).

I went back and forth on whether arvados-python-client should include boto3 as a standard dependency or an optional dependency. On the one hand, it's extra thing to carry around which is inelegant for users who are not using it. On the other hand, it is extra friction to have to request it explicitly for users who do want to use it. I decided to split the difference and make it so that it is only imported on first use when a s3 URL is found, so in the case where there are no s3 URLs, it is not imported at all.

See https://workbench.tordo.arvadosapi.com/processes/tordo-xvhdp-j1qpgyuhap0t83b for example of this in action.

Actions #22

Updated by Peter Amstutz 1 day ago

  • Target version changed from Development 2025-04-02 to Development 2025-04-16
Actions #23

Updated by Lucas Di Pentima about 20 hours ago

Sorry for the late review. Here're my comments/questions:

  • Trying to understand all the combinations, does this allows to use --defer-downloads and --disable-aws-credentials-capture? Looking at pathmapper.py line 123, it seems that this is not possible, as the credentials would still be passed on to the container.
    • I think this would be a useful combination for cases where the compute nodes have access to the target S3 buckets via an instance profile and IAM role, for example.
    • I think a new test would be needed to try the combination "deferred download without credentials" making sure we're not leaking anything.
  • In case of multiple AWS profiles, it might cause confusion to the user on why they're not able to download from the bucket if the correct credentials are not stored in the default profile.
    • Maybe we need an additional a-c-r argument for this?
  • Would it be useful (and easy enough) to add a validation step to check that the s3 URLs are indeed accessible with the given credentials, when using --defer-downloads?
  • File doc/user/cwl/cwl-style.html.textile.liquid: Typo at line 288: 'Arvado' and missing 'in' at the end of the last sentence.
Actions

Also available in: Atom PDF