Project

General

Profile

Actions

Feature #20937

closed

http-to-keep feature of arv-copy

Added by Peter Amstutz 8 months ago. Updated 7 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
SDKs
Story points:
-
Release relationship:
Auto

Description

arv-copy feature that will

  1. accept a HTTP or HTTPS URL
  2. check if the URL has already been downloaded, on either the 'src' or 'dst' cluster
  3. if the URL is found, copy from src to dst using collection copy
  4. if it is not found, download it from origin to dst using the http_to_keep downloader

Subtasks 1 (1 open0 closed)

Task #20942: Review 20937-arv-copy-httpIn ProgressPeter Amstutz09/08/2023Actions

Related issues

Related to Arvados - Feature #21046: Ability to specify authorization headers for tools using http_to_keep (a-c-r / arv-copy)NewActions
Actions #1

Updated by Peter Amstutz 8 months ago

  • Status changed from New to In Progress
Actions #2

Updated by Peter Amstutz 8 months ago

  • Description updated (diff)
Actions #3

Updated by Peter Amstutz 8 months ago

  • Description updated (diff)
Actions #4

Updated by Peter Amstutz 8 months ago

20937-arv-copy-http @ 57380e658e5deed7a1f7f735f88a14f1b616bfc0

developer-run-tests: #3822

  • Can now provide a http URL and arv-copy will transfer it to keep
  • If 'src' or 'dst' already has a cached copy of the http resource, it will copy the collection to the target project instead
  • Also addresses long-standing performance issue by implementing parallel upload/download of blocks, can now transfer collections at 150+ MiB/s
Actions #5

Updated by Peter Amstutz 8 months ago

  • Release set to 66
Actions #6

Updated by Peter Amstutz 8 months ago

  • Release changed from 66 to 67
Actions #7

Updated by Peter Amstutz 8 months ago

  • Target version changed from Development 2023-09-13 sprint to Development 2023-09-27 sprint
Actions #8

Updated by Peter Amstutz 8 months ago

  • Target version changed from Development 2023-09-27 sprint to Development 2023-10-11 sprint
Actions #9

Updated by Peter Amstutz 7 months ago

20937-arv-copy-http @ 73a20dc01eb18185bbccbbe3878b9fc56e4cbad8

developer-run-tests: #3852

  • All agreed upon points are implemented / addressed.
    • implemented behavior as described.
    • uses the http_to_keep SDK method
    • Added --varying-url-params and --prefer-cached-downloads to arv-copy which have the same behavior as arvados-cwl-runner
    • Additionally implemented parallel and simultaneous block upload/download to speed up collection copy
  • Anything not implemented (discovered or discussed during work) has a follow-up story.
    • As I was writing this up, I thought it would probably be useful to be able to provide authorization headers, filed #21046
  • Code is tested and passing, both automated and manual, what manual testing was done is described
    • Unfortunately, arv-copy doesn't really have any tests. However, this test branch has been used to transfer hundreds of collections for about 20 TiB of data so it has seen substantial practical use already
  • Documentation has been updated.
    • added section to the arv-copy page about copying HTTP urls
  • Behaves appropriately at the intended scale (describe intended scale).
    • parallel upload/download enabled me to transfer collections between clusters at about 150 MiB/s
  • Considered backwards and forwards compatibility issues between client and server.
    • Uses the SDK implementation of http_to_keep so it follows existing convention for storing HTTP header metadata and will remain consistent with any future changes. The API of http_to_keep did change slightly (the function returns a 5-tuple instead of a 2-tuple, but the first two items are the same)
  • Follows our coding standards and GUI style guidelines.
    • yes

The motivation for this is that I had a sample sheet with hundreds of HTTP URLs. These files had already been downloaded to one cluster, and I needed them on a different another cluster. Because the http_to_keep method already is able to determine if a file has already been downloaded, the least additional coding required was to leverage the existing http_to_keep method to see if a collection already existed and if so, copy it, otherwise download it from the original URL.

In addition, because I had a giant pile of data to transfer, I took this opportunity to address a long-standing pet peeve, which was that arv-copy used an incredibly basic (and slow) block copying strategy. The new strategy uses multiple simultaneous upload and download threads which makes it possible to max out both the upload and download bandwidth of the node, significantly improving transfer rate over the original implementation.

Actions #10

Updated by Peter Amstutz 7 months ago

  • Related to Feature #21046: Ability to specify authorization headers for tools using http_to_keep (a-c-r / arv-copy) added
Actions #11

Updated by Brett Smith 7 months ago

Peter Amstutz wrote in #note-9:

20937-arv-copy-http @ 73a20dc01eb18185bbccbbe3878b9fc56e4cbad8

This feature crashes if the URL ends with a slash, which I feel like isn't uncommon:

% arv-copy --project-uuid=pirca-j7d0g-bbuwax2ne0jqkop 'https://doc.arvados.org/v2.7/'
Traceback (most recent call last):
  File "/opt/arvados/arvenv/lib/python3.11/site-packages/arvados/http_to_keep.py", line 190, in body_write
    self.headers_received()
  File "/opt/arvados/arvenv/lib/python3.11/site-packages/arvados/http_to_keep.py", line 186, in headers_received
    self.target = self.collection.open(self.name, "wb")
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/arvados/arvenv/lib/python3.11/site-packages/arvados/collection.py", line 702, in open
    raise IOError(errno.EISDIR, "Is a directory", path)
IsADirectoryError: [Errno 21] Is a directory: ''
Traceback (most recent call last):
  File "/opt/arvados/arvenv/lib/python3.11/site-packages/arvados/http_to_keep.py", line 150, in download
    self.curl.perform()
pycurl.error: (23, 'Failure writing output to destination')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/arvados/arvenv/bin/arv-copy", line 7, in <module>
    main()
  File "/opt/arvados/arvenv/lib/python3.11/site-packages/arvados/commands/arv_copy.py", line 161, in main
    result = copy_from_http(args.object_uuid, src_arv, dst_arv, args)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/arvados/arvenv/lib/python3.11/site-packages/arvados/commands/arv_copy.py", line 903, in copy_from_http
    cached = arvados.http_to_keep.http_to_keep(dst, project_uuid, url,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/arvados/arvenv/lib/python3.11/site-packages/arvados/http_to_keep.py", line 335, in http_to_keep
    req = curldownloader.download(url, headers)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/arvados/arvenv/lib/python3.11/site-packages/arvados/http_to_keep.py", line 152, in download
    raise arvados.errors.HttpError(0, str(e))
arvados.errors.HttpError: (0, "(23, 'Failure writing output to destination')")

If you want to shuttle that off into a separate bug, okay.

In doc/user/topics/arv-copy.html.textile.liquid:

In both example command outputs, your personal shell prompt is accidentally included as part of the command. Please clean up.

I'm also not sure about the value of the second example. Documenting this usage of --src is cool, but the example output is functionally identical to the previous example, so it leaves me wondering if I'm missing something. I don't feel too strongly about this, just sharing my initial impression if it's interesting feedback.

In arv_copy.py:

Did you make bytes_written a list just so you can mutate it within the inner functions? You can leave it as an int and just add the line nonlocal bytes_written in the inner function definitions to get at it.

In get_thread, when you drain get_queue after an exception, you don't call task_done for each of those tasks. If you go down that branch, later get_queue.join() will hang forever.

Same comment for put_thread.

In http_to_keep.py:

This branch breaks the API of http_to_keep by changing the return type from a 2-tuple to a 5-tuple. I assume nobody actually cares so I'm just noting for posterity, yes I noticed.

Thanks.

Actions #12

Updated by Peter Amstutz 7 months ago

Brett Smith wrote in #note-11:

Peter Amstutz wrote in #note-9:

20937-arv-copy-http @ 73a20dc01eb18185bbccbbe3878b9fc56e4cbad8

This feature crashes if the URL ends with a slash, which I feel like isn't uncommon:

Fixed.

In doc/user/topics/arv-copy.html.textile.liquid:

In both example command outputs, your personal shell prompt is accidentally included as part of the command. Please clean up.

Fixed

I'm also not sure about the value of the second example. Documenting this usage of --src is cool, but the example output is functionally identical to the previous example, so it leaves me wondering if I'm missing something. I don't feel too strongly about this, just sharing my initial impression if it's interesting feedback.

I agree, I left the text but took out the example.

In arv_copy.py:

Did you make bytes_written a list just so you can mutate it within the inner functions? You can leave it as an int and just add the line nonlocal bytes_written in the inner function definitions to get at it.

You're right, that's what I was doing, I wasn't aware of the 'nonlocal' keyword (looks like it was introduced in Python 3, and the workaround I was using was necessary in 2.7).

In get_thread, when you drain get_queue after an exception, you don't call task_done for each of those tasks. If you go down that branch, later get_queue.join() will hang forever.

Same comment for put_thread.

Fixed.

In http_to_keep.py:

This branch breaks the API of http_to_keep by changing the return type from a 2-tuple to a 5-tuple. I assume nobody actually cares so I'm just noting for posterity, yes I noticed.

I did note that in the initial request for review, and yes I'm pretty sure nothing cares except arvados-cwl-runner which I fixed already.

20937-arv-copy-http @ 51ab4e57c3e395bee108270e0b3620bffb0c5856

developer-run-tests: #3856

Actions #13

Updated by Brett Smith 7 months ago

Peter Amstutz wrote in #note-12:

20937-arv-copy-http @ 51ab4e57c3e395bee108270e0b3620bffb0c5856

Just one API tip for the future, doesn't need to be done in the branch.

if args.verbose:
    logger.exception("%s", e)
else:
    logger.error("%s", e)

This can be simplified to:

logger.error("%s", e, exc_info=args.verbose)

Please merge, thanks.

Actions #14

Updated by Peter Amstutz 7 months ago

  • Status changed from In Progress to Resolved
Actions

Also available in: Atom PDF