Feature #20937
closedhttp-to-keep feature of arv-copy
Added by Peter Amstutz over 1 year ago. Updated over 1 year ago.
Description
arv-copy feature that will
- accept a HTTP or HTTPS URL
- check if the URL has already been downloaded, on either the 'src' or 'dst' cluster
- if the URL is found, copy from src to dst using collection copy
- if it is not found, download it from origin to dst using the http_to_keep downloader
Updated by Peter Amstutz over 1 year ago
20937-arv-copy-http @ 57380e658e5deed7a1f7f735f88a14f1b616bfc0
- Can now provide a http URL and arv-copy will transfer it to keep
- If 'src' or 'dst' already has a cached copy of the http resource, it will copy the collection to the target project instead
- Also addresses long-standing performance issue by implementing parallel upload/download of blocks, can now transfer collections at 150+ MiB/s
Updated by Peter Amstutz over 1 year ago
- Target version changed from Development 2023-09-13 sprint to Development 2023-09-27 sprint
Updated by Peter Amstutz over 1 year ago
- Target version changed from Development 2023-09-27 sprint to Development 2023-10-11 sprint
Updated by Peter Amstutz over 1 year ago
20937-arv-copy-http @ 73a20dc01eb18185bbccbbe3878b9fc56e4cbad8
- All agreed upon points are implemented / addressed.
- implemented behavior as described.
- uses the
http_to_keep
SDK method - Added
--varying-url-params
and--prefer-cached-downloads
toarv-copy
which have the same behavior asarvados-cwl-runner
- Additionally implemented parallel and simultaneous block upload/download to speed up collection copy
- Anything not implemented (discovered or discussed during work) has a follow-up story.
- As I was writing this up, I thought it would probably be useful to be able to provide authorization headers, filed #21046
- Code is tested and passing, both automated and manual, what manual testing was done is described
- Unfortunately,
arv-copy
doesn't really have any tests. However, this test branch has been used to transfer hundreds of collections for about 20 TiB of data so it has seen substantial practical use already
- Unfortunately,
- Documentation has been updated.
- added section to the arv-copy page about copying HTTP urls
- Behaves appropriately at the intended scale (describe intended scale).
- parallel upload/download enabled me to transfer collections between clusters at about 150 MiB/s
- Considered backwards and forwards compatibility issues between client and server.
- Uses the SDK implementation of
http_to_keep
so it follows existing convention for storing HTTP header metadata and will remain consistent with any future changes. The API ofhttp_to_keep
did change slightly (the function returns a 5-tuple instead of a 2-tuple, but the first two items are the same)
- Uses the SDK implementation of
- Follows our coding standards and GUI style guidelines.
- yes
The motivation for this is that I had a sample sheet with hundreds of HTTP URLs. These files had already been downloaded to one cluster, and I needed them on a different another cluster. Because the http_to_keep
method already is able to determine if a file has already been downloaded, the least additional coding required was to leverage the existing http_to_keep
method to see if a collection already existed and if so, copy it, otherwise download it from the original URL.
In addition, because I had a giant pile of data to transfer, I took this opportunity to address a long-standing pet peeve, which was that arv-copy
used an incredibly basic (and slow) block copying strategy. The new strategy uses multiple simultaneous upload and download threads which makes it possible to max out both the upload and download bandwidth of the node, significantly improving transfer rate over the original implementation.
Updated by Peter Amstutz over 1 year ago
- Related to Feature #21046: Ability to specify authorization headers for tools using http_to_keep (a-c-r / arv-copy) added
Updated by Brett Smith over 1 year ago
Peter Amstutz wrote in #note-9:
20937-arv-copy-http @ 73a20dc01eb18185bbccbbe3878b9fc56e4cbad8
This feature crashes if the URL ends with a slash, which I feel like isn't uncommon:
% arv-copy --project-uuid=pirca-j7d0g-bbuwax2ne0jqkop 'https://doc.arvados.org/v2.7/' Traceback (most recent call last): File "/opt/arvados/arvenv/lib/python3.11/site-packages/arvados/http_to_keep.py", line 190, in body_write self.headers_received() File "/opt/arvados/arvenv/lib/python3.11/site-packages/arvados/http_to_keep.py", line 186, in headers_received self.target = self.collection.open(self.name, "wb") ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/arvados/arvenv/lib/python3.11/site-packages/arvados/collection.py", line 702, in open raise IOError(errno.EISDIR, "Is a directory", path) IsADirectoryError: [Errno 21] Is a directory: '' Traceback (most recent call last): File "/opt/arvados/arvenv/lib/python3.11/site-packages/arvados/http_to_keep.py", line 150, in download self.curl.perform() pycurl.error: (23, 'Failure writing output to destination') During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/arvados/arvenv/bin/arv-copy", line 7, in <module> main() File "/opt/arvados/arvenv/lib/python3.11/site-packages/arvados/commands/arv_copy.py", line 161, in main result = copy_from_http(args.object_uuid, src_arv, dst_arv, args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/arvados/arvenv/lib/python3.11/site-packages/arvados/commands/arv_copy.py", line 903, in copy_from_http cached = arvados.http_to_keep.http_to_keep(dst, project_uuid, url, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/arvados/arvenv/lib/python3.11/site-packages/arvados/http_to_keep.py", line 335, in http_to_keep req = curldownloader.download(url, headers) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/arvados/arvenv/lib/python3.11/site-packages/arvados/http_to_keep.py", line 152, in download raise arvados.errors.HttpError(0, str(e)) arvados.errors.HttpError: (0, "(23, 'Failure writing output to destination')")
If you want to shuttle that off into a separate bug, okay.
In doc/user/topics/arv-copy.html.textile.liquid
:
In both example command outputs, your personal shell prompt is accidentally included as part of the command. Please clean up.
I'm also not sure about the value of the second example. Documenting this usage of --src
is cool, but the example output is functionally identical to the previous example, so it leaves me wondering if I'm missing something. I don't feel too strongly about this, just sharing my initial impression if it's interesting feedback.
In arv_copy.py
:
Did you make bytes_written
a list just so you can mutate it within the inner functions? You can leave it as an int and just add the line nonlocal bytes_written
in the inner function definitions to get at it.
In get_thread
, when you drain get_queue
after an exception, you don't call task_done
for each of those tasks. If you go down that branch, later get_queue.join()
will hang forever.
Same comment for put_thread
.
In http_to_keep.py
:
This branch breaks the API of http_to_keep
by changing the return type from a 2-tuple to a 5-tuple. I assume nobody actually cares so I'm just noting for posterity, yes I noticed.
Thanks.
Updated by Peter Amstutz over 1 year ago
Brett Smith wrote in #note-11:
Peter Amstutz wrote in #note-9:
20937-arv-copy-http @ 73a20dc01eb18185bbccbbe3878b9fc56e4cbad8
This feature crashes if the URL ends with a slash, which I feel like isn't uncommon:
Fixed.
In
doc/user/topics/arv-copy.html.textile.liquid
:In both example command outputs, your personal shell prompt is accidentally included as part of the command. Please clean up.
Fixed
I'm also not sure about the value of the second example. Documenting this usage of
--src
is cool, but the example output is functionally identical to the previous example, so it leaves me wondering if I'm missing something. I don't feel too strongly about this, just sharing my initial impression if it's interesting feedback.
I agree, I left the text but took out the example.
In
arv_copy.py
:Did you make
bytes_written
a list just so you can mutate it within the inner functions? You can leave it as an int and just add the linenonlocal bytes_written
in the inner function definitions to get at it.
You're right, that's what I was doing, I wasn't aware of the 'nonlocal' keyword (looks like it was introduced in Python 3, and the workaround I was using was necessary in 2.7).
In
get_thread
, when you drainget_queue
after an exception, you don't calltask_done
for each of those tasks. If you go down that branch, laterget_queue.join()
will hang forever.Same comment for
put_thread
.
Fixed.
In
http_to_keep.py
:This branch breaks the API of
http_to_keep
by changing the return type from a 2-tuple to a 5-tuple. I assume nobody actually cares so I'm just noting for posterity, yes I noticed.
I did note that in the initial request for review, and yes I'm pretty sure nothing cares except arvados-cwl-runner which I fixed already.
20937-arv-copy-http @ 51ab4e57c3e395bee108270e0b3620bffb0c5856
Updated by Brett Smith over 1 year ago
Peter Amstutz wrote in #note-12:
20937-arv-copy-http @ 51ab4e57c3e395bee108270e0b3620bffb0c5856
Just one API tip for the future, doesn't need to be done in the branch.
if args.verbose:
logger.exception("%s", e)
else:
logger.error("%s", e)
This can be simplified to:
logger.error("%s", e, exc_info=args.verbose)
Please merge, thanks.
Updated by Peter Amstutz over 1 year ago
- Status changed from In Progress to Resolved