Project

General

Profile

Actions

Bug #20257

closed

Timeout stuck downloads in arvados-cwl-runner

Added by Peter Amstutz about 1 year ago. Updated 12 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
CWL
Story points:
-
Release relationship:
Auto

Description

User reported that, using the http download feature of arvados-cwl-runner, there is no timeout for stuck downloads (where the connection hasn't been dropped but no progress seems to be made).

Possible solutions:

  • Calculate a timeout based on file size and minimum bandwidth and use that with the existing Requests code
    • the only problem is we probably have to use a HEAD call to get the content length, and sometimes (such as S3 signed URLs) you can't use HEAD
  • Switch the download code to use pyCURL and use pycurl.LOW_SPEED_TIME / pycurl.LOW_SPEED_LIMIT options to determine when to abandon poorly performing downloads

Sort of related, it seems to be doing integer division to calculate MiB/s, I only see 0.00, 1.00 or 2.00 and not fractional download rates that I would expect.


Subtasks 1 (0 open1 closed)

Task #20387: Review 20257-http-importResolvedPeter Amstutz04/19/2023Actions

Related issues

Related to Arvados - Idea #20917: Test PySDK with different versions of pycurlNewActions
Actions #1

Updated by Peter Amstutz about 1 year ago

  • Description updated (diff)
Actions #2

Updated by Peter Amstutz about 1 year ago

  • Description updated (diff)
Actions #3

Updated by Peter Amstutz about 1 year ago

  • Target version changed from Future to Development 2023-04-26 sprint
Actions #4

Updated by Peter Amstutz about 1 year ago

  • Assigned To set to Peter Amstutz
  • Status changed from New to In Progress
Actions #5

Updated by Peter Amstutz about 1 year ago

20257-http-import @ 59ffa1048ea27cfc18dfb510868edd994c0af5b2

developer-run-tests: #3611

  • Move the "http download to keep" module from CWL runner into the Python SDK so that other software can use it
  • Rework it to use pyCurl so that we can can set SO_KEEPALIVE, TCP_KEEPIDLE, CONNECTTIMEOUT, LOW_SPEED_TIME and LOW_SPEED_LIMIT in order to fail dead connections instead of sitting forever
  • cwl-runner now sets the "activity" property of runtime_status to one of "initialization", "data transfer" or "execution" to provide some basic insight into what the workflow runner is doing.
Actions #7

Updated by Brett Smith about 1 year ago

Peter Amstutz wrote in #note-5:

20257-http-import @ 59ffa1048ea27cfc18dfb510868edd994c0af5b2

  • Move the "http download to keep" module from CWL runner into the Python SDK so that other software can use it

Most of my concerns are about all the API added by this change. I appreciate the move to make this more reusable, but a lot of this is stuff I'm not convinced we want to commit to supporting in the SDK forever.

  • What do you think about renaming the pycurl helper module to _pycurl so it's all internal?
  • Similarly, I think most of the functions and classes in http_import would be better marked internal, except the entry point http_to_keep.
  • I feel like http_import is a pretty generic name that doesn't help API readers understand what it does (imports to what/where?). http_to_collection or http_to_keep seem like improvements. Open to other ideas too.
  • The format of the string returned by http_to_keep is really CWL-specific and not really useful anywhere else in the Python SDK. It would be nicer to return a 2-tuple (collection_id, filename), and let a-c-r do its own string-building.
  • I don't see much value in naming CurlDownloader after the library it's based on. Just Downloader or similar seems fine.

Implementation stuff:

-    api.collections().update(uuid=c.manifest_locator(), body={"collection":{"properties": properties}}).execute()
+    api.collections().update(uuid=c.manifest_locator(), body={"collection":{"properties": properties}}) #.execute()
  • This line is now a noop. So either it should be fixed to do something, or else removed entirely.
  • self.collection = arvados.collection.Collection() - The API user has no control over how this collection is constructed. It's not even tied to the api argument that the user passes to http_to_keep, so it's possible to create a weird split brain situation if the api client doesn't correspond to the user's default client configuration. It would be nice to have a test for that case.
  • I'm fine with us standardizing on titlecase header names internally for the implementation, but I think it's more friendly for user-facing strings like log messages to follow canonical capitalization for those names. e.g., "Found Etags %s" → "Found ETags %s"
  • def head(self, url, headers={}) - To prevent the usual Python mutable argument default gotcha in future development, it would be nicer to say headers=None and deal with that case in the body. It's also a weird asymmetry that headers has a default value in def head but not def get.
  • The Response class could be made a lot richer and debuggable with no extra code by decorating it with dataclasses.dataclass.
  • Comment "For some reason if I don't set these [info levels] explicitly some logs won't show up." - Quoting pydoc logging.Logger.setLevel: "Note that the root logger is created with level WARNING."

Thanks.

Actions #8

Updated by Peter Amstutz about 1 year ago

Brett Smith wrote in #note-7:

Peter Amstutz wrote in #note-5:

20257-http-import @ 59ffa1048ea27cfc18dfb510868edd994c0af5b2

  • Move the "http download to keep" module from CWL runner into the Python SDK so that other software can use it

Most of my concerns are about all the API added by this change. I appreciate the move to make this more reusable, but a lot of this is stuff I'm not convinced we want to commit to supporting in the SDK forever.

  • What do you think about renaming the pycurl helper module to _pycurl so it's all internal?

Renamed to _pycurlhelper

  • Similarly, I think most of the functions and classes in http_import would be better marked internal, except the entry point http_to_keep.

Added underscores to everything except http_to_keep

  • I feel like http_import is a pretty generic name that doesn't help API readers understand what it does (imports to what/where?). http_to_collection or http_to_keep seem like improvements. Open to other ideas too.
  • The format of the string returned by http_to_keep is really CWL-specific and not really useful anywhere else in the Python SDK. It would be nicer to return a 2-tuple (collection_id, filename), and let a-c-r do its own string-building.

I agree, I changed it.

  • I don't see much value in naming CurlDownloader after the library it's based on. Just Downloader or similar seems fine.

Sure, it is _Downloader now.

Implementation stuff:

[...]

  • This line is now a noop. So either it should be fixed to do something, or else removed entirely.

I can't remember what I was doing now but that was something I commented out in testing, that was a mistake, good catch.

  • self.collection = arvados.collection.Collection() - The API user has no control over how this collection is constructed. It's not even tied to the api argument that the user passes to http_to_keep, so it's possible to create a weird split brain situation if the api client doesn't correspond to the user's default client configuration. It would be nice to have a test for that case.

That was a mistake, it now uses the api client that was passed in.

  • I'm fine with us standardizing on titlecase header names internally for the implementation, but I think it's more friendly for user-facing strings like log messages to follow canonical capitalization for those names. e.g., "Found Etags %s" → "Found ETags %s"

Using title case for headers wasn't so much by choice so much as trying to reduce the footprint of the changes because requests uses title case but our existing pycurl code converted everything to lower case. It's unclear if requests returns "ETag" or "Etag", if it was the latter, then my etag checking code has never worked. For a debug logging string it doesn't matter, but I tweaked the string a little bit.

  • def head(self, url, headers={}) - To prevent the usual Python mutable argument default gotcha in future development, it would be nicer to say headers=None and deal with that case in the body. It's also a weird asymmetry that headers has a default value in def head but not def get.

I just removed the headers argument entirely for this case because it's an internal method and additional headers were not used.

  • The Response class could be made a lot richer and debuggable with no extra code by decorating it with dataclasses.dataclass.

Sure, that is easy.

  • Comment "For some reason if I don't set these [info levels] explicitly some logs won't show up." - Quoting pydoc logging.Logger.setLevel: "Note that the root logger is created with level WARNING."

Thanks, I thought I was going a little crazy. It was behaving strangely because sometimes it would log what it was doing and sometimes it wouldn't (although in testing, when it got to the new code, it wasn't logging at all, which is how I figured it out). I feel like there might have been some load order issue with modules that set a log level in the initialization, and it was working sometimes. It is fixed properly by setting it explicitly. I updated comment.

This does explain why we were seeing seemingly stuck jobs with no status being printed, they were actually maybe doing something but just not logging anything.

20257-http-import @ 10aaf718b0795aa37a4e74063b0206c507ddc6fe

developer-run-tests: #3618

Actions #9

Updated by Brett Smith about 1 year ago

Peter Amstutz wrote in #note-8:

  • The Response class could be made a lot richer and debuggable with no extra code by decorating it with dataclasses.dataclass.

Sure, that is easy.

Sorry, I should've spelled this out more. It's the same amount of code but it does require more changes, it needs to look like this. Double-check my typing for headers, feel free to ping me if it needs to be different but you're not sure how to write it.

import typing

@dataclasses.dataclass
class _Response:
    status_code: int
    headers: typing.Mapping[str, str]

That's right there's no __init__, that's provided by the decorator.

  • Thank you for the added http_to_keep docstring, that is nice. There's a stray "x" at the end of it as a typo that would be nice to clean up.
  • Just FYI, instead of saying list(cr.keys())[0], you can say next(iter(cr.keys())) and avoid the overhead of instantiating a whole list.

With at least the _Response change this is good to merge, thanks.

Actions #11

Updated by Peter Amstutz 12 months ago

  • Target version changed from Development 2023-04-26 sprint to Development 2023-05-10 sprint
Actions #12

Updated by Peter Amstutz 12 months ago

  • Release set to 63
Actions #13

Updated by Peter Amstutz 12 months ago

  • Status changed from In Progress to Resolved
Actions #14

Updated by Brett Smith 8 months ago

  • Related to Idea #20917: Test PySDK with different versions of pycurl added
Actions

Also available in: Atom PDF