Bug #20257
closedTimeout stuck downloads in arvados-cwl-runner
Description
User reported that, using the http download feature of arvados-cwl-runner, there is no timeout for stuck downloads (where the connection hasn't been dropped but no progress seems to be made).
Possible solutions:
- Calculate a timeout based on file size and minimum bandwidth and use that with the existing Requests code
- the only problem is we probably have to use a HEAD call to get the content length, and sometimes (such as S3 signed URLs) you can't use HEAD
- Switch the download code to use pyCURL and use pycurl.LOW_SPEED_TIME / pycurl.LOW_SPEED_LIMIT options to determine when to abandon poorly performing downloads
Sort of related, it seems to be doing integer division to calculate MiB/s, I only see 0.00, 1.00 or 2.00 and not fractional download rates that I would expect.
Updated by Peter Amstutz over 1 year ago
- Target version changed from Future to Development 2023-04-26 sprint
Updated by Peter Amstutz over 1 year ago
- Assigned To set to Peter Amstutz
- Status changed from New to In Progress
Updated by Peter Amstutz over 1 year ago
20257-http-import @ 59ffa1048ea27cfc18dfb510868edd994c0af5b2
- Move the "http download to keep" module from CWL runner into the Python SDK so that other software can use it
- Rework it to use pyCurl so that we can can set SO_KEEPALIVE, TCP_KEEPIDLE, CONNECTTIMEOUT, LOW_SPEED_TIME and LOW_SPEED_LIMIT in order to fail dead connections instead of sitting forever
- cwl-runner now sets the "activity" property of runtime_status to one of "initialization", "data transfer" or "execution" to provide some basic insight into what the workflow runner is doing.
Updated by Peter Amstutz over 1 year ago
Updated by Brett Smith over 1 year ago
Peter Amstutz wrote in #note-5:
20257-http-import @ 59ffa1048ea27cfc18dfb510868edd994c0af5b2
- Move the "http download to keep" module from CWL runner into the Python SDK so that other software can use it
Most of my concerns are about all the API added by this change. I appreciate the move to make this more reusable, but a lot of this is stuff I'm not convinced we want to commit to supporting in the SDK forever.
- What do you think about renaming the
pycurl
helper module to_pycurl
so it's all internal?
- Similarly, I think most of the functions and classes in
http_import
would be better marked internal, except the entry pointhttp_to_keep
.
- I feel like
http_import
is a pretty generic name that doesn't help API readers understand what it does (imports to what/where?).http_to_collection
orhttp_to_keep
seem like improvements. Open to other ideas too.
- The format of the string returned by
http_to_keep
is really CWL-specific and not really useful anywhere else in the Python SDK. It would be nicer to return a 2-tuple(collection_id, filename)
, and let a-c-r do its own string-building.
- I don't see much value in naming
CurlDownloader
after the library it's based on. JustDownloader
or similar seems fine.
Implementation stuff:
- api.collections().update(uuid=c.manifest_locator(), body={"collection":{"properties": properties}}).execute() + api.collections().update(uuid=c.manifest_locator(), body={"collection":{"properties": properties}}) #.execute()
- This line is now a noop. So either it should be fixed to do something, or else removed entirely.
self.collection = arvados.collection.Collection()
- The API user has no control over how this collection is constructed. It's not even tied to theapi
argument that the user passes tohttp_to_keep
, so it's possible to create a weird split brain situation if theapi
client doesn't correspond to the user's default client configuration. It would be nice to have a test for that case.
- I'm fine with us standardizing on titlecase header names internally for the implementation, but I think it's more friendly for user-facing strings like log messages to follow canonical capitalization for those names. e.g., "Found Etags %s" → "Found ETags %s"
def head(self, url, headers={})
- To prevent the usual Python mutable argument default gotcha in future development, it would be nicer to sayheaders=None
and deal with that case in the body. It's also a weird asymmetry thatheaders
has a default value indef head
but notdef get
.
- The
Response
class could be made a lot richer and debuggable with no extra code by decorating it withdataclasses.dataclass
.
- Comment "For some reason if I don't set these [info levels] explicitly some logs won't show up." - Quoting
pydoc logging.Logger.setLevel
: "Note that the root logger is created with level WARNING."
Thanks.
Updated by Peter Amstutz over 1 year ago
Brett Smith wrote in #note-7:
Peter Amstutz wrote in #note-5:
20257-http-import @ 59ffa1048ea27cfc18dfb510868edd994c0af5b2
- Move the "http download to keep" module from CWL runner into the Python SDK so that other software can use it
Most of my concerns are about all the API added by this change. I appreciate the move to make this more reusable, but a lot of this is stuff I'm not convinced we want to commit to supporting in the SDK forever.
- What do you think about renaming the
pycurl
helper module to_pycurl
so it's all internal?
Renamed to _pycurlhelper
- Similarly, I think most of the functions and classes in
http_import
would be better marked internal, except the entry pointhttp_to_keep
.
Added underscores to everything except http_to_keep
- I feel like
http_import
is a pretty generic name that doesn't help API readers understand what it does (imports to what/where?).http_to_collection
orhttp_to_keep
seem like improvements. Open to other ideas too.
- The format of the string returned by
http_to_keep
is really CWL-specific and not really useful anywhere else in the Python SDK. It would be nicer to return a 2-tuple(collection_id, filename)
, and let a-c-r do its own string-building.
I agree, I changed it.
- I don't see much value in naming
CurlDownloader
after the library it's based on. JustDownloader
or similar seems fine.
Sure, it is _Downloader
now.
Implementation stuff:
[...]
- This line is now a noop. So either it should be fixed to do something, or else removed entirely.
I can't remember what I was doing now but that was something I commented out in testing, that was a mistake, good catch.
self.collection = arvados.collection.Collection()
- The API user has no control over how this collection is constructed. It's not even tied to theapi
argument that the user passes tohttp_to_keep
, so it's possible to create a weird split brain situation if theapi
client doesn't correspond to the user's default client configuration. It would be nice to have a test for that case.
That was a mistake, it now uses the api client that was passed in.
- I'm fine with us standardizing on titlecase header names internally for the implementation, but I think it's more friendly for user-facing strings like log messages to follow canonical capitalization for those names. e.g., "Found Etags %s" → "Found ETags %s"
Using title case for headers wasn't so much by choice so much as trying to reduce the footprint of the changes because requests
uses title case but our existing pycurl code converted everything to lower case. It's unclear if requests returns "ETag" or "Etag", if it was the latter, then my etag checking code has never worked. For a debug logging string it doesn't matter, but I tweaked the string a little bit.
def head(self, url, headers={})
- To prevent the usual Python mutable argument default gotcha in future development, it would be nicer to sayheaders=None
and deal with that case in the body. It's also a weird asymmetry thatheaders
has a default value indef head
but notdef get
.
I just removed the headers
argument entirely for this case because it's an internal method and additional headers were not used.
- The
Response
class could be made a lot richer and debuggable with no extra code by decorating it withdataclasses.dataclass
.
Sure, that is easy.
- Comment "For some reason if I don't set these [info levels] explicitly some logs won't show up." - Quoting
pydoc logging.Logger.setLevel
: "Note that the root logger is created with level WARNING."
Thanks, I thought I was going a little crazy. It was behaving strangely because sometimes it would log what it was doing and sometimes it wouldn't (although in testing, when it got to the new code, it wasn't logging at all, which is how I figured it out). I feel like there might have been some load order issue with modules that set a log level in the initialization, and it was working sometimes. It is fixed properly by setting it explicitly. I updated comment.
This does explain why we were seeing seemingly stuck jobs with no status being printed, they were actually maybe doing something but just not logging anything.
20257-http-import @ 10aaf718b0795aa37a4e74063b0206c507ddc6fe
Updated by Brett Smith over 1 year ago
Peter Amstutz wrote in #note-8:
- The
Response
class could be made a lot richer and debuggable with no extra code by decorating it withdataclasses.dataclass
.Sure, that is easy.
Sorry, I should've spelled this out more. It's the same amount of code but it does require more changes, it needs to look like this. Double-check my typing for headers
, feel free to ping me if it needs to be different but you're not sure how to write it.
import typing @dataclasses.dataclass class _Response: status_code: int headers: typing.Mapping[str, str]
That's right there's no __init__
, that's provided by the decorator.
- Thank you for the added
http_to_keep
docstring, that is nice. There's a stray "x" at the end of it as a typo that would be nice to clean up.
- Just FYI, instead of saying
list(cr.keys())[0]
, you can saynext(iter(cr.keys()))
and avoid the overhead of instantiating a whole list.
With at least the _Response
change this is good to merge, thanks.
Updated by Peter Amstutz over 1 year ago
- Target version changed from Development 2023-04-26 sprint to Development 2023-05-10 sprint
Updated by Peter Amstutz over 1 year ago
- Status changed from In Progress to Resolved
Updated by Brett Smith over 1 year ago
- Related to Idea #20917: Test PySDK with different versions of pycurl added