Bug #3835

[SDKs] Python and CLI tools should give more helpful error messages after a Keep failure

Added by Tim Pierce almost 5 years ago. Updated over 4 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Brett Smith
Category:
SDKs
Target version:
Start date:
01/13/2015
Due date:
% Done:

100%

Estimated time:
(Total: 2.00 h)
Story points:
0.5

Description

Errors returned from arvados.keep.KeepClient.put are short and cryptic, e.g.:

arvados.errors.KeepWriteError: Write fail for d41d8cd98f00b204e9800998ecf8427e: wanted 2 but wrote 0

The SDK should record the reason why a write failed (at least one of the failed writes) and report it in the KeepWriteError that is raised to the user.

Each writer thread should either succeed or throw an exception. If the write does not eventually succeed, the main thread should report the last exception encountered for each service it attempted to use.


Subtasks

Task #4972: Review 3835-pysdk-keep-exceptions-wipResolvedTom Clegg


Related issues

Related to Arvados - Bug #4738: [SDKs] Keep write error (wanted 2 but wrote 0)Closed12/08/2014

Has duplicate Arvados - Bug #4672: [SDKs] KeepClient.put raises an exception on a block larger than 64MBResolved11/25/2014

Has duplicate Arvados - Bug #4661: [SDKs] Python Keep client's retry/rescue should not make an OOM exception look like a Keep problemResolved

Has duplicate Arvados - Bug #4885: keep write error (wanted 2 but wrote 0) arvadosResolved12/30/2014

Has duplicate Arvados - Bug #4884: keep write error (wanted 2 but wrote 0) arvadosResolved12/30/2014

Associated revisions

Revision 952bfa87
Added by Brett Smith over 4 years ago

Merge branch '3835-pysdk-keep-exceptions-wip'

Closes #3835, #4972.

History

#1 Updated by Tom Clegg almost 5 years ago

  • Subject changed from Python SDK does not report reasons for a Keep failure to [SDKs] Python adn CLI tools should give more helpful error messages after a Keep failure
  • Story points set to 1.0

#2 Updated by Tom Clegg almost 5 years ago

  • Target version set to Arvados Future Sprints

#3 Updated by Tom Clegg almost 5 years ago

  • Subject changed from [SDKs] Python adn CLI tools should give more helpful error messages after a Keep failure to [SDKs] Python and CLI tools should give more helpful error messages after a Keep failure

#4 Updated by Tom Clegg over 4 years ago

  • Description updated (diff)
  • Story points changed from 1.0 to 0.5

#5 Updated by Tom Clegg over 4 years ago

  • Priority changed from Normal to High

#6 Updated by Tom Clegg over 4 years ago

  • Target version changed from Arvados Future Sprints to 2015-01-28 Sprint

#7 Updated by Brett Smith over 4 years ago

  • Assigned To set to Brett Smith

#8 Updated by Brett Smith over 4 years ago

When Python writes tracebacks, the exception line is formatted like: "error.__class__: str(__init__ arguments)". If __init__ only got one argument, it writes that as a string instead of the one-element arguments tuple. This can be customized in the subclass, of course.

I think it makes sense for the Keep code to pass in a dictionary that maps Keep service roots to specific exceptions when it raises a KeepReadError or KeepWriteError. That way, clients catching the exception can readily introspect it. Is it acceptable for tracebacks to just include that raw dictionary string, or should we make the presentation more human-friendly? If so, should errors be sorted by service root? Should functionally identical exceptions be collapsed (e.g., report "got MemoryError writing to keep0 and keep1")? Any other constraints on the formatting I'm not thinking of?

#9 Updated by Tim Pierce over 4 years ago

Recording the errors in a dictionary for structured analysis makes a lot of sense to me. I don't feel strongly about the specific formatting; I think that for the end user it would be nice to see a formatted list like:

keep0.qr1hi.arvadosapi.com: MemoryError
keep1.qr1hi.arvadosapi.com: NotFoundError
keep2.qr1hi.arvadosapi.com: NotFoundError
keep3.qr1hi.arvadosapi.com: IOError: disk error

Collapsing errors doesn't seem particularly worthwhile to me unless we get to a point where it's reporting errors from dozens of servers.

#10 Updated by Brett Smith over 4 years ago

  • Status changed from New to In Progress

#11 Updated by Tom Clegg over 4 years ago

The code at d8148b7 looks good.

I'm having trouble installing/testing our Python packages, but that's a separate issue (trying again now after merging 9c2e673).

There is one error condition that I suspect will (still) give a misleading error message with the current code.
  1. Have 10 keepstores.
  2. Try to read block.
  3. Block is stored on first two keepstores (i.e., optimal probe order) but those two keepstores are temporarily unreachable due to a network error.
  4. The other 8 keepstores do not have the block, so 404.
  5. 80% (>75%) of the keepstores said 404, so we show the user just 404. We hide the (probably extremely helpful) information that the two keepstores where we know the block is most likely to have been stored are temporarily unreachable due to a network error.

Perhaps changing the reporting threshold from 80% to 100% would be a good move? (Or (N-1)/N -- with replication=2, "one server down, remainder assure 404" is pretty close to a solid 404.) Either way, now that we are reporting more error details, it seems less important for that 80% rule to make a distinction between "probably 404" and "probably not 404". Perhaps it's more valuable to make a distinction between "definitely 404" and "not definitely-404" -- and on that note, perhaps the error should be DoesNotExistError (or GoneError) rather than NotFoundError, if the caller is expected to interpret it as a permanent condition.

"A 404 status code does not indicate whether this lack of representation is temporary or permanent; the 410 (Gone) status code is preferred over 404 if the origin server knows, presumably through some configurable means, that the condition is likely to be permanent."
-- http://tools.ietf.org/html/rfc7231#section-6.5.4

To be clear, this shouldn't hold up the merge: we show the details either way. I'm just checking "should we tack this onto this story while we're here, or make a separate one?"

#12 Updated by Brett Smith over 4 years ago

Tom Clegg wrote:

There is one error condition that I suspect will (still) give a misleading error message with the current code.
  1. Have 10 keepstores.
  2. Try to read block.
  3. Block is stored on first two keepstores (i.e., optimal probe order) but those two keepstores are temporarily unreachable due to a network error.
  4. The other 8 keepstores do not have the block, so 404.
  5. 80% (>75%) of the keepstores said 404, so we show the user just 404. We hide the (probably extremely helpful) information that the two keepstores where we know the block is most likely to have been stored are temporarily unreachable due to a network error.

I'm not sure what you mean when you say we "hide the (probably extremely helpful) information." Like you noted below, the exception has the individual service errors just like the KeepReadErrors we raise, and they'll show up in the default stringification. I think the "upgrade" from generic KeepReadError to specific NotFoundError just reflects a greater confidence that the block doesn't exist, and we're discussing what level of confidence is appropriate.

Perhaps changing the reporting threshold from 80% to 100% would be a good move? (Or (N-1)/N -- with replication=2, "one server down, remainder assure 404" is pretty close to a solid 404.)

The GET has no way of knowing what replication level was requested when the block was stored, and based on recent user discussions, I don't think we should assume a particular value. I know most of the client code currently forces replication level 2, but I think now's a bad time to write more code that assumes that.

Either way, now that we are reporting more error details, it seems less important for that 80% rule to make a distinction between "probably 404" and "probably not 404". Perhaps it's more valuable to make a distinction between "definitely 404" and "not definitely-404"

I'm fine with implementing the 100% threshold if that's where you'd like to go.

on that note, perhaps the error should be DoesNotExistError (or GoneError) rather than NotFoundError, if the caller is expected to interpret it as a permanent condition.

"A 404 status code does not indicate whether this lack of representation is temporary or permanent; the 410 (Gone) status code is preferred over 404 if the origin server knows, presumably through some configurable means, that the condition is likely to be permanent."
-- http://tools.ietf.org/html/rfc7231#section-6.5.4

I don't think the caller should interpret is as permanent, though? The block could appear at any later time. The name's evocation of 404 seems apt.

#13 Updated by Tom Clegg over 4 years ago

Brett Smith wrote:

I'm not sure what you mean when you say we "hide the (probably extremely helpful) information."

Sorry, "hide" was wrong there. It's not an issue of hiding it, it's just a matter of when the distinction between "didn't get it that time" and "can't get it this way" error classes is helpful and when it's misleading.

The GET has no way of knowing what replication level was requested when the block was stored, and based on recent user discussions, I don't think we should assume a particular value. I know most of the client code currently forces replication level 2, but I think now's a bad time to write more code that assumes that.

I agree, 100% seems better for this reason.

on that note, perhaps the error should be DoesNotExistError (or GoneError) rather than NotFoundError, if the caller is expected to interpret it as a permanent condition.

"A 404 status code does not indicate whether this lack of representation is temporary or permanent; the 410 (Gone) status code is preferred over 404 if the origin server knows, presumably through some configurable means, that the condition is likely to be permanent."
-- http://tools.ietf.org/html/rfc7231#section-6.5.4

I don't think the caller should interpret is as permanent, though? The block could appear at any later time. The name's evocation of 404 seems apt.

I think the distinction we are trying to convey (FWIW, which might not be much) is not "block might become available here" vs. "block will never be available here", but rather "block cannot be retrieved due to some condition of the delivery mechanism" ("try again later") vs. "block cannot be retrieved because it is not stored in this storage system" ("try something besides waiting")... which is as permanent as anyone should really expect from a writable storage system!

It seems perfectly reasonable to me to set the threshold at 100% and call it a day. I'd say the term NotFoundError will be correctly interpreted by anyone who's been paying attention to the internet: "wait-and-try-later is not likely to fix this."

#14 Updated by Brett Smith over 4 years ago

Tom Clegg wrote:

It seems perfectly reasonable to me to set the threshold at 100% and call it a day.

Done in bd680fb and ready for another look.

I'd say the term NotFoundError will be correctly interpreted by anyone who's been paying attention to the internet: "wait-and-try-later is not likely to fix this."

Right. Like the bit of RFC you quoted says, "not found" isn't necessarily a temporary error either. :) Maybe a future story could add a method to KeepRequestError that returns a boolean value suggesting whether or not the error is worth retrying—that might be a nice affordance for users. But it's probably worth getting a little experience with the code in the current branch before we try to figure out something like that.

Thanks.

#15 Updated by Tom Clegg over 4 years ago

Looks great, thanks.

#16 Updated by Brett Smith over 4 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100

Applied in changeset arvados|commit:952bfa87465a27f83dca7feca7d369fda4200eb5.

Also available in: Atom PDF