Bug #14804

[keepstore] Return 5xx (not 4xx) if block is not found due to transient backend device failure

Added by Tom Clegg 9 months ago. Updated 8 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Keep
Target version:
Start date:
02/22/2019
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-
Release relationship:
Auto

Description

Currently, when keepstore is trying to read a block, if one Azure-backed volume encounters a 503 error and all other volumes return 404, keepstore returns 404 to its client. This is a non-retryable error so the client will give up.

The correct behavior is to return a 502 or 503 status in this situation.

Azure error message:

storage: service returned error: StatusCode=503, ErrorCode=ServerBusy, ErrorMessage=The server is busy.


Subtasks

Task #14836: Review 14804-keepstore-transient-backend-errorsResolvedLucas Di Pentima


Related issues

Related to Arvados - Bug #15118: [keepstore] Return 5xx (not 4xx) if block is not found due to transient backend device failureNew

Associated revisions

Revision 7e5f0e9c
Added by Lucas Di Pentima 8 months ago

Merge branch '14804-keepstore-transient-backend-errors'
Closes #14804

Arvados-DCO-1.1-Signed-off-by: Lucas Di Pentima <>

History

#2 Updated by Tom Morris 8 months ago

  • Target version changed from Arvados Future Sprints to 2019-02-27 Sprint

#3 Updated by Lucas Di Pentima 8 months ago

  • Assigned To set to Lucas Di Pentima

#4 Updated by Lucas Di Pentima 8 months ago

  • Status changed from New to In Progress

#5 Updated by Lucas Di Pentima 8 months ago

Updates at 601764a10 - branch 14804-keepstore-transient-backend-errors
Test run: https://ci.curoverse.com/job/developer-run-tests/1082/

When requesting a block, if keepstore gets errors from all of its volumes, the error that was being returned to the client was 404 no matter which error the volumes returned.
Now, when receiving a VolumeBusyError (transient error) from a volume backend, keepstore will return a 503 status so that the client can retry instead of mistakenly believe that the block is not there.

#6 Updated by Lucas Di Pentima 8 months ago

Re-running developer-run-tests-remainder at: https://ci.curoverse.com/job/developer-run-tests-remainder/1117/

#7 Updated by Eric Biagiotti 8 months ago

Small nit pick, I would update the comment for TestGetHandler to include your test scenario. Otherwise, LGTM.

#8 Updated by Lucas Di Pentima 8 months ago

  • Status changed from In Progress to Resolved

#9 Updated by Tom Morris 8 months ago

  • Release set to 15

#10 Updated by Tom Clegg 6 months ago

  • Related to Bug #15118: [keepstore] Return 5xx (not 4xx) if block is not found due to transient backend device failure added

Also available in: Atom PDF