Bug #22455
closederror in CaptureOutput: Could not write sufficient replicas: [503] volume unavailable
Description
User reported this error. When CaptureOutput
error in CaptureOutput: Could not write sufficient replicas: [503] volume unavailable
This error is produced by keepstore when it could not write a block.
From discussion at standup 2025-01-09:
It looks like we have two keepstore clients, one is set to retry the API up to 10 times and keepclient up to 4 times, but the other client only uses the Go SDK defaults, which is ? API retries and 2 keepclient retries.
The number of retries should probably be bumped up, following the logic from here:
https://doc.arvados.org/v3.0/admin/upgrading.html#v2_6_3
Also, it would be helpful if the error message reported something about how long it spent retrying and how many attempts were made.
Also, should set LocalKeepLogsToContainerLog to "errors" (#22456) so that we can see the underlying error.
Updated by Peter Amstutz 3 months ago
- Related to Feature #22456: LocalKeepLogsToContainerLog is important for debugging and we should make it more visible added
Updated by Peter Amstutz 3 months ago
- Target version changed from Development 2025-01-29 to Development 2025-02-12
Updated by Peter Amstutz about 2 months ago
- Target version changed from Development 2025-02-12 to Development 2025-02-26
Updated by Tom Clegg about 2 months ago
CaptureOutput uses runner.ContainerKeepClient, which is initialized by calling runner.MkArvClient, which is an inline func in NewContainerRunner, which was using the default 2 retries.
Increased that to 10.
Also increased Retries to 10 for the API and Keep clients used elsewhere in crunch-run (e.g., retrieving and updating the container record).
22455-crunch-run-retry @ 188c4051ca58eebbb01322905ff64aa138622bf6 -- developer-run-tests: #4663
Updated by Peter Amstutz about 2 months ago
LGTM, so I went ahead and merged it.
Updated by Peter Amstutz about 2 months ago
- Status changed from In Progress to Resolved