Project

General

Profile

Actions

Bug #22455

closed

error in CaptureOutput: Could not write sufficient replicas: [503] volume unavailable

Added by Peter Amstutz 3 months ago. Updated about 1 month ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Story points:
-
Release relationship:
Auto

Description

User reported this error. When CaptureOutput

error in CaptureOutput: Could not write sufficient replicas: [503] volume unavailable

This error is produced by keepstore when it could not write a block.

From discussion at standup 2025-01-09:

It looks like we have two keepstore clients, one is set to retry the API up to 10 times and keepclient up to 4 times, but the other client only uses the Go SDK defaults, which is ? API retries and 2 keepclient retries.

The number of retries should probably be bumped up, following the logic from here:

https://doc.arvados.org/v3.0/admin/upgrading.html#v2_6_3

Also, it would be helpful if the error message reported something about how long it spent retrying and how many attempts were made.

Also, should set LocalKeepLogsToContainerLog to "errors" (#22456) so that we can see the underlying error.


Subtasks 1 (0 open1 closed)

Task #22503: Review 22455-crunch-run-retryResolvedPeter Amstutz02/17/2025Actions

Related issues 1 (1 open0 closed)

Related to Arvados - Feature #22456: LocalKeepLogsToContainerLog is important for debugging and we should make it more visibleNewPeter AmstutzActions
Actions #1

Updated by Peter Amstutz 3 months ago

  • Position changed from -933963 to -933954
Actions #2

Updated by Peter Amstutz 3 months ago

  • Description updated (diff)
Actions #3

Updated by Peter Amstutz 3 months ago

  • Description updated (diff)
Actions #4

Updated by Peter Amstutz 3 months ago

  • Description updated (diff)
Actions #5

Updated by Peter Amstutz 3 months ago

  • Description updated (diff)
Actions #6

Updated by Peter Amstutz 3 months ago

  • Related to Feature #22456: LocalKeepLogsToContainerLog is important for debugging and we should make it more visible added
Actions #7

Updated by Peter Amstutz 3 months ago

  • Description updated (diff)
Actions #8

Updated by Peter Amstutz 3 months ago

  • Target version changed from Development 2025-01-29 to Development 2025-02-12
Actions #9

Updated by Peter Amstutz 2 months ago

  • Assigned To set to Tom Clegg
Actions #10

Updated by Peter Amstutz 2 months ago

  • Subtask #22503 added
Actions #11

Updated by Peter Amstutz about 2 months ago

  • Target version changed from Development 2025-02-12 to Development 2025-02-26
Actions #12

Updated by Tom Clegg about 2 months ago

CaptureOutput uses runner.ContainerKeepClient, which is initialized by calling runner.MkArvClient, which is an inline func in NewContainerRunner, which was using the default 2 retries.

Increased that to 10.

Also increased Retries to 10 for the API and Keep clients used elsewhere in crunch-run (e.g., retrieving and updating the container record).

22455-crunch-run-retry @ 188c4051ca58eebbb01322905ff64aa138622bf6 -- developer-run-tests: #4663

Actions #13

Updated by Tom Clegg about 2 months ago

  • Status changed from New to In Progress
Actions #14

Updated by Peter Amstutz about 2 months ago

LGTM, so I went ahead and merged it.

Actions #15

Updated by Peter Amstutz about 2 months ago

  • Status changed from In Progress to Resolved
Actions #16

Updated by Peter Amstutz about 1 month ago

  • Release set to 75
Actions

Also available in: Atom PDF