Project

General

Profile

Actions

Bug #20378

open

crunch-run maximum downtime tolerance

Added by Peter Amstutz 12 months ago. Updated 11 months ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
Crunch
Target version:
Story points:
-

Description

crunch-run is designed to tolerate API server downtime without losing work.

However, the various things that can time out need to be aligned.

A specific problem we ran into is saveLogCollection() setting a trash_at time of now + 12 hours. However, when the API server went down for longer than 12 hours, the containers continued running but the log collection became trashed. When the API server came back, it couldn't update the log collection.

There are probably other situations the expectations of various components are not aligned in how long they are prepared to weather downtime. Components should:

1) Use a consistent value for downtime tolerance

2) Attempt to shut down in an orderly fashion when downtime is exceeded.

Actions

Also available in: Atom PDF