Project

General

Profile

Actions

Bug #20378

open

crunch-run maximum downtime tolerance

Added by Peter Amstutz about 1 year ago. Updated about 1 year ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
Crunch
Target version:
Story points:
-

Description

crunch-run is designed to tolerate API server downtime without losing work.

However, the various things that can time out need to be aligned.

A specific problem we ran into is saveLogCollection() setting a trash_at time of now + 12 hours. However, when the API server went down for longer than 12 hours, the containers continued running but the log collection became trashed. When the API server came back, it couldn't update the log collection.

There are probably other situations the expectations of various components are not aligned in how long they are prepared to weather downtime. Components should:

1) Use a consistent value for downtime tolerance

2) Attempt to shut down in an orderly fashion when downtime is exceeded.

Actions #1

Updated by Peter Amstutz about 1 year ago

  • Status changed from New to In Progress
Actions #2

Updated by Peter Amstutz about 1 year ago

  • Category set to Crunch
  • Subject changed from crunch- to crunch-run maximum downtime tolerance
Actions #3

Updated by Peter Amstutz about 1 year ago

  • Description updated (diff)
Actions #4

Updated by Peter Amstutz about 1 year ago

  • Status changed from In Progress to New
Actions #5

Updated by Brett Smith about 1 year ago

I think downtime tolerance should err on the long side, and perhaps be a function of how long the job has run, maybe with a cap on minimum and maximum. Having a compute node sitting around for the API server to come back is expensive and annoying, but it's not nearly as annoying as losing a week's worth of compute because the API server was unreachable for a few hours at the end of a job.

Actions

Also available in: Atom PDF