Project

General

Profile

Bug #20378

Updated by Peter Amstutz about 1 year ago

crunch-run is designed to tolerate API server downtime without losing work. 

 However, the various things that can time out need to be aligned. 

 A specific problem we ran into is saveLogCollection() setting a trash_at time of now + 12 hours.    However, when the API server went down for longer than 12 hours, the containers continued running but the log collection became trashed.    When the API server came back, it couldn't update the log collection. 

 There are probably other situations the expectations of various components are not aligned in how long they are prepared to weather downtime.    Components should: 

 1) Use a consistent value for downtime tolerance 

 2) Attempt to shut down in an orderly fashion when downtime is exceeded. 

Back