Project

General

Profile

Feature #21460

Updated by Brett Smith 3 months ago

When a spot instance is reclaimed, we don't want to treat it as a instance failure and retry immediately. 

 Instead, we want to wait a little bit (should be configurable, maybe just add capacityErrorTTL to the config file?) before trying to acquire that instance type again. 

 When container has indicated that it cancelled because its instance was reclaimed, the spot instance type it was running on should be marked as "at capacity". 

 If an attempt to allocate a spot instance fails with a "can't get spot instance" error we should also set "at capacity" state. 

 notes: 

 This is what it does when a preemption notice happens. "The documentation":https://doc.arvados.org/v2.7/api/methods/containers.html#runtime_status suggests checking for the existence of the @preemptionNotice@ key in the container's @runtime_status@. happens 

 <pre> 
			 runner.updateRuntimeStatus(arvadosclient.Dict{ 
				 "warning":            "preemption notice", 
				 "warningDetail":      text, 
				 "preemptionNotice": text, 
			 }) 
 </pre> 

Back