Project

General

Profile

Bug #17293

Updated by Peter Amstutz about 3 years ago

Customer reported that a long-lived (multi-day) workflow failed and was automatically restarted.    This turned out to be due to getting 404 errors when trying to update the log collection. 

 It was determined that the log collection had been put in the trash.    It seems like there is a bug where the @trash_at/delete_at@ times were intended to be pushed into the future each time the log collection is updated, but there seems to be some interaction where that is not working as intended. 

 crunchrun.go: 

 <pre> 
 func (runner *ContainerRunner) saveLogCollection(final bool) (response arvados.Collection, err error) { 
 ... 
	 mt, err := runner.LogCollection.MarshalManifest(".") 
	 if err != nil { 
		 err = fmt.Errorf("error creating log manifest: %v", err) 
		 return 
	 } 
	 updates := arvadosclient.Dict{ 
		 "name":            "logs for " + runner.Container.UUID, 
		 "manifest_text": mt, 
	 } 
	 if final { 
		 updates["is_trashed"] = true 
	 } else { 
		 exp := time.Now().Add(crunchLogUpdatePeriod * 24) 
		 updates["trash_at"] = exp 
		 updates["delete_at"] = exp 
	 } 
 </pre> 

 There were reports of keep write failures.    If it failed 24 consecutive times, then it would hit the trash_at time.    It might be a good idea to tweak the behavior so that if there is an error writing the manifest and @runner.logUUID != ""@ then it proceeds to updates the collection anyway but does not include the manifest text in the update. 

 Update: this seems to be the case: 

 > Indeed keep was unable to write for over 12 hours. It seems this was a consequence of how we start/stop keep on the compute nodes for Arvados jobs and an issue we had in Slurm where it was unresponsive for some time. Keep may have been prematurely stopped on the compute node causing the issue. It was then started later (too late), and that caused the issue. I'll be working on fixing the way we stop keep to be more robust to some Slurm errors. 

 We should adjust the behavior as described so that it doesn't get trashed by accident this way. 

Back