Project

General

Profile

Bug #12891

Updated by Tom Clegg over 6 years ago

h3. Bug 1 

 In some cases, after If a container request is cancelled, cancelled neither the container's @log@ `log` field and nor the container request's @log_uuid@ are both null. This makes requests `log_uuid` field get populated, making it impossible for crunchstat-summary to analyze the portion of the job which did run. Example: su92l-xvhdp-4j98m0zgu9xst51 
 e.g. https://workbench.su92l.arvadosapi.com/container_requests/su92l-xvhdp-4j98m0zgu9xst51#Advanced 

 The intended/expected behavior is for the log to be saved in Keep, and accessible via container_request.log_uuid, just as it is when the container exits non-zero. 

 h3. Bug 2 

 crunchstat-summary should be able to do some analysis on containers that haven't saved a log to Keep (because they're still running, or because of a bug finished yet like this). It can do this with crunch1 jobs. jobs in crunch v1 

 h3. Bug 1 analysis -------- 

 Additional info from #12893: 

 crunch-run tries to save a log file after the container ends, regardless of final state, but sometimes (sometimes?) this doesn't work. Example: su92l-xvhdp-4j98m0zgu9xst51 

 Some possible explanations: 
 * Worked: 9tee4-dz642-r8lk4a9xcwdazs7 
 * Didn't work: su92l-xvhdp-4j98m0zgu9xst51 

 crunch-dispatch-slurm cancels the slurm job as soon as it notices the container priority is zero. cancelled. crunch-run catches SIGTERM and tries to write the buffered output and logs, but (according to sample logs) gives seems to give up 30-40 seconds later without actually writing them. 
 * even if writing the partial output/logs takes longer than that -- because slurm kills crunch-run with SIGKILL when the slurm KillWait timer (30s here) runs out. 

 crunch-dispatch-slurm is using scancel to notify crunch-run gets that far, it should terminate seems apiserver would refuse to update the output or log field of a container and cancel    gracefully. The KillWait behavior whose state is not desirable. Instead of using plain "scancel", which gets impatient Cancelled. 

 (It would be helpful to check server logs to determine whether crunch-run is in fact getting as far as calling collections#create and kills crunch-run forcefully, we want "scancel --signal=TERM", which merely sends the given signal to crunch-run. containers#update.) 

Back