Bug #20614

Updated by Peter Amstutz 9 months ago

If container_count > 1 then Workbench 2 renders a message like "Warning: Process retried 1 time due to failure." 


 The problem is that the log collection doesn't seem to have any record of the 1st attempt. 

 We need to figure out why it is not including the 1st failure in the log collection (and then maybe fix what's actually failing). 


 Actually, the first failure was recorded (but it is unclear why it failed) but the new container was not started before it was cancelled, and thus never created any logs to be recorded, resulting in a confusing log collection that only shows one container (the old one). 

 We need a way to communicate this situation better. 


 Need to figure out why this processes are actually getting killed, the API server was at high load and maybe this is causing thing to time out and think the containers are abandoned somehow? 


 The process wasn't killed, it was cancelled (on purpose).    But several weird things happened. 

 # The log collection scale-4zz18-z33wfnfdga5wroy has incomplete log files at the root, but complete log files in the subdirectory "log for container scale-dz642-c9o553ezb91wbfl/" 
 # These are log files for the _same container_ as evidenced by them having the same timestamps 
 # So there was only one (real) container, which was running normally, until it was cancelled 
 # At that point something weird happened 
 # It picked up the partial collection as the logs from the "last" container 
 # It incremented the container count 
 # It created a new container scale-dz642-5x9d8kfjpcdo3xx which ends up at priority 0 and state Queued, and has no log 

 My suspicion is that there is a race happening in container.rb#handle_completed  

 * When the state goes to cancelled, it looks for container requests with priority > 0 
 * Container requests under the root container request have their own priority, but this is normally ignored unless priority = 0 (which cancels a subtree). 
 * update_priorities only updates containers, not container requests 
 * So it seems like there must be there's a window where the container is cancelled, but the request for it is requests are still 'Committed' (it hasn't (they haven't yet moved to 'Final') so it gets where the cancelled containers get treated as a retry (because "cancelled" means "did not finish" for any reason). retries 

 I think this means, when checking for retryable requests, if a request has a requesting_container, we need to also check that the requesting container has priority > 0. 

 Or maybe it is simpler, we should just check for Cancelled && priority > 0 of the container directly.    If the container priority is 0, we can assume there are no live container requests for it?