Project

General

Profile

Bug #14495

Updated by Ward Vandewege over 5 years ago

 
 Container request d79c1-xvhdp-28emqt3jby9s2a8 was seemingly stuck - its child container requests remained in the queued state after many hours. 

 What was actually happening was that the child container d79c1-dz642-3apw65ik2snziqh was scheduled on a compute node that didn't have sufficient scratch space available to load the (large) docker image: 

 <pre> 
 2018-11-14T13:33:31.066367347Z Docker response: {"errorDetail":{"message":"Error processing tar file(exit status 1): write /195522a76483d96f4cc529d0b11e5e840596992eddfb89f4f86499a4af381832/layer.tar: no space left on device"},"error":"Error processing tar file(exit status 1): write /195522a76483d96f4cc529d0b11e5e840596992eddfb89f4f86499a4af381832/layer.tar: no space left on device"} 
 2018-11-14T13:33:31.066643166Z Running [arv-mount --foreground --allow-other --read-write --crunchstat-interval=10 --file-cache 268435456 --mount-by-pdh by_id /tmp/crunch-run.d79c1-dz642-3apw65ik2snziqh.694256924/keep198730187] 
 </pre> 

 This kept happening, and because the container couldn't be started, it remained in Queued state in workbench, with the only hint the above lines in the logs. 

 The workaround is easy - specify a large enough tmpdirMin in the workflow, but it's way too hard for the user to figure that out. 

 We should probably error out immediately when this happens, and we need to make it clear to the user what the actual problem is. 

 Or maybe we can take the size of the docker image into account before allocating a job to a compute node? That would be even better.

Back