Project

General

Profile

Bug #19418

Updated by Tom Clegg over 1 year ago

Currently dispatch-lsf checks lsf's PEND_REASON for the string "There are no suitable hosts for the job" and assumes this means the job will never run because there are no hosts big enough to satisfy its requirements. 

 However, a user has reported that their LSF cluster also returns that string when a job is waiting in the queue for hardware that does indeed exist. Their LSF admins say there is no way to detect that a "pending" job is impossible to run for this reason. 

 Proposed solution: 
 # remove the "no suitable hosts for the job" check 
 # instead, if InstanceTypes are defined in the arvados config, call dispatchcloud.ChooseInstanceType() before submitting a job to LSF; if it returns an error (ConstraintsNotSatisfiableError), cancel the container 
 # if ChooseInstanceType() is successful, ignore which type was chosen, submit to LSF and let LSF decide where to run it 
 # if no InstanceTypes are defined in the arvados config, just submit all containers and hope for the best 
 # in the LSF install docs, explain that InstanceTypes are only used to determine whether a container is too big to run at all, this, and recommend adding enough InstanceTypes entries to make so this work. This may strategy will provide a good prediction of whether LSF will run a job. Often this will be just one entry for the machine with the highest RAM, one for the machine with the highest CPU count, and one for the machine with the most scratch space.

Back