Project

General

Profile

Actions

Bug #19418

closed

[dispatch-lsf] Use InstanceTypes config to cancel containers with unsatisfiable requirements, not LSF "reason" string

Added by Tom Clegg over 1 year ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Story points:
-
Release relationship:
Auto

Description

Currently dispatch-lsf checks lsf's PEND_REASON for the string "There are no suitable hosts for the job" and assumes this means the job will never run because there are no hosts big enough to satisfy its requirements.

However, a user has reported that their LSF cluster also returns that string when a job is waiting in the queue for hardware that does indeed exist. Their LSF admins say there is no way to detect that a "pending" job is impossible to run for this reason.

Proposed solution:
  1. remove the "no suitable hosts for the job" check
  2. instead, if InstanceTypes are defined in the arvados config, call dispatchcloud.ChooseInstanceType() before submitting a job to LSF; if it returns an error (ConstraintsNotSatisfiableError), cancel the container
  3. if ChooseInstanceType() is successful, ignore which type was chosen, submit to LSF and let LSF decide where to run it
  4. if no InstanceTypes are defined in the arvados config, just submit all containers and hope for the best
  5. in the LSF install docs, explain that InstanceTypes are only used to determine whether a container is too big to run at all, and recommend adding enough InstanceTypes entries to make this work. This may be just one entry for the machine with the highest RAM, one for the machine with the highest CPU count, and one for the machine with the most scratch space.

Subtasks 1 (0 open1 closed)

Task #19444: Review 19418-lsf-unsatisfiableResolvedStephen Smith10/04/2022Actions
Actions

Also available in: Atom PDF