Project

General

Profile

Actions

Bug #19418

open

[dispatch-lsf] Use InstanceTypes config to cancel containers with unsatisfiable requirements, not LSF "reason" string

Added by Tom Clegg about 1 month ago. Updated about 12 hours ago.

Status:
New
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:
(Total: 0.00 h)
Story points:
-

Description

Currently dispatch-lsf checks lsf's PEND_REASON for the string "There are no suitable hosts for the job" and assumes this means the job will never run because there are no hosts big enough to satisfy its requirements.

However, a user has reported that their LSF cluster also returns that string when a job is waiting in the queue for hardware that does indeed exist. Their LSF admins say there is no way to detect that a "pending" job is impossible to run for this reason.

Proposed solution:
  1. remove the "no suitable hosts for the job" check
  2. instead, if InstanceTypes are defined in the arvados config, call dispatchcloud.ChooseInstanceType() before submitting a job to LSF; if it returns an error (ConstraintsNotSatisfiableError), cancel the container
  3. if ChooseInstanceType() is successful, ignore which type was chosen, submit to LSF and let LSF decide where to run it
  4. if no InstanceTypes are defined in the arvados config, just submit all containers and hope for the best
  5. in the LSF install docs, explain that InstanceTypes are only used to determine whether a container is too big to run at all, and recommend adding enough InstanceTypes entries to make this work. This may be just one entry for the machine with the highest RAM, one for the machine with the highest CPU count, and one for the machine with the most scratch space.

Subtasks 1 (1 open0 closed)

Task #19444: ReviewNewStephen Smith

Actions
Actions #1

Updated by Tom Clegg about 1 month ago

  • Description updated (diff)
Actions #2

Updated by Peter Amstutz 27 days ago

  • Assigned To set to Tom Clegg
Actions #3

Updated by Tom Clegg 14 days ago

  • Target version changed from 2022-09-14 sprint to 2022-09-28 sprint
Actions #4

Updated by Carlos Fenoy 9 days ago

The proposed solution seems sensible to me

Actions #5

Updated by Peter Amstutz about 12 hours ago

  • Target version changed from 2022-09-28 sprint to 2022-10-12 sprint
Actions

Also available in: Atom PDF