Project

General

Profile

Actions

Bug #19418

closed

[dispatch-lsf] Use InstanceTypes config to cancel containers with unsatisfiable requirements, not LSF "reason" string

Added by Tom Clegg about 2 years ago. Updated almost 2 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Story points:
-
Release relationship:
Auto

Description

Currently dispatch-lsf checks lsf's PEND_REASON for the string "There are no suitable hosts for the job" and assumes this means the job will never run because there are no hosts big enough to satisfy its requirements.

However, a user has reported that their LSF cluster also returns that string when a job is waiting in the queue for hardware that does indeed exist. Their LSF admins say there is no way to detect that a "pending" job is impossible to run for this reason.

Proposed solution:
  1. remove the "no suitable hosts for the job" check
  2. instead, if InstanceTypes are defined in the arvados config, call dispatchcloud.ChooseInstanceType() before submitting a job to LSF; if it returns an error (ConstraintsNotSatisfiableError), cancel the container
  3. if ChooseInstanceType() is successful, ignore which type was chosen, submit to LSF and let LSF decide where to run it
  4. if no InstanceTypes are defined in the arvados config, just submit all containers and hope for the best
  5. in the LSF install docs, explain that InstanceTypes are only used to determine whether a container is too big to run at all, and recommend adding enough InstanceTypes entries to make this work. This may be just one entry for the machine with the highest RAM, one for the machine with the highest CPU count, and one for the machine with the most scratch space.

Subtasks 1 (0 open1 closed)

Task #19444: Review 19418-lsf-unsatisfiableResolvedStephen Smith10/04/2022Actions
Actions #1

Updated by Tom Clegg about 2 years ago

  • Description updated (diff)
Actions #2

Updated by Peter Amstutz about 2 years ago

  • Assigned To set to Tom Clegg
Actions #3

Updated by Tom Clegg about 2 years ago

  • Target version changed from 2022-09-14 sprint to 2022-09-28 sprint
Actions #4

Updated by Carlos Fenoy about 2 years ago

The proposed solution seems sensible to me

Actions #5

Updated by Peter Amstutz about 2 years ago

  • Target version changed from 2022-09-28 sprint to 2022-10-12 sprint
Actions #6

Updated by Tom Clegg about 2 years ago

  • Status changed from New to In Progress
Actions #7

Updated by Stephen Smith about 2 years ago

Lgtm!

Actions #8

Updated by Tom Clegg about 2 years ago

  • Status changed from In Progress to Resolved
Actions #9

Updated by Peter Amstutz almost 2 years ago

  • Release set to 47
Actions

Also available in: Atom PDF