Feature #18298: Feedback when container can't be scheduled on LSF - Arvados

Actions

Copy link

Feature #18298

closed

Feedback when container can't be scheduled on LSF

Added by Peter Amstutz over 2 years ago. Updated over 2 years ago.

Status:

Resolved

Priority:

Normal

Assigned To:

Tom Clegg

Category:

Target version:

2021-11-24 sprint

Story points:

Release:

Arvados 2.3.1

Release relationship:

Auto

Description

If a container requests excessive resources that cannot be fulfilled, it should provide visible feedback to the user via the Arvados logs, instead of remaining queued forever.

Files

test18298-9tee4.png (29.6 KB) test18298-9tee4.png

Tom Clegg, 11/22/2021 04:57 PM

Subtasks 1 (0 open — 1 closed)

Actions

Copy link

Updated by Peter Amstutz over 2 years ago

Status changed from New to In Progress

Actions

Copy link

Updated by Peter Amstutz over 2 years ago

Category set to Crunch
Description updated (diff)
Subject changed from Feedback when container can't be scheduled to Feedback when container can't be scheduled on LSF

Actions

Copy link

Updated by Peter Amstutz over 2 years ago

Status changed from In Progress to New
Category changed from Crunch to 0

Actions

Copy link

Updated by Peter Amstutz over 2 years ago

Release set to 45
Category deleted (0)

Actions

Copy link

Updated by Peter Amstutz over 2 years ago

Target version changed from 2021-11-10 sprint to 2021-11-24 sprint

Actions

Copy link

Updated by Peter Amstutz over 2 years ago

Tracker changed from Bug to Feature

Actions

Copy link

Updated by Peter Amstutz over 2 years ago

Assigned To set to Tom Clegg

Actions

Copy link

Updated by Tom Clegg over 2 years ago

Hm.

Here is `bjobs -UF` ("unformatted") for a job that will run when another job finishes:

Job <22701>, Job Name <aaaaa-aaaaa-aaaaaaaaaab>, User <tom>, Project <default>, Status <PEND>, Queue <normal>, Command <sleep 120>
Fri Nov 12 15:54:26: Submitted from host <9tee4.arvadosapi.com>, CWD <$HOME>, Requested Resources <rusage[mem=8000.00] span[hosts=1]>;
 PENDING REASONS:
 Job requirements for reserving resource (mem) not satisfied: 2 hosts;

 SCHEDULING PARAMETERS:
           r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem
 loadSched   -     -     -     -       -     -    -     -     -      -      -  
 loadStop    -     -     -     -       -     -    -     -     -      -      -  

 RESOURCE REQUIREMENT DETAILS:
 Combined: select[type == local] order[r15s:pg] rusage[mem=8000.00] span[hosts=1]
 Effective: -

Here is a job that will never run because there are no nodes with this much memory:

Job <22735>, Job Name <aaaaa-aaaaa-aaaaaaaaaag>, User <tom>, Project <default>, Status <PEND>, Queue <normal>, Command <sleep 120>
Fri Nov 12 15:57:44: Submitted from host <9tee4.arvadosapi.com>, CWD <$HOME>, Requested Resources <rusage[mem=8000000.00] span[hosts=1]>;
 PENDING REASONS:
 Job requirements for reserving resource (mem) not satisfied: 2 hosts;

 SCHEDULING PARAMETERS:
           r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem
 loadSched   -     -     -     -       -     -    -     -     -      -      -  
 loadStop    -     -     -     -       -     -    -     -     -      -      -  

 RESOURCE REQUIREMENT DETAILS:
 Combined: select[type == local] order[r15s:pg] rusage[mem=8000000.00] span[hosts=1]
 Effective: -

Actions

Copy link

Updated by Tom Clegg over 2 years ago

If we add -R 'select[mem>8000000M]' to the bsub arguments, bjobs -UF says "New job is waiting for scheduling" for a while, then:

Job <22930>, Job Name <aaaaa-aaaaa-aaaaaaaaaah>, User <tom>, Project <default>, Status <PEND>, Queue <normal>, Command <sleep 120>
Fri Nov 12 16:11:05: Submitted from host <9tee4.arvadosapi.com>, CWD <$HOME>, Requested Resources < select[mem>8000000.00] rusage[mem=8000000.00] span[hosts=1]>;
 PENDING REASONS:
 There are no suitable hosts for the job;

 SCHEDULING PARAMETERS:
           r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem
 loadSched   -     -     -     -       -     -    -     -     -      -      -  
 loadStop    -     -     -     -       -     -    -     -     -      -      -  

 RESOURCE REQUIREMENT DETAILS:
 Combined: select[(mem>8000000.00) && (type == local)] order[r15s:pg] rusage[mem=8000000.00] span[hosts=1]
 Effective: -

Actions

Copy link

#10

Updated by Tom Clegg over 2 years ago

Status changed from New to In Progress

Actions

Copy link

#11

Updated by Tom Clegg over 2 years ago

18298-lsf-no-suitable-hosts @ f6e8d7c2cada1570bac3e98f0712ad8651b8d9fd -- developer-run-tests: #2810

If LSF reports the job status is PEND and the reason contains the magic string "There are no suitable hosts for the job", cancel the container and copy the reason text into runtime_status["errors"].

Actions

Copy link

#12

Updated by Peter Amstutz over 2 years ago

Tom Clegg wrote:

18298-lsf-no-suitable-hosts @ f6e8d7c2cada1570bac3e98f0712ad8651b8d9fd -- developer-run-tests: #2810

If LSF reports the job status is PEND and the reason contains the magic string "There are no suitable hosts for the job", cancel the container and copy the reason text into runtime_status["errors"].

Getting json structured output out of bjobs is nice.
It seems unfortunate that in the first case it tells you what resource can't be reserved (mem), but not in the second case.
Were you able to test this branch on 9tee4 already or should we plan to do a quick manual test it after it is merged and auto-deployed to 9tee4?

Actions

Copy link

#13

Updated by Tom Clegg over 2 years ago

Peter Amstutz wrote:

It seems unfortunate that in the first case it tells you what resource can't be reserved (mem), but not in the second case.

Yeah, it seems a bit backwards.

Were you able to test this branch on 9tee4 already or should we plan to do a quick manual test it after it is merged and auto-deployed to 9tee4?

I have not tested it with real LSF. Doing that after auto-deploy sounds worthwhile, yes.

Actions

Copy link

#14

Updated by Tom Clegg over 2 years ago

File test18298-9tee4.png test18298-9tee4.png added

9tee4-xvhdp-fjb1ctlvsbtn5dk / 9tee4-dz642-u3hjqy0agtl3jya

Actions

Copy link

#15

Updated by Tom Clegg over 2 years ago

Couple of caveats:

Thanks to auto-retry after cancel, this has to happen 3 times before giving up on the container request
If the admin has configured an explicit BsubArgumentsList based on an old config.default.yml file, the new arguments won't get passed to bsub and they will get the old "queued forever" behavior. Should we reconsider the previous (pre #18290) approach of appending the configured args to the defaults rather than replacing the defaults? Or just add an upgrade note?

Actions

Copy link

#16

Updated by Peter Amstutz over 2 years ago

Tom Clegg wrote:

Couple of caveats:

Thanks to auto-retry after cancel, this has to happen 3 times before giving up on the container request

This seems inelegant but not actually a problem?

If the admin has configured an explicit BsubArgumentsList based on an old config.default.yml file, the new arguments won't get passed to bsub and they will get the old "queued forever" behavior. Should we reconsider the previous (pre #18290) approach of appending the configured args to the defaults rather than replacing the defaults? Or just add an upgrade note?

Upgrade note is fine. I don't know if anyone is actually using LSF yet. We can communicate about the upgrade to customers.

Actions

Copy link

#17

Updated by Tom Clegg over 2 years ago

Status changed from In Progress to Resolved

Thanks to auto-retry after cancel, this has to happen 3 times before giving up on the container request

This seems inelegant but not actually a problem?

Right.

Upgrade note is fine. I don't know if anyone is actually using LSF yet. We can communicate about the upgrade to customers.

Added upgrade note & merged.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Arvados

Custom queries

Feature #18298

Feedback when container can't be scheduled on LSF

Updated by Peter Amstutz over 2 years ago

Updated by Peter Amstutz over 2 years ago

Updated by Peter Amstutz over 2 years ago

Updated by Peter Amstutz over 2 years ago

Updated by Peter Amstutz over 2 years ago

Updated by Peter Amstutz over 2 years ago

Updated by Peter Amstutz over 2 years ago

Updated by Tom Clegg over 2 years ago

Updated by Tom Clegg over 2 years ago

Updated by Tom Clegg over 2 years ago

Updated by Tom Clegg over 2 years ago

Updated by Peter Amstutz over 2 years ago

Updated by Tom Clegg over 2 years ago

Updated by Tom Clegg over 2 years ago

Updated by Tom Clegg over 2 years ago

Updated by Peter Amstutz over 2 years ago

Updated by Tom Clegg over 2 years ago