Project

General

Profile

Actions

Idea #9616

closed

[Crunch2] SLURM dispatcher ignores Containers it can't dispatch

Added by Brett Smith over 7 years ago. Updated about 4 years ago.

Status:
Closed
Priority:
Normal
Assigned To:
-
Category:
Crunch
Target version:
-
Story points:
-

Description

Bug report

Given this container request:

 {
   "href":"/container_requests/9tee4-xvhdp-agu2zocga6tyv7u",
   "kind":"arvados#containerRequest",
   "etag":"74g8x93wxb6f1mly0woyun710",
   "uuid":"9tee4-xvhdp-agu2zocga6tyv7u",
   "owner_uuid":"9tee4-tpzed-3qslznhod3vdpug",
   "created_at":"2016-06-30T19:48:57.796884000Z",
   "modified_by_client_uuid":"9tee4-ozdt8-yft7aixdt2dx5ni",
   "modified_by_user_uuid":"9tee4-tpzed-3qslznhod3vdpug",
   "modified_at":"2016-06-30T19:48:57.796401000Z",
   "command":[
    "true" 
   ],
   "container_count_max":null,
   "container_image":"arvados/jobs:latest",
   "container_uuid":"9tee4-dz642-pi27n3loindq935",
   "cwd":".",
   "description":null,
   "environment":{},
   "expires_at":null,
   "filters":null,
   "mounts":{
    "/out":{
     "kind":"tmp",
     "capacity":1000
    }
   },
   "name":"Brett test 2016-06-30a",
   "output_path":"/out",
   "priority":1,
   "properties":{},
   "requesting_container_uuid":null,
   "runtime_constraints":{},
   "state":"Committed" 
  }

And this container:

{
 "href":"/containers/9tee4-dz642-pi27n3loindq935",
 "kind":"arvados#container",
 "etag":"7wpsprryz63nv2znu6qpqk9az",
 "uuid":"9tee4-dz642-pi27n3loindq935",
 "owner_uuid":"9tee4-tpzed-000000000000000",
 "created_at":"2016-06-30T19:48:57.766176000Z",
 "modified_by_client_uuid":"9tee4-ozdt8-wt0x6s6j9yhycfh",
 "modified_by_user_uuid":"9tee4-tpzed-000000000000000",
 "modified_at":"2016-07-18T13:38:05.043051000Z",
 "command":[
  "true" 
 ],
 "container_image":"arvados/jobs:latest",
 "cwd":".",
 "environment":{},
 "exit_code":null,
 "finished_at":null,
 "locked_by_uuid":"9tee4-gj3su-n39vrgwxelusj7n",
 "log":null,
 "mounts":{
  "/out":{
   "kind":"tmp",
   "capacity":1000
  }
 },
 "output":null,
 "output_path":"/out",
 "priority":1,
 "progress":null,
 "runtime_constraints":{},
 "started_at":null,
 "state":"Locked",
 "auth_uuid":"…" 
}

crunch-dispatch-slurm tries to queue the container with negative mem-per-cpu, which fails. From the logs:

2016-06-30_19:49:26.22088 2016/06/30 19:49:26 Monitoring container 9tee4-dz642-pi27n3loindq935 started
2016-06-30_19:49:36.03180 2016/06/30 19:49:36 About to submit queued container 9tee4-dz642-pi27n3loindq935
2016-06-30_19:49:36.09338 2016/06/30 19:49:36 Error submitting container 9tee4-dz642-pi27n3loindq935 to slurm: Container submission failed [sbatch --share --parsable --job-name=9tee4-dz642-pi27n3loindq935 --mem-per-cpu=-9223372036854775808 --cpus-per-task=0 --priority=1]: exit status 1 [115 98 97 116 99 104 58 32 117 110 114 101 99 111 103 110 105 122 101 100 32 111 112 116 105 111 110 32 39 45 45 112 97 114 115 97 98 108 101 39 10 84 114 121 32 34 115 98 97 116 99 104 32 45 45 104 101 108 112 34 32 102 111 114 32 109 111 114 101 32 105 110 102 111 114 109 97 116 105 111 110 10]
2016-06-30_19:49:36.13980 2016/06/30 19:49:36 Monitoring container 9tee4-dz642-pi27n3loindq935 finished

Background

In the Crunch 2 API, Containers are expected to have valid values set for the ram and vcpus runtime_constraints. The primary bug here is that the API server didn't enforce this. Those are #9617 and #9618.

That said, crunch-dispatch-slurm could've been a nicer client if it noticed the problem itself, and logged it as such, rather than sending a bad request to SLURM.

Fix

Don't try to monitor or run Containers that are missing fields in serialized attributes like runtime_constraints that crunch-dispatch-slurm requires for operation. Right now this applies to the vcpsu and ram fields of runtime_constraints. When we consider acting on such a Container, instead log a message "Ignoring container [UUID] because it is missing [field name] in [attribute name]," and do no more processing on it. Continue processing other containers as normal.

Actions #1

Updated by Brett Smith over 7 years ago

The --cpus-per-task=0 also looks a little fishy. The sbatch man page doesn't say anything about what 0 means. Logically, it doesn't make a ton of sense.

Actions #2

Updated by Tom Clegg over 7 years ago

I think crunch-dispatch-slurm is (correctly) relying on the API to always provide values for runtime_constraints[vcpus] and runtime_constraints[ram], but API is (incorrectly) leaving them empty if the container request doesn't provide values or ranges.

In that case, you can work around it by setting "runtime_constraints":{"vcpus":1,"ram":1000000000}.

Actions #3

Updated by Brett Smith over 7 years ago

  • Target version deleted (Arvados Future Sprints)
  • Release set to 11
Actions #4

Updated by Brett Smith over 7 years ago

  • Tracker changed from Bug to Idea
  • Subject changed from [Crunch2] SLURM dispatcher tries to run sbatch with negative --mem-per-cpu to [Crunch2] SLURM dispatcher ignores Containers it can't dispatch
  • Description updated (diff)
Actions #5

Updated by Tom Morris almost 6 years ago

  • Release deleted (11)
Actions #6

Updated by Peter Amstutz about 4 years ago

  • Status changed from New to Closed
Actions

Also available in: Atom PDF