Project

General

Profile

Idea #9616

Updated by Brett Smith almost 8 years ago

h2. Bug report 

 Given this container request: 

 <pre><code class="json">    { 
    "href":"/container_requests/9tee4-xvhdp-agu2zocga6tyv7u", 
    "kind":"arvados#containerRequest", 
    "etag":"74g8x93wxb6f1mly0woyun710", 
    "uuid":"9tee4-xvhdp-agu2zocga6tyv7u", 
    "owner_uuid":"9tee4-tpzed-3qslznhod3vdpug", 
    "created_at":"2016-06-30T19:48:57.796884000Z", 
    "modified_by_client_uuid":"9tee4-ozdt8-yft7aixdt2dx5ni", 
    "modified_by_user_uuid":"9tee4-tpzed-3qslznhod3vdpug", 
    "modified_at":"2016-06-30T19:48:57.796401000Z", 
    "command":[ 
     "true" 
    ], 
    "container_count_max":null, 
    "container_image":"arvados/jobs:latest", 
    "container_uuid":"9tee4-dz642-pi27n3loindq935", 
    "cwd":".", 
    "description":null, 
    "environment":{}, 
    "expires_at":null, 
    "filters":null, 
    "mounts":{ 
     "/out":{ 
      "kind":"tmp", 
      "capacity":1000 
     } 
    }, 
    "name":"Brett test 2016-06-30a", 
    "output_path":"/out", 
    "priority":1, 
    "properties":{}, 
    "requesting_container_uuid":null, 
    "runtime_constraints":{}, 
    "state":"Committed" 
   } 
 </code></pre> 

 And this container: 

 <pre><code class="json">{ 
  "href":"/containers/9tee4-dz642-pi27n3loindq935", 
  "kind":"arvados#container", 
  "etag":"7wpsprryz63nv2znu6qpqk9az", 
  "uuid":"9tee4-dz642-pi27n3loindq935", 
  "owner_uuid":"9tee4-tpzed-000000000000000", 
  "created_at":"2016-06-30T19:48:57.766176000Z", 
  "modified_by_client_uuid":"9tee4-ozdt8-wt0x6s6j9yhycfh", 
  "modified_by_user_uuid":"9tee4-tpzed-000000000000000", 
  "modified_at":"2016-07-18T13:38:05.043051000Z", 
  "command":[ 
   "true" 
  ], 
  "container_image":"arvados/jobs:latest", 
  "cwd":".", 
  "environment":{}, 
  "exit_code":null, 
  "finished_at":null, 
  "locked_by_uuid":"9tee4-gj3su-n39vrgwxelusj7n", 
  "log":null, 
  "mounts":{ 
   "/out":{ 
    "kind":"tmp", 
    "capacity":1000 
   } 
  }, 
  "output":null, 
  "output_path":"/out", 
  "priority":1, 
  "progress":null, 
  "runtime_constraints":{}, 
  "started_at":null, 
  "state":"Locked", 
  "auth_uuid":"…" 
 } 
 </code></pre> 

 crunch-dispatch-slurm tries to queue the container with negative mem-per-cpu, which fails.    From the logs: 

 <pre>2016-06-30_19:49:26.22088 2016/06/30 19:49:26 Monitoring container 9tee4-dz642-pi27n3loindq935 started 
 2016-06-30_19:49:36.03180 2016/06/30 19:49:36 About to submit queued container 9tee4-dz642-pi27n3loindq935 
 2016-06-30_19:49:36.09338 2016/06/30 19:49:36 Error submitting container 9tee4-dz642-pi27n3loindq935 to slurm: Container submission failed [sbatch --share --parsable --job-name=9tee4-dz642-pi27n3loindq935 --mem-per-cpu=-9223372036854775808 --cpus-per-task=0 --priority=1]: exit status 1 [115 98 97 116 99 104 58 32 117 110 114 101 99 111 103 110 105 122 101 100 32 111 112 116 105 111 110 32 39 45 45 112 97 114 115 97 98 108 101 39 10 84 114 121 32 34 115 98 97 116 99 104 32 45 45 104 101 108 112 34 32 102 111 114 32 109 111 114 101 32 105 110 102 111 114 109 97 116 105 111 110 10] 
 2016-06-30_19:49:36.13980 2016/06/30 19:49:36 Monitoring container 9tee4-dz642-pi27n3loindq935 finished 
 </pre> 

 h2. Background 

 In the Crunch 2 API, Containers are expected to have valid values set for the @ram@ and @vcpus@ runtime_constraints.    The primary bug here is that the API server didn't enforce this.    Those are #9617 and #9618. 

 That said, crunch-dispatch-slurm could've been a nicer client if it noticed the problem itself, and logged it as such, rather than sending a bad request to SLURM. 

 h2. Fix 

 Don't try to monitor or run Containers that are missing fields in serialized attributes like runtime_constraints that crunch-dispatch-slurm requires for operation.    Right now this applies to the @vcpsu@ and @ram@ fields of runtime_constraints.    When we consider acting on such a Container, instead log a message "Ignoring container [UUID] because it is missing [field name] in [attribute name]," and do no more processing on it.    Continue processing other containers as normal.

Back