Bug #15499

job reuse didn't work?

Added by Bryan Cosca about 2 months ago. Updated 12 days ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:
Story points:
-
Release relationship:
Auto

Description

e51c5-xvhdp-8xzb5204f72jbxt should have reused e51c5-xvhdp-cxr3405i56yxqjl

I see in the command all the hashes are the same, the docker_image_locator is the same, the runtime_constraints are the same. Not sure why this didn't get reused?


Related issues

Related to Arvados - Story #15502: [API] API call or script to show why jobs didn't reuseNew

Associated revisions

Revision 93086f7d
Added by Peter Amstutz 13 days ago

Merge branch '15499-container-reuse-fix' refs #15499

Arvados-DCO-1.1-Signed-off-by: Peter Amstutz <>

Revision 04c70223
Added by Peter Amstutz 13 days ago

Merge branch '15499-container-reuse-fix' refs #15499

Arvados-DCO-1.1-Signed-off-by: Peter Amstutz <>

History

#1 Updated by Lucas Di Pentima about 2 months ago

This is the diff between both CRs:

--- a.txt    2019-07-30 11:31:19.000000000 -0300
+++ b.txt    2019-07-30 11:31:37.000000000 -0300
@@ -1,8 +1,8 @@
 {
-  "uuid": "e51c5-xvhdp-3vq89dl2f6p8m3m",
-  "owner_uuid": "e51c5-j7d0g-nige6d55lix0cgk",
-  "created_at": "2019-07-30T13:48:42.472Z",
-  "modified_at": "2019-07-30T13:48:42.711Z",
+  "uuid": "e51c5-xvhdp-cxr3405i56yxqjl",
+  "owner_uuid": "e51c5-j7d0g-o16waa83mr5gop0",
+  "created_at": "2019-07-26T18:44:13.477Z",
+  "modified_at": "2019-07-26T18:44:13.736Z",
   "modified_by_client_uuid": null,
   "modified_by_user_uuid": "e51c5-tpzed-47sse2yu889u0r1",
   "name": "bwamem-splitsam",
@@ -10,8 +10,8 @@
   "properties": {
   },
   "state": "Final",
-  "requesting_container_uuid": "e51c5-dz642-vytrbkw8dn891r1",
-  "container_uuid": "e51c5-dz642-8ee4pme3cnqu0xo",
+  "requesting_container_uuid": "e51c5-dz642-86aw5vmzi9thcwl",
+  "container_uuid": "e51c5-dz642-d6qcn6uu9uktmq8",
   "container_count_max": 3,
   "mounts": {
     "/tmp": {
@@ -24,7 +24,7 @@
     },
     "/var/spool/cwl": {
       "kind": "tmp",
-      "capacity": 68275929088
+      "capacity": 54621372416
     },
     "/keep/9a0815d6efd15058102a5ea01ce5c593+5838": {
       "portable_data_hash": "9a0815d6efd15058102a5ea01ce5c593+5838",
@@ -74,8 +74,8 @@
   "use_existing": true,
   "scheduling_parameters": {
   },
-  "output_uuid": "e51c5-4zz18-b3e3hgqvn67pmg3",
-  "log_uuid": "e51c5-4zz18-8ultf9iez50dxvu",
+  "output_uuid": "e51c5-4zz18-77lnv1wrjw518bx",
+  "log_uuid": "e51c5-4zz18-z5f9svxhg7cy92p",
   "output_name": "Output for step bwamem-splitsam",
   "output_ttl": 0,
   "runtime_token": null

Looking at the code, the secret mounts are a reuse factor that doesn't get logged or shown on API responses.

The new CR e51c5-xvhdp-8xzb5204f72jbxt instead of reusing a previously run container, it created a new one: e51c5-dz642-8ee4pme3cnqu0xo, and looking its first CR's requesting container (e51c5-dz642-z6ui0juoqmv51gw) it looks like it used secret mounts by looking at its cwl.input.json. I wonder if that triggered a bug.

#2 Updated by Tom Clegg about 2 months ago

arvados_production=# select secret_mounts_md5 from containers where uuid in ('e51c5-dz642-8ee4pme3cnqu0xo','e51c5-dz642-d6qcn6uu9uktmq8');
        secret_mounts_md5         
----------------------------------
 99914b932bd37a50b983c5e7c90ae93b
 99914b932bd37a50b983c5e7c90ae93b
(2 rows)

#3 Updated by Lucas Di Pentima about 2 months ago

From chat with Tom Clegg, the issue seem to be the difference in capacity on the /var/spool/cwl tmp mounts. Indeed, CRs that used e51c5-dz642-d6qcn6uu9uktmq8 match on their tmp mounts, and CRs that used e51c5-dz642-8ee4pme3cnqu0xo also match between them.

I suppose these mounts weren't directly added by the user but instead from an intermediate process, so even though different mounts should invalidate container reuse, I think the fact that compute nodes suddenly offer slightly different sized volumes shouldn't produce such side effect.

#4 Updated by Peter Amstutz about 2 months ago

Either capacity should affect reuse decisions at all, or the rule should be that requesting a greater capacity will also reuse containers that ran with a smaller capacity -- although that wouldn't have worked here, since the capacity request went down.

If you actually want to reproduce a run, you should be resubmitting a request with the values from the container record, not the container request.

If changing disk space (and really, RAM and CPU cores as well) materially affects the outcome, the job was probably was not deterministic to start with.

#5 Updated by Lucas Di Pentima about 2 months ago

Peter Amstutz wrote:

Either capacity should affect reuse decisions at all, or the rule should be that requesting a greater capacity will also reuse containers that ran with a smaller capacity -- although that wouldn't have worked here, since the capacity request went down.

I think in this case, capacity went up, as e51c5-xvhdp-3vq89dl2f6p8m3m is newer, it's just that the diff was done backwards (sorry about that)

#6 Updated by Jiayong Li about 1 month ago

I'm running into this on su92l as well. The following three jobs have no difference in their scripts or docker image. I also checked the capacity in /var/spool/cwl in containers.json as mentioned above, they're the same. But the newer run didn't reuse the older run.

https://workbench.su92l.arvadosapi.com/container_requests/su92l-xvhdp-ip6qxpomjd3z15m
https://workbench.su92l.arvadosapi.com/container_requests/su92l-xvhdp-r8r7yq436vhakca
https://workbench.su92l.arvadosapi.com/container_requests/su92l-xvhdp-yv1oj15k1byhwlp

#7 Updated by Peter Amstutz about 1 month ago

  • Related to Story #15502: [API] API call or script to show why jobs didn't reuse added

#8 Updated by Tom Morris about 1 month ago

  • Target version set to To Be Groomed

Peter Amstutz wrote:

Either capacity should affect reuse decisions at all, or the rule should be that requesting a greater capacity will also reuse containers that ran with a smaller capacity -- although that wouldn't have worked here, since the capacity request went down.

I don't see how we can assert that capacity should affect reuse decisions when we reserve the right to schedule jobs on any size machine that we think is appropriate.

If you actually want to reproduce a run, you should be resubmitting a request with the values from the container record, not the container request.

Is that what Workbench (and Workbench 2) do? In the OP's case though, the submission was done using arvados-cwl-runner with the same CWL, the same Docker image, and the same inputs, yet the job didn't get reused. This makes it seem like some system component changed its behavior in a way which prevented job reuse for this case, which is very undesirable.

#9 Updated by Stephen McLaughlin 17 days ago

I would like to add my voice to this issue as well. I have been testing the deep-variant-only workflows and i'm not getting container reuse as I expect/hope for the following:

https://workbench.e51c5.arvadosapi.com/container_requests/e51c5-xvhdp-4qleefleibg6cyo
https://workbench.e51c5.arvadosapi.com/container_requests/e51c5-xvhdp-aopvx2rb9ag1ew7
https://workbench.e51c5.arvadosapi.com/container_requests/e51c5-xvhdp-vmq3guqbdjbhzmz

specifically, it's starting at the merge-bams-and-index step everytime I rerun, but I don't think any changes I made should affect anything that should make that step repeat.

#10 Updated by Lucas Di Pentima 16 days ago

  • Target version changed from To Be Groomed to 2019-09-11 Sprint
  • Assigned To set to Lucas Di Pentima
  • Status changed from New to In Progress

#12 Updated by Lucas Di Pentima 16 days ago

Ok, so I managed to get arvbox working and tried a very simple cwl job (the cwltool's 1st-tool example) and it's not reusing it either. Here's the find_reusable log, I'll see why it doesn't find any container with readable log:

[req-8amasp5p39kfp25pqwz5] find_reusable: starting with 11 container records in database
[req-8amasp5p39kfp25pqwz5] find_reusable: have 3 candidates after filtering on command ["echo", "Hello world"]
[req-8amasp5p39kfp25pqwz5] find_reusable: have 3 candidates after filtering on cwd "/var/spool/cwl" 
[req-8amasp5p39kfp25pqwz5] find_reusable: have 3 candidates after filtering on environment {"TMPDIR"=>"/tmp", "HOME"=>"/var/spool/cwl"}
[req-8amasp5p39kfp25pqwz5] find_reusable: have 3 candidates after filtering on output_path "/var/spool/cwl" 
[req-8amasp5p39kfp25pqwz5] find_reusable: have 3 candidates after filtering on container_image "2a3e6fe07bb319c5812dc5c0a783e7f1+261" (resolved from "2a3e6fe07bb319c5812dc5c0a783e7f1+261")
[req-8amasp5p39kfp25pqwz5] find_reusable: have 3 candidates after filtering on mounts {"/tmp"=>{"capacity"=>1073741824, "kind"=>"tmp"}, "/var/spool/cwl"=>{"capacity"=>1073741824, "kind"=>"tmp"}}
[req-8amasp5p39kfp25pqwz5] find_reusable: have 3 candidates after filtering on secret_mounts_md5 "99914b932bd37a50b983c5e7c90ae93b" 
[req-8amasp5p39kfp25pqwz5] find_reusable: have 3 candidates after filtering on runtime_constraints {"keep_cache_ram"=>268435456, "ram"=>1073741824, "API"=>true, "vcpus"=>1}
[req-8amasp5p39kfp25pqwz5] find_reusable: checking for state=Complete with readable output and log...
[req-8amasp5p39kfp25pqwz5] find_reusable: have 3 candidates with state=Complete, exit_code=0
[req-8amasp5p39kfp25pqwz5] find_reusable: have 0 candidates with readable log
[req-8amasp5p39kfp25pqwz5] find_reusable: have 0 candidates with readable output
[req-8amasp5p39kfp25pqwz5] find_reusable: checking for state=Running...
[req-8amasp5p39kfp25pqwz5] find_reusable: have no containers in Running state
[req-8amasp5p39kfp25pqwz5] find_reusable: have no containers in Locked or Queued state
[req-8amasp5p39kfp25pqwz5] find_reusable: done, no reusable container found

#13 Updated by Peter Amstutz 16 days ago

I see. The bug is that the container request log collection has a different PDH than the log collection produced by the container. So when it searches for an existing collection with the log, it can't find it. So it doesn't reuse. This is the same change that broke crunchstat-summary.

We probably need a slightly different strategy: search for container requests that refer to the candidate container, and make a copy of it's log collection, instead of using the log PDH from the container record.

#14 Updated by Lucas Di Pentima 16 days ago

Made an update to make reuse tests fail at 3ee696c64 - branch 15499-container-reuse-fix

#15 Updated by Peter Amstutz 14 days ago

  • Release set to 26

#16 Updated by Tom Clegg 14 days ago

It looks to me like we don't need/want to change the way we make the new CR's log (which happens later after choosing a container) -- we only need to relax the permission check on the existing container's log.

Are we looking for a filter like "a CR exists with this container UUID and a readable log"?

#17 Updated by Tom Clegg 14 days ago

Once we get past the permission issue, and succeed in reusing the container, we will have to deal with the possibility that the container's original log collection doesn't exist when it's time to create a new log collection for the new CR. So yes, I suppose in that case we need to use a readable log collection from an existing CR (we can assume such a log exists, otherwise we wouldn't be reusing the container).

#18 Updated by Tom Clegg 14 days ago

Making an identical copy of the original container log collection (in addition to the multi-container log collection we make now) when finalizing a successful CR should be enough to make reuse work again.

The new log collection should have an entry in properties indicating the container UUID.

(Separately/later) putting the new collection's UUID in a new CR column would help the users keep things organized.

#19 Updated by Peter Amstutz 13 days ago

  • Assigned To changed from Lucas Di Pentima to Peter Amstutz

#21 Updated by Peter Amstutz 13 days ago

It would be really easy to just get rid of the merged collection. The additional output logs would be available by searching for (type = 'log', properties.container_request = '...')

#22 Updated by Tom Clegg 13 days ago

Peter Amstutz wrote:

15499-container-reuse-fix @ d47c64becdb1f28b297b3de60c4a87b12bae70be

update_collections() runs every time the container log is updated (at least 1x every half hour). This looks like it would create another new log collection every time that happens, which seems a bit much -- or am I missing something?

We do need to do something to ensure containers that are still running can be reused, though.

#23 Updated by Peter Amstutz 13 days ago

Tom Clegg wrote:

Peter Amstutz wrote:

15499-container-reuse-fix @ d47c64becdb1f28b297b3de60c4a87b12bae70be

update_collections() runs every time the container log is updated (at least 1x every half hour). This looks like it would create another new log collection every time that happens, which seems a bit much -- or am I missing something?

We do need to do something to ensure containers that are still running can be reused, though.

Good catch. That means it needs to happen in finalize!

15499-container-reuse-fix @ 6b43b14a374f217ddf5870c2ed9cd90152d8e6e2

https://ci.curoverse.com/view/Developer/job/developer-run-tests/1513/

#24 Updated by Peter Amstutz 13 days ago

Submitted for consideration, this removes the merged logs entirely and just saves each container log in a separate collection. The client (e.g. Workbench) can query properties.container_request to programmatically determine which log collections are associated with a container request.

15499-reuse-fix-no-merged @ dbe464e8ccee9c7d0048bc59a871627591e2b141

#25 Updated by Lucas Di Pentima 13 days ago

Peter Amstutz wrote:

[...]

15499-container-reuse-fix @ 6b43b14a374f217ddf5870c2ed9cd90152d8e6e2

https://ci.curoverse.com/view/Developer/job/developer-run-tests/1513/

Manually tested on arvbox, it's working ok. LGTM.

#26 Updated by Tom Clegg 13 days ago

Peter Amstutz wrote:

15499-container-reuse-fix @ 6b43b14a374f217ddf5870c2ed9cd90152d8e6e2

This looks good for the case where the container succeeds.

If the container fails/cancels, it makes another new collection anyway, which seems unnecessary. Is this an intentional change, or should saving the extra collection be conditional on success as in note-18?

We do need to do something to ensure containers that are still running can be reused, though.

Looking at container.rb it seems this is already OK because we don't filter on readable logs when considering candidates with state==Running.

#27 Updated by Peter Amstutz 13 days ago

Tom Clegg wrote:

Peter Amstutz wrote:

15499-container-reuse-fix @ 6b43b14a374f217ddf5870c2ed9cd90152d8e6e2

This looks good for the case where the container succeeds.

If the container fails/cancels, it makes another new collection anyway, which seems unnecessary. Is this an intentional change, or should saving the extra collection be conditional on success as in note-18?

15499-container-reuse-fix @ 91bee39e7e5bcba0d5e418b532fc34949ffbefa8

only save the extra collection when container.state == Container::Complete

#28 Updated by Tom Clegg 13 days ago

LGTM thanks!

#29 Updated by Peter Amstutz 12 days ago

  • Target version changed from 2019-09-11 Sprint to 2019-09-25 Sprint

#30 Updated by Peter Amstutz 12 days ago

  • Target version changed from 2019-09-25 Sprint to 2019-09-11 Sprint
  • Status changed from In Progress to Resolved

Also available in: Atom PDF