Feature #14198

[CWL] run steps on remote clusters

Added by Peter Amstutz about 1 year ago. Updated about 1 year ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
11/01/2018
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
3.0
Release:
Release relationship:
Auto

Description

When submitting container requests, a workflow step can provide a CWL hint (something like arv:TargetCluster) which sets the cluster_id and owner_uuid for container requests that inherit the hint. A remote cluster_id means the container request is submitted to the remote cluster (#14197)

The cluster_id field may contain an expression. It is evaluated in the context of the first (toplevel) appearance of the hint.

Perform validation that cluster_id appearing in hints are known by checking remote_hosts in discovery document.

The hint should be in the arvados namespace as an extension. It should be added to the documentation.

A-c-r gets a new command line flag to specify which cluster to submit the cwl-runner container.

Update code that gets container request status. If there are more than max_request_items (default 1000) container requests, break them up into pages. Test should confirm the filters in the "list these containers by UUID" API call (when getting progress updates) match the requirements for #13619


Subtasks

Task #14367: Review 14198-fed-testingResolvedLucas Di Pentima

Task #14437: a-c-r supportResolvedPeter Amstutz

Task #14438: automated testingResolvedPeter Amstutz

Task #14459: Review 14198-index-accepts-cluster-idResolvedPeter Amstutz

Task #14464: Review 14198-crunch-run-tokensResolvedPeter Amstutz

Task #14468: Review 14198-fed-collection-listResolvedPeter Amstutz

Task #14486: Review 14198-request-tokenResolvedPeter Amstutz

Task #14506: Review 14198-cwl-run-remoteResolvedLucas Di Pentima


Related issues

Related to Arvados - Bug #14458: [controller] collection federation panic send on closed channelResolved11/07/2018

Related to Arvados - Bug #14475: [Controller] POST .../collections/$uuid seems to have broken Workbench integration testsResolved

Associated revisions

Revision c47872d7
Added by Peter Amstutz about 1 year ago

Merge branch '14198-index-accepts-cluster-id' refs #14198

Arvados-DCO-1.1-Signed-off-by: Peter Amstutz <>

Revision c663e037 (diff)
Added by Peter Amstutz about 1 year ago

"arvbox status" includes cluster id refs #14198

Arvados-DCO-1.1-Signed-off-by: Peter Amstutz <>

Revision d9e48a19
Added by Peter Amstutz about 1 year ago

Merge branch '14198-crunch-run-tokens' refs #14198

Arvados-DCO-1.1-Signed-off-by: Peter Amstutz <>

Revision 034d8c6c
Added by Peter Amstutz about 1 year ago

Merge branch '14198-fed-collection-list' refs #14198

Arvados-DCO-1.1-Signed-off-by: Peter Amstutz <>

Revision d3502068
Added by Peter Amstutz about 1 year ago

Merge branch '14198-request-token' refs #14198

Arvados-DCO-1.1-Signed-off-by: Peter Amstutz <>

Revision 55b409d9
Added by Peter Amstutz about 1 year ago

Merge branch '14198-fed-testing' refs #14198

Arvados-DCO-1.1-Signed-off-by: Peter Amstutz <>

Revision 8352a686 (diff)
Added by Peter Amstutz about 1 year ago

Fix --submit-runner-cluster tests refs #14198

Arvados-DCO-1.1-Signed-off-by: Peter Amstutz <>

Revision 22cad1a4 (diff)
Added by Peter Amstutz about 1 year ago

Fix ClusterTarget.project_uuid to correctly set the project uuid

refs #14198

Arvados-DCO-1.1-Signed-off-by: Peter Amstutz <>

Revision d738b8c3 (diff)
Added by Peter Amstutz about 1 year ago

Finish fixing setting project_uuid from runtimeContext, fix tests.

refs #14198

Arvados-DCO-1.1-Signed-off-by: Peter Amstutz <>

Revision eb59eb57
Added by Peter Amstutz about 1 year ago

Merge branch '14198-make-builder' refs #14198

Arvados-DCO-1.1-Signed-off-by: Peter Amstutz <>

History

#1 Updated by Peter Amstutz about 1 year ago

  • Status changed from New to In Progress

#2 Updated by Peter Amstutz about 1 year ago

  • Description updated (diff)

#3 Updated by Peter Amstutz about 1 year ago

  • Status changed from In Progress to New

#4 Updated by Peter Amstutz about 1 year ago

  • Description updated (diff)
  • Status changed from New to In Progress

#5 Updated by Peter Amstutz about 1 year ago

  • Status changed from In Progress to New

#7 Updated by Ward Vandewege about 1 year ago

  • Target version changed from To Be Groomed to 2018-10-31 sprint

#8 Updated by Tom Clegg about 1 year ago

Test should confirm the filters in the "list these containers by UUID" API call (when getting progress updates) match the requirements for #13619

#9 Updated by Peter Amstutz about 1 year ago

  • Description updated (diff)

#10 Updated by Peter Amstutz about 1 year ago

  • Description updated (diff)

#11 Updated by Peter Amstutz about 1 year ago

  • Description updated (diff)
  • Story points set to 3.0

#12 Updated by Peter Amstutz about 1 year ago

  • Assigned To set to Peter Amstutz

#13 Updated by Peter Amstutz about 1 year ago

Workbench doesn't display steps that were run on remote clusters. Probably because the query it uses is "requested_by_container_uuid" and it won't discover container requests submitted to other clusters that way. Need to think about how we want that to work.

#14 Updated by Peter Amstutz about 1 year ago

This is odd. The API server seems to have accepted a create request for a manifest with a +R hint (but not signed with +A)

It returns a manifest with both +R and +A but the block can't be fetched. This is what #14259 is supposed to fix.

Maybe because the user is an admin (will try it with a non-admin).

#15 Updated by Peter Amstutz about 1 year ago

Confirmed that for non-admin users, the collection create request is rejected. However, the error message is really cryptic:

2018-10-25T15:37:54.913301561Z cwltool ERROR: Unhandled error:
2018-10-25T15:37:54.913301561Z   ">

So we'll want to fix that.

#16 Updated by Peter Amstutz about 1 year ago

Running the workflow runner on a cluster that isn't home (2gvh8) and having it submit jobs back to the home cluster (1b7fr):

2018-10-25T20:51:07.594188714Z arvados.cwl-runner ERROR: [container workflow.json#main] got error <HttpError 422 when requesting https://172.17.0.3:8000/arvados/v1/container_requests?alt=json&cluster_id=1b7fr returned "#<ArgumentError: Invalid runtime token>">
2018-10-25T20:51:07.915895195Z arvados.cwl-runner INFO: Couldn't update runtime_status: <HttpError 422 when requesting https://172.17.0.3:8000/arvados/v1/containers/2gvh8-dz642-ubcvbb55p5jd1z9?alt=json returned "Runtime status cannot be modified in state 'Running' ({}, {"error"=>"arvados.cwl-runner: [container workflow.json#main] got error <HttpError 422 when requesting https://172.17.0.3:8000/arvados/v1/container_requests?alt=json&cluster_id=1b7fr returned \"#<ArgumentError: Invalid runtime token>\">"})">
2018-10-25T20:51:07.916505321Z arvados.cwl-runner ERROR: Overall process status is permanentFail
2018-10-25T20:51:08.129810132Z arvados.cwl-runner INFO: Couldn't update runtime_status: <HttpError 422 when requesting https://172.17.0.3:8000/arvados/v1/containers/2gvh8-dz642-ubcvbb55p5jd1z9?alt=json returned "Runtime status cannot be modified in state 'Running' ({}, {"error"=>"arvados.cwl-runner: Overall process status is permanentFail"})">
2018-10-25T20:51:08.851368888Z arvados.cwl-runner INFO: Final output collection 282fbb8a925551343e98e762fa1ec1be+57 "Output of main (2018-10-25T20:51:08.733Z)" (2gvh8-4zz18-hhln3867yo7z02y)
2018-10-25T20:51:09.308662474Z arvados.cwl-runner INFO: Setting container output: 
2018-10-25T20:51:09.308662474Z cwltool WARNING: Final process status is permanentFail
2018-10-25T20:51:09.456511368Z arvados.cwl-runner INFO: Couldn't update runtime_status: "cwltool: Final process status is permanentFail"})">

#17 Updated by Peter Amstutz about 1 year ago

More detail on the cryptic ' ">' error message, most likely controller isn't handling it correctly:

arvados.cwl-runner INFO: Couldn't update runtime_status: "googleapiclient.http: Invalid JSON content from response: {\"errors\":[\"#\"],\"error_token\":\"1540502665+80f0ca10\"}"})">

#18 Updated by Peter Amstutz about 1 year ago

Wait a minute, here's the same error message from stderr.txt log. So the ' ">' error is actually lack of output sanitizing in in workbench (argh!)

2018-10-25T21:24:25.907555235Z arvados.cwl-runner INFO: Couldn't update runtime_status: <HttpError 422 when requesting https://172.17.0.3:8000/arvados/v1/containers/2gvh8-dz642-wjej4r4gv4ytngp?alt=json returned "Runtime status cannot be modified in state 'Running' ({}, {"warning"=>"googleapiclient.http: Invalid JSON content from response: {\"errors\":[\"#<ArvadosModel::PermissionDeniedError: ArvadosModel::PermissionDeniedError>\"],\"error_token\":\"1540502665+80f0ca10\"}"})">
2018-10-25T21:24:25.907583927Z cwltool ERROR: Unhandled error, try again with --debug for more information:
2018-10-25T21:24:25.907583927Z   <HttpError 403 when requesting https://172.17.0.3:8000/arvados/v1/collections?ensure_unique_name=true&alt=json returned "#<ArvadosModel::PermissionDeniedError: ArvadosModel::PermissionDeniedError>">
2018-10-25T21:24:25.954565678Z arvados.cwl-runner INFO: Couldn't update runtime_status: <HttpError 422 when requesting https://172.17.0.3:8000/arvados/v1/containers/2gvh8-dz642-wjej4r4gv4ytngp?alt=json returned "Runtime status cannot be modified in state 'Running' ({}, {"error"=>"cwltool: Unhandled error, try again with --debug for more information:", "errorDetail"=>"  <HttpError 403 when requesting https://172.17.0.3:8000/arvados/v1/collections?ensure_unique_name=true&alt=json returned \"#<ArvadosModel::PermissionDeniedError: ArvadosModel::PermissionDeniedError>\">"})">

#19 Updated by Peter Amstutz about 1 year ago

  • Target version changed from 2018-10-31 sprint to 2018-11-14 Sprint

#20 Updated by Peter Amstutz about 1 year ago

  • Status changed from New to In Progress

#21 Updated by Peter Amstutz about 1 year ago

  • Related to Bug #14458: [controller] collection federation panic send on closed channel added

#22 Updated by Lucas Di Pentima about 1 year ago

14198-index-accepts-cluster-id at 9e5234eb02e593791b3f776fe9f2e043229513ca LGTM.

#23 Updated by Peter Amstutz about 1 year ago

Hmm I wonder what's going on here. This is from test case 'runner-home-step-remote', error message is from crunch run, on a container submitted to a remote cluster, where the docker image must be fetched from the home cluster. Puzzling because this had been working in previous testing.

2018-11-08_21:15:04.42894 2lg42-dz642-szrcmkktcdl47xn 2018-11-08T21:15:04.428808123Z While writing stdin collection to docker container %qGET bf6b3413be678e8f910a64045c9f3216+1127+R21ixe-27a505ac28617b5400c8370735af2f3263d9562d@5bf71c58 failed: [http://172.17.0.6:25107/bf6b3413be678e8f910a64045c9f3216+1127+R21ixe-27a505ac28617b5400c8370735af2f3263d9562d@5bf71c58: HTTP 400 "obsolete token format" http://172.17.0.6:25108/bf6b3413be678e8f910a64045c9f3216+1127+R21ixe-27a505ac28617b5400c8370735af2f3263d9562d@5bf71c58: HTTP 400 "obsolete token format"]

#24 Updated by Peter Amstutz about 1 year ago

The error message is misleading. The error is piping file data to the stdin of the process inside the container, and doesn't have anything to do with loading the Docker image.

Ooof. It is using the wrong arvados client / keep client (the one with admin privs, not the container auth token).

#25 Updated by Lucas Di Pentima about 1 year ago

Reviewing 14198-crunch-run-tokens - b4f1614

What’s the difference between dispatcherClient vs DispatcherArvClient and containerClient vs ContainerArvClient on the ContainerRunner struct? I think some comments on all *Client members would be useful for future contributors.

#26 Updated by Peter Amstutz about 1 year ago

Lucas Di Pentima wrote:

Reviewing 14198-crunch-run-tokens - b4f1614

What’s the difference between dispatcherClient vs DispatcherArvClient and containerClient vs ContainerArvClient on the ContainerRunner struct? I think some comments on all *Client members would be useful for future contributors.

Yea, one is ArvadosClient and the other is arvados.Client -- I hate it. I can add comments. Related, I just added #14467 as a way out of this mess.

#27 Updated by Peter Amstutz about 1 year ago

14198-fed-collection-list

GenericFederatedRequestHandler will route any requests to the desired cluster if it has cluster_id in the parameters, so it is a cheap way for the client to talk to any cluster, even if some operations haven't been fused yet. However, the collection request handler didn't have that logic so I refactored it so it would share the generic request routing logic in addition to the specialized logic for collections.

#28 Updated by Lucas Di Pentima about 1 year ago

7c2aa951ed9e7d0010fd58d59dc1e98c9d5e2800 (14198-fed-collection-list branch) LGTM.

#29 Updated by Tom Clegg about 1 year ago

  • Related to Bug #14475: [Controller] POST .../collections/$uuid seems to have broken Workbench integration tests added

#30 Updated by Peter Amstutz about 1 year ago

14198-request-token @ 2b30b239c8c66c3a1a7be35c61a9fb707df1e7ee

When a container request was submitted by a remote user, we can't create a new token for that user, so set runtime_token to the provided auth token, making sure to strip out the container uuid in the 4th position in the case that the auth token is itself a container token.

#31 Updated by Peter Amstutz about 1 year ago

  • Target version changed from 2018-11-14 Sprint to 2018-11-28 Sprint

#32 Updated by Lucas Di Pentima about 1 year ago

Reviewing 14198-request-token: Could a test be added for this case? Apart from that, it LGTM.

#33 Updated by Peter Amstutz about 1 year ago

Lucas Di Pentima wrote:

Reviewing 14198-request-token: Could a test be added for this case? Apart from that, it LGTM.

Added a test & merged, thanks

#34 Updated by Peter Amstutz about 1 year ago

14198-cwl-run-remote @ e4273f3dca8a41271dba5c1491d45f3a87fb5bb2

https://ci.curoverse.com/view/Developer/job/developer-run-tests/977/console

14198-fed-testing @ 4c9e5266635324ce7b44a2a9649e9ad002108d20

  • Based on previous branch
  • Uses a cwl workflow where each test case is a step
  • Tested with arvbox
  • See arvados/sdk/cwl/tests/federation/README

#36 Updated by Lucas Di Pentima about 1 year ago

Ok, so to avoid making the wait longer, I'll submit my questions now so that it can be tested on our test clusters instead of me keep trying to make the arvboxes work (and surely failing as I don't have enough RAM, as Peter commented)

  • Regarding the database credentials (on arvados-controller) that are now needed, I don't know if it's specific to this branch but it seemed that tried to use "root" as a default user when not configured, maybe it can fail with a specific error message.
  • a-c-r's parameters' help messages could be enhanced a little bit by using the "metavar" arg, for example on --submit-runner-cluster we can specify "CLUSTER_PREFIX" as the metavar. Something along those lines could be made with --submit-request-uuid
  • As we talked on the chat --submit and --always-submit-runner confused me because they seemed to me like almost the same thing. It doesn't seem "clean" to me having both but it's ok if there's no better option.
  • Are the --submit-runner-cluster param inputs being validated? It would be great if a-c-r could detect typos early.

#40 Updated by Peter Amstutz about 1 year ago

Lucas Di Pentima wrote:

Ok, so to avoid making the wait longer, I'll submit my questions now so that it can be tested on our test clusters instead of me keep trying to make the arvboxes work (and surely failing as I don't have enough RAM, as Peter commented)

  • Regarding the database credentials (on arvados-controller) that are now needed, I don't know if it's specific to this branch but it seemed that tried to use "root" as a default user when not configured, maybe it can fail with a specific error message.

Yes, that's controller related, so I'm going to rule that out of scope (but controller should have a better error message when the database login information is missing, I agree).

  • a-c-r's parameters' help messages could be enhanced a little bit by using the "metavar" arg, for example on --submit-runner-cluster we can specify "CLUSTER_PREFIX" as the metavar. Something along those lines could be made with --submit-request-uuid

Done.

  • As we talked on the chat --submit and --always-submit-runner confused me because they seemed to me like almost the same thing. It doesn't seem "clean" to me having both but it's ok if there's no better option.

I rewrote the message a bit. We could also consider making it a "hidden" option.

  • Are the --submit-runner-cluster param inputs being validated? It would be great if a-c-r could detect typos early.

Fixed.

14198-fed-testing @ bb570d30eb44e34c95d645fea42fabbede5e91f9

#41 Updated by Lucas Di Pentima about 1 year ago

LGTM, please merge. Thanks!

#42 Updated by Peter Amstutz about 1 year ago

  • Status changed from In Progress to Resolved

#43 Updated by Tom Morris about 1 year ago

  • Release set to 14

Also available in: Atom PDF