Feature #14198
closed[CWL] run steps on remote clusters
Description
When submitting container requests, a workflow step can provide a CWL hint (something like arv:TargetCluster) which sets the cluster_id and owner_uuid for container requests that inherit the hint. A remote cluster_id means the container request is submitted to the remote cluster (#14197)
The cluster_id field may contain an expression. It is evaluated in the context of the first (toplevel) appearance of the hint.
Perform validation that cluster_id appearing in hints are known by checking remote_hosts in discovery document.
The hint should be in the arvados namespace as an extension. It should be added to the documentation.
A-c-r gets a new command line flag to specify which cluster to submit the cwl-runner container.
Update code that gets container request status. If there are more than max_request_items (default 1000) container requests, break them up into pages. Test should confirm the filters in the "list these containers by UUID" API call (when getting progress updates) match the requirements for #13619
Updated by Peter Amstutz over 6 years ago
- Status changed from New to In Progress
Updated by Peter Amstutz over 6 years ago
- Status changed from In Progress to New
Updated by Peter Amstutz over 6 years ago
- Description updated (diff)
- Status changed from New to In Progress
Updated by Peter Amstutz over 6 years ago
- Status changed from In Progress to New
Updated by Ward Vandewege over 6 years ago
- Target version changed from To Be Groomed to 2018-10-31 sprint
Updated by Tom Clegg over 6 years ago
Test should confirm the filters in the "list these containers by UUID" API call (when getting progress updates) match the requirements for #13619
Updated by Peter Amstutz over 6 years ago
- Description updated (diff)
- Story points set to 3.0
Updated by Peter Amstutz about 6 years ago
Workbench doesn't display steps that were run on remote clusters. Probably because the query it uses is "requested_by_container_uuid" and it won't discover container requests submitted to other clusters that way. Need to think about how we want that to work.
Updated by Peter Amstutz about 6 years ago
This is odd. The API server seems to have accepted a create request for a manifest with a +R hint (but not signed with +A)
It returns a manifest with both +R and +A but the block can't be fetched. This is what #14259 is supposed to fix.
Maybe because the user is an admin (will try it with a non-admin).
Updated by Peter Amstutz about 6 years ago
Confirmed that for non-admin users, the collection create request is rejected. However, the error message is really cryptic:
2018-10-25T15:37:54.913301561Z cwltool ERROR: Unhandled error: 2018-10-25T15:37:54.913301561Z ">
So we'll want to fix that.
Updated by Peter Amstutz about 6 years ago
Running the workflow runner on a cluster that isn't home (2gvh8) and having it submit jobs back to the home cluster (1b7fr):
2018-10-25T20:51:07.594188714Z arvados.cwl-runner ERROR: [container workflow.json#main] got error <HttpError 422 when requesting https://172.17.0.3:8000/arvados/v1/container_requests?alt=json&cluster_id=1b7fr returned "#<ArgumentError: Invalid runtime token>"> 2018-10-25T20:51:07.915895195Z arvados.cwl-runner INFO: Couldn't update runtime_status: <HttpError 422 when requesting https://172.17.0.3:8000/arvados/v1/containers/2gvh8-dz642-ubcvbb55p5jd1z9?alt=json returned "Runtime status cannot be modified in state 'Running' ({}, {"error"=>"arvados.cwl-runner: [container workflow.json#main] got error <HttpError 422 when requesting https://172.17.0.3:8000/arvados/v1/container_requests?alt=json&cluster_id=1b7fr returned \"#<ArgumentError: Invalid runtime token>\">"})"> 2018-10-25T20:51:07.916505321Z arvados.cwl-runner ERROR: Overall process status is permanentFail 2018-10-25T20:51:08.129810132Z arvados.cwl-runner INFO: Couldn't update runtime_status: <HttpError 422 when requesting https://172.17.0.3:8000/arvados/v1/containers/2gvh8-dz642-ubcvbb55p5jd1z9?alt=json returned "Runtime status cannot be modified in state 'Running' ({}, {"error"=>"arvados.cwl-runner: Overall process status is permanentFail"})"> 2018-10-25T20:51:08.851368888Z arvados.cwl-runner INFO: Final output collection 282fbb8a925551343e98e762fa1ec1be+57 "Output of main (2018-10-25T20:51:08.733Z)" (2gvh8-4zz18-hhln3867yo7z02y) 2018-10-25T20:51:09.308662474Z arvados.cwl-runner INFO: Setting container output: 2018-10-25T20:51:09.308662474Z cwltool WARNING: Final process status is permanentFail 2018-10-25T20:51:09.456511368Z arvados.cwl-runner INFO: Couldn't update runtime_status: "cwltool: Final process status is permanentFail"})">
Updated by Peter Amstutz about 6 years ago
More detail on the cryptic ' ">' error message, most likely controller isn't handling it correctly:
arvados.cwl-runner INFO: Couldn't update runtime_status: "googleapiclient.http: Invalid JSON content from response: {\"errors\":[\"#\"],\"error_token\":\"1540502665+80f0ca10\"}"})">
Updated by Peter Amstutz about 6 years ago
Wait a minute, here's the same error message from stderr.txt log. So the ' ">' error is actually lack of output sanitizing in in workbench (argh!)
2018-10-25T21:24:25.907555235Z arvados.cwl-runner INFO: Couldn't update runtime_status: <HttpError 422 when requesting https://172.17.0.3:8000/arvados/v1/containers/2gvh8-dz642-wjej4r4gv4ytngp?alt=json returned "Runtime status cannot be modified in state 'Running' ({}, {"warning"=>"googleapiclient.http: Invalid JSON content from response: {\"errors\":[\"#<ArvadosModel::PermissionDeniedError: ArvadosModel::PermissionDeniedError>\"],\"error_token\":\"1540502665+80f0ca10\"}"})"> 2018-10-25T21:24:25.907583927Z cwltool ERROR: Unhandled error, try again with --debug for more information: 2018-10-25T21:24:25.907583927Z <HttpError 403 when requesting https://172.17.0.3:8000/arvados/v1/collections?ensure_unique_name=true&alt=json returned "#<ArvadosModel::PermissionDeniedError: ArvadosModel::PermissionDeniedError>"> 2018-10-25T21:24:25.954565678Z arvados.cwl-runner INFO: Couldn't update runtime_status: <HttpError 422 when requesting https://172.17.0.3:8000/arvados/v1/containers/2gvh8-dz642-wjej4r4gv4ytngp?alt=json returned "Runtime status cannot be modified in state 'Running' ({}, {"error"=>"cwltool: Unhandled error, try again with --debug for more information:", "errorDetail"=>" <HttpError 403 when requesting https://172.17.0.3:8000/arvados/v1/collections?ensure_unique_name=true&alt=json returned \"#<ArvadosModel::PermissionDeniedError: ArvadosModel::PermissionDeniedError>\">"})">
Updated by Peter Amstutz about 6 years ago
- Target version changed from 2018-10-31 sprint to 2018-11-14 Sprint
Updated by Peter Amstutz about 6 years ago
- Status changed from New to In Progress
Updated by Peter Amstutz about 6 years ago
- Related to Bug #14458: [controller] collection federation panic send on closed channel added
Updated by Lucas Di Pentima about 6 years ago
14198-index-accepts-cluster-id
at 9e5234eb02e593791b3f776fe9f2e043229513ca LGTM.
Updated by Peter Amstutz about 6 years ago
Hmm I wonder what's going on here. This is from test case 'runner-home-step-remote', error message is from crunch run, on a container submitted to a remote cluster, where the docker image must be fetched from the home cluster. Puzzling because this had been working in previous testing.
2018-11-08_21:15:04.42894 2lg42-dz642-szrcmkktcdl47xn 2018-11-08T21:15:04.428808123Z While writing stdin collection to docker container %qGET bf6b3413be678e8f910a64045c9f3216+1127+R21ixe-27a505ac28617b5400c8370735af2f3263d9562d@5bf71c58 failed: [http://172.17.0.6:25107/bf6b3413be678e8f910a64045c9f3216+1127+R21ixe-27a505ac28617b5400c8370735af2f3263d9562d@5bf71c58: HTTP 400 "obsolete token format" http://172.17.0.6:25108/bf6b3413be678e8f910a64045c9f3216+1127+R21ixe-27a505ac28617b5400c8370735af2f3263d9562d@5bf71c58: HTTP 400 "obsolete token format"]
Updated by Peter Amstutz about 6 years ago
The error message is misleading. The error is piping file data to the stdin of the process inside the container, and doesn't have anything to do with loading the Docker image.
Ooof. It is using the wrong arvados client / keep client (the one with admin privs, not the container auth token).
Updated by Lucas Di Pentima about 6 years ago
Reviewing 14198-crunch-run-tokens
- b4f1614
What’s the difference between dispatcherClient
vs DispatcherArvClient
and containerClient
vs ContainerArvClient
on the ContainerRunner
struct? I think some comments on all *Client members would be useful for future contributors.
Updated by Peter Amstutz about 6 years ago
Lucas Di Pentima wrote:
Reviewing
14198-crunch-run-tokens
- b4f1614What’s the difference between
dispatcherClient
vsDispatcherArvClient
andcontainerClient
vsContainerArvClient
on theContainerRunner
struct? I think some comments on all *Client members would be useful for future contributors.
Yea, one is ArvadosClient and the other is arvados.Client -- I hate it. I can add comments. Related, I just added #14467 as a way out of this mess.
Updated by Peter Amstutz about 6 years ago
14198-fed-collection-list
GenericFederatedRequestHandler will route any requests to the desired cluster if it has cluster_id in the parameters, so it is a cheap way for the client to talk to any cluster, even if some operations haven't been fused yet. However, the collection request handler didn't have that logic so I refactored it so it would share the generic request routing logic in addition to the specialized logic for collections.
Updated by Lucas Di Pentima about 6 years ago
7c2aa951ed9e7d0010fd58d59dc1e98c9d5e2800 (14198-fed-collection-list
branch) LGTM.
Updated by Tom Clegg about 6 years ago
- Related to Bug #14475: [Controller] POST .../collections/$uuid seems to have broken Workbench integration tests added
Updated by Peter Amstutz about 6 years ago
14198-request-token @ 2b30b239c8c66c3a1a7be35c61a9fb707df1e7ee
When a container request was submitted by a remote user, we can't create a new token for that user, so set runtime_token to the provided auth token, making sure to strip out the container uuid in the 4th position in the case that the auth token is itself a container token.
Updated by Peter Amstutz about 6 years ago
- Target version changed from 2018-11-14 Sprint to 2018-11-28 Sprint
Updated by Lucas Di Pentima about 6 years ago
Reviewing 14198-request-token
: Could a test be added for this case? Apart from that, it LGTM.
Updated by Peter Amstutz about 6 years ago
Lucas Di Pentima wrote:
Reviewing
14198-request-token
: Could a test be added for this case? Apart from that, it LGTM.
Added a test & merged, thanks
Updated by Peter Amstutz about 6 years ago
14198-cwl-run-remote @ e4273f3dca8a41271dba5c1491d45f3a87fb5bb2
https://ci.curoverse.com/view/Developer/job/developer-run-tests/977/console
- Add ClusterTarget hint
- Use PDH for cluster_image (see 46d3a59fdedee791b3fc5b19ad64c7417235cf2c)
- Add --submit-runner-cluster and --always-submit-runner
- Update cwltool and schema salad
14198-fed-testing @ 4c9e5266635324ce7b44a2a9649e9ad002108d20
- Based on previous branch
- Uses a cwl workflow where each test case is a step
- Tested with arvbox
- See arvados/sdk/cwl/tests/federation/README
Updated by Peter Amstutz about 6 years ago
Rebased on latest master 14198-fed-testing @ 423936db5b43ba7b60a9088d5b0be80e1dee0599
https://ci.curoverse.com/view/Developer/job/developer-run-tests/978/
Updated by Lucas Di Pentima about 6 years ago
Ok, so to avoid making the wait longer, I'll submit my questions now so that it can be tested on our test clusters instead of me keep trying to make the arvboxes work (and surely failing as I don't have enough RAM, as Peter commented)
- Regarding the database credentials (on
arvados-controller
) that are now needed, I don't know if it's specific to this branch but it seemed that tried to use "root" as a default user when not configured, maybe it can fail with a specific error message. - a-c-r's parameters' help messages could be enhanced a little bit by using the "metavar" arg, for example on
--submit-runner-cluster
we can specify "CLUSTER_PREFIX" as the metavar. Something along those lines could be made with--submit-request-uuid
- As we talked on the chat
--submit
and--always-submit-runner
confused me because they seemed to me like almost the same thing. It doesn't seem "clean" to me having both but it's ok if there's no better option. - Are the
--submit-runner-cluster
param inputs being validated? It would be great if a-c-r could detect typos early.
Updated by Peter Amstutz about 6 years ago
Lucas Di Pentima wrote:
Ok, so to avoid making the wait longer, I'll submit my questions now so that it can be tested on our test clusters instead of me keep trying to make the arvboxes work (and surely failing as I don't have enough RAM, as Peter commented)
- Regarding the database credentials (on
arvados-controller
) that are now needed, I don't know if it's specific to this branch but it seemed that tried to use "root" as a default user when not configured, maybe it can fail with a specific error message.
Yes, that's controller related, so I'm going to rule that out of scope (but controller should have a better error message when the database login information is missing, I agree).
- a-c-r's parameters' help messages could be enhanced a little bit by using the "metavar" arg, for example on
--submit-runner-cluster
we can specify "CLUSTER_PREFIX" as the metavar. Something along those lines could be made with--submit-request-uuid
Done.
- As we talked on the chat
--submit
and--always-submit-runner
confused me because they seemed to me like almost the same thing. It doesn't seem "clean" to me having both but it's ok if there's no better option.
I rewrote the message a bit. We could also consider making it a "hidden" option.
- Are the
--submit-runner-cluster
param inputs being validated? It would be great if a-c-r could detect typos early.
Fixed.
14198-fed-testing @ bb570d30eb44e34c95d645fea42fabbede5e91f9
Updated by Peter Amstutz about 6 years ago
- Status changed from In Progress to Resolved