Feature #8128
closed[Crunch2] API support for crunch-dispatch
Description
See Container dispatch ("Arvados API support")
Some of the required APIs don't exist yet:- "Locked" state
- auth_uuid field: created automatically when container changes to state="Running"
- arvados.v1.containers.auth(uuid=container.uuid): retrieve the api_client_authentication record corresponding to auth_uuid
- api_client_authorizations.current: retrieve UUID of current token (similar to users.current). This avoids having to put the secret token anywhere other than the Authorization header.
- Filter containers list by API token:
containers.list?filters=[[locked_by_uuid,=,my_token_uuid],[state,in,[Running,Locked]]]
- Get list of queued containers:
containers.list?filters=[[state,=,Queued]]
- Get UUID of current token:
api_client_authorizations.list?filters=[[api_token,=,my_token]]
Related issues
Updated by Peter Amstutz over 8 years ago
// Update container status to Running err := arv.Update("containers", uuid, arvadosclient.Dict{ "container": arvadosclient.Dict{"state": "Locked"}}, nil) if err != nil { log.Printf("Error updating container %v to 'Locked' state: %v", uuid, err) }
- The comment says the 1st step is to "set container state to locked", but it locks the container record after starting the crunch-run -process. I believe the comment is correct and the code is wrong in this case.
- If it fails to lock the container, it logs an error but continues to run it anyway
- There's nothing here that fetches the container-specific authorization token after it is locked
- The comment says the 1st step is to "set container state to locked", but it locks the container record after submitting it to the slurm queue. I believe the comment is correct and the code is wrong in this case.
- Line 161 ends at column 237, which is a bit excessive even for my 200-character-wide Emacs window.
- There's nothing here that fetches the container-specific authorization token after it is locked
To be continued..
Updated by Peter Amstutz over 8 years ago
- Hard time following logic in validate_lock, could use some comments.
- Need row-level locking (with_lock) anytime container state field changes. Easy
way to do that with Rails?
Current design seems to be that crunch-run is responsible for fetching auth
token, but crunch-run isn't doing that yet. Crunch-run itself currently runs
with same token as dispatcher. This seems like a bad idea, since the dispatcher token is a long-lived admin token.
Suggest that crunch-run runs with auth token (requires changes to dispatchers, no change to crunch-run). This would imply that both locked_by and auth
token are permitted to update container record.
Dispatchers should honor priority list: "order by priority DESC, modified_at ASC"
Also filter priority > 0
Slurm dispatcher should pass priority on sbatch command line, --priority=<value>
When claiming a Locked container previously owned by me, should check squeue to make sure
the job exists, clean up record if not.
Updated by Peter Amstutz over 8 years ago
Perhaps we can tweak the containers controller so that all updates take a row level lock?
Updated by Tom Clegg over 8 years ago
Added row-level lock in container controller.
Added comments in validate_lock.
Fixed the Locked/Running state changes in dispatchers (you're right, I changed the Running transition but I should have added a Locked transition elsewhere).
Pretty sure the Running transition really should be happening inside crunch-run anyway, not dispatch, but I don't want to block the API changes behind that.
The rest of the comments are about improving dispatchers, not the API support or the changes I made to dispatchers here, right?
Updated by Peter Amstutz over 8 years ago
@ 88d5c65 services/crunch-dispatch-local tests fails
********** Running services/crunch-dispatch-local tests ********** 2016/05/11 16:06:08 authSettings: map[ARVADOS_API_HOST:0.0.0.0:37318 ARVADOS_API_HOST_INSECURE:true ARVADOS_API_TOKEN:4axaw8zxe0qm22wa6urpp5nskcne8z88cvbupv653y1njyi05h] 2016/05/11 16:06:10 About to run queued container zzzzz-dz642-queuedcontainer 2016/05/11 16:06:10 Starting container zzzzz-dz642-queuedcontainer zzzzz-dz642-queuedcontainer 2016/05/11 16:06:10 Finished container run for zzzzz-dz642-queuedcontainer 2016/05/11 16:06:10 After crunch-run process termination, the state is still 'Running' for zzzzz-dz642-queuedcontainer. Updating it to 'Complete' 2016/05/11 16:06:13 Caught signal: interrupt 2016/05/11 16:06:17 Error updating container zzzzz-dz642-xxxxxxxxxxxxxx1 to 'Locked' state: "arvados API server error: 500: 500 Internal Server Error returned by 127.0.0.1:35716" 2016/05/11 16:06:17 Error updating container zzzzz-dz642-xxxxxxxxxxxxxx1 to 'Locked' state: "arvados API server error: 500: 500 Internal Server Error returned by 127.0.0.1:35716" 2016/05/11 16:06:17 Error updating container zzzzz-dz642-xxxxxxxxxxxxxx1 to 'Locked' state: "arvados API server error: 500: 500 Internal Server Error returned by 127.0.0.1:35716" 2016/05/11 16:06:17 Error updating container zzzzz-dz642-xxxxxxxxxxxxxx1 to 'Locked' state: "arvados API server error: 500: 500 Internal Server Error returned by 127.0.0.1:35716" 2016/05/11 16:06:17 Error updating container zzzzz-dz642-xxxxxxxxxxxxxx1 to 'Locked' state: "arvados API server error: 500: 500 Internal Server Error returned by 127.0.0.1:35716" 2016/05/11 16:06:17 Error updating container zzzzz-dz642-xxxxxxxxxxxxxx1 to 'Locked' state: "arvados API server error: 500: 500 Internal Server Error returned by 127.0.0.1:35716" 2016/05/11 16:06:17 Error updating container zzzzz-dz642-xxxxxxxxxxxxxx1 to 'Locked' state: "arvados API server error: 500: 500 Internal Server Error returned by 127.0.0.1:35716" 2016/05/11 16:06:17 Error updating container zzzzz-dz642-xxxxxxxxxxxxxx1 to 'Locked' state: "arvados API server error: 500: 500 Internal Server Error returned by 127.0.0.1:35716" 2016/05/11 16:06:17 Error updating container zzzzz-dz642-xxxxxxxxxxxxxx1 to 'Locked' state: "arvados API server error: 500: 500 Internal Server Error returned by 127.0.0.1:35716" 2016/05/11 16:06:17 Error updating container zzzzz-dz642-xxxxxxxxxxxxxx1 to 'Locked' state: "arvados API server error: 500: 500 Internal Server Error returned by 127.0.0.1:35716" 2016/05/11 16:06:17 Error updating container zzzzz-dz642-xxxxxxxxxxxxxx1 to 'Locked' state: "arvados API server error: 500: 500 Internal Server Error returned by 127.0.0.1:35716" 2016/05/11 16:06:17 Error updating container zzzzz-dz642-xxxxxxxxxxxxxx1 to 'Locked' state: "arvados API server error: 500: 500 Internal Server Error returned by 127.0.0.1:35716" 2016/05/11 16:06:17 Error updating container zzzzz-dz642-xxxxxxxxxxxxxx1 to 'Locked' state: "arvados API server error: 500: 500 Internal Server Error returned by 127.0.0.1:35716" 2016/05/11 16:06:17 Error updating container zzzzz-dz642-xxxxxxxxxxxxxx1 to 'Locked' state: "arvados API server error: 500: 500 Internal Server Error returned by 127.0.0.1:35716" 2016/05/11 16:06:17 Error updating container zzzzz-dz642-xxxxxxxxxxxxxx1 to 'Locked' state: "arvados API server error: 500: 500 Internal Server Error returned by 127.0.0.1:35716" 2016/05/11 16:06:17 Error updating container zzzzz-dz642-xxxxxxxxxxxxxx1 to 'Locked' state: "arvados API server error: 500: 500 Internal Server Error returned by 127.0.0.1:35716" 2016/05/11 16:06:17 Error updating container zzzzz-dz642-xxxxxxxxxxxxxx1 to 'Locked' state: "arvados API server error: 500: 500 Internal Server Error returned by 127.0.0.1:35716" 2016/05/11 16:06:17 Error updating container zzzzz-dz642-xxxxxxxxxxxxxx1 to 'Locked' state: "arvados API server error: 500: 500 Internal Server Error returned by 127.0.0.1:35716" 2016/05/11 16:06:17 Error updating container zzzzz-dz642-xxxxxxxxxxxxxx1 to 'Locked' state: "arvados API server error: 500: 500 Internal Server Error returned by 127.0.0.1:35716" 2016/05/11 16:06:17 Error updating container zzzzz-dz642-xxxxxxxxxxxxxx1 to 'Locked' state: "arvados API server error: 500: 500 Internal Server Error returned by 127.0.0.1:35716" 2016/05/11 16:06:17 Error updating container zzzzz-dz642-xxxxxxxxxxxxxx1 to 'Locked' state: "arvados API server error: 500: 500 Internal Server Error returned by 127.0.0.1:35716" ---------------------------------------------------------------------- FAIL: /var/lib/arvados/test/GOPATH/src/git.curoverse.com/arvados.git/services/crunch-dispatch-local/crunch-dispatch-local_test.go:96: MockArvadosServerSuite.Test_APIErrorUpdatingContainerState /var/lib/arvados/test/GOPATH/src/git.curoverse.com/arvados.git/services/crunch-dispatch-local/crunch-dispatch-local_test.go:103: testWithServerStub(c, apiStubResponses, "echo", "Error updating container zzzzz-dz642-xxxxxxxxxxxxxx1 to 'Locked' state") /var/lib/arvados/test/GOPATH/src/git.curoverse.com/arvados.git/services/crunch-dispatch-local/crunch-dispatch-local_test.go:155: c.Check(buf.String(), Matches, `(?ms).*`+expected+`.*`) ... value string = "" + ... "2016/05/11 16:06:15 About to run queued container zzzzz-dz642-xxxxxxxxxxxxxx1\n" + ... "2016/05/11 16:06:15 About to run queued container zzzzz-dz642-xxxxxxxxxxxxxx1\n" + ... "2016/05/11 16:06:15 About to run queued container zzzzz-dz642-xxxxxxxxxxxxxx1\n" + ... "2016/05/11 16:06:15 About to run queued container zzzzz-dz642-xxxxxxxxxxxxxx1\n" + ... "2016/05/11 16:06:15 About to run queued container zzzzz-dz642-xxxxxxxxxxxxxx1\n" + ... "2016/05/11 16:06:15 About to run queued container zzzzz-dz642-xxxxxxxxxxxxxx1\n" + ... "2016/05/11 16:06:15 About to run queued container zzzzz-dz642-xxxxxxxxxxxxxx1\n" + ... "2016/05/11 16:06:15 About to run queued container zzzzz-dz642-xxxxxxxxxxxxxx1\n" + ... "2016/05/11 16:06:15 About to run queued container zzzzz-dz642-xxxxxxxxxxxxxx1\n" + ... "2016/05/11 16:06:15 About to run queued container zzzzz-dz642-xxxxxxxxxxxxxx1\n" + ... "2016/05/11 16:06:15 About to run queued container zzzzz-dz642-xxxxxxxxxxxxxx1\n" + ... "2016/05/11 16:06:15 About to run queued container zzzzz-dz642-xxxxxxxxxxxxxx1\n" + ... "2016/05/11 16:06:15 About to run queued container zzzzz-dz642-xxxxxxxxxxxxxx1\n" + ... "2016/05/11 16:06:15 About to run queued container zzzzz-dz642-xxxxxxxxxxxxxx1\n" + ... "2016/05/11 16:06:17 About to run queued container zzzzz-dz642-xxxxxxxxxxxxxx1\n" + ... "2016/05/11 16:06:17 About to run queued container zzzzz-dz642-xxxxxxxxxxxxxx1\n" + ... "2016/05/11 16:06:17 About to run queued container zzzzz-dz642-xxxxxxxxxxxxxx1\n" + ... "2016/05/11 16:06:17 About to run queued container zzzzz-dz642-xxxxxxxxxxxxxx1\n" + [50,000 more lines] ... "2016/05/11 16:06:17 About to run queued container zzzzz-dz642-xxxxxxxxxxxxxx1\n" + ... "2016/05/11 16:06:17 About to run queued container zzzzz-dz642-xxxxxxxxxxxxxx1\n" + ... "2016/05/11 16:06:17 Caught signal: terminated\n" + ... "2016/05/11 16:06:17 About to run queued container zzzzz-dz642-xxxxxxxxxxxxxx1\n" ... regex string = "(?ms).*Error updating container zzzzz-dz642-xxxxxxxxxxxxxx1 to 'Locked' state.*" zzzzz-dz642-xxxxxxxxxxxxxx2 zzzzz-dz642-xxxxxxxxxxxxxx2 zzzzz-dz642-xxxxxxxxxxxxxx2 zzzzz-dz642-xxxxxxxxxxxxxx2 zzzzz-dz642-xxxxxxxxxxxxxx2 zzzzz-dz642-xxxxxxxxxxxxxx2 2016/05/11 16:06:19 After crunch-run process termination, the state is still 'Running' for zzzzz-dz642-xxxxxxxxxxxxxx2. Updating it to 'Complete' 2016/05/11 16:06:19 After crunch-run process termination, the state is still 'Running' for zzzzz-dz642-xxxxxxxxxxxxxx2. Updating it to 'Complete' zzzzz-dz642-xxxxxxxxxxxxxx2 zzzzz-dz642-xxxxxxxxxxxxxx2 zzzzz-dz642-xxxxxxxxxxxxxx2 zzzzz-dz642-xxxxxxxxxxxxxx2 zzzzz-dz642-xxxxxxxxxxxxxx2 zzzzz-dz642-xxxxxxxxxxxxxx2 OOPS: 4 passed, 1 FAILED --- FAIL: Test (15.27 seconds) FAIL
Updated by Tom Clegg over 8 years ago
Flaky test. Seems to be one of those SIGPIPE races, because we were trying to pipe a "#!/bin/sh\necho foo bar\n"
script into a mock sbatch "echo foo"
. Changed the mock sbatch to "sh"
and it hasn't failed for me since, fwiw.
→ cbc54fd
Updated by Tom Clegg over 8 years ago
- Status changed from In Progress to Resolved
- % Done changed from 80 to 100
Applied in changeset arvados|commit:3ede205bf0e5af7a88e04009cd66ff111e0729c3.