Feature #8128

[Crunch2] API support for crunch-dispatch

Added by Tom Clegg about 4 years ago. Updated almost 4 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
04/28/2016
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
1.0
Release:
Release relationship:
Auto

Description

See Container dispatch ("Arvados API support")

Some of the required APIs don't exist yet:
  • "Locked" state
  • auth_uuid field: created automatically when container changes to state="Running"
  • arvados.v1.containers.auth(uuid=container.uuid): retrieve the api_client_authentication record corresponding to auth_uuid
  • api_client_authorizations.current: retrieve UUID of current token (similar to users.current). This avoids having to put the secret token anywhere other than the Authorization header.
We do have:
  • Filter containers list by API token: containers.list?filters=[[locked_by_uuid,=,my_token_uuid],[state,in,[Running,Locked]]]
  • Get list of queued containers: containers.list?filters=[[state,=,Queued]]
  • Get UUID of current token: api_client_authorizations.list?filters=[[api_token,=,my_token]]

Subtasks

Task #9106: Add "Locked" stateResolvedTom Clegg

Task #9109: Add containers.container_auth APIResolvedTom Clegg

Task #9098: Review 8128-crunch2-auth-apiResolvedPeter Amstutz

Task #9107: Update container_auth_uuid as needed when state changesResolvedTom Clegg

Task #9108: Add api_client_authorizations.current APIResolvedTom Clegg


Related issues

Related to Arvados - Feature #6518: [Crunch] [Crunch2] Dispatch containers via slurmResolved07/08/2015

Related to Arvados - Feature #8028: [Crunch2] Dispatch containers locallyResolved01/19/2016

Related to Arvados - Bug #9900: [Crunch2] [API] Add ephemeral "run token" for running containersResolved08/31/2016

Blocked by Arvados - Bug #8079: [API] Add uuid property to ApiClientAuthorizationResolved02/12/2016

Precedes (1 day) Arvados - Story #9187: [Crunch2] Update dispatcher to use new API supportResolved05/02/2016

Associated revisions

Revision 3ede205b
Added by Tom Clegg almost 4 years ago

Merge branch '8128-crunch2-auth-api'

closes #8128

History

#1 Updated by Tom Clegg about 4 years ago

  • Description updated (diff)

#2 Updated by Tom Clegg about 4 years ago

  • Description updated (diff)

#3 Updated by Tom Clegg almost 4 years ago

  • Description updated (diff)

#4 Updated by Tom Clegg almost 4 years ago

  • Target version set to 2016-05-11 sprint

#5 Updated by Tom Clegg almost 4 years ago

  • Assigned To set to Tom Clegg

#6 Updated by Tom Clegg almost 4 years ago

  • Description updated (diff)

#7 Updated by Tom Clegg almost 4 years ago

  • Description updated (diff)

#8 Updated by Peter Amstutz almost 4 years ago

crunch-dispatch-local.go:
    // Update container status to Running
    err := arv.Update("containers", uuid,
        arvadosclient.Dict{
            "container": arvadosclient.Dict{"state": "Locked"}},
        nil)
    if err != nil {
        log.Printf("Error updating container %v to 'Locked' state: %v", uuid, err)
    }

  • The comment says the 1st step is to "set container state to locked", but it locks the container record after starting the crunch-run -process. I believe the comment is correct and the code is wrong in this case.
  • If it fails to lock the container, it logs an error but continues to run it anyway
  • There's nothing here that fetches the container-specific authorization token after it is locked
crunch-dispatch-slurm.go:
  • The comment says the 1st step is to "set container state to locked", but it locks the container record after submitting it to the slurm queue. I believe the comment is correct and the code is wrong in this case.
  • Line 161 ends at column 237, which is a bit excessive even for my 200-character-wide Emacs window.
  • There's nothing here that fetches the container-specific authorization token after it is locked

To be continued..

#9 Updated by Peter Amstutz almost 4 years ago

container.rb:
  • Hard time following logic in validate_lock, could use some comments.
  • Need row-level locking (with_lock) anytime container state field changes. Easy
    way to do that with Rails?

Current design seems to be that crunch-run is responsible for fetching auth
token, but crunch-run isn't doing that yet. Crunch-run itself currently runs
with same token as dispatcher. This seems like a bad idea, since the dispatcher token is a long-lived admin token.

Suggest that crunch-run runs with auth token (requires changes to dispatchers, no change to crunch-run). This would imply that both locked_by and auth
token are permitted to update container record.

Dispatchers should honor priority list: "order by priority DESC, modified_at ASC"
Also filter priority > 0

Slurm dispatcher should pass priority on sbatch command line, --priority=<value>

When claiming a Locked container previously owned by me, should check squeue to make sure
the job exists, clean up record if not.

#10 Updated by Peter Amstutz almost 4 years ago

Perhaps we can tweak the containers controller so that all updates take a row level lock?

#11 Updated by Tom Clegg almost 4 years ago

Added row-level lock in container controller.

Added comments in validate_lock.

Fixed the Locked/Running state changes in dispatchers (you're right, I changed the Running transition but I should have added a Locked transition elsewhere).

Pretty sure the Running transition really should be happening inside crunch-run anyway, not dispatch, but I don't want to block the API changes behind that.

The rest of the comments are about improving dispatchers, not the API support or the changes I made to dispatchers here, right?

#12 Updated by Peter Amstutz almost 4 years ago

@ 88d5c65 services/crunch-dispatch-local tests fails

       ********** Running services/crunch-dispatch-local tests **********

2016/05/11 16:06:08 authSettings: map[ARVADOS_API_HOST:0.0.0.0:37318 ARVADOS_API_HOST_INSECURE:true ARVADOS_API_TOKEN:4axaw8zxe0qm22wa6urpp5nskcne8z88cvbupv653y1njyi05h]
2016/05/11 16:06:10 About to run queued container zzzzz-dz642-queuedcontainer
2016/05/11 16:06:10 Starting container zzzzz-dz642-queuedcontainer
zzzzz-dz642-queuedcontainer
2016/05/11 16:06:10 Finished container run for zzzzz-dz642-queuedcontainer
2016/05/11 16:06:10 After crunch-run process termination, the state is still 'Running' for zzzzz-dz642-queuedcontainer. Updating it to 'Complete'
2016/05/11 16:06:13 Caught signal: interrupt
2016/05/11 16:06:17 Error updating container zzzzz-dz642-xxxxxxxxxxxxxx1 to 'Locked' state: "arvados API server error: 500: 500 Internal Server Error returned by 127.0.0.1:35716" 
2016/05/11 16:06:17 Error updating container zzzzz-dz642-xxxxxxxxxxxxxx1 to 'Locked' state: "arvados API server error: 500: 500 Internal Server Error returned by 127.0.0.1:35716" 
2016/05/11 16:06:17 Error updating container zzzzz-dz642-xxxxxxxxxxxxxx1 to 'Locked' state: "arvados API server error: 500: 500 Internal Server Error returned by 127.0.0.1:35716" 
2016/05/11 16:06:17 Error updating container zzzzz-dz642-xxxxxxxxxxxxxx1 to 'Locked' state: "arvados API server error: 500: 500 Internal Server Error returned by 127.0.0.1:35716" 
2016/05/11 16:06:17 Error updating container zzzzz-dz642-xxxxxxxxxxxxxx1 to 'Locked' state: "arvados API server error: 500: 500 Internal Server Error returned by 127.0.0.1:35716" 
2016/05/11 16:06:17 Error updating container zzzzz-dz642-xxxxxxxxxxxxxx1 to 'Locked' state: "arvados API server error: 500: 500 Internal Server Error returned by 127.0.0.1:35716" 
2016/05/11 16:06:17 Error updating container zzzzz-dz642-xxxxxxxxxxxxxx1 to 'Locked' state: "arvados API server error: 500: 500 Internal Server Error returned by 127.0.0.1:35716" 
2016/05/11 16:06:17 Error updating container zzzzz-dz642-xxxxxxxxxxxxxx1 to 'Locked' state: "arvados API server error: 500: 500 Internal Server Error returned by 127.0.0.1:35716" 
2016/05/11 16:06:17 Error updating container zzzzz-dz642-xxxxxxxxxxxxxx1 to 'Locked' state: "arvados API server error: 500: 500 Internal Server Error returned by 127.0.0.1:35716" 
2016/05/11 16:06:17 Error updating container zzzzz-dz642-xxxxxxxxxxxxxx1 to 'Locked' state: "arvados API server error: 500: 500 Internal Server Error returned by 127.0.0.1:35716" 
2016/05/11 16:06:17 Error updating container zzzzz-dz642-xxxxxxxxxxxxxx1 to 'Locked' state: "arvados API server error: 500: 500 Internal Server Error returned by 127.0.0.1:35716" 
2016/05/11 16:06:17 Error updating container zzzzz-dz642-xxxxxxxxxxxxxx1 to 'Locked' state: "arvados API server error: 500: 500 Internal Server Error returned by 127.0.0.1:35716" 
2016/05/11 16:06:17 Error updating container zzzzz-dz642-xxxxxxxxxxxxxx1 to 'Locked' state: "arvados API server error: 500: 500 Internal Server Error returned by 127.0.0.1:35716" 
2016/05/11 16:06:17 Error updating container zzzzz-dz642-xxxxxxxxxxxxxx1 to 'Locked' state: "arvados API server error: 500: 500 Internal Server Error returned by 127.0.0.1:35716" 
2016/05/11 16:06:17 Error updating container zzzzz-dz642-xxxxxxxxxxxxxx1 to 'Locked' state: "arvados API server error: 500: 500 Internal Server Error returned by 127.0.0.1:35716" 
2016/05/11 16:06:17 Error updating container zzzzz-dz642-xxxxxxxxxxxxxx1 to 'Locked' state: "arvados API server error: 500: 500 Internal Server Error returned by 127.0.0.1:35716" 
2016/05/11 16:06:17 Error updating container zzzzz-dz642-xxxxxxxxxxxxxx1 to 'Locked' state: "arvados API server error: 500: 500 Internal Server Error returned by 127.0.0.1:35716" 
2016/05/11 16:06:17 Error updating container zzzzz-dz642-xxxxxxxxxxxxxx1 to 'Locked' state: "arvados API server error: 500: 500 Internal Server Error returned by 127.0.0.1:35716" 
2016/05/11 16:06:17 Error updating container zzzzz-dz642-xxxxxxxxxxxxxx1 to 'Locked' state: "arvados API server error: 500: 500 Internal Server Error returned by 127.0.0.1:35716" 
2016/05/11 16:06:17 Error updating container zzzzz-dz642-xxxxxxxxxxxxxx1 to 'Locked' state: "arvados API server error: 500: 500 Internal Server Error returned by 127.0.0.1:35716" 
2016/05/11 16:06:17 Error updating container zzzzz-dz642-xxxxxxxxxxxxxx1 to 'Locked' state: "arvados API server error: 500: 500 Internal Server Error returned by 127.0.0.1:35716" 

----------------------------------------------------------------------
FAIL: /var/lib/arvados/test/GOPATH/src/git.curoverse.com/arvados.git/services/crunch-dispatch-local/crunch-dispatch-local_test.go:96: MockArvadosServerSuite.Test_APIErrorUpdatingContainerState

/var/lib/arvados/test/GOPATH/src/git.curoverse.com/arvados.git/services/crunch-dispatch-local/crunch-dispatch-local_test.go:103:
    testWithServerStub(c, apiStubResponses, "echo", "Error updating container zzzzz-dz642-xxxxxxxxxxxxxx1 to 'Locked' state")
/var/lib/arvados/test/GOPATH/src/git.curoverse.com/arvados.git/services/crunch-dispatch-local/crunch-dispatch-local_test.go:155:
    c.Check(buf.String(), Matches, `(?ms).*`+expected+`.*`)
... value string = "" +
...     "2016/05/11 16:06:15 About to run queued container zzzzz-dz642-xxxxxxxxxxxxxx1\n" +
...     "2016/05/11 16:06:15 About to run queued container zzzzz-dz642-xxxxxxxxxxxxxx1\n" +
...     "2016/05/11 16:06:15 About to run queued container zzzzz-dz642-xxxxxxxxxxxxxx1\n" +
...     "2016/05/11 16:06:15 About to run queued container zzzzz-dz642-xxxxxxxxxxxxxx1\n" +
...     "2016/05/11 16:06:15 About to run queued container zzzzz-dz642-xxxxxxxxxxxxxx1\n" +
...     "2016/05/11 16:06:15 About to run queued container zzzzz-dz642-xxxxxxxxxxxxxx1\n" +
...     "2016/05/11 16:06:15 About to run queued container zzzzz-dz642-xxxxxxxxxxxxxx1\n" +
...     "2016/05/11 16:06:15 About to run queued container zzzzz-dz642-xxxxxxxxxxxxxx1\n" +
...     "2016/05/11 16:06:15 About to run queued container zzzzz-dz642-xxxxxxxxxxxxxx1\n" +
...     "2016/05/11 16:06:15 About to run queued container zzzzz-dz642-xxxxxxxxxxxxxx1\n" +
...     "2016/05/11 16:06:15 About to run queued container zzzzz-dz642-xxxxxxxxxxxxxx1\n" +
...     "2016/05/11 16:06:15 About to run queued container zzzzz-dz642-xxxxxxxxxxxxxx1\n" +
...     "2016/05/11 16:06:15 About to run queued container zzzzz-dz642-xxxxxxxxxxxxxx1\n" +
...     "2016/05/11 16:06:15 About to run queued container zzzzz-dz642-xxxxxxxxxxxxxx1\n" +
...     "2016/05/11 16:06:17 About to run queued container zzzzz-dz642-xxxxxxxxxxxxxx1\n" +
...     "2016/05/11 16:06:17 About to run queued container zzzzz-dz642-xxxxxxxxxxxxxx1\n" +
...     "2016/05/11 16:06:17 About to run queued container zzzzz-dz642-xxxxxxxxxxxxxx1\n" +
...     "2016/05/11 16:06:17 About to run queued container zzzzz-dz642-xxxxxxxxxxxxxx1\n" +
[50,000 more lines]
...     "2016/05/11 16:06:17 About to run queued container zzzzz-dz642-xxxxxxxxxxxxxx1\n" +
...     "2016/05/11 16:06:17 About to run queued container zzzzz-dz642-xxxxxxxxxxxxxx1\n" +
...     "2016/05/11 16:06:17 Caught signal: terminated\n" +
...     "2016/05/11 16:06:17 About to run queued container zzzzz-dz642-xxxxxxxxxxxxxx1\n" 
... regex string = "(?ms).*Error updating container zzzzz-dz642-xxxxxxxxxxxxxx1 to 'Locked' state.*" 

zzzzz-dz642-xxxxxxxxxxxxxx2
zzzzz-dz642-xxxxxxxxxxxxxx2
zzzzz-dz642-xxxxxxxxxxxxxx2
zzzzz-dz642-xxxxxxxxxxxxxx2
zzzzz-dz642-xxxxxxxxxxxxxx2
zzzzz-dz642-xxxxxxxxxxxxxx2
2016/05/11 16:06:19 After crunch-run process termination, the state is still 'Running' for zzzzz-dz642-xxxxxxxxxxxxxx2. Updating it to 'Complete'
2016/05/11 16:06:19 After crunch-run process termination, the state is still 'Running' for zzzzz-dz642-xxxxxxxxxxxxxx2. Updating it to 'Complete'
zzzzz-dz642-xxxxxxxxxxxxxx2
zzzzz-dz642-xxxxxxxxxxxxxx2
zzzzz-dz642-xxxxxxxxxxxxxx2
zzzzz-dz642-xxxxxxxxxxxxxx2
zzzzz-dz642-xxxxxxxxxxxxxx2
zzzzz-dz642-xxxxxxxxxxxxxx2
OOPS: 4 passed, 1 FAILED
--- FAIL: Test (15.27 seconds)
FAIL

#13 Updated by Tom Clegg almost 4 years ago

  • Status changed from New to In Progress

#14 Updated by Tom Clegg almost 4 years ago

Flaky test. Seems to be one of those SIGPIPE races, because we were trying to pipe a "#!/bin/sh\necho foo bar\n" script into a mock sbatch "echo foo". Changed the mock sbatch to "sh" and it hasn't failed for me since, fwiw.

cbc54fd

#15 Updated by Tom Clegg almost 4 years ago

More races (and errant poll intervals) fixed in 2e34fd2 and b840689

#16 Updated by Peter Amstutz almost 4 years ago

LGTM @ ea0d81d

#17 Updated by Tom Clegg almost 4 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 80 to 100

Applied in changeset arvados|commit:3ede205bf0e5af7a88e04009cd66ff111e0729c3.

Also available in: Atom PDF