Bug #13164
closed[API] dispatch sometimes tries to run cancelled containers at the expense of pending containers
Description
Sometimes the crunch-dispatch-slurm logs can be filled with messages such as:
Mar 1 19:54:11 arvados-master-ncucu crunch-dispatch-slurm[26949]: 2018/03/01 19:54:11 debug: error locking container ncucu-dz642-2gpgq1vw3dflmox: arvados API server error: Auth uuid cannot be assigned because priority <= 0 (422: 422 Unprocessable Entity) returned by arvados-api-ncucu.hgi.sanger.ac.uk
This seems to be after a cwl runner job has been cancelled or after slurm container jobs are cancelled manually using `scancel`.
When in this state, crunch-dispatch-slurm seems to ignore most or all of the pending containers (sometimes for hours). However, if when in that condition it is restarted manually (i.e. by running `systemctl restart crunch-dispatch-slurm`) the new process will immediately submit to slurm all of the (sometimes many hundreds in our case) pending containers.
Related issues
Updated by Joshua Randall over 6 years ago
Note also that the message regarding "priority <=0" is a bit misleading.
From the spot checks I've done, the issue has been that the container in question has no container_request with priority > 0. At first was confused because the priority of the containers I found was 1, but the message is actually referring to the container_request priority rather than the container priority.
Updated by Joshua Randall over 6 years ago
Whose responsibility should it be to do something about a container with priority > 0 but with no container_requests that have priority > 0?
What appears to happen is that a container in that state stays that way permanently - crunch-dispatch-slurm considers them again and again and every time it does, it logs the "error locking container" message. In the cases I've checked, this is because the container_request has been cancelled (it has `"state": "Final"` and `"priority": 0`). Shouldn't the API server update the container priority to 0 once it notices that there are no extant requests for it?
# grep crunch-dispatch-slurm /var/log/syslog | grep ncucu-dz642-jxmn3owekbfmxgn Mar 2 06:59:49 arvados-master-ncucu crunch-dispatch-slurm[3562]: 2018/03/02 06:59:49 debug: error locking container ncucu-dz642-jxmn3owekbfmxgn: arvados API server error: Auth uuid cannot be assigned because priority <= 0 (422: 422 Unprocessable Entity) returned by arvados-api-ncucu.hgi.sanger.ac.uk Mar 2 07:59:22 arvados-master-ncucu crunch-dispatch-slurm[3562]: 2018/03/02 07:59:22 debug: error locking container ncucu-dz642-jxmn3owekbfmxgn: arvados API server error: Auth uuid cannot be assigned because priority <= 0 (422: 422 Unprocessable Entity) returned by arvados-api-ncucu.hgi.sanger.ac.uk Mar 2 08:58:55 arvados-master-ncucu crunch-dispatch-slurm[3562]: 2018/03/02 08:58:55 debug: error locking container ncucu-dz642-jxmn3owekbfmxgn: arvados API server error: Auth uuid cannot be assigned because priority <= 0 (422: 422 Unprocessable Entity) returned by arvados-api-ncucu.hgi.sanger.ac.uk Mar 2 09:58:33 arvados-master-ncucu crunch-dispatch-slurm[3562]: 2018/03/02 09:58:33 debug: error locking container ncucu-dz642-jxmn3owekbfmxgn: arvados API server error: Auth uuid cannot be assigned because priority <= 0 (422: 422 Unprocessable Entity) returned by arvados-api-ncucu.hgi.sanger.ac.uk Mar 2 10:58:05 arvados-master-ncucu crunch-dispatch-slurm[3562]: 2018/03/02 10:58:05 debug: error locking container ncucu-dz642-jxmn3owekbfmxgn: arvados API server error: Auth uuid cannot be assigned because priority <= 0 (422: 422 Unprocessable Entity) returned by arvados-api-ncucu.hgi.sanger.ac.uk Mar 2 11:57:39 arvados-master-ncucu crunch-dispatch-slurm[3562]: 2018/03/02 11:57:39 debug: error locking container ncucu-dz642-jxmn3owekbfmxgn: arvados API server error: Auth uuid cannot be assigned because priority <= 0 (422: 422 Unprocessable Entity) returned by arvados-api-ncucu.hgi.sanger.ac.uk Mar 2 12:57:12 arvados-master-ncucu crunch-dispatch-slurm[3562]: 2018/03/02 12:57:12 debug: error locking container ncucu-dz642-jxmn3owekbfmxgn: arvados API server error: Auth uuid cannot be assigned because priority <= 0 (422: 422 Unprocessable Entity) returned by arvados-api-ncucu.hgi.sanger.ac.uk Mar 2 13:56:43 arvados-master-ncucu crunch-dispatch-slurm[3562]: 2018/03/02 13:56:43 debug: error locking container ncucu-dz642-jxmn3owekbfmxgn: arvados API server error: Auth uuid cannot be assigned because priority <= 0 (422: 422 Unprocessable Entity) returned by arvados-api-ncucu.hgi.sanger.ac.uk Mar 2 14:56:14 arvados-master-ncucu crunch-dispatch-slurm[3562]: 2018/03/02 14:56:14 debug: error locking container ncucu-dz642-jxmn3owekbfmxgn: arvados API server error: Auth uuid cannot be assigned because priority <= 0 (422: 422 Unprocessable Entity) returned by arvados-api-ncucu.hgi.sanger.ac.uk Mar 2 15:55:46 arvados-master-ncucu crunch-dispatch-slurm[3562]: 2018/03/02 15:55:46 debug: error locking container ncucu-dz642-jxmn3owekbfmxgn: arvados API server error: Auth uuid cannot be assigned because priority <= 0 (422: 422 Unprocessable Entity) returned by arvados-api-ncucu.hgi.sanger.ac.uk Mar 2 16:55:17 arvados-master-ncucu crunch-dispatch-slurm[3562]: 2018/03/02 16:55:17 debug: error locking container ncucu-dz642-jxmn3owekbfmxgn: arvados API server error: Auth uuid cannot be assigned because priority <= 0 (422: 422 Unprocessable Entity) returned by arvados-api-ncucu.hgi.sanger.ac.uk $ arv container_request list -f '[["container_uuid","=","ncucu-dz642-jxmn3owekbfmxgn"]]' | jq '.items[0] | {state: .state, priority: .priority}' { "state": "Final", "priority": 0 }
Updated by Joshua Randall over 6 years ago
Manual fix using psql:
arvados_api_production=> update containers set priority = 0 where containers.uuid in (select c.uuid from containers as c join container_requests as cr on c.uuid = cr.container_uuid where c.priority > 0 and cr.priority <= 0); UPDATE 302
Updated by Tom Morris over 6 years ago
- Target version changed from To Be Groomed to 2018-04-11 Sprint
Updated by Tom Morris over 6 years ago
- Target version changed from 2018-04-11 Sprint to Arvados Future Sprints
Updated by Tom Clegg over 6 years ago
This seems like a database consistency issue -- perhaps we need better locking when updating containers and container_requests.
Updated by Tom Morris over 6 years ago
- Target version changed from Arvados Future Sprints to 2018-05-23 Sprint
Updated by Peter Amstutz over 6 years ago
- Priority changed from Normal to Low
Possibly related https://ci.curoverse.com/job/run-tests-services-api/1948/console
Updated by Tom Clegg over 6 years ago
- Subject changed from crunch-dispatch-slurm sometimes seems to get obsessed with cancelled containers at the expense of pending containers to [API] dispatch sometimes tries to run cancelled containers at the expense of pending containers
- Assigned To set to Tom Clegg
- Priority changed from Low to Normal
Updated by Tom Clegg over 6 years ago
13164-container-locking @ 764334dee966e04161d411a9feb0074b99faa147
https://ci.curoverse.com/job/developer-run-tests/716/
I suspect we get to this state (container priority>0 even though no matching requests have priority>0) in a race like this:- (thread 1) cancel request 1
- (thread 1) compute new container priority (request 2 is still active, so priority>0)
- (thread 2) cancel request 2
- (thread 2) compute new container priority (request 1 is still active, so priority>0)
- (thread 1) commit
- (thread 2) commit
- both requests have priority=0, but container has priority>0
Updated by Lucas Di Pentima over 6 years ago
Reading PostgreSQL documentation about locking, it says that there’s no UNLOCK command, all locks are release at the end of the transaction, so my question is if we’re sure that the several update_priority!()
calls from map()
on container_request.rb:292
are contained within a single transaction or do we need to explictly enclose that query inside a transaction block? The same for propagate_priority()
on containers.rb
Updated by Tom Clegg over 6 years ago
Lucas Di Pentima wrote:
...if we’re sure that the several
update_priority!()
calls frommap()
oncontainer_request.rb:292
are contained within a single transaction or do we need to explictly enclose that query inside a transaction block? The same forpropagate_priority()
oncontainers.rb
We want the tables to stay locked until the entire set of updates is committed, so we don't want an inner transaction.
Here's an excerpt from test.log while crunch-dispatch-slurm is running:
ApiClientAuthorization Load (0.3ms) SELECT "api_client_authorizations".* FROM "api_client_authorizations" WHERE (api_token='4axaw8zxe0qm22wa6urpp5nskcne8z88cvbupv653y1njyi05h' and (expires_at is null or expires_at > CURRENT_TIMESTAMP)) ORDER BY "api_client_authorizations"."id" ASC LIMIT 1 User Load (0.3ms) SELECT "users".* FROM "users" WHERE (redirect_to_user_uuid is null) AND "users"."id" IN (135138680) ApiClient Load (0.2ms) SELECT "api_clients".* FROM "api_clients" WHERE "api_clients"."id" IN (243589807) Container Load (0.7ms) SELECT "containers".* FROM "containers" WHERE (EXISTS (SELECT "container_requests".* FROM "container_requests" WHERE (NOT EXISTS(SELECT 1 FROM materialized_permission_view WHERE trashed = 1 AND (container_requests.owner_uuid = target_uuid)) ) AND (containers.uuid = container_requests.container_uuid))) AND (containers.uuid='zzzzz-dz642-queuedcontainer') ORDER BY "containers"."id" ASC LIMIT 1 OFFSET 0 (0.2ms) BEGIN Container Load (0.4ms) SELECT "containers".* FROM "containers" WHERE "containers"."id" = $1 LIMIT 1 FOR UPDATE [["id", 202999634]] (0.2ms) SELECT clock_timestamp() ApiClientAuthorization Load (0.3ms) SELECT "api_client_authorizations".* FROM "api_client_authorizations" WHERE "api_client_authorizations"."uuid" = $1 LIMIT 1 [["uuid", "zzzzz-gj3su-114qi6skofvqv4q"]] (0.2ms) SELECT clock_timestamp() User Load (0.3ms) SELECT "users".* FROM "users" WHERE (redirect_to_user_uuid is null) AND "users"."id" = $1 LIMIT 1 [["id", 186579971]] (0.2ms) SELECT clock_timestamp() SQL (0.3ms) UPDATE "api_client_authorizations" SET "expires_at" = $1, "updated_at" = $2 WHERE "api_client_authorizations"."id" = $3 [["expires_at", "2018-05-15 18:10:13.234542"], ["updated_at", "2018-05-15 18:10:13.237432"], ["id", 1054825101]] (0.2ms) SELECT clock_timestamp() (0.2ms) SELECT clock_timestamp() SQL (0.3ms) INSERT INTO "logs" ("properties", "event_type", "object_uuid", "object_owner_uuid", "summary", "owner_uuid", "event_at", "created_at", "updated_at", "modified_at", "modified_by_user_uuid", "modified_by_client_uuid", "uuid") VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13) RETURNING "id" [["properties", "{\"old_etag\":\"55dfahuvg9gnjevribw0hjwbr\",\"old_attributes\":{\"id\":1054825101,\"api_client_id\":0,\"user_id\":186579971,\"created_by_ip_address\":null,\"last_used_by_ip_address\":null,\"last_used_at\":null,\"expires_at\":null,\"created_at\":\"2018-05-15T18:10:11.681531000Z\",\"updated_at\":\"2018-05-15T18:10:11.681531000Z\",\"default_owner_uuid\":null,\"scopes\":[\"all\"],\"uuid\":\"zzzzz-gj3su-114qi6skofvqv4q\"},\"new_etag\":\"av8ir4fijkatvg6dbks87sa90\",\"new_attributes\":{\"id\":1054825101,\"api_client_id\":0,\"user_id\":186579971,\"created_by_ip_address\":null,\"last_used_by_ip_address\":null,\"last_used_at\":null,\"expires_at\":\"2018-05-15T18:10:13.234542000Z\",\"created_at\":\"2018-05-15T18:10:11.681531000Z\",\"updated_at\":\"2018-05-15T18:10:13.237432000Z\",\"default_owner_uuid\":null,\"scopes\":[\"all\"],\"uuid\":\"zzzzz-gj3su-114qi6skofvqv4q\"}}"], ["event_type", "update"], ["object_uuid", "zzzzz-gj3su-114qi6skofvqv4q"], ["object_owner_uuid", "zzzzz-tpzed-xurymjxw79nv3jz"], ["summary", "update of zzzzz-gj3su-114qi6skofvqv4q"], ["owner_uuid", "zzzzz-tpzed-d9tiejq69daie8f"], ["event_at", "2018-05-15 18:10:13.239312"], ["created_at", "2018-05-15 18:10:13.240100"], ["updated_at", "2018-05-15 18:10:13.240100"], ["modified_at", "2018-05-15 18:10:13.240100"], ["modified_by_user_uuid", "zzzzz-tpzed-d9tiejq69daie8f"], ["modified_by_client_uuid", "zzzzz-ozdt8-obw7foaks3qjyej"], ["uuid", "zzzzz-57u5n-vzdurpld3fvqzcg"]] (0.1ms) NOTIFY logs, '21' (0.2ms) SELECT clock_timestamp() SQL (0.3ms) UPDATE "containers" SET "state" = $1, "finished_at" = $2, "locked_by_uuid" = $3, "auth_uuid" = $4, "updated_at" = $5, "modified_at" = $6 WHERE "containers"."id" = $7 [["state", "Cancelled"], ["finished_at", "2018-05-15 18:10:13.231126"], ["locked_by_uuid", nil], ["auth_uuid", nil], ["updated_at", "2018-05-15 18:10:13.244388"], ["modified_at", "2018-05-15 18:10:13.244388"], ["id", 202999634]] (0.2ms) SELECT clock_timestamp() SQL (0.3ms) INSERT INTO "logs" ("properties", "event_type", "object_uuid", "object_owner_uuid", "summary", "event_at", "owner_uuid", "created_at", "updated_at", "modified_at", "modified_by_user_uuid", "modified_by_client_uuid", "uuid") VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13) RETURNING "id" [["properties", "{\"old_etag\":\"ecqaadnxky966yutczxw7taml\",\"old_attributes\":{\"id\":202999634,\"uuid\":\"zzzzz-dz642-queuedcontainer\",\"owner_uuid\":\"zzzzz-tpzed-000000000000000\",\"created_at\":\"2016-01-11T11:11:11.111111000Z\",\"modified_at\":\"2018-05-15T18:10:11.720293000Z\",\"modified_by_client_uuid\":\"zzzzz-ozdt8-obw7foaks3qjyej\",\"modified_by_user_uuid\":\"zzzzz-tpzed-d9tiejq69daie8f\",\"state\":\"Running\",\"started_at\":\"2018-05-15T18:10:11.715368000Z\",\"finished_at\":null,\"log\":null,\"environment\":{},\"cwd\":\"test\",\"command\":[\"echo\",\"hello\"],\"output_path\":\"test\",\"mounts\":{\"/tmp\":{\"kind\":\"tmp\",\"capacity\":24000000000},\"/var/spool/cwl\":{\"kind\":\"tmp\",\"capacity\":24000000000}},\"runtime_constraints\":{\"ram\":12000000000,\"vcpus\":4},\"output\":null,\"container_image\":\"test\",\"progress\":null,\"priority\":1,\"updated_at\":\"2018-05-15T18:10:11.720293000Z\",\"exit_code\":null,\"auth_uuid\":\"zzzzz-gj3su-114qi6skofvqv4q\",\"locked_by_uuid\":\"zzzzz-gj3su-027z32aux8dg2s1\",\"scheduling_parameters\":{},\"secret_mounts_md5\":\"99914b932bd37a50b983c5e7c90ae93b\"},\"new_etag\":\"1hniprpj4getougzfqqltqi71\",\"new_attributes\":{\"id\":202999634,\"uuid\":\"zzzzz-dz642-queuedcontainer\",\"owner_uuid\":\"zzzzz-tpzed-000000000000000\",\"created_at\":\"2016-01-11T11:11:11.111111000Z\",\"modified_at\":\"2018-05-15T18:10:13.244388000Z\",\"modified_by_client_uuid\":\"zzzzz-ozdt8-obw7foaks3qjyej\",\"modified_by_user_uuid\":\"zzzzz-tpzed-d9tiejq69daie8f\",\"state\":\"Cancelled\",\"started_at\":\"2018-05-15T18:10:11.715368000Z\",\"finished_at\":\"2018-05-15T18:10:13.231126000Z\",\"log\":null,\"environment\":{},\"cwd\":\"test\",\"command\":[\"echo\",\"hello\"],\"output_path\":\"test\",\"mounts\":{\"/tmp\":{\"capacity\":24000000000,\"kind\":\"tmp\"},\"/var/spool/cwl\":{\"capacity\":24000000000,\"kind\":\"tmp\"}},\"runtime_constraints\":{\"ram\":12000000000,\"vcpus\":4},\"output\":null,\"container_image\":\"test\",\"progress\":null,\"priority\":1,\"updated_at\":\"2018-05-15T18:10:13.244388000Z\",\"exit_code\":null,\"auth_uuid\":null,\"locked_by_uuid\":null,\"scheduling_parameters\":{},\"secret_mounts_md5\":\"99914b932bd37a50b983c5e7c90ae93b\"}}"], ["event_type", "update"], ["object_uuid", "zzzzz-dz642-queuedcontainer"], ["object_owner_uuid", "zzzzz-tpzed-000000000000000"], ["summary", "update of zzzzz-dz642-queuedcontainer"], ["event_at", "2018-05-15 18:10:13.244388"], ["owner_uuid", "zzzzz-tpzed-d9tiejq69daie8f"], ["created_at", "2018-05-15 18:10:13.247052"], ["updated_at", "2018-05-15 18:10:13.247052"], ["modified_at", "2018-05-15 18:10:13.247052"], ["modified_by_user_uuid", "zzzzz-tpzed-d9tiejq69daie8f"], ["modified_by_client_uuid", "zzzzz-ozdt8-obw7foaks3qjyej"], ["uuid", "zzzzz-57u5n-p7n3sz7s2om6dw2"]] (0.1ms) NOTIFY logs, '22' (0.2ms) LOCK container_requests, containers IN EXCLUSIVE MODE (0.4ms) SELECT COUNT(*) FROM "container_requests" WHERE (container_uuid = 'zzzzz-dz642-queuedcontainer' and priority > 0 and state = 'Committed' and container_count < container_count_max) ContainerRequest Load (0.3ms) SELECT "container_requests".* FROM "container_requests" WHERE "container_requests"."container_uuid" = $1 AND "container_requests"."state" = $2 [["container_uuid", "zzzzz-dz642-queuedcontainer"], ["state", "Committed"]] Container Load (0.3ms) SELECT "containers".* FROM "containers" WHERE "containers"."uuid" = $1 LIMIT 1 [["uuid", "zzzzz-dz642-queuedcontainer"]] (0.2ms) SELECT clock_timestamp() SQL (0.6ms) UPDATE "container_requests" SET "runtime_constraints" = $1, "command" = $2, "state" = $3, "updated_at" = $4, "modified_at" = $5, "modified_by_client_uuid" = $6 WHERE "container_requests"."id" = $7 [["runtime_constraints", "{\"vcpus\":1,\"ram\":123}"], ["command", "[\"echo\",\"hello\"]"], ["state", "Final"], ["updated_at", "2018-05-15 18:10:13.259424"], ["modified_at", "2018-05-15 18:10:13.259424"], ["modified_by_client_uuid", "zzzzz-ozdt8-obw7foaks3qjyej"], ["id", 202999634]] (0.2ms) SELECT clock_timestamp() SQL (0.4ms) INSERT INTO "logs" ("properties", "event_type", "object_uuid", "object_owner_uuid", "summary", "event_at", "owner_uuid", "created_at", "updated_at", "modified_at", "modified_by_client_uuid", "uuid") VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12) RETURNING "id" [["properties", "{\"old_etag\":\"2fzkr20o1i6ndw4cf04a65zl\",\"old_attributes\":{\"id\":202999634,\"uuid\":\"zzzzz-xvhdp-cr4queuedcontnr\",\"owner_uuid\":\"zzzzz-tpzed-xurymjxw79nv3jz\",\"created_at\":\"2018-05-15T18:08:09.000000000Z\",\"modified_at\":\"2018-05-15T18:09:09.000000000Z\",\"modified_by_client_uuid\":null,\"modified_by_user_uuid\":\"zzzzz-tpzed-xurymjxw79nv3jz\",\"name\":\"queued\",\"description\":null,\"properties\":{},\"state\":\"Committed\",\"requesting_container_uuid\":null,\"container_uuid\":\"zzzzz-dz642-queuedcontainer\",\"container_count_max\":null,\"mounts\":{},\"runtime_constraints\":{\"vcpus\":1,\"ram\":123},\"container_image\":\"test\",\"environment\":{},\"cwd\":\"test\",\"command\":[\"echo\",\"hello\"],\"output_path\":\"test\",\"priority\":1,\"expires_at\":null,\"filters\":null,\"updated_at\":\"2018-05-15T18:09:09.000000000Z\",\"container_count\":0,\"use_existing\":true,\"scheduling_parameters\":{},\"output_uuid\":null,\"log_uuid\":null,\"output_name\":null,\"output_ttl\":0},\"new_etag\":\"2m51ajia11fnpe0jc9pc0qdlp\",\"new_attributes\":{\"id\":202999634,\"uuid\":\"zzzzz-xvhdp-cr4queuedcontnr\",\"owner_uuid\":\"zzzzz-tpzed-xurymjxw79nv3jz\",\"created_at\":\"2018-05-15T18:08:09.000000000Z\",\"modified_at\":\"2018-05-15T18:10:13.259424000Z\",\"modified_by_client_uuid\":\"zzzzz-ozdt8-obw7foaks3qjyej\",\"modified_by_user_uuid\":\"zzzzz-tpzed-xurymjxw79nv3jz\",\"name\":\"queued\",\"description\":null,\"properties\":{},\"state\":\"Final\",\"requesting_container_uuid\":null,\"container_uuid\":\"zzzzz-dz642-queuedcontainer\",\"container_count_max\":null,\"mounts\":{},\"runtime_constraints\":{\"vcpus\":1,\"ram\":123},\"container_image\":\"test\",\"environment\":{},\"cwd\":\"test\",\"command\":[\"echo\",\"hello\"],\"output_path\":\"test\",\"priority\":1,\"expires_at\":null,\"filters\":null,\"updated_at\":\"2018-05-15T18:10:13.259424000Z\",\"container_count\":0,\"use_existing\":true,\"scheduling_parameters\":{},\"output_uuid\":null,\"log_uuid\":null,\"output_name\":null,\"output_ttl\":0}}"], ["event_type", "update"], ["object_uuid", "zzzzz-xvhdp-cr4queuedcontnr"], ["object_owner_uuid", "zzzzz-tpzed-xurymjxw79nv3jz"], ["summary", "update of zzzzz-xvhdp-cr4queuedcontnr"], ["event_at", "2018-05-15 18:10:13.259424"], ["owner_uuid", "zzzzz-tpzed-000000000000000"], ["created_at", "2018-05-15 18:10:13.263102"], ["updated_at", "2018-05-15 18:10:13.263102"], ["modified_at", "2018-05-15 18:10:13.263102"], ["modified_by_client_uuid", "zzzzz-ozdt8-obw7foaks3qjyej"], ["uuid", "zzzzz-57u5n-ctmzypuckjo8jag"]] (0.2ms) NOTIFY logs, '23' (0.2ms) LOCK container_requests, containers IN EXCLUSIVE MODE Container Load (0.3ms) SELECT "containers".* FROM "containers" WHERE (uuid in ('zzzzz-dz642-queuedcontainer','zzzzz-dz642-queuedcontainer')) ContainerRequest Load (0.3ms) SELECT "container_requests".* FROM "container_requests" WHERE "container_requests"."requesting_container_uuid" = $1 AND "container_requests"."state" = $2 [["requesting_container_uuid", "zzzzz-dz642-queuedcontainer"], ["state", "Committed"]] (1.3ms) COMMIT {"method":"PUT","path":"/arvados/v1/containers/zzzzz-dz642-queuedcontainer","format":"html","controller":"Arvados::V1::ContainersController","action":"update","status":200,"duration":48.69,"view":0.17,"db":9.14,"request_id":"req-4pohkf3gpp3pliof975s","client_ipaddr":"127.0.0.1","client_auth":"zzzzz-gj3su-027z32aux8dg2s1","params":{"container":"{\"state\":\"Cancelled\"}"},"@timestamp":"2018-05-15T18:10:13.272779425Z","@version":"1","message":"[200] PUT /arvados/v1/containers/zzzzz-dz642-queuedcontainer (Arvados::V1::ContainersController#update)"}
Updated by Lucas Di Pentima over 6 years ago
This LGTM, thanks for the clarification.
Updated by Tom Clegg over 6 years ago
- Status changed from In Progress to Feedback
Updated by Joshua Randall over 6 years ago
These changes seems likely to be the cause of #13500 -- reverting arvados-api-server to the version prior to this revision resolved the issue. With these locking changes, our system was not able to schedule any container work whatsoever. Any attempts to run workflows using arvados-cwl-runner results in the database filling with waiting locks and eventually it starts reporting deadlocks. No jobs were submitted to SLURM at all.
Updated by Joshua Randall over 6 years ago
Is the deadlock here between a-c-r and c-d-s?
Updated by Peter Amstutz over 6 years ago
Joshua Randall wrote:
Is the deadlock here between a-c-r and c-d-s?
a-c-r now has multithreaded container submit, more container requests in flight probably makes it much more prone to triggering the underlying race condition / deadlock.
Updated by Peter Amstutz over 6 years ago
- Related to Bug #13500: crunch-dispatch-slurm PG::TRDeadlockDetected: ERROR: deadlock detected added
Updated by Peter Amstutz over 6 years ago
My suspicion is that the exclusive table locks are stepping on the existing row level locks in Container.handle_completed, Container.update_priority!, and (ironically) Container.lock.
Updated by Tom Clegg over 6 years ago
- (thread 1) lock a row in containers table
- (thread 2) lock table container_requests → success
- (thread 2) lock table containers → wait for thread 1
- (thread 1) look up row in container_requests → wait for thread 2
- deadlock
We need to lock container_requests before locking a row in containers, if there's a possibility the transaction will select from container_requests after that.
Updated by Tom Clegg over 6 years ago
- Status changed from Feedback to In Progress
Updated by Tom Clegg over 6 years ago
- lock both tables during containers#update (because priority might change, and the container might be finalized, both of which invoke CR lookups)
- lock both tables during containers#lock (because assigning auth_uuid invokes CR lookups)
- no extra locking in containers#unlock (because unlocking doesn't involve CR lookups)
Updated by Peter Amstutz over 6 years ago
Tom Clegg wrote:
13164-container-deadlock @ 3b3794c7e0e026a6338165b9e171925e384e6502
- lock both tables during containers#update (because priority might change, and the container might be finalized, both of which invoke CR lookups)
- lock both tables during containers#lock (because assigning auth_uuid invokes CR lookups)
- no extra locking in containers#unlock (because unlocking doesn't involve CR lookups)
Should probably remove the existing row level locks, because they are now redundant? I see two places in update_priority! and handle_completed. Should also have a note about how they are now protected by table-level locks.
Updated by Tom Clegg over 6 years ago
- Target version changed from 2018-05-23 Sprint to 2018-06-06 Sprint
Updated by Ward Vandewege over 6 years ago
- Status changed from In Progress to Feedback
Updated by Joshua Randall over 6 years ago
I've now updated our system to the newest arvados-api-server:
root@arvados-master-eglyx:/var/www/arvados-api/current# dpkg -l |grep arvados-api ii arvados-api-server 1.1.4.20180524153119-8 amd64 Arvados API server - Arvados is a free and open source platform for big data science.
With this, we are back to having lots of deadlock errors propagated back to API clients such as a-c-r:
2018-05-31 12:31:47 cwltool ERROR: Unhandled error, try again with --debug for more information: <HttpError 422 when requesting https://arvados-api-eglyx.hgi.sanger.ac.uk/arvados/v1/container_requests?alt=json returned "#<PG::TRDeadlockDetected: ERROR: dead lock detected DETAIL: Process 6320 waits for ExclusiveLock on relation 16440 of database 16385; blocked by process 6321. Process 6321 waits for ExclusiveLock on relation 16453 of database 16385; blocked by process 6283. Process 6283 waits for ExclusiveLock on relation 16440 of database 16385; blocked by process 6320. HINT: See server log for query details. >">
Updated by Joshua Randall over 6 years ago
downgrading to arvados-api-server 1.1.4.20180509181910-8 again so we have a hope of getting container_requests run.
Updated by Tom Clegg over 6 years ago
- Status changed from Feedback to In Progress
An optimization that could make a big difference in some of these cases: don't bother updating priority of a container if state=Running (or a later state) -- it won't make any difference anyway.
Updated by Joshua Randall over 6 years ago
I've upgraded back to 1.1.4.20180524153119-8 and our arvados-cwl-runner processes are once again getting a lot of 422 errors.
Most of them seem to be on container_request creation, but there are some others.
For example, this is for a container update (to set the state to `Running`), which seems to be sent from a compute node:
#<PG::TRDeadlockDetected: ERROR: deadlock detected DETAIL: Process 16299 waits for ExclusiveLock on relation 16453 of database 16385; blocked by process 15768. Process 15768 waits for ExclusiveLock on relation 16440 of database 16385; blocked by process 16299. HINT: See server log for query details. > /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/activerecord-4.2.10/lib/active_record/connection_adapters/postgresql/database_statements.rb:155:in `async_exec' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/activerecord-4.2.10/lib/active_record/connection_adapters/postgresql/database_statements.rb:155:in `block in execute' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/activerecord-4.2.10/lib/active_record/connection_adapters/abstract_adapter.rb:484:in `block in log' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.2.10/lib/active_support/notifications/instrumenter.rb:20:in `instrument' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/activerecord-4.2.10/lib/active_record/connection_adapters/abstract_adapter.rb:478:in `log' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/activerecord-4.2.10/lib/active_record/connection_adapters/postgresql/database_statements.rb:154:in `execute' /mnt/arvados-api-db/arvados-api/current/app/controllers/arvados/v1/containers_controller.rb:28:in `block in update' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/activerecord-4.2.10/lib/active_record/connection_adapters/abstract/database_statements.rb:213:in `block in transaction' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/activerecord-4.2.10/lib/active_record/connection_adapters/abstract/transaction.rb:184:in `within_new_transaction' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/activerecord-4.2.10/lib/active_record/connection_adapters/abstract/database_statements.rb:213:in `transaction' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/activerecord-4.2.10/lib/active_record/transactions.rb:220:in `transaction' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/activerecord-4.2.10/lib/active_record/transactions.rb:277:in `transaction' /mnt/arvados-api-db/arvados-api/current/app/controllers/arvados/v1/containers_controller.rb:27:in `update' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.2.10/lib/action_controller/metal/implicit_render.rb:4:in `send_action' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.2.10/lib/abstract_controller/base.rb:198:in `process_action' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.2.10/lib/action_controller/metal/rendering.rb:10:in `process_action' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.2.10/lib/abstract_controller/callbacks.rb:20:in `block in process_action' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.2.10/lib/active_support/callbacks.rb:117:in `call' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.2.10/lib/active_support/callbacks.rb:555:in `block (2 levels) in compile' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.2.10/lib/active_support/callbacks.rb:505:in `call' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.2.10/lib/active_support/callbacks.rb:498:in `block (2 levels) in around' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.2.10/lib/active_support/callbacks.rb:313:in `block (2 levels) in halting' /mnt/arvados-api-db/arvados-api/current/app/controllers/application_controller.rb:385:in `set_current_request_id' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.2.10/lib/active_support/callbacks.rb:432:in `block in make_lambda' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.2.10/lib/active_support/callbacks.rb:312:in `block in halting' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.2.10/lib/active_support/callbacks.rb:497:in `block in around' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.2.10/lib/active_support/callbacks.rb:505:in `call' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.2.10/lib/active_support/callbacks.rb:92:in `__run_callbacks__' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.2.10/lib/active_support/callbacks.rb:778:in `_run_process_action_callbacks' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.2.10/lib/active_support/callbacks.rb:81:in `run_callbacks' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.2.10/lib/abstract_controller/callbacks.rb:19:in `process_action' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.2.10/lib/action_controller/metal/rescue.rb:29:in `process_action' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.2.10/lib/action_controller/metal/instrumentation.rb:32:in `block in process_action' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.2.10/lib/active_support/notifications.rb:164:in `block in instrument' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.2.10/lib/active_support/notifications/instrumenter.rb:20:in `instrument' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.2.10/lib/active_support/notifications.rb:164:in `instrument' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.2.10/lib/action_controller/metal/instrumentation.rb:30:in `process_action' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.2.10/lib/action_controller/metal/params_wrapper.rb:250:in `process_action' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/activerecord-4.2.10/lib/active_record/railties/controller_runtime.rb:18:in `process_action' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.2.10/lib/abstract_controller/base.rb:137:in `process' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/actionview-4.2.10/lib/action_view/rendering.rb:30:in `process' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.2.10/lib/action_controller/metal.rb:196:in `dispatch' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.2.10/lib/action_controller/metal/rack_delegation.rb:13:in `dispatch' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.2.10/lib/action_controller/metal.rb:237:in `block in action' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.2.10/lib/action_dispatch/routing/route_set.rb:74:in `dispatch' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.2.10/lib/action_dispatch/routing/route_set.rb:43:in `serve' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.2.10/lib/action_dispatch/journey/router.rb:43:in `block in serve' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.2.10/lib/action_dispatch/journey/router.rb:30:in `each' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.2.10/lib/action_dispatch/journey/router.rb:30:in `serve' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.2.10/lib/action_dispatch/routing/route_set.rb:817:in `call' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/omniauth-1.4.3/lib/omniauth/strategy.rb:186:in `call!' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/omniauth-1.4.3/lib/omniauth/strategy.rb:164:in `call' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/omniauth-1.4.3/lib/omniauth/builder.rb:63:in `call' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/rack-1.6.10/lib/rack/etag.rb:24:in `call' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/rack-1.6.10/lib/rack/conditionalget.rb:38:in `call' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/rack-1.6.10/lib/rack/head.rb:13:in `call' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.2.10/lib/action_dispatch/middleware/params_parser.rb:27:in `call' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.2.10/lib/action_dispatch/middleware/flash.rb:260:in `call' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/rack-1.6.10/lib/rack/session/abstract/id.rb:225:in `context' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/rack-1.6.10/lib/rack/session/abstract/id.rb:220:in `call' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.2.10/lib/action_dispatch/middleware/cookies.rb:560:in `call' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/activerecord-4.2.10/lib/active_record/query_cache.rb:36:in `call' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/activerecord-4.2.10/lib/active_record/connection_adapters/abstract/connection_pool.rb:653:in `call' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.2.10/lib/action_dispatch/middleware/callbacks.rb:29:in `block in call' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.2.10/lib/active_support/callbacks.rb:88:in `__run_callbacks__' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.2.10/lib/active_support/callbacks.rb:778:in `_run_call_callbacks' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.2.10/lib/active_support/callbacks.rb:81:in `run_callbacks' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.2.10/lib/action_dispatch/middleware/callbacks.rb:27:in `call' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.2.10/lib/action_dispatch/middleware/debug_exceptions.rb:17:in `call' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.2.10/lib/action_dispatch/middleware/show_exceptions.rb:30:in `call' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/lograge-0.10.0/lib/lograge/rails_ext/rack/logger.rb:15:in `call_app' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/railties-4.2.10/lib/rails/rack/logger.rb:20:in `block in call' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.2.10/lib/active_support/tagged_logging.rb:68:in `block in tagged' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.2.10/lib/active_support/tagged_logging.rb:26:in `tagged' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.2.10/lib/active_support/tagged_logging.rb:68:in `tagged' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/railties-4.2.10/lib/rails/rack/logger.rb:20:in `call' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/request_store-1.4.1/lib/request_store/middleware.rb:19:in `call' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.2.10/lib/action_dispatch/middleware/request_id.rb:21:in `call' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/rack-1.6.10/lib/rack/methodoverride.rb:22:in `call' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/rack-1.6.10/lib/rack/runtime.rb:18:in `call' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.2.10/lib/active_support/cache/strategy/local_cache_middleware.rb:28:in `call' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/rack-1.6.10/lib/rack/sendfile.rb:113:in `call' /mnt/arvados-api-db/arvados-api/current/app/middlewares/arvados_api_token.rb:63:in `call' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.2.10/lib/action_dispatch/middleware/remote_ip.rb:78:in `call' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/railties-4.2.10/lib/rails/engine.rb:518:in `call' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/railties-4.2.10/lib/rails/application.rb:165:in `call' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/railties-4.2.10/lib/rails/railtie.rb:194:in `public_send' /var/www/arvados-api/shared/vendor_bundle/ruby/2.3.0/gems/railties-4.2.10/lib/rails/railtie.rb:194:in `method_missing' /usr/lib/ruby/vendor_ruby/phusion_passenger/rack/thread_handler_extension.rb:97:in `process_request' /usr/lib/ruby/vendor_ruby/phusion_passenger/request_handler/thread_handler.rb:157:in `accept_and_process_next_request' /usr/lib/ruby/vendor_ruby/phusion_passenger/request_handler/thread_handler.rb:110:in `main_loop' /usr/lib/ruby/vendor_ruby/phusion_passenger/request_handler.rb:415:in `block (3 levels) in start_threads' /usr/lib/ruby/vendor_ruby/phusion_passenger/utils.rb:113:in `block in create_thread_and_abort_on_exception' Error 1527782528+98750bce: 422 {"method":"PUT","path":"/arvados/v1/containers/eglyx-dz642-z8t8ztzfcpmc78o","format":"html","controller":"Arvados::V1::ContainersController","action":"update","status":422,"duration":1010.97,"view":0.4,"db":1004.98,"request_id":"req-be3rv5spsavqas7nxkjx","client_ipaddr":"172.27.16.250","client_auth":"eglyx-gj3su-xo6zhx218uqho7j","params":{"container":"{\"state\":\"Running\"}"},"@timestamp":"2018-05-31T16:02:08.115645347Z","@version":"1","message":"[422] PUT /arvados/v1/containers/eglyx-dz642-z8t8ztzfcpmc78o (Arvados::V1::ContainersController#update)"}
Updated by Joshua Randall over 6 years ago
It appears that update probably came from crunch-run:
https://github.com/curoverse/arvados/blob/master/services/crunch-run/crunchrun.go#L1271-L1272
Updated by Joshua Randall over 6 years ago
I've extracted some example requests for the API server's log during a 24s period when there were 96 HTTP 422 responses due to TRDeadlockDetected:
2018-05-31T16:33:14.260042457Z POST /arvados/v1/container_requests create 2018-05-31T16:33:14.380758178Z POST /arvados/v1/container_requests create 2018-05-31T16:33:19.228471368Z POST /arvados/v1/container_requests create 2018-05-31T16:33:25.918683968Z POST /arvados/v1/container_requests create 2018-05-31T16:33:26.013534586Z POST /arvados/v1/container_requests create 2018-05-31T16:33:26.023165819Z POST /arvados/v1/container_requests create 2018-05-31T16:33:26.494429790Z POST /arvados/v1/container_requests create 2018-05-31T16:33:29.764160890Z POST /arvados/v1/container_requests create 2018-05-31T16:33:29.857466106Z POST /arvados/v1/container_requests create 2018-05-31T16:33:39.986681830Z POST /arvados/v1/container_requests create 2018-05-31T16:33:44.379492460Z POST /arvados/v1/container_requests create 2018-05-31T16:33:53.811395309Z POST /arvados/v1/container_requests create 2018-05-31T16:33:53.837001477Z POST /arvados/v1/container_requests create 2018-05-31T16:34:01.357786379Z POST /arvados/v1/container_requests create 2018-05-31T16:34:01.569143198Z POST /arvados/v1/container_requests create 2018-05-31T16:34:01.589093287Z POST /arvados/v1/container_requests create 2018-05-31T16:34:01.597377434Z POST /arvados/v1/container_requests create 2018-05-31T16:34:01.672114451Z POST /arvados/v1/container_requests create 2018-05-31T16:34:02.112452335Z POST /arvados/v1/container_requests create 2018-05-31T16:34:02.511041340Z PUT /arvados/v1/containers/eglyx-dz642-5zvj4d0tf8wrjed update 2018-05-31T16:34:02.675445682Z POST /arvados/v1/container_requests create 2018-05-31T16:34:06.485674626Z POST /arvados/v1/container_requests create 2018-05-31T16:34:06.879895199Z PUT /arvados/v1/containers/eglyx-dz642-av0yenahb6wf2u9 update 2018-05-31T16:34:10.188776676Z POST /arvados/v1/container_requests create 2018-05-31T16:34:14.006035186Z POST /arvados/v1/container_requests create 2018-05-31T16:34:16.516127092Z POST /arvados/v1/container_requests create 2018-05-31T16:34:16.517689627Z POST /arvados/v1/container_requests create 2018-05-31T16:34:16.559788121Z POST /arvados/v1/container_requests create 2018-05-31T16:34:16.939764061Z POST /arvados/v1/containers/eglyx-dz642-e92sdvbsrzjq3fz/lock lock 2018-05-31T16:34:17.032186820Z POST /arvados/v1/container_requests create 2018-05-31T16:34:17.101631751Z POST /arvados/v1/container_requests create 2018-05-31T16:34:17.225529850Z POST /arvados/v1/container_requests create 2018-05-31T16:34:17.329000920Z POST /arvados/v1/container_requests create 2018-05-31T16:34:17.395110330Z POST /arvados/v1/container_requests create 2018-05-31T16:34:18.080849677Z PUT /arvados/v1/containers/eglyx-dz642-rqe87jbizyq4ygq update 2018-05-31T16:34:18.102241195Z POST /arvados/v1/container_requests create 2018-05-31T16:34:18.230772902Z POST /arvados/v1/container_requests create 2018-05-31T16:34:18.233212268Z POST /arvados/v1/container_requests create 2018-05-31T16:34:18.326708398Z POST /arvados/v1/container_requests create 2018-05-31T16:34:18.421412931Z PUT /arvados/v1/containers/eglyx-dz642-2uxh1sbc27beg9m update 2018-05-31T16:34:18.744867869Z PUT /arvados/v1/containers/eglyx-dz642-gcpz95ar6ds2fs6 update 2018-05-31T16:34:19.104888341Z POST /arvados/v1/container_requests create 2018-05-31T16:34:19.145671351Z POST /arvados/v1/container_requests create 2018-05-31T16:34:19.150510311Z POST /arvados/v1/container_requests create 2018-05-31T16:34:19.159019741Z POST /arvados/v1/container_requests create 2018-05-31T16:34:19.185260586Z POST /arvados/v1/container_requests create 2018-05-31T16:34:19.597076441Z PUT /arvados/v1/containers/eglyx-dz642-mbugca9159onqk0 update 2018-05-31T16:34:19.821330704Z PUT /arvados/v1/containers/eglyx-dz642-afhvel19ehbb2i7 update 2018-05-31T16:34:19.959645357Z POST /arvados/v1/containers/eglyx-dz642-e92sdvbsrzjq3fz/lock lock 2018-05-31T16:34:20.105113988Z POST /arvados/v1/container_requests create 2018-05-31T16:34:20.599093437Z POST /arvados/v1/container_requests create 2018-05-31T16:34:20.766082247Z POST /arvados/v1/container_requests create 2018-05-31T16:34:20.945262263Z POST /arvados/v1/container_requests create 2018-05-31T16:34:21.146963787Z PUT /arvados/v1/containers/eglyx-dz642-rqe87jbizyq4ygq update 2018-05-31T16:34:21.229772802Z PUT /arvados/v1/containers/eglyx-dz642-2uxh1sbc27beg9m update 2018-05-31T16:34:21.324271647Z POST /arvados/v1/container_requests create 2018-05-31T16:34:21.431301196Z PUT /arvados/v1/containers/eglyx-dz642-gcpz95ar6ds2fs6 update 2018-05-31T16:34:21.453103799Z PUT /arvados/v1/containers/eglyx-dz642-eawqda8qpq383op update 2018-05-31T16:34:21.936206582Z PUT /arvados/v1/containers/eglyx-dz642-mbugca9159onqk0 update 2018-05-31T16:34:21.990931132Z POST /arvados/v1/container_requests create 2018-05-31T16:34:22.224844343Z POST /arvados/v1/container_requests create 2018-05-31T16:34:22.489714373Z POST /arvados/v1/container_requests create 2018-05-31T16:34:22.656725375Z PUT /arvados/v1/containers/eglyx-dz642-yur8u0z55g2q860 update 2018-05-31T16:34:22.756557387Z PUT /arvados/v1/containers/eglyx-dz642-5fajehnmxzj138p update 2018-05-31T16:34:22.792780779Z PUT /arvados/v1/containers/eglyx-dz642-afhvel19ehbb2i7 update 2018-05-31T16:34:22.982379363Z POST /arvados/v1/containers/eglyx-dz642-e92sdvbsrzjq3fz/lock lock 2018-05-31T16:34:23.306001165Z POST /arvados/v1/container_requests create 2018-05-31T16:34:23.793704308Z POST /arvados/v1/container_requests create 2018-05-31T16:34:25.788260174Z PUT /arvados/v1/containers/eglyx-dz642-5fajehnmxzj138p update 2018-05-31T16:34:25.825857378Z PUT /arvados/v1/containers/eglyx-dz642-afhvel19ehbb2i7 update 2018-05-31T16:34:26.005828766Z POST /arvados/v1/containers/eglyx-dz642-e92sdvbsrzjq3fz/lock lock 2018-05-31T16:34:26.039441821Z PUT /arvados/v1/containers/eglyx-dz642-iq9a6qfdi9093iy update 2018-05-31T16:34:26.606532503Z POST /arvados/v1/container_requests create 2018-05-31T16:34:26.636961127Z POST /arvados/v1/container_requests create 2018-05-31T16:34:26.641369919Z POST /arvados/v1/container_requests create 2018-05-31T16:34:29.805022806Z POST /arvados/v1/container_requests create 2018-05-31T16:34:29.862335955Z POST /arvados/v1/container_requests create 2018-05-31T16:34:30.279329077Z PUT /arvados/v1/containers/eglyx-dz642-afhvel19ehbb2i7 update 2018-05-31T16:34:30.315241073Z POST /arvados/v1/container_requests create 2018-05-31T16:34:33.702315820Z POST /arvados/v1/container_requests create 2018-05-31T16:34:33.704909592Z POST /arvados/v1/container_requests create 2018-05-31T16:34:34.537123976Z POST /arvados/v1/container_requests create 2018-05-31T16:34:34.651974359Z PUT /arvados/v1/containers/eglyx-dz642-e92sdvbsrzjq3fz update 2018-05-31T16:34:34.774763233Z PUT /arvados/v1/containers/eglyx-dz642-wvr4ij3c4iim2es update 2018-05-31T16:34:34.815256088Z POST /arvados/v1/container_requests create 2018-05-31T16:34:35.615939670Z POST /arvados/v1/containers/eglyx-dz642-pjm5oj56zsd4yl3/lock lock 2018-05-31T16:34:35.708768189Z PUT /arvados/v1/containers/eglyx-dz642-dbftrk07cjov7gu update 2018-05-31T16:34:36.099553427Z PUT /arvados/v1/containers/eglyx-dz642-1fc3lk93b0p5cze update 2018-05-31T16:34:36.107335192Z POST /arvados/v1/container_requests create 2018-05-31T16:34:36.209065991Z PUT /arvados/v1/containers/eglyx-dz642-z73r0031xfpyr11 update 2018-05-31T16:34:36.450799636Z POST /arvados/v1/container_requests create 2018-05-31T16:34:36.763212494Z POST /arvados/v1/container_requests create 2018-05-31T16:34:36.999527184Z POST /arvados/v1/container_requests create 2018-05-31T16:34:37.100410859Z POST /arvados/v1/container_requests create 2018-05-31T16:34:37.142219710Z POST /arvados/v1/container_requests create
Updated by Joshua Randall over 6 years ago
That was supposed to be 84s, not 24s.
Updated by Tom Clegg over 6 years ago
containers_controller.rb:28 is "lock container_requests, containers" which can deadlock if a different thread acquires a conflicting lock on containers followed by a conflicting lock on container_requests.
Suspected example: When creating a new container_request with priority>0 and not reusing an existing container, we insert the new container (in the CR's before_validation hook!) before inserting the new container_request row.
Container.resolve() should lock both tables before creating a new container. I think it would also help to move the Container.resolve() call to a later point in the ContainerRequest lifecycle, so we don't hold that lock longer than necessary.
Updated by Tom Morris over 6 years ago
- Using multiple exclusive table locks seems very heavyweight to me since it'll lock everything, even for readers
- models/container.rb:341 unlock method has a
reload(lock: 'FOR UPDATE')
which seems like it could conflict with the table locks. - Using a raw SQL command
ActiveRecord::Base.connection.execute('LOCK container_requests, containers IN EXCLUSIVE MODE')
seems like it could potentially conflict with any locking that ActiveRecord is doing internally
Updated by Joshua Randall over 6 years ago
Three of us spent the whole day today looking at this problem, as it is preventing our system from processing data.
We initially tried implementing additional exclusive table locks:
ActiveRecord::Base.connection.execute('LOCK container_requests, containers IN EXCLUSIVE MODE')
Which we added in two additional places:
- container.rb line 109 in update_priority!
- container.rb line 158 just before the call to Container.create!
This did work, in that with those locks, there are no longer any HTTP 422 conflict errors.
However, our system remained effectively unusable. While submitting a-c-r workflows in parallel, they do work, although each one takes longer and longer (according to SQL debug logging, the LOCK ... EXCLUSIVE statement is entirely responsible for this increase. Eventually the locks took longer than 5 minutes and all container_request creation and container update requests begin to fail with 504 errors.
In addition, since crunch-dispatch-slurm was also subject to these same delays, there were hardly any containers actually submitted to slurm (there were a handful at the beginning when the API server first started, but once things are underway crunch-dispatch-slurm would sit for several minutes after making a submission (I believe because it was trying to update the state of the container to locked).
Updated by Joshua Randall over 6 years ago
To get us back up and running, we have removed all exclusive table locks, and are currently running with:
https://github.com/curoverse/arvados/pull/70
We appear to be not having any significant problems with this configuration. I am occasionally running the workaround query to fix up containers whose cancellation gets missed:
arvados_api_production=# update containers set priority = 0 where containers.uuid in (select c.uuid from containers as c join container_requests as cr on c.uuid = cr.container_uuid where c.priority > 0 and cr.priority <= 0); UPDATE 52 arvados_api_production=# update containers set priority = 0 where containers.uuid in (select c.uuid from containers as c join container_requests as cr on c.uuid = cr.container_uuid where c.priority > 0 and cr.priority <= 0); UPDATE 149 arvados_api_production=# update containers set priority = 0 where containers.uuid in (select c.uuid from containers as c join container_requests as cr on c.uuid = cr.container_uuid where c.priority > 0 and cr.priority <= 0); UPDATE 107
Updated by Tom Clegg over 6 years ago
- remove table locks
- periodically (~5s) fix queued containers that have priority>0 but no outstanding requests
Updated by Joshua Randall over 6 years ago
As expected, without the table locks we see inconsistency in the container_request and container priority (but this is much much better than the table locks as we are still able to get work done, and employ the occasional SQL query workaround to clean things up).
Shouldn't it be possible to fix this with row-level locks on the containers table?
When a CR resolves a container, as soon as it gets the container uuid but before it updates its container_uuid with the uuid of the container, it should obtain a "FOR UPDATE" lock on the container and hold that lock while it updates its own container_request and subsequently calculates and updates the container's priority.
Any time a CR priority is changed, it should obtain a "FOR UPDATE" lock on the container row associated with its container_uuid before updating its own priority (and subsequently the container priority).
Updated by Joshua Randall over 6 years ago
Incidentally, the workaround query we were using above is overzealous - we are now running one which is more careful about not accidentally cancelling containers that still have some associated container_requests with nonzero priority:
update containers set priority = 0 where uuid in (select ccr.uuid from (select c.uuid as uuid, max(cr.priority) as max_priority from containers as c left join container_requests as cr on c.uuid = cr.container_uuid where c.priority > 0 group by c.uuid) as ccr where ccr.max_priority <= 0);
Updated by Tom Clegg over 6 years ago
- "exclusive mode" doesn't block readers ("access exclusive mode" would do that)
- locks don't conflict with other locks in the same transaction
- the basic problem is the approach of updating all container priority values during each update transaction: it's not feasible (or necessary) to apply all of the consequences atomically during each update. We really should apply all of these cascading updates asynchronously, but for a short term workaround we can use the existing code, sans locking, with a background cleanup job to deal with the most troublesome inconsistencies, i.e., the "tries to run cancelled containers" bug originally reported here.
Updated by Lucas Di Pentima over 6 years ago
Updated by Joshua Randall over 6 years ago
7e6ac5a4967614cbe59ed5c0ec41c8be4d4cff4d looks good to me in terms of fixing the issue with containers with nonzero priority (which is the original subject of this issue).
However, in addition to the inconsistency with containers with nonzero priority > 0 which have no corresponding CRs with nonzero priority, without any of the exclusive table locks, we also occasionally observe the opposite (containers with zero priority that should have nonzero priority) and are running this script to fix them:
while true; do (echo "select container_requests.uuid,container_requests.priority from container_requests join containers on containers.uuid = container_requests.container_uuid where containers.priority = 0 and container_requests.priority > 0 and container_requests.state = 'Committed' and containers.state = 'Queued';" | ssh arvados-api-db-eglyx sudo -u postgres psql --tuples-only --no-align --field-separator="," arvados_api_production | while read cr_uuid_priority; do cr_uuid=$(echo "${cr_uuid_priority}" | cut -f1 -d,); priority=$(echo "${cr_uuid_priority}" | cut -f2 -d,); arv container_request update -u "${cr_uuid}" -c '{"priority":'$((${priority}+1))'}' | jq -r '.uuid'; done | nl); echo "sleeping 15m"; sleep 900; done
I.e. sometimes we find Queued containers with priority 0 which have extant Committed container requests with nonzero priority. The result is that the container_requests that submitted them (such as CWL runner containers) sit around forever waiting for a container that is never going to run. This workaround is a bit of a hack to fix them - it triggers a proper update to the container priority by incrementing the CR priority by 1. It would be better if we could just trigger the priority recalculation without changing the CR priority.
Updated by Joshua Randall over 6 years ago
Tom Clegg wrote:
Some clarifications
- "exclusive mode" doesn't block readers ("access exclusive mode" would do that)
- locks don't conflict with other locks in the same transaction
- the basic problem is the approach of updating all container priority values during each update transaction: it's not feasible (or necessary) to apply all of the consequences atomically during each update. We really should apply all of these cascading updates asynchronously, but for a short term workaround we can use the existing code, sans locking, with a background cleanup job to deal with the most troublesome inconsistencies, i.e., the "tries to run cancelled containers" bug originally reported here.
Exclusive mode doesn't block readers, but in the previous code many of the reads were a part of a transaction that included a lock, and so were being blocked by that lock.
Also agree that locks that are compatible with each other appear to not conflict if they are in the same transaction - it also looks like it is safe to obtain a lock within a sub transaction if a compatible lock is held by the outer transaction.
I agree with you that ideally the code that updates container priority should not be run on every CR update. Rather than applying those updates asynchronously based on (potentially expensive) queries, it would probably be better to have a queue of some kind on which a CR change that may affect container priority (such as a change of CR priority, CR deletion, CR state change, etc) can be notified, and the asynchronous worker(s) can pick up work from that queue and do the priority recalculation.
Updated by Tom Clegg over 6 years ago
- Status changed from In Progress to Feedback
Added #13574 for longer term asynchronous solution.
Updated by Tom Clegg over 6 years ago
- Related to Idea #13574: [Controller] Update container priorities asynchronously added
Updated by Tom Morris over 6 years ago
- Target version changed from 2018-06-06 Sprint to 2018-06-20 Sprint
Updated by Peter Amstutz over 6 years ago
Got this on a branch based on 0809440ad583a2556c0f97fd000be9e9b7f94eb5
018-06-07T14:53:40.458831348Z cwltool INFO: [step step1] start 2018-06-07T14:53:42.031923977Z arvados.cwl-runner ERROR: [container step1_2] got error <HttpError 422 when requesting https://172.17.0.3:8000/arvados/v1/container_requests?alt=json returned "#<PG::TRDeadlockDetected: ERROR: deadlock detected 2018-06-07T14:53:42.031923977Z DETAIL: Process 20820 waits for RowExclusiveLock on relation 16497 of database 16444; blocked by process 10594. 2018-06-07T14:53:42.031923977Z Process 10594 waits for ExclusiveLock on relation 16509 of database 16444; blocked by process 20820. 2018-06-07T14:53:42.031923977Z HINT: See server log for query details. 2018-06-07T14:53:42.031923977Z >"> 2018-06-07T14:53:42.041544047Z arvados.cwl-runner ERROR: [container step1] got error <HttpError 422 when requesting https://172.17.0.3:8000/arvados/v1/container_requests?alt=json returned "#<PG::TRDeadlockDetected: ERROR: deadlock detected 2018-06-07T14:53:42.041544047Z LINE 1: INSERT INTO "container_requests" ("name", "state", "containe... 2018-06-07T14:53:42.041544047Z ^ 2018-06-07T14:53:42.041544047Z DETAIL: Process 24022 waits for RowExclusiveLock on relation 16497 of database 16444; blocked by process 10594. 2018-06-07T14:53:42.041544047Z Process 10594 waits for ExclusiveLock on relation 16509 of database 16444; blocked by process 24022. 2018-06-07T14:53:42.041544047Z HINT: See server log for query details. 2018-06-07T14:53:42.041544047Z >"> 2018-06-07T14:53:42.042033766Z cwltool WARNING: [step step1] completed permanentFail
Updated by Tom Clegg over 6 years ago
- Status changed from Feedback to In Progress
Updated by Ward Vandewege over 6 years ago
- Related to Bug #13594: PG::TRDeadlockDetected when running cwl tests in parallel added
Updated by Tom Clegg over 6 years ago
- addresses situation in note-44 (priority=0 but should be >0)
- https://ci.curoverse.com/job/developer-run-tests/761/
Updated by Lucas Di Pentima over 6 years ago
Sorry for the delay on this, 187d6318298adca84193d8c78952e023f303bc2d LGTM.
Updated by Tom Clegg over 6 years ago
- Status changed from In Progress to Resolved