Bug #15942

[a-d-c] Doesn't fail when resources can't be satisfied?

Added by Peter Amstutz 7 months ago. Updated 6 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
12/19/2019
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-
Release relationship:
Auto

Description

https://workbench.ce8i5.arvadosapi.com/container_requests/ce8i5-xvhdp-zi71xvgt3cbjsmq#Status

This container request asks for 2 vCPUs.

The cluster config only has one node type, which has 1 vCPU.

arvados-dispatch-cloud probably should have rejected it, but it scheduled it instead. Unfortunately then we asked Docker for 2 CPUs and it didn't like that:

2019-12-18T19:53:51.826922218Z Creating Docker container
2019-12-18T19:53:51.832452845Z error in Run: While creating container: Error response from daemon: Range of CPUs is from 0.01 to 1.00, as there are only 1 CPUs available

Subtasks

Task #15943: Review 15942-cancel-unsatisfiableResolvedTom Clegg


Related issues

Related to Arvados - Bug #16036: lib/dispatchcloud/container IntegrationSuite.TestGetLockUnlockCancel is flakyDuplicate

Associated revisions

Revision 1080a6b8
Added by Tom Clegg 7 months ago

Merge branch '15942-cancel-unsatisfiable'

refs #15942

Arvados-DCO-1.1-Signed-off-by: Tom Clegg <>

History

#1 Updated by Peter Amstutz 7 months ago

  • Status changed from New to In Progress

#2 Updated by Peter Amstutz 7 months ago

  • Description updated (diff)

#3 Updated by Tom Clegg 7 months ago

ii  arvados-controller                 1.4.2.20191212205527-1                      amd64        Arvados cluster controller daemon
ii  arvados-dispatch-cloud             1.4.2.20191205154502-1                      amd64        Arvados cluster cloud dispatch

a-d-c logs show the expected error ("cancel container with no suitable instance type") followed by attempts to run the container anyway.

{"ContainerUUID":"ce8i5-dz642-reervfqxy9hla0f","PID":119132,"error":"constraints not satisfiable by any configured instance type","level":"warning","msg":"cancel container with no suitable instance type","time":"2019-12-18T19:53:30.666716050Z"}
{"ContainerUUID":"ce8i5-dz642-reervfqxy9hla0f","InstanceType":"a1v2","PID":119132,"Priority":562948376722770419,"State":"Locked","level":"info","msg":"adding container to queue","time":"2019-12-18T19:53:30.749278632Z"}
{"Address":"10.47.0.8","ContainerUUID":"ce8i5-dz642-reervfqxy9hla0f","Instance":"/subscriptions/3fa048dc-aa38-4820-85ba-68498da5f26b/resourceGroups/ce8i5/providers/Microsoft.Compute/virtualMachines/compute-99cb150b26149780de44b929577e1aed-3w5kue90h391hho","InstanceType":"a1v2","PID":119132,"level":"info","msg":"crunch-run process started","time":"2019-12-18T19:53:30.782655808Z"}
{"Address":"10.47.0.8","ContainerUUID":"ce8i5-dz642-reervfqxy9hla0f","Instance":"/subscriptions/3fa048dc-aa38-4820-85ba-68498da5f26b/resourceGroups/ce8i5/providers/Microsoft.Compute/virtualMachines/compute-99cb150b26149780de44b929577e1aed-3w5kue90h391hho","InstanceType":"a1v2","PID":119132,"level":"info","msg":"crunch-run process ended","time":"2019-12-18T19:53:40.605524849Z"}
{"Address":"10.47.0.8","ContainerUUID":"ce8i5-dz642-reervfqxy9hla0f","Instance":"/subscriptions/3fa048dc-aa38-4820-85ba-68498da5f26b/resourceGroups/ce8i5/providers/Microsoft.Compute/virtualMachines/compute-99cb150b26149780de44b929577e1aed-3w5kue90h391hho","InstanceType":"a1v2","PID":119132,"level":"info","msg":"crunch-run process started","time":"2019-12-18T19:53:50.778935999Z"}
{"Address":"10.47.0.8","ContainerUUID":"ce8i5-dz642-reervfqxy9hla0f","Instance":"/subscriptions/3fa048dc-aa38-4820-85ba-68498da5f26b/resourceGroups/ce8i5/providers/Microsoft.Compute/virtualMachines/compute-99cb150b26149780de44b929577e1aed-3w5kue90h391hho","InstanceType":"a1v2","PID":119132,"level":"info","msg":"crunch-run process ended","time":"2019-12-18T19:54:00.602361070Z"}
{"Address":"10.47.0.8","ContainerUUID":"ce8i5-dz642-reervfqxy9hla0f","Instance":"/subscriptions/3fa048dc-aa38-4820-85ba-68498da5f26b/resourceGroups/ce8i5/providers/Microsoft.Compute/virtualMachines/compute-99cb150b26149780de44b929577e1aed-3w5kue90h391hho","InstanceType":"a1v2","PID":119132,"level":"info","msg":"crunch-run process started","time":"2019-12-18T19:54:10.721237795Z"}
{"Address":"10.47.0.8","ContainerUUID":"ce8i5-dz642-reervfqxy9hla0f","Instance":"/subscriptions/3fa048dc-aa38-4820-85ba-68498da5f26b/resourceGroups/ce8i5/providers/Microsoft.Compute/virtualMachines/compute-99cb150b26149780de44b929577e1aed-3w5kue90h391hho","InstanceType":"a1v2","PID":119132,"level":"info","msg":"crunch-run process ended","time":"2019-12-18T19:54:20.601629762Z"}
{"Address":"10.47.0.8","ContainerUUID":"ce8i5-dz642-reervfqxy9hla0f","Instance":"/subscriptions/3fa048dc-aa38-4820-85ba-68498da5f26b/resourceGroups/ce8i5/providers/Microsoft.Compute/virtualMachines/compute-99cb150b26149780de44b929577e1aed-3w5kue90h391hho","InstanceType":"a1v2","PID":119132,"level":"info","msg":"crunch-run process started","time":"2019-12-18T19:54:30.709696522Z"}
{"Address":"10.47.0.8","ContainerUUID":"ce8i5-dz642-reervfqxy9hla0f","Instance":"/subscriptions/3fa048dc-aa38-4820-85ba-68498da5f26b/resourceGroups/ce8i5/providers/Microsoft.Compute/virtualMachines/compute-99cb150b26149780de44b929577e1aed-3w5kue90h391hho","InstanceType":"a1v2","PID":119132,"level":"info","msg":"crunch-run process ended","time":"2019-12-18T19:54:40.605795842Z"}
{"Address":"10.47.0.8","ContainerUUID":"ce8i5-dz642-reervfqxy9hla0f","Instance":"/subscriptions/3fa048dc-aa38-4820-85ba-68498da5f26b/resourceGroups/ce8i5/providers/Microsoft.Compute/virtualMachines/compute-99cb150b26149780de44b929577e1aed-3w5kue90h391hho","InstanceType":"a1v2","PID":119132,"level":"info","msg":"crunch-run process started","time":"2019-12-18T19:54:50.734759199Z"}
{"Address":"10.47.0.8","ContainerUUID":"ce8i5-dz642-reervfqxy9hla0f","Instance":"/subscriptions/3fa048dc-aa38-4820-85ba-68498da5f26b/resourceGroups/ce8i5/providers/Microsoft.Compute/virtualMachines/compute-99cb150b26149780de44b929577e1aed-3w5kue90h391hho","InstanceType":"a1v2","PID":119132,"level":"info","msg":"crunch-run process ended","time":"2019-12-18T19:55:00.594075637Z"}
{"ContainerUUID":"ce8i5-dz642-reervfqxy9hla0f","PID":119132,"State":"Cancelled","level":"info","msg":"dropping container from queue","time":"2019-12-18T19:55:10.634223282Z"}

controller shows multiple cycles of lock (by dispatcher) + unlock (by worker node):

{"PID":41701,"RequestID":"req-k00k7pe4rkuhtv3y6o65","level":"info","msg":"request","remoteAddr":"127.0.0.1:45574","reqBytes":0,"reqForwardedFor":"52.147.168.174","reqHost":"ce8i5.arvadosapi.com","reqMethod":"GET","reqPath":"arvados/v1/containers/ce8i5-dz642-reervfqxy9hla0f","reqQuery":"include_trash=true\u0026select=%5B%22uuid%22%5D","time":"2019-12-18T19:53:21.922992435Z"}
{"PID":41701,"RequestID":"req-k00k7pe4rkuhtv3y6o65","level":"info","msg":"response","remoteAddr":"127.0.0.1:45574","reqBytes":0,"reqForwardedFor":"52.147.168.174","reqHost":"ce8i5.arvadosapi.com","reqMethod":"GET","reqPath":"arvados/v1/containers/ce8i5-dz642-reervfqxy9hla0f","reqQuery":"include_trash=true\u0026select=%5B%22uuid%22%5D","respBytes":1564,"respStatus":"OK","respStatusCode":200,"time":"2019-12-18T19:53:21.934605230Z","timeToStatus":0.011520,"timeTotal":0.011590,"timeWriteBody":0.000070}

{"PID":41701,"RequestID":"req-53z0j6pu0x7dvnuod779","level":"info","msg":"request","remoteAddr":"127.0.0.1:45664","reqBytes":0,"reqForwardedFor":"52.147.168.174","reqHost":"ce8i5.arvadosapi.com","reqMethod":"POST","reqPath":"arvados/v1/containers/ce8i5-dz642-reervfqxy9hla0f/lock","reqQuery":"","time":"2019-12-18T19:53:30.667409656Z"}
{"PID":41701,"RequestID":"req-53z0j6pu0x7dvnuod779","level":"info","msg":"response","remoteAddr":"127.0.0.1:45664","reqBytes":0,"reqForwardedFor":"52.147.168.174","reqHost":"ce8i5.arvadosapi.com","reqMethod":"POST","reqPath":"arvados/v1/containers/ce8i5-dz642-reervfqxy9hla0f/lock","reqQuery":"","respBytes":174,"respStatus":"OK","respStatusCode":200,"time":"2019-12-18T19:53:30.747416617Z","timeToStatus":0.079848,"timeTotal":0.079990,"timeWriteBody":0.000142}

{"PID":41701,"RequestID":"req-12e4zygc4ld1wik6t9j2","level":"info","msg":"request","remoteAddr":"127.0.0.1:45690","reqBytes":42,"reqForwardedFor":"52.177.181.122","reqHost":"ce8i5.arvadosapi.com","reqMethod":"PUT","reqPath":"arvados/v1/containers/ce8i5-dz642-reervfqxy9hla0f","reqQuery":"","time":"2019-12-18T19:53:31.858865397Z"}
{"PID":41701,"RequestID":"req-12e4zygc4ld1wik6t9j2","level":"info","msg":"response","remoteAddr":"127.0.0.1:45690","reqBytes":42,"reqForwardedFor":"52.177.181.122","reqHost":"ce8i5.arvadosapi.com","reqMethod":"PUT","reqPath":"arvados/v1/containers/ce8i5-dz642-reervfqxy9hla0f","reqQuery":"","respBytes":1564,"respStatus":"OK","respStatusCode":200,"time":"2019-12-18T19:53:31.889957654Z","timeToStatus":0.031041,"timeTotal":0.031081,"timeWriteBody":0.000040}

{"PID":41701,"RequestID":"req-1p3oilkv0upa8lf45oxq","level":"info","msg":"request","remoteAddr":"127.0.0.1:45796","reqBytes":0,"reqForwardedFor":"52.147.168.174","reqHost":"ce8i5.arvadosapi.com","reqMethod":"POST","reqPath":"arvados/v1/containers/ce8i5-dz642-reervfqxy9hla0f/lock","reqQuery":"","time":"2019-12-18T19:53:50.646289403Z"}
{"PID":41701,"RequestID":"req-1p3oilkv0upa8lf45oxq","level":"info","msg":"response","remoteAddr":"127.0.0.1:45796","reqBytes":0,"reqForwardedFor":"52.147.168.174","reqHost":"ce8i5.arvadosapi.com","reqMethod":"POST","reqPath":"arvados/v1/containers/ce8i5-dz642-reervfqxy9hla0f/lock","reqQuery":"","respBytes":174,"respStatus":"OK","respStatusCode":200,"time":"2019-12-18T19:53:50.741234388Z","timeToStatus":0.094782,"timeTotal":0.094930,"timeWriteBody":0.000148}

#4 Updated by Tom Clegg 7 months ago

(*lib/dispatcher/container.Queue)addEnt() detects that the container can't run and (instead of actually adding a queue entry) starts a goroutine to deal with it. That goroutine calls Lock(), which calls /arvados/v1/containers/$uuid/lock and updates the queue with the response using the generic "addEnt() or update existing entry" code. In this case there's no existing entry, so it calls addEnt() again.

This time, RuntimeConstraints is empty, because the lock and unlock APIs have an implicit "select" param that doesn't include runtime_constraints -- so the container gets dispatched as if it had no runtime constraints, i.e., on the lowest priced node type.

I think the easiest solution is for updateWithResp (called by Cancel and apiUpdate) to just discard the response, instead of adding an entry, if the relevant container is no longer in the queue. I don't see any cases where it makes sense to add a new entry when a just-locked/unlocked/cancelled container has disappeared from the queue and needs to be re-added.

#5 Updated by Tom Clegg 7 months ago

  • Assigned To set to Tom Clegg

#7 Updated by Tom Clegg 7 months ago

15942-cancel-unsatisfiable @ ef48a49d302ca25136bbb2ed96319a8b56b46250 with a test that confirms the fix.

#8 Updated by Lucas Di Pentima 7 months ago

Sorry for the delay, I was getting more familiar with the code. Although your explanation makes perfect sense and the fix LGTM, I tried running the test with the pre-fix queue.go code and it succeeded. Maybe there's an issue with the test itself?

#9 Updated by Tom Clegg 7 months ago

Indeed, if I un-fix the bug, the test only fails for me about 80% of the time. I've updated the test to listen to the queue's update channel instead, and (although I still don't think it's airtight) this caught the reintroduced bug 15/15 times.

15942-cancel-unsatisfiable @ d01a136c64bb5e342542b7de4da45e09e0961dc1

#10 Updated by Lucas Di Pentima 7 months ago

Test works great on my side too, please merge. Thanks!

#11 Updated by Tom Clegg 7 months ago

  • Status changed from In Progress to Resolved

#12 Updated by Peter Amstutz 6 months ago

  • Target version changed from 2020-01-15 Sprint to 2020-01-02 Sprint

#13 Updated by Tom Clegg 6 months ago

  • Related to Bug #16036: lib/dispatchcloud/container IntegrationSuite.TestGetLockUnlockCancel is flaky added

#14 Updated by Peter Amstutz 6 months ago

  • Release set to 22

Also available in: Atom PDF