Bug #15942
closed[a-d-c] Doesn't fail when resources can't be satisfied?
Description
https://workbench.ce8i5.arvadosapi.com/container_requests/ce8i5-xvhdp-zi71xvgt3cbjsmq#Status
This container request asks for 2 vCPUs.
The cluster config only has one node type, which has 1 vCPU.
arvados-dispatch-cloud probably should have rejected it, but it scheduled it instead. Unfortunately then we asked Docker for 2 CPUs and it didn't like that:
2019-12-18T19:53:51.826922218Z Creating Docker container 2019-12-18T19:53:51.832452845Z error in Run: While creating container: Error response from daemon: Range of CPUs is from 0.01 to 1.00, as there are only 1 CPUs available
Updated by Peter Amstutz about 5 years ago
- Status changed from New to In Progress
Updated by Tom Clegg about 5 years ago
ii arvados-controller 1.4.2.20191212205527-1 amd64 Arvados cluster controller daemon ii arvados-dispatch-cloud 1.4.2.20191205154502-1 amd64 Arvados cluster cloud dispatch
a-d-c logs show the expected error ("cancel container with no suitable instance type") followed by attempts to run the container anyway.
{"ContainerUUID":"ce8i5-dz642-reervfqxy9hla0f","PID":119132,"error":"constraints not satisfiable by any configured instance type","level":"warning","msg":"cancel container with no suitable instance type","time":"2019-12-18T19:53:30.666716050Z"} {"ContainerUUID":"ce8i5-dz642-reervfqxy9hla0f","InstanceType":"a1v2","PID":119132,"Priority":562948376722770419,"State":"Locked","level":"info","msg":"adding container to queue","time":"2019-12-18T19:53:30.749278632Z"} {"Address":"10.47.0.8","ContainerUUID":"ce8i5-dz642-reervfqxy9hla0f","Instance":"/subscriptions/3fa048dc-aa38-4820-85ba-68498da5f26b/resourceGroups/ce8i5/providers/Microsoft.Compute/virtualMachines/compute-99cb150b26149780de44b929577e1aed-3w5kue90h391hho","InstanceType":"a1v2","PID":119132,"level":"info","msg":"crunch-run process started","time":"2019-12-18T19:53:30.782655808Z"} {"Address":"10.47.0.8","ContainerUUID":"ce8i5-dz642-reervfqxy9hla0f","Instance":"/subscriptions/3fa048dc-aa38-4820-85ba-68498da5f26b/resourceGroups/ce8i5/providers/Microsoft.Compute/virtualMachines/compute-99cb150b26149780de44b929577e1aed-3w5kue90h391hho","InstanceType":"a1v2","PID":119132,"level":"info","msg":"crunch-run process ended","time":"2019-12-18T19:53:40.605524849Z"} {"Address":"10.47.0.8","ContainerUUID":"ce8i5-dz642-reervfqxy9hla0f","Instance":"/subscriptions/3fa048dc-aa38-4820-85ba-68498da5f26b/resourceGroups/ce8i5/providers/Microsoft.Compute/virtualMachines/compute-99cb150b26149780de44b929577e1aed-3w5kue90h391hho","InstanceType":"a1v2","PID":119132,"level":"info","msg":"crunch-run process started","time":"2019-12-18T19:53:50.778935999Z"} {"Address":"10.47.0.8","ContainerUUID":"ce8i5-dz642-reervfqxy9hla0f","Instance":"/subscriptions/3fa048dc-aa38-4820-85ba-68498da5f26b/resourceGroups/ce8i5/providers/Microsoft.Compute/virtualMachines/compute-99cb150b26149780de44b929577e1aed-3w5kue90h391hho","InstanceType":"a1v2","PID":119132,"level":"info","msg":"crunch-run process ended","time":"2019-12-18T19:54:00.602361070Z"} {"Address":"10.47.0.8","ContainerUUID":"ce8i5-dz642-reervfqxy9hla0f","Instance":"/subscriptions/3fa048dc-aa38-4820-85ba-68498da5f26b/resourceGroups/ce8i5/providers/Microsoft.Compute/virtualMachines/compute-99cb150b26149780de44b929577e1aed-3w5kue90h391hho","InstanceType":"a1v2","PID":119132,"level":"info","msg":"crunch-run process started","time":"2019-12-18T19:54:10.721237795Z"} {"Address":"10.47.0.8","ContainerUUID":"ce8i5-dz642-reervfqxy9hla0f","Instance":"/subscriptions/3fa048dc-aa38-4820-85ba-68498da5f26b/resourceGroups/ce8i5/providers/Microsoft.Compute/virtualMachines/compute-99cb150b26149780de44b929577e1aed-3w5kue90h391hho","InstanceType":"a1v2","PID":119132,"level":"info","msg":"crunch-run process ended","time":"2019-12-18T19:54:20.601629762Z"} {"Address":"10.47.0.8","ContainerUUID":"ce8i5-dz642-reervfqxy9hla0f","Instance":"/subscriptions/3fa048dc-aa38-4820-85ba-68498da5f26b/resourceGroups/ce8i5/providers/Microsoft.Compute/virtualMachines/compute-99cb150b26149780de44b929577e1aed-3w5kue90h391hho","InstanceType":"a1v2","PID":119132,"level":"info","msg":"crunch-run process started","time":"2019-12-18T19:54:30.709696522Z"} {"Address":"10.47.0.8","ContainerUUID":"ce8i5-dz642-reervfqxy9hla0f","Instance":"/subscriptions/3fa048dc-aa38-4820-85ba-68498da5f26b/resourceGroups/ce8i5/providers/Microsoft.Compute/virtualMachines/compute-99cb150b26149780de44b929577e1aed-3w5kue90h391hho","InstanceType":"a1v2","PID":119132,"level":"info","msg":"crunch-run process ended","time":"2019-12-18T19:54:40.605795842Z"} {"Address":"10.47.0.8","ContainerUUID":"ce8i5-dz642-reervfqxy9hla0f","Instance":"/subscriptions/3fa048dc-aa38-4820-85ba-68498da5f26b/resourceGroups/ce8i5/providers/Microsoft.Compute/virtualMachines/compute-99cb150b26149780de44b929577e1aed-3w5kue90h391hho","InstanceType":"a1v2","PID":119132,"level":"info","msg":"crunch-run process started","time":"2019-12-18T19:54:50.734759199Z"} {"Address":"10.47.0.8","ContainerUUID":"ce8i5-dz642-reervfqxy9hla0f","Instance":"/subscriptions/3fa048dc-aa38-4820-85ba-68498da5f26b/resourceGroups/ce8i5/providers/Microsoft.Compute/virtualMachines/compute-99cb150b26149780de44b929577e1aed-3w5kue90h391hho","InstanceType":"a1v2","PID":119132,"level":"info","msg":"crunch-run process ended","time":"2019-12-18T19:55:00.594075637Z"} {"ContainerUUID":"ce8i5-dz642-reervfqxy9hla0f","PID":119132,"State":"Cancelled","level":"info","msg":"dropping container from queue","time":"2019-12-18T19:55:10.634223282Z"}
controller shows multiple cycles of lock (by dispatcher) + unlock (by worker node):
{"PID":41701,"RequestID":"req-k00k7pe4rkuhtv3y6o65","level":"info","msg":"request","remoteAddr":"127.0.0.1:45574","reqBytes":0,"reqForwardedFor":"52.147.168.174","reqHost":"ce8i5.arvadosapi.com","reqMethod":"GET","reqPath":"arvados/v1/containers/ce8i5-dz642-reervfqxy9hla0f","reqQuery":"include_trash=true\u0026select=%5B%22uuid%22%5D","time":"2019-12-18T19:53:21.922992435Z"} {"PID":41701,"RequestID":"req-k00k7pe4rkuhtv3y6o65","level":"info","msg":"response","remoteAddr":"127.0.0.1:45574","reqBytes":0,"reqForwardedFor":"52.147.168.174","reqHost":"ce8i5.arvadosapi.com","reqMethod":"GET","reqPath":"arvados/v1/containers/ce8i5-dz642-reervfqxy9hla0f","reqQuery":"include_trash=true\u0026select=%5B%22uuid%22%5D","respBytes":1564,"respStatus":"OK","respStatusCode":200,"time":"2019-12-18T19:53:21.934605230Z","timeToStatus":0.011520,"timeTotal":0.011590,"timeWriteBody":0.000070} {"PID":41701,"RequestID":"req-53z0j6pu0x7dvnuod779","level":"info","msg":"request","remoteAddr":"127.0.0.1:45664","reqBytes":0,"reqForwardedFor":"52.147.168.174","reqHost":"ce8i5.arvadosapi.com","reqMethod":"POST","reqPath":"arvados/v1/containers/ce8i5-dz642-reervfqxy9hla0f/lock","reqQuery":"","time":"2019-12-18T19:53:30.667409656Z"} {"PID":41701,"RequestID":"req-53z0j6pu0x7dvnuod779","level":"info","msg":"response","remoteAddr":"127.0.0.1:45664","reqBytes":0,"reqForwardedFor":"52.147.168.174","reqHost":"ce8i5.arvadosapi.com","reqMethod":"POST","reqPath":"arvados/v1/containers/ce8i5-dz642-reervfqxy9hla0f/lock","reqQuery":"","respBytes":174,"respStatus":"OK","respStatusCode":200,"time":"2019-12-18T19:53:30.747416617Z","timeToStatus":0.079848,"timeTotal":0.079990,"timeWriteBody":0.000142} {"PID":41701,"RequestID":"req-12e4zygc4ld1wik6t9j2","level":"info","msg":"request","remoteAddr":"127.0.0.1:45690","reqBytes":42,"reqForwardedFor":"52.177.181.122","reqHost":"ce8i5.arvadosapi.com","reqMethod":"PUT","reqPath":"arvados/v1/containers/ce8i5-dz642-reervfqxy9hla0f","reqQuery":"","time":"2019-12-18T19:53:31.858865397Z"} {"PID":41701,"RequestID":"req-12e4zygc4ld1wik6t9j2","level":"info","msg":"response","remoteAddr":"127.0.0.1:45690","reqBytes":42,"reqForwardedFor":"52.177.181.122","reqHost":"ce8i5.arvadosapi.com","reqMethod":"PUT","reqPath":"arvados/v1/containers/ce8i5-dz642-reervfqxy9hla0f","reqQuery":"","respBytes":1564,"respStatus":"OK","respStatusCode":200,"time":"2019-12-18T19:53:31.889957654Z","timeToStatus":0.031041,"timeTotal":0.031081,"timeWriteBody":0.000040} {"PID":41701,"RequestID":"req-1p3oilkv0upa8lf45oxq","level":"info","msg":"request","remoteAddr":"127.0.0.1:45796","reqBytes":0,"reqForwardedFor":"52.147.168.174","reqHost":"ce8i5.arvadosapi.com","reqMethod":"POST","reqPath":"arvados/v1/containers/ce8i5-dz642-reervfqxy9hla0f/lock","reqQuery":"","time":"2019-12-18T19:53:50.646289403Z"} {"PID":41701,"RequestID":"req-1p3oilkv0upa8lf45oxq","level":"info","msg":"response","remoteAddr":"127.0.0.1:45796","reqBytes":0,"reqForwardedFor":"52.147.168.174","reqHost":"ce8i5.arvadosapi.com","reqMethod":"POST","reqPath":"arvados/v1/containers/ce8i5-dz642-reervfqxy9hla0f/lock","reqQuery":"","respBytes":174,"respStatus":"OK","respStatusCode":200,"time":"2019-12-18T19:53:50.741234388Z","timeToStatus":0.094782,"timeTotal":0.094930,"timeWriteBody":0.000148}
Updated by Tom Clegg about 5 years ago
(*lib/dispatcher/container.Queue)addEnt() detects that the container can't run and (instead of actually adding a queue entry) starts a goroutine to deal with it. That goroutine calls Lock(), which calls /arvados/v1/containers/$uuid/lock and updates the queue with the response using the generic "addEnt() or update existing entry" code. In this case there's no existing entry, so it calls addEnt() again.
This time, RuntimeConstraints is empty, because the lock and unlock APIs have an implicit "select" param that doesn't include runtime_constraints -- so the container gets dispatched as if it had no runtime constraints, i.e., on the lowest priced node type.
I think the easiest solution is for updateWithResp (called by Cancel and apiUpdate) to just discard the response, instead of adding an entry, if the relevant container is no longer in the queue. I don't see any cases where it makes sense to add a new entry when a just-locked/unlocked/cancelled container has disappeared from the queue and needs to be re-added.
Updated by Tom Clegg about 5 years ago
15942-cancel-unsatisfiable @ 6162a7fc5b213446b808313770d0d7e491fa5723 -- developer-run-tests: #1696
Updated by Tom Clegg about 5 years ago
15942-cancel-unsatisfiable @ ef48a49d302ca25136bbb2ed96319a8b56b46250 with a test that confirms the fix.
Updated by Lucas Di Pentima about 5 years ago
Sorry for the delay, I was getting more familiar with the code. Although your explanation makes perfect sense and the fix LGTM, I tried running the test with the pre-fix queue.go code and it succeeded. Maybe there's an issue with the test itself?
Updated by Tom Clegg about 5 years ago
Indeed, if I un-fix the bug, the test only fails for me about 80% of the time. I've updated the test to listen to the queue's update channel instead, and (although I still don't think it's airtight) this caught the reintroduced bug 15/15 times.
15942-cancel-unsatisfiable @ d01a136c64bb5e342542b7de4da45e09e0961dc1
Updated by Lucas Di Pentima about 5 years ago
Test works great on my side too, please merge. Thanks!
Updated by Tom Clegg about 5 years ago
- Status changed from In Progress to Resolved
Updated by Peter Amstutz about 5 years ago
- Target version changed from 2020-01-15 Sprint to 2020-01-02 Sprint
Updated by Tom Clegg about 5 years ago
- Related to Bug #16036: lib/dispatchcloud/container IntegrationSuite.TestGetLockUnlockCancel is flaky added