Bug #15695
closed[a-d-c] Long delay before cloud dispatcher starts jobs on playground
Description
This workflow wants 100 parallel jobs running the same code over different date.
There are two separate runs shown in the Prometheus graph below:
https://prometheus.curoverse.com/consoles/qr1hi/index.html#pctc%7B%22duration%22%3A7200%2C%22endTime%22%3A1570223940%7D
The timeline is (all times UTC):
19:24 First run submitted with requirements for 100 x 4 core nodes - https://workbench.qr1hi.arvadosapi.com/container_requests/qr1hi-xvhdp-ctw1t6m8z718emc
19:41 17 nodes with 4 cores each started
19:57 Workflow canceled
19:57 75 nodes idle
19:58 Second run submitted with edited run time requirements for 100 x 2 core nodes - https://workbench.qr1hi.arvadosapi.com/container_requests/qr1hi-xvhdp-2ry6g3l031wlygu
20:00 71 nodes idle from 1st cancelled workflow
20:03 1 node busy, 0 nodes idle
20:26 1st node child container started
20:31 2nd node for child container start
20:36 39 nodes booting for child containers
20:40 another 22 nodes start booting
20:46 final 7 nodes start booting
20:55 All 100 containers finally running
Files