[a-d-c] Long delay before cloud dispatcher starts jobs on playground
This workflow wants 100 parallel jobs running the same code over different date.
There are two separate runs shown in the Prometheus graph below:
The timeline is (all times UTC):
19:24 First run submitted with requirements for 100 x 4 core nodes - https://workbench.qr1hi.arvadosapi.com/container_requests/qr1hi-xvhdp-ctw1t6m8z718emc
19:41 17 nodes with 4 cores each started
19:57 Workflow canceled
19:57 75 nodes idle
19:58 Second run submitted with edited run time requirements for 100 x 2 core nodes - https://workbench.qr1hi.arvadosapi.com/container_requests/qr1hi-xvhdp-2ry6g3l031wlygu
20:00 71 nodes idle from 1st cancelled workflow
20:03 1 node busy, 0 nodes idle
20:26 1st node child container started
20:31 2nd node for child container start
20:36 39 nodes booting for child containers
20:40 another 22 nodes start booting
20:46 final 7 nodes start booting
20:55 All 100 containers finally running
The 26 queued containers are some secondary artifact left over from the first cancelled run.
Not sure it adds any additional info, but a run submitted over the weekend showed a normal startup profile with 120 nodes up and running within 15 minutes:
https://workbench.qr1hi.arvadosapi.com/container_requests/qr1hi-xvhdp-o40xgjibykcsqmk (it failed later for an unrelated reason)
During the dead zone (20:00 to 20:36) containers are locked but no nodes are booting. It seems like arvados-dispatch-cloud isn't able to start new nodes. Errors would have been logged, but logs from that time are already deleted.
Config has MaxCloudOpsPerSecond=1. That seems low, but not low enough to cause starvation.