Bug #15695
closed[a-d-c] Long delay before cloud dispatcher starts jobs on playground
Description
This workflow wants 100 parallel jobs running the same code over different date.
There are two separate runs shown in the Prometheus graph below:
https://prometheus.curoverse.com/consoles/qr1hi/index.html#pctc%7B%22duration%22%3A7200%2C%22endTime%22%3A1570223940%7D
The timeline is (all times UTC):
19:24 First run submitted with requirements for 100 x 4 core nodes - https://workbench.qr1hi.arvadosapi.com/container_requests/qr1hi-xvhdp-ctw1t6m8z718emc
19:41 17 nodes with 4 cores each started
19:57 Workflow canceled
19:57 75 nodes idle
19:58 Second run submitted with edited run time requirements for 100 x 2 core nodes - https://workbench.qr1hi.arvadosapi.com/container_requests/qr1hi-xvhdp-2ry6g3l031wlygu
20:00 71 nodes idle from 1st cancelled workflow
20:03 1 node busy, 0 nodes idle
20:26 1st node child container started
20:31 2nd node for child container start
20:36 39 nodes booting for child containers
20:40 another 22 nodes start booting
20:46 final 7 nodes start booting
20:55 All 100 containers finally running
Files
Updated by Tom Clegg about 5 years ago
- File 15695-nodes.png 15695-nodes.png added
Updated by Tom Morris about 5 years ago
The 26 queued containers are some secondary artifact left over from the first cancelled run.
Not sure it adds any additional info, but a run submitted over the weekend showed a normal startup profile with 120 nodes up and running within 15 minutes:
https://prometheus.curoverse.com/consoles/qr1hi/index.html#pctc%7B%22duration%22%3A21600%2C%22endTime%22%3A1570445340%7D
https://workbench.qr1hi.arvadosapi.com/container_requests/qr1hi-xvhdp-o40xgjibykcsqmk (it failed later for an unrelated reason)
Updated by Tom Clegg about 5 years ago
During the dead zone (20:00 to 20:36) containers are locked but no nodes are booting. It seems like arvados-dispatch-cloud isn't able to start new nodes. Errors would have been logged, but logs from that time are already deleted.
Config has MaxCloudOpsPerSecond=1. That seems low, but not low enough to cause starvation.
Updated by Ward Vandewege about 5 years ago
- Subject changed from Long delay before cloud dispatcher starts jobs on playground to [a-d-c] Long delay before cloud dispatcher starts jobs on playground
Updated by Tom Morris about 5 years ago
- Assigned To set to Tom Morris
- Target version set to 2019-10-23 Sprint
I'll attempt to reproduce this next week when folks are available to look at the logs before they get deleted.
Updated by Tom Morris about 5 years ago
- Target version changed from 2019-10-23 Sprint to 2019-11-06 Sprint
Updated by Tom Morris about 5 years ago
- Target version changed from 2019-11-06 Sprint to 2019-11-20 Sprint
Updated by Tom Morris about 5 years ago
- Target version changed from 2019-11-20 Sprint to 2019-12-04 Sprint
Updated by Tom Morris almost 5 years ago
- Target version changed from 2019-12-04 Sprint to 2020-01-02 Sprint
Updated by Tom Clegg almost 5 years ago
- Assigned To deleted (
Tom Morris) - Status changed from New to Closed
Problem still unknown, but hasn't happened (or hasn't been noticed) since first reported.