Project

General

Profile

Bug #20894

Updated by Peter Amstutz 9 months ago

I just found an interesting cloud dispatch bug... we start with maxConcurrency at 16 and SupervisorFraction at .30. If each workflow only starts one subprocess, we use .60 of maxConcurrency. problem is, maxSupervisors is based on maxConcurrency, so even though there is a big backlog of supervisor processes that want to run, they don't get scheduled, but because we're only using 60% capacity, it doesn't try to raise maxConcurrency either 

 I think the answer is that the default value of SupervisorFraction has to be 50% 

 and/or the default value of InitialQuotaEstimate should be 0 (which sets it to match MaxInstances) 

 I didn't see this on the scale cluster test, but I had already adjusted SupervisorFraction to 0.45 and InitialQuotaEstimate to 400, and the synthetic workflow does have a phase where it submits two parallel jobs, which would push the queue up to the maximum 

 so I wonder, if it didn't have a parallel job phase, it would have been stuck at 400

Back