https://dev.arvados.org/https://dev.arvados.org/favicon.ico?15576888422021-09-03T20:06:21ZArvadosArvados - Bug #18102: max dispatch attempts errorhttps://dev.arvados.org/issues/18102?journal_id=967692021-09-03T20:06:21ZJiayong Lijli@curii.com
<ul><li><strong>Status</strong> changed from <i>New</i> to <i>In Progress</i></li></ul> Arvados - Bug #18102: max dispatch attempts errorhttps://dev.arvados.org/issues/18102?journal_id=967702021-09-03T20:07:23ZJiayong Lijli@curii.com
<ul><li><strong>Description</strong> updated (<a title="View differences" href="/journals/96770/diff?detail_id=93433">diff</a>)</li></ul> Arvados - Bug #18102: max dispatch attempts errorhttps://dev.arvados.org/issues/18102?journal_id=967852021-09-03T20:52:09ZTom Cleggtom@curii.com
<ul></ul><p>example of affected container: <a href="https://arvadosapi.com/2xpu4-dz642-3duf09jaxd3b9gv">2xpu4-dz642-3duf09jaxd3b9gv</a></p>
<pre>
{"ContainerUUID":"<a href="https://arvadosapi.com/2xpu4-dz642-3duf09jaxd3b9gv">2xpu4-dz642-3duf09jaxd3b9gv</a>","InstanceType":"m58xlarge.spot","PID":23276,"Priority":562948322809997302,"State":"Queued","level":"info","msg":"adding container to queue","time":"2021-09-02T20:20:53.543265064Z"}
{"ContainerUUID":"<a href="https://arvadosapi.com/2xpu4-dz642-3duf09jaxd3b9gv">2xpu4-dz642-3duf09jaxd3b9gv</a>","InstanceType":"m58xlarge.spot","PID":23276,"level":"info","msg":"creating new instance","time":"2021-09-02T20:21:51.709046394Z"}
{"InstanceType":"m58xlarge.spot","PID":23276,"error":"InsufficientInstanceCapacity: We currently do not have sufficient m5.8xlarge capacity in the Availability Zone you requested (us-east-1a). Our system will be working on provisioning additional capacity. You can currently get m5.8xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1b, us-east-1c, us-east-1d, us-east-1f.\n\tstatus code: 500, request id: 1a987cc4-80fc-4105-85bb-2d25fdf8dbbd","level":"error","msg":"create failed","time":"2021-09-02T20:21:52.628528300Z"}
{"ContainerUUID":"<a href="https://arvadosapi.com/2xpu4-dz642-3duf09jaxd3b9gv">2xpu4-dz642-3duf09jaxd3b9gv</a>","PID":23276,"error":"request failed: https://2xpu4.arvadosapi.com/arvados/v1/containers/2xpu4-dz642-3duf09jaxd3b9gv/unlock: 422 Unprocessable Entity: #\u003cArvadosModel::InvalidStateTransitionError: cannot unlock when Queued\u003e (req-a9h7sg2h1e1u11ity3xs)","level":"warning","msg":"error unlocking","time":"2021-09-02T20:21:53.392472387Z"}
{"ContainerUUID":"<a href="https://arvadosapi.com/2xpu4-dz642-3duf09jaxd3b9gv">2xpu4-dz642-3duf09jaxd3b9gv</a>","PID":23276,"error":"request failed: https://2xpu4.arvadosapi.com/arvados/v1/containers/2xpu4-dz642-3duf09jaxd3b9gv/unlock: 422 Unprocessable Entity: #\u003cArvadosModel::InvalidStateTransitionError: cannot unlock when Queued\u003e (req-14br8w8ut0nf9nek9xex)","level":"warning","msg":"error unlocking","time":"2021-09-02T20:21:53.974055346Z"}
{"ContainerUUID":"<a href="https://arvadosapi.com/2xpu4-dz642-3duf09jaxd3b9gv">2xpu4-dz642-3duf09jaxd3b9gv</a>","PID":23276,"error":"request failed: https://2xpu4.arvadosapi.com/arvados/v1/containers/2xpu4-dz642-3duf09jaxd3b9gv/unlock: 422 Unprocessable Entity: #\u003cArvadosModel::InvalidStateTransitionError: cannot unlock when Cancelled\u003e (req-4f9hcjkzs9x61409czrt)","level":"warning","msg":"error unlocking","time":"2021-09-02T20:21:56.305846772Z"}
{"ContainerUUID":"<a href="https://arvadosapi.com/2xpu4-dz642-3duf09jaxd3b9gv">2xpu4-dz642-3duf09jaxd3b9gv</a>","PID":23276,"State":"Cancelled","level":"info","msg":"container finished -- dropping from queue","time":"2021-09-02T20:21:56.305932024Z"}
{"ContainerUUID":"<a href="https://arvadosapi.com/2xpu4-dz642-3duf09jaxd3b9gv">2xpu4-dz642-3duf09jaxd3b9gv</a>","PID":23276,"State":"Cancelled","level":"info","msg":"dropping container from queue","time":"2021-09-02T20:21:56.305946078Z"}
</pre> Arvados - Bug #18102: max dispatch attempts errorhttps://dev.arvados.org/issues/18102?journal_id=967862021-09-03T20:52:18ZTom Cleggtom@curii.com
<ul><li><strong>Target version</strong> set to <i>2021-09-15 sprint</i></li></ul> Arvados - Bug #18102: max dispatch attempts errorhttps://dev.arvados.org/issues/18102?journal_id=968052021-09-06T18:27:39ZTom Cleggtom@curii.com
<ul></ul><a name="Flapping-lockunlock-calls"></a>
<h4 >Flapping lock/unlock calls<a href="#Flapping-lockunlock-calls" class="wiki-anchor">¶</a></h4>
<p>The scheduler assumed, incorrectly, that sorting containers by priority would give consistent results from one iteration to the next. This is not true when multiple containers have identical priority, which is how we do child containers. As a result, in a situation where the scheduler is waiting for new instances to come up and knows that there are more containers than new instances, the scheduler keeps changing its mind about which containers are "first" and should be locked, and which should be unlocked because they won't fit on the new instances.</p>
<p>Fix by tracking FirstSeenAt timestamp on each queue entry, and using that to sort identical-priority containers.</p>
<a name="Invalid-state-transition-errors"></a>
<h4 >"Invalid state transition" errors<a href="#Invalid-state-transition-errors" class="wiki-anchor">¶</a></h4>
<p>In runQueue, when ctr.State is Locked and sch.pool.AtQuota() is true, we call sch.queue.Unlock(), but we also include the current container in the overquota slice, which means we call sch.queue.Unlock() on it again in the "unlock any containers ..." block below. Fix by removing the first call to sch.queue.Unlock().</p>
<p>18102-max-dispatch-attempts @ <a class="changeset" title="18102: Improve logging. Arvados-DCO-1.1-Signed-off-by: Tom Clegg <tom@curii.com>" href="https://dev.arvados.org/projects/arvados/repository/arvados/revisions/6ba70e34ec9bc9941e8c04d6654106e6d1bbae82">6ba70e34ec9bc9941e8c04d6654106e6d1bbae82</a> -- <a class="external" href="https://ci.arvados.org/view/Developer/job/developer-run-tests/2678/"<a href="https://ci.arvados.org/view/Developer/job/developer-run-tests/2678/">developer-run-tests: #2678 <img src="https://ci.arvados.org/buildStatus/icon?job=developer-run-tests&build=2678" alt="" /></a></a></p>
<a name="extra-thoughts"></a>
<h4 >extra thoughts<a href="#extra-thoughts" class="wiki-anchor">¶</a></h4>
<p>Rate limiting:</p>
<p>We're getting some rate-limiting errors (RequestLimitExceeded) from AWS. Not clear whether it's caused by having additional create requests in progress when the first one hits InsufficientInstanceCapacity, or just by creating several dozen instances at once.</p>
<p>Either way, these configs should probably not be zero/unlimited. Perhaps we should update the defaults to something reasonable.</p>
<pre>
MaxCloudOpsPerSecond: 0
MaxConcurrentInstanceCreateOps: 0
</pre>
<p>MaxConcurrentInstanceCreateOps was invented to accommodate Azure but it seems useful for AWS too.</p>
<p>Error handling:</p>
<p>If we get an error from the cloud provider when trying to create an instance, but the error isn't recognized as a quota or rate-limiting error, we retry very quickly, which would be unhelpful for an error like "invalid instance type". I don't think this is happening here (InsufficientInstanceCapacity is recognized as a quota error) but we might want to consider other options, like a per-instance-type quiet period for unrecognized errors (which would probably be a better way to respond to InsufficientInstanceCapacity as well).</p> Arvados - Bug #18102: max dispatch attempts errorhttps://dev.arvados.org/issues/18102?journal_id=968072021-09-06T19:32:09ZTom Cleggtom@curii.com
<ul><li><strong>Related to</strong> <i><a class="issue tracker-1 status-3 priority-4 priority-default closed parent" href="/issues/11561">Bug #11561</a>: [API] Limit number of lock/unlock cycles for a given container</i> added</li></ul> Arvados - Bug #18102: max dispatch attempts errorhttps://dev.arvados.org/issues/18102?journal_id=968242021-09-07T15:15:05ZLucas Di Pentimalucas.dipentima@curii.com
<ul></ul><p>Just one small question:</p>
<ul>
<li>At file <code>lib/dispatchcloud/scheduler/sync.go</code>
<ul>
<li>Wouldn't the changes at lines 87-93 avoid almost completely the logging of unexpected state? Maybe we could log the message just once per container?</li>
</ul></li>
</ul>
<p>The rest LGTM, including your extra thoughts, although maybe we want to discuss them with the rest of the team?</p> Arvados - Bug #18102: max dispatch attempts errorhttps://dev.arvados.org/issues/18102?journal_id=968332021-09-07T18:10:02ZTom Cleggtom@curii.com
<ul><li><strong>Related to</strong> <i><a class="issue tracker-2 status-3 priority-4 priority-default closed parent" href="/issues/18113">Feature #18113</a>: [a-d-c] non-zero defaults for MaxCloudOpsPerSecond and MaxConcurrentInstanceCreateOps</i> added</li></ul> Arvados - Bug #18102: max dispatch attempts errorhttps://dev.arvados.org/issues/18102?journal_id=968372021-09-07T18:11:20ZTom Cleggtom@curii.com
<ul><li><strong>Related to</strong> <i><a class="issue tracker-1 status-1 priority-4 priority-default" href="/issues/18114">Bug #18114</a>: [a-d-c] slow down retries when CreateInstance returns non-quota/non-throttle errors</i> added</li></ul> Arvados - Bug #18102: max dispatch attempts errorhttps://dev.arvados.org/issues/18102?journal_id=969032021-09-09T18:19:25ZTom Cleggtom@curii.com
<ul><li><strong>% Done</strong> changed from <i>0</i> to <i>100</i></li><li><strong>Status</strong> changed from <i>In Progress</i> to <i>Resolved</i></li></ul><p>Applied in changeset arvados-private:commit:arvados|89bdd6bf07147ad707125fc3882caebb5402ab09.</p> Arvados - Bug #18102: max dispatch attempts errorhttps://dev.arvados.org/issues/18102?journal_id=969202021-09-10T15:34:07ZTom Cleggtom@curii.com
<ul><li><strong>Release</strong> set to <i>43</i></li></ul> Arvados - Bug #18102: max dispatch attempts errorhttps://dev.arvados.org/issues/18102?journal_id=979012021-10-22T18:47:53ZLucas Di Pentimalucas.dipentima@curii.com
<ul><li><strong>Release</strong> changed from <i>43</i> to <i>42</i></li></ul>