https://dev.arvados.org/https://dev.arvados.org/favicon.ico?15576888422015-07-28T17:17:05ZArvadosArvados - Bug #6702: [Node Manager] Retries forever when a node creation request times out, even though the node was createdhttps://dev.arvados.org/issues/6702?journal_id=278302015-07-28T17:17:05ZBrett Smithbrett.smith@curii.com
<ul></ul><p>We just saw this again, except the original error was different: Google responded, "The zone 'projects/projname/zones/us-central1-a' does not have enough resources available to fulfill the request. Try a different zone, or try again later."</p> Arvados - Bug #6702: [Node Manager] Retries forever when a node creation request times out, even though the node was createdhttps://dev.arvados.org/issues/6702?journal_id=344412016-01-19T19:22:44ZPeter Amstutzpeter.amstutz@curii.com
<ul><li><strong>Description</strong> updated (<a title="View differences" href="/journals/34441/diff?detail_id=33800">diff</a>)</li></ul> Arvados - Bug #6702: [Node Manager] Retries forever when a node creation request times out, even though the node was createdhttps://dev.arvados.org/issues/6702?journal_id=344422016-01-19T19:23:01ZPeter Amstutzpeter.amstutz@curii.com
<ul><li><strong>Story points</strong> set to <i>1.0</i></li></ul> Arvados - Bug #6702: [Node Manager] Retries forever when a node creation request times out, even though the node was createdhttps://dev.arvados.org/issues/6702?journal_id=345582016-01-20T20:44:18ZPeter Amstutzpeter.amstutz@curii.com
<ul><li><strong>Target version</strong> changed from <i>Arvados Future Sprints</i> to <i>2016-02-03 Sprint</i></li></ul> Arvados - Bug #6702: [Node Manager] Retries forever when a node creation request times out, even though the node was createdhttps://dev.arvados.org/issues/6702?journal_id=345632016-01-20T20:47:29ZPeter Amstutzpeter.amstutz@curii.com
<ul><li><strong>Assigned To</strong> set to <i>Peter Amstutz</i></li></ul> Arvados - Bug #6702: [Node Manager] Retries forever when a node creation request times out, even though the node was createdhttps://dev.arvados.org/issues/6702?journal_id=348542016-02-02T16:11:34ZTom Cleggtom@curii.com
<ul></ul><p>Is there some reason gce.ComputeNodeDriver.create_node() needs a copy of the code from BaseComputeNodeDriver.create_node(), instead of calling super()?</p>
<p>Would this same logic wouldn't work with other cloud drivers? It seems like the bug can happen just as easily with other clouds, so it should be in BaseComputeNodeDriver, unless there's some reason not to...</p>
<p>Otherwise LGTM.</p> Arvados - Bug #6702: [Node Manager] Retries forever when a node creation request times out, even though the node was createdhttps://dev.arvados.org/issues/6702?journal_id=348562016-02-02T16:20:38ZPeter Amstutzpeter.amstutz@curii.com
<ul></ul><p>Tom Clegg wrote:</p>
<blockquote>
<p>Is there some reason gce.ComputeNodeDriver.create_node() needs a copy of the code from BaseComputeNodeDriver.create_node(),</p>
</blockquote>
<p>instead of calling super()?</p>
<p>Because it needs kwargs['name'], which would incur either more refactoring or calling <code>self.arvados_create_kwargs</code> twice (once in the GCE <code>create_node()</code> and again in <code>super.create_node()</code>) and possibly getting different results.</p>
<blockquote>
<p>Would this same logic wouldn't work with other cloud drivers? It seems like the bug can happen just as easily with other clouds, so it should be in BaseComputeNodeDriver, unless there's some reason not to...</p>
</blockquote>
<p>When we discussed the ticket earlier, the instructions were to only make the change for GCE. The same logic is definitely valid for Azure but I'm a bit fuzzy on whether it's also valid for AWS (I don't know if the name on AWS is a strongly unique identifier).</p>
<blockquote>
<p>Otherwise LGTM.</p>
</blockquote> Arvados - Bug #6702: [Node Manager] Retries forever when a node creation request times out, even though the node was createdhttps://dev.arvados.org/issues/6702?journal_id=348592016-02-02T16:28:07ZPeter Amstutzpeter.amstutz@curii.com
<ul></ul><p>To follow up, I just checked the libcloud driver for EC2. The "name" field on the Node object set by libcloud is either the "Name" tag or the instance id. However, the "Name" tag isn't set until after node creation succeeded and we don't know the instance id because we never got a response, so we can't use the same logic as Azure and GCE.</p> Arvados - Bug #6702: [Node Manager] Retries forever when a node creation request times out, even though the node was createdhttps://dev.arvados.org/issues/6702?journal_id=348612016-02-02T16:35:06ZPeter Amstutzpeter.amstutz@curii.com
<ul><li><strong>Status</strong> changed from <i>New</i> to <i>Resolved</i></li></ul><p>Applied in changeset arvados|commit:6570eec0115d7973cce4df10857631cfe6bd11c5.</p> Arvados - Bug #6702: [Node Manager] Retries forever when a node creation request times out, even though the node was createdhttps://dev.arvados.org/issues/6702?journal_id=348622016-02-02T16:38:58ZTom Cleggtom@curii.com
<ul></ul><p>Surely better to refactor than to have copy-and-pasted code. But it's a moot point if we move this to BaseComputeNodeDriver, which it sounds like we should.</p>
<p>Seems like we just need a method ("node_lookup"?) that returns True if the named node exists, and raise NotImplemented in the EC2 driver since names seem a bit less tight there. Aside from that, the existing code looks generic already.</p>