Project

General

Profile

Actions

Bug #6702

closed

[Node Manager] Retries forever when a node creation request times out, even though the node was created

Added by Brett Smith almost 9 years ago. Updated about 8 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Node Manager
Target version:
Story points:
1.0

Description

Node Manager decides to bring up a node. First this happens:

2015-07-22_14:48:15.48284 2015-07-22 14:48:15 arvnodeman.nodeup[13142] WARNING: Client error: The read operation timed out - waiting 1 seconds

Node Manager sees the failure and decides to retry the request. But on subsequent tries, this is always the response:

2015-07-22_14:48:18.92984 2015-07-22 14:48:18 arvnodeman.nodeup[13142] WARNING: Client error: u"The resource 'projects/curoverse-production/zones/us-central1-a/instances/compute-yp0s2tcidxw77kp-su92l' already exists" - waiting 2 seconds

The server handled the first request fine, we just didn't get the response back. We need to recognize when this happens and continue the node setup process, rather than retrying infinitely.

There might be a few ways to do this:

  • If the exception makes the problem easily identifiable, just catch it and move it.
  • At least some of the clouds let you send along a request ID with the request to ensure idempotency. Adding this to our requests might make the response nicer. I'm not sure—this would need testing.
  • If all else fails, you could periodically check for the existence of the desired node, at least when it has a predictable name.

Steps to fix:

If a cloud error is raised by create_node() on GCE, test if the cloud node exists. If so, return the cloud node record and proceed. If not, raise the original error.


Subtasks 3 (0 open3 closed)

Task #8267: Add check if node existsResolvedPeter Amstutz07/22/2015Actions
Task #8268: Add test?ResolvedPeter Amstutz07/22/2015Actions
Task #8263: Review 6702-gce-node-create-fixResolvedTom Clegg07/22/2015Actions
Actions

Also available in: Atom PDF