Bug #5292

[Node Manager] Failed to recognize busy node on qr1hi

Added by Brett Smith almost 7 years ago. Updated almost 7 years ago.

Status:
Closed
Priority:
Normal
Assigned To:
-
Category:
Node Manager
Target version:
-
Start date:
02/23/2015
Due date:
% Done:

0%

Estimated time:
Story points:
-

Description

This morning on qr1hi, the single minimum compute node was up (qr1hi-7ekkf-09hjulgcrpxp1iw), and running a job (qr1hi-8i9sb-mcla1dzm2zrpl0t).

At 10:15 EST, qr1hi-8i9sb-palag4xt4jjln0v was added to the queue. This was reflected in Node Manager's internal server wishlist, but Node Manager did not start a node to accommodate it.

Several jobs and nodes came up in the following time. I'm not saying this is the cause, because I haven't tracked it down yet, but in general Node Manager acted like one of the up nodes was idle when it was in fact busy. It created even more nodes as more jobs were added to the queue, but it was always behind by one.

The original compute node record's crunch_worker_state appears correct now, so there's not anything blatantly wrong there.


Related issues

Is duplicate of Arvados - Bug #4751: [Node Manager] Can erroneously pair cloud nodes with stale Arvados node recordsResolved03/02/2015

History

#1 Updated by Brett Smith almost 7 years ago

Saw something similar with qr1hi-8i9sb-en76y3f8yrtjitc. When it entered the queue, Node Manager booted several nodes simultaneously… but one less than the number needed to actually run the job. After those nodes pinged Arvados, it booted the last node to get the job over the edge. It doesn't look like there were any other jobs in the queue or running while all this happened.

#2 Updated by Brett Smith almost 7 years ago

  • Status changed from New to Closed
  • Target version deleted (Bug Triage)

Caused by #4751.

Also available in: Atom PDF