Bug #5292
closed[Node Manager] Failed to recognize busy node on qr1hi
Description
This morning on qr1hi, the single minimum compute node was up (qr1hi-7ekkf-09hjulgcrpxp1iw), and running a job (qr1hi-8i9sb-mcla1dzm2zrpl0t).
At 10:15 EST, qr1hi-8i9sb-palag4xt4jjln0v was added to the queue. This was reflected in Node Manager's internal server wishlist, but Node Manager did not start a node to accommodate it.
Several jobs and nodes came up in the following time. I'm not saying this is the cause, because I haven't tracked it down yet, but in general Node Manager acted like one of the up nodes was idle when it was in fact busy. It created even more nodes as more jobs were added to the queue, but it was always behind by one.
The original compute node record's crunch_worker_state appears correct now, so there's not anything blatantly wrong there.
Updated by Brett Smith almost 10 years ago
Saw something similar with qr1hi-8i9sb-en76y3f8yrtjitc. When it entered the queue, Node Manager booted several nodes simultaneously… but one less than the number needed to actually run the job. After those nodes pinged Arvados, it booted the last node to get the job over the edge. It doesn't look like there were any other jobs in the queue or running while all this happened.
Updated by Brett Smith almost 10 years ago
- Status changed from New to Closed
- Target version deleted (
Bug Triage)
Caused by #4751.