[Crunch] Figure out why this job was marked Failed unexpectedly
They seemed to have this error in common: error: Unable to allocate resources: Requested nodes are busy. Ward said that there were two crunch dispatchers running and shutting one down seemed to fix it.
When the jobs end, they usually have a 403 Permission error and cannot output to keep.
Touch the "crunch_refresh_trigger" file when the state changes. This notifies
all crunch-job instances to check the cancelled and state flags, so if a
running job changes state unexpectedly, it will be treated as a cancellation. refs #4314
#7 Updated by Peter Amstutz about 6 years ago
There's two different errors here.
- qr1hi-8i9sb-vt7mb676a4htd6k dies because of "Job state unexpectedly changed to Failed" which certainly could be due to #4310
#8 Updated by Tom Clegg about 6 years ago
The timing here makes the #4310 explanation less than 100% convincing. What would make crunch-dispatch take any interest in a job that has had
state=='Running' for 37 minutes? (Process suspended???)
2014-10-24_19:59:31 qr1hi-8i9sb-vt7mb676a4htd6k 14114 start ... 2014-10-24_20:37:11 qr1hi-8i9sb-vt7mb676a4htd6k 14114 Job state unexpectedly changed to Failed
#10 Updated by Peter Amstutz about 6 years ago
Some additional sluthing shows that qr1hi-8i9sb-vt7mb676a4htd6k changed from "Running" to "Failed" at 2014-10-24T19:59:32Z which suggests that it was a result of a race between crunch-dispatchers, but crunch-job didn't notice it until the task had completed 35 minutes later.
#13 Updated by Peter Amstutz almost 6 years ago
Finally figured this one out. crunch-job only checks the job state if the file listed in "CRUNCH_REFRESH_TRIGGER" has been touched recently. It does this for cancellations, but not for other state changes, so even though the job was marked "failed" almost immediately due to the crunch-dispatcher race, it didn't notice until it completed on its own. 7e4a195 fixes that (unexpected state changes will be treated as cancellations).
("crunch_refresh_trigger" is an unfortunate backchannel method of communicating from API server to crunch-job, in the future when we need crunch-dispatch to run on a separate instance from crunch-dispatch this will need to use websockets.)
#14 Updated by Radhika Chippada almost 6 years ago
Discussed the one-liner update with Peter for background info, and the update looks go to me.
And, all api server tests passed.
My only comment was, since the update was so close to the comment "# TODO: Remove the following case block when old ..." and we agreed that we will create a separate ticket to clean the old code.