Bug #7225
Updated by Brett Smith over 9 years ago
qr1hi-8i9sb-h5yt6xmpk8u6dps is a pretty typical BWA aligner job. The aligner apparently ran fine, but then run-command got stuck uploading the data. These lines were are the last interesting ones that appear in the log:
<pre>2015-09-04_15:16:52 qr1hi-8i9sb-h5yt6xmpk8u6dps 4106 0 stderr run-command: /keep/39c6f22d40001074f4200a72559ae7eb+5745/bwa completed with exit code 0 (success)
2015-09-04_15:16:52 qr1hi-8i9sb-h5yt6xmpk8u6dps 4106 0 stderr run-command: the following output files will be saved to keep:
2015-09-04_15:16:52 qr1hi-8i9sb-h5yt6xmpk8u6dps 4106 0 stderr run-command: 1455988972 ./[filename].sai
2015-09-04_15:16:52 qr1hi-8i9sb-h5yt6xmpk8u6dps 4106 0 stderr run-command: start writing output to keep
</pre>
After that, run-command was never heard from again. When I checked on the compute node, the run-command process was still alive, but not doing anything. strace reported it was stuck in a futex call.
The last two lines in run-command update the task with success and output information, and exit. The API server logs show that it received and handled the task update with no problem, shortly after those last lines in the log, implying run-command got stuck somewhere between sending the request and exiting.
If the fix for this requires users to make specific API calls, Tom should sign off on those requirements as architect, and the requirements should be clearly documented.
The branch that fixes this is expected to include a test for the unsigned locator race condition.