Bug #10808

Updated by Ward Vandewege over 4 years ago

Job c97qk-8i9sb-bj9c3ojdng85osz appears to be unkillable via the cancel button on workbench.

There are several pipeline instances waiting on it:

<pre>
2017-01-04_18:27:11.93074 2017-01-04 18:27:11 +0000 -- pipeline_instance c97qk-d1hrv-n6pik83zizjk5hn
2017-01-04_18:27:11.93074 cwl-runner c97qk-8i9sb-bj9c3ojdng85osz {:running=>1, :done=>0, :failed=>0, :todo=>0}
2017-01-04_18:27:13.03719
2017-01-04_18:27:13.03721 2017-01-04 18:27:12 +0000 -- pipeline_instance c97qk-d1hrv-0thxn81rmpaedyo
2017-01-04_18:27:13.03721 cwl-runner c97qk-8i9sb-bj9c3ojdng85osz {:running=>1, :done=>0, :failed=>0, :todo=>0}
2017-01-04_18:27:14.53057
2017-01-04_18:27:14.53060 2017-01-04 18:27:14 +0000 -- pipeline_instance c97qk-d1hrv-frf2e4vls4gq22v
2017-01-04_18:27:14.53062 cwl-runner c97qk-8i9sb-bj9c3ojdng85osz {:running=>1, :done=>0, :failed=>0, :todo=>0}
2017-01-04_18:27:15.74509
2017-01-04_18:27:15.74511 2017-01-04 18:27:15 +0000 -- pipeline_instance c97qk-d1hrv-5dzt55sa9wlq495
2017-01-04_18:27:15.74512 cwl-runner c97qk-8i9sb-bj9c3ojdng85osz {:running=>1, :done=>0, :failed=>0, :todo=>0}
2017-01-04_18:27:16.69833
2017-01-04_18:27:16.69834 2017-01-04 18:27:16 +0000 -- pipeline_instance c97qk-d1hrv-1uwxdzktqgl8hr6
2017-01-04_18:27:16.69835 cwl-runner c97qk-8i9sb-bj9c3ojdng85osz {:running=>1, :done=>0, :failed=>0, :todo=>0}
2017-01-04_18:27:20.34010
</pre>

It is not actually running:

<pre>
c97qk:/etc/service# sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
compute* up infinite 7 drain* compute[3-9]
compute* up infinite 249 down* compute[0-2,10-255]
crypto up infinite 7 drain* compute[3-9]
crypto up infinite 249 down* compute[0-2,10-255]
c97qk:/etc/service# squeue_long
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
</pre>

I ran the stale jobs script which cleaned up two stale jobs but not c97qk-8i9sb-bj9c3ojdng85osz:

<pre>
c97qk:/var/www/arvados-api/current/script# RAILS_ENV=production bundle exec ./fail-jobs.rb --before reboot
Called 'load' without the :safe option -- defaulting to safe mode.
You can avoid this warning in the future by setting the SafeYAML::OPTIONS[:default_mode] option (to :safe or :unsafe).
dispatch: c97qk-8i9sb-ip0ts72w6z9nty9: cleaned up stale job: started before 2016-11-18 02:31:23 +0000
dispatch: c97qk-8i9sb-7wzv76a2p7jbi9d: cleaned up stale job: started before 2016-11-18 02:31:23 +0000
</pre>

Back