Bug #10808
closed[Crunch] stuck job and pipeline instances on c97qk
Description
Job c97qk-8i9sb-bj9c3ojdng85osz appears to be unkillable via the cancel button on workbench.
There are several pipeline instances waiting on it:
2017-01-04_18:27:11.93074 2017-01-04 18:27:11 +0000 -- pipeline_instance c97qk-d1hrv-n6pik83zizjk5hn 2017-01-04_18:27:11.93074 cwl-runner c97qk-8i9sb-bj9c3ojdng85osz {:running=>1, :done=>0, :failed=>0, :todo=>0} 2017-01-04_18:27:13.03719 2017-01-04_18:27:13.03721 2017-01-04 18:27:12 +0000 -- pipeline_instance c97qk-d1hrv-0thxn81rmpaedyo 2017-01-04_18:27:13.03721 cwl-runner c97qk-8i9sb-bj9c3ojdng85osz {:running=>1, :done=>0, :failed=>0, :todo=>0} 2017-01-04_18:27:14.53057 2017-01-04_18:27:14.53060 2017-01-04 18:27:14 +0000 -- pipeline_instance c97qk-d1hrv-frf2e4vls4gq22v 2017-01-04_18:27:14.53062 cwl-runner c97qk-8i9sb-bj9c3ojdng85osz {:running=>1, :done=>0, :failed=>0, :todo=>0} 2017-01-04_18:27:15.74509 2017-01-04_18:27:15.74511 2017-01-04 18:27:15 +0000 -- pipeline_instance c97qk-d1hrv-5dzt55sa9wlq495 2017-01-04_18:27:15.74512 cwl-runner c97qk-8i9sb-bj9c3ojdng85osz {:running=>1, :done=>0, :failed=>0, :todo=>0} 2017-01-04_18:27:16.69833 2017-01-04_18:27:16.69834 2017-01-04 18:27:16 +0000 -- pipeline_instance c97qk-d1hrv-1uwxdzktqgl8hr6 2017-01-04_18:27:16.69835 cwl-runner c97qk-8i9sb-bj9c3ojdng85osz {:running=>1, :done=>0, :failed=>0, :todo=>0} 2017-01-04_18:27:20.34010
It is not actually running:
c97qk:/etc/service# sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST compute* up infinite 7 drain* compute[3-9] compute* up infinite 249 down* compute[0-2,10-255] crypto up infinite 7 drain* compute[3-9] crypto up infinite 249 down* compute[0-2,10-255] c97qk:/etc/service# squeue_long JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
I ran the stale jobs script which cleaned up two stale jobs but not c97qk-8i9sb-bj9c3ojdng85osz:
c97qk:/var/www/arvados-api/current/script# RAILS_ENV=production bundle exec ./fail-jobs.rb --before reboot Called 'load' without the :safe option -- defaulting to safe mode. You can avoid this warning in the future by setting the SafeYAML::OPTIONS[:default_mode] option (to :safe or :unsafe). dispatch: c97qk-8i9sb-ip0ts72w6z9nty9: cleaned up stale job: started before 2016-11-18 02:31:23 +0000 dispatch: c97qk-8i9sb-7wzv76a2p7jbi9d: cleaned up stale job: started before 2016-11-18 02:31:23 +0000
The workbench logs show this when I press the cancel button:
#<ArvadosApiClient::AccessForbiddenException: #<ArvadosModel::PermissionDeniedError: ArvadosModel::PermissionDeniedError> [API: 403]> /var/www/arvados-workbench/current/app/models/arvados_api_client.rb:171:in `api' /var/www/arvados-workbench/current/app/models/job.rb:30:in `cancel' /var/www/arvados-workbench/current/app/controllers/jobs_controller.rb:50:in `cancel' /var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.1.12/lib/action_controller/metal/implicit_render.rb:4:in `send_action' /var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.1.12/lib/abstract_controller/base.rb:189:in `process_action' /var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.1.12/lib/action_controller/metal/rendering.rb:10:in `process_action' /var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.1.12/lib/abstract_controller/callbacks.rb:20:in `block in process_action' /var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.1.12/lib/active_support/callbacks.rb:113:in `call' /var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.1.12/lib/active_support/callbacks.rb:552:in `block (2 levels) in compile' /var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.1.12/lib/active_support/callbacks.rb:502:in `call' /var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.1.12/lib/active_support/callbacks.rb:495:in `block (2 levels) in around' /var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.1.12/lib/active_support/callbacks.rb:286:in `block (2 levels) in halting_and_conditional' /var/www/arvados-workbench/current/app/controllers/application_controller.rb:649:in `require_thread_api_token' /var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.1.12/lib/active_support/callbacks.rb:429:in `block in make_lambda' /var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.1.12/lib/active_support/callbacks.rb:285:in `block in halting_and_conditional' /var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.1.12/lib/active_support/callbacks.rb:494:in `block in around' /var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.1.12/lib/active_support/callbacks.rb:502:in `call' /var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.1.12/lib/active_support/callbacks.rb:495:in `block (2 levels) in around' /var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.1.12/lib/active_support/callbacks.rb:306:in `block (2 levels) in halting' /var/www/arvados-workbench/current/app/controllers/application_controller.rb:631:in `set_thread_api_token' /var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.1.12/lib/active_support/callbacks.rb:429:in `block in make_lambda' /var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.1.12/lib/active_support/callbacks.rb:305:in `block in halting' /var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.1.12/lib/active_support/callbacks.rb:494:in `block in around' /var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.1.12/lib/active_support/callbacks.rb:502:in `call' /var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.1.12/lib/active_support/callbacks.rb:495:in `block (2 levels) in around' /var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.1.12/lib/active_support/callbacks.rb:306:in `block (2 levels) in halting' /var/www/arvados-workbench/current/app/controllers/application_controller.rb:561:in `thread_clear' /var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.1.12/lib/active_support/callbacks.rb:429:in `block in make_lambda' /var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.1.12/lib/active_support/callbacks.rb:305:in `block in halting' /var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.1.12/lib/active_support/callbacks.rb:494:in `block in around' /var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.1.12/lib/active_support/callbacks.rb:502:in `call' /var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.1.12/lib/active_support/callbacks.rb:86:in `run_callbacks' /var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.1.12/lib/abstract_controller/callbacks.rb:19:in `process_action' /var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.1.12/lib/action_controller/metal/rescue.rb:29:in `process_action' /var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.1.12/lib/action_controller/metal/instrumentation.rb:32:in `block in process_action' /var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.1.12/lib/active_support/notifications.rb:159:in `block in instrument' /var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.1.12/lib/active_support/notifications/instrumenter.rb:20:in `instrument' /var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.1.12/lib/active_support/notifications.rb:159:in `instrument' /var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.1.12/lib/action_controller/metal/instrumentation.rb:30:in `process_action' /var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.1.12/lib/action_controller/metal/params_wrapper.rb:250:in `process_action' /var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/activerecord-4.1.12/lib/active_record/railties/controller_runtime.rb:18:in `process_action' /var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.1.12/lib/abstract_controller/base.rb:136:in `process' /var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/actionview-4.1.12/lib/action_view/rendering.rb:30:in `process' /var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.1.12/lib/action_controller/metal.rb:196:in `dispatch' /var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.1.12/lib/action_controller/metal/rack_delegation.rb:13:in `dispatch' /var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.1.12/lib/action_controller/metal.rb:232:in `block in action' /var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.1.12/lib/action_dispatch/routing/route_set.rb:82:in `dispatch' /var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.1.12/lib/action_dispatch/routing/route_set.rb:50:in `call' /var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.1.12/lib/action_dispatch/journey/router.rb:73:in `block in call' /var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.1.12/lib/action_dispatch/journey/router.rb:59:in `each' /var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.1.12/lib/action_dispatch/journey/router.rb:59:in `call' /var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.1.12/lib/action_dispatch/routing/route_set.rb:692:in `call' /var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/rack-1.5.5/lib/rack/etag.rb:23:in `call' /var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/rack-1.5.5/lib/rack/conditionalget.rb:35:in `call' /var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/rack-1.5.5/lib/rack/head.rb:11:in `call' /var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.1.12/lib/action_dispatch/middleware/params_parser.rb:27:in `call' /var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.1.12/lib/action_dispatch/middleware/flash.rb:254:in `call' /var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/rack-1.5.5/lib/rack/session/abstract/id.rb:225:in `context' /var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/rack-1.5.5/lib/rack/session/abstract/id.rb:220:in `call' /var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.1.12/lib/action_dispatch/middleware/cookies.rb:562:in `call' /var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/activerecord-4.1.12/lib/active_record/query_cache.rb:36:in `call' /var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/activerecord-4.1.12/lib/active_record/connection_adapters/abstract/connection_pool.rb:621:in `call' /var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.1.12/lib/action_dispatch/middleware/callbacks.rb:29:in `block in call' /var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.1.12/lib/active_support/callbacks.rb:82:in `run_callbacks' /var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.1.12/lib/action_dispatch/middleware/callbacks.rb:27:in `call' /var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.1.12/lib/action_dispatch/middleware/remote_ip.rb:76:in `call' /var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.1.12/lib/action_dispatch/middleware/debug_exceptions.rb:17:in `call' /var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.1.12/lib/action_dispatch/middleware/show_exceptions.rb:30:in `call' /var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/lograge-0.3.6/lib/lograge/rails_ext/rack/logger.rb:15:in `call_app' /var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/railties-4.1.12/lib/rails/rack/logger.rb:20:in `block in call' /var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.1.12/lib/active_support/tagged_logging.rb:68:in `block in tagged' /var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.1.12/lib/active_support/tagged_logging.rb:26:in `tagged' /var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.1.12/lib/active_support/tagged_logging.rb:68:in `tagged' /var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/railties-4.1.12/lib/rails/rack/logger.rb:20:in `call' /var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.1.12/lib/action_dispatch/middleware/request_id.rb:21:in `call' /var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/rack-1.5.5/lib/rack/methodoverride.rb:21:in `call' /var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/rack-1.5.5/lib/rack/runtime.rb:17:in `call' /var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.1.12/lib/active_support/cache/strategy/local_cache_middleware.rb:26:in `call' /var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/rack-1.5.5/lib/rack/sendfile.rb:112:in `call' /var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.1.12/lib/action_dispatch/middleware/ssl.rb:24:in `call' /var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/railties-4.1.12/lib/rails/engine.rb:514:in `call' /var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/railties-4.1.12/lib/rails/application.rb:144:in `call' /var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/railties-4.1.12/lib/rails/railtie.rb:194:in `public_send' /var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/railties-4.1.12/lib/rails/railtie.rb:194:in `method_missing' /usr/lib/ruby/vendor_ruby/phusion_passenger/rack/thread_handler_extension.rb:94:in `process_request' /usr/lib/ruby/vendor_ruby/phusion_passenger/request_handler/thread_handler.rb:157:in `accept_and_process_next_request' /usr/lib/ruby/vendor_ruby/phusion_passenger/request_handler/thread_handler.rb:110:in `main_loop' /usr/lib/ruby/vendor_ruby/phusion_passenger/request_handler.rb:415:in `block (3 levels) in start_threads' /usr/lib/ruby/vendor_ruby/phusion_passenger/utils.rb:111:in `block in create_thread_and_abort_on_exception'
Updated by Tom Clegg almost 8 years ago
10808-file-cache-ownership @ d7b27f798a0298f5508842c5f7f03b8fccafa3ab
Updated by Ward Vandewege almost 8 years ago
Tom Clegg wrote:
10808-file-cache-ownership @ d7b27f798a0298f5508842c5f7f03b8fccafa3ab
Background: Tom found
#<Errno::EACCES: Permission denied @ unlink_internal - /var/www/arvados-api/current/tmp/cache/94F/070/SweepTrashedCollections>
in the API server logs. SweepTrashedCollections is new, it runs whenever Collection.where() is executed.
Clearly, something running as root created the cached copy on c97qk by executing Collection.where(). First we suspected the rake call during package installation, but our postinst script is already very careful to fix the cache dir ownership.
Then we looked at crunch-dispatch, which does still run as root (bleh).
Tom came up with the patch in this branch to avoid getting into the state where cache dir ownership is screwed up.
I have applied this patch on c97qk and confirmed that the if block is executed for crunch-dispatch (which currently runs as root, :/), but does not trigger for the rails processes running under Passenger, as the www-data user (on debian-based systems).
I removed the offending cache directory with ownership root and restarted crunch-dispatch-jobs and crunch-dispatch-pipelines.
After a few minutes, a new SweepCacheCollections file appeared but this time owned by www-data, which means it must have been written by the API server not crunch-dispatch:
c97qk:/etc/service# v /var/www/arvados-api/current/tmp/cache/94F total 4 drwxr-sr-x 3 www-data www-data 16 Jan 5 01:56 ./ drwxrwsr-x 23 www-data www-data 4096 Jan 5 01:56 ../ drwxr-sr-x 2 www-data www-data 36 Jan 5 01:56 070/ c97qk:/etc/service# v /var/www/arvados-api/current/tmp/cache/94F/070/ total 4 drwxr-sr-x 2 www-data www-data 36 Jan 5 01:56 ./ drwxr-sr-x 3 www-data www-data 16 Jan 5 01:56 ../ -rw-r--r-- 1 www-data www-data 111 Jan 5 01:56 SweepTrashedCollections
In other words, this looks good to me and I would like to see the patch merged. Thanks!
Updated by Ward Vandewege almost 8 years ago
- Description updated (diff)
- Assigned To set to Ward Vandewege
- Target version set to 2017-01-18 sprint
Updated by Tom Clegg almost 8 years ago
- Even after the cache dir is fixed, clicking the "Cancel" button doesn't cancel the job.
- No error message or other user feedback.
- Chrome debugger shows
Request URL:https://workbench.c97qk.arvadosapi.com/jobs/c97qk-8i9sb-bj9c3ojdng85osz/cancel Request Method:POST Status Code:422 Unprocessable Entity HTTP/1.1 422 Unprocessable Entity Server: nginx/1.8.0 Date: Thu, 05 Jan 2017 15:19:48 GMT Content-Type: application/json; charset=utf-8 Transfer-Encoding: chunked Connection: keep-alive Status: 422 Unprocessable Entity Cache-Control: no-cache, no-store, max-age=0, must-revalidate Strict-Transport-Security: max-age=31536000 Pragma: no-cache X-XSS-Protection: 1; mode=block X-Request-Id: dff5a12c-afa9-424e-a2e6-62ccbae1b1ef X-Runtime: 0.302502 X-Frame-Options: SAMEORIGIN X-Content-Type-Options: nosniff Expires: Fri, 01 Jan 1990 00:00:00 GMT X-Powered-By: Phusion Passenger 5.0.16 {"success":false,"errors":["#\u003cArvadosModel::PermissionDeniedError: ArvadosModel::PermissionDeniedError\u003e"]}
User c97qk-tpzed-7o2gfjd1p10dq9r tried to change protected job attributes on locked Job c97qk-8i9sb-bj9c3ojdng85osz #<ArvadosModel::PermissionDeniedError: ArvadosModel::PermissionDeniedError> /data-sdc/var-www/arvados-api/current/app/models/arvados_model.rb:370:in `ensure_permission_to_save' ... Error 1483629588+6489a86c: 403 {"method":"POST","path":"/arvados/v1/jobs/c97qk-8i9sb-bj9c3ojdng85osz/cancel","format":"*/*","controller":"arvados/v1/jobs","action":"cancel","status":403,"duration":11.66,"view":0.14,"db":3.06,..."@timestamp":"2017-01-05T15:19:48Z"...
Updated by Tom Clegg almost 8 years ago
10808-admin-cancel-job @ 4a00ceae73e8d76affe6b646832c525355e7897c https://ci.curoverse.com/job/developer-run-tests/127/
Updated by Radhika Chippada almost 8 years ago
Just a nit. It might be easier to follow the intent of the code block if the comment “If we don't own the cache dir …” is placed above the default_cache_path declaration in application.rb
LGTM
Updated by Ward Vandewege almost 8 years ago
- Status changed from In Progress to Resolved