Project

General

Profile

Actions

Bug #10808

closed

[Crunch] stuck job and pipeline instances on c97qk

Added by Ward Vandewege almost 8 years ago. Updated almost 8 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Story points:
-

Description

Job c97qk-8i9sb-bj9c3ojdng85osz appears to be unkillable via the cancel button on workbench.

There are several pipeline instances waiting on it:

2017-01-04_18:27:11.93074 2017-01-04 18:27:11 +0000 -- pipeline_instance c97qk-d1hrv-n6pik83zizjk5hn
2017-01-04_18:27:11.93074 cwl-runner c97qk-8i9sb-bj9c3ojdng85osz {:running=>1, :done=>0, :failed=>0, :todo=>0}
2017-01-04_18:27:13.03719 
2017-01-04_18:27:13.03721 2017-01-04 18:27:12 +0000 -- pipeline_instance c97qk-d1hrv-0thxn81rmpaedyo
2017-01-04_18:27:13.03721 cwl-runner c97qk-8i9sb-bj9c3ojdng85osz {:running=>1, :done=>0, :failed=>0, :todo=>0}
2017-01-04_18:27:14.53057 
2017-01-04_18:27:14.53060 2017-01-04 18:27:14 +0000 -- pipeline_instance c97qk-d1hrv-frf2e4vls4gq22v
2017-01-04_18:27:14.53062 cwl-runner c97qk-8i9sb-bj9c3ojdng85osz {:running=>1, :done=>0, :failed=>0, :todo=>0}
2017-01-04_18:27:15.74509 
2017-01-04_18:27:15.74511 2017-01-04 18:27:15 +0000 -- pipeline_instance c97qk-d1hrv-5dzt55sa9wlq495
2017-01-04_18:27:15.74512 cwl-runner c97qk-8i9sb-bj9c3ojdng85osz {:running=>1, :done=>0, :failed=>0, :todo=>0}
2017-01-04_18:27:16.69833 
2017-01-04_18:27:16.69834 2017-01-04 18:27:16 +0000 -- pipeline_instance c97qk-d1hrv-1uwxdzktqgl8hr6
2017-01-04_18:27:16.69835 cwl-runner c97qk-8i9sb-bj9c3ojdng85osz {:running=>1, :done=>0, :failed=>0, :todo=>0}
2017-01-04_18:27:20.34010 

It is not actually running:

c97qk:/etc/service# sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
compute*     up   infinite      7 drain* compute[3-9]
compute*     up   infinite    249  down* compute[0-2,10-255]
crypto       up   infinite      7 drain* compute[3-9]
crypto       up   infinite    249  down* compute[0-2,10-255]
c97qk:/etc/service# squeue_long 
  JOBID PARTITION NAME     USER ST       TIME  NODES NODELIST(REASON)

I ran the stale jobs script which cleaned up two stale jobs but not c97qk-8i9sb-bj9c3ojdng85osz:

c97qk:/var/www/arvados-api/current/script# RAILS_ENV=production bundle exec ./fail-jobs.rb --before reboot
Called 'load' without the :safe option -- defaulting to safe mode.
You can avoid this warning in the future by setting the SafeYAML::OPTIONS[:default_mode] option (to :safe or :unsafe).
dispatch: c97qk-8i9sb-ip0ts72w6z9nty9: cleaned up stale job: started before 2016-11-18 02:31:23 +0000
dispatch: c97qk-8i9sb-7wzv76a2p7jbi9d: cleaned up stale job: started before 2016-11-18 02:31:23 +0000

The workbench logs show this when I press the cancel button:

#<ArvadosApiClient::AccessForbiddenException: #<ArvadosModel::PermissionDeniedError: ArvadosModel::PermissionDeniedError> [API: 403]>
/var/www/arvados-workbench/current/app/models/arvados_api_client.rb:171:in `api'
/var/www/arvados-workbench/current/app/models/job.rb:30:in `cancel'
/var/www/arvados-workbench/current/app/controllers/jobs_controller.rb:50:in `cancel'
/var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.1.12/lib/action_controller/metal/implicit_render.rb:4:in `send_action'
/var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.1.12/lib/abstract_controller/base.rb:189:in `process_action'
/var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.1.12/lib/action_controller/metal/rendering.rb:10:in `process_action'
/var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.1.12/lib/abstract_controller/callbacks.rb:20:in `block in process_action'
/var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.1.12/lib/active_support/callbacks.rb:113:in `call'
/var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.1.12/lib/active_support/callbacks.rb:552:in `block (2 levels) in compile'
/var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.1.12/lib/active_support/callbacks.rb:502:in `call'
/var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.1.12/lib/active_support/callbacks.rb:495:in `block (2 levels) in around'
/var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.1.12/lib/active_support/callbacks.rb:286:in `block (2 levels) in halting_and_conditional'
/var/www/arvados-workbench/current/app/controllers/application_controller.rb:649:in `require_thread_api_token'
/var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.1.12/lib/active_support/callbacks.rb:429:in `block in make_lambda'
/var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.1.12/lib/active_support/callbacks.rb:285:in `block in halting_and_conditional'
/var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.1.12/lib/active_support/callbacks.rb:494:in `block in around'
/var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.1.12/lib/active_support/callbacks.rb:502:in `call'
/var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.1.12/lib/active_support/callbacks.rb:495:in `block (2 levels) in around'
/var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.1.12/lib/active_support/callbacks.rb:306:in `block (2 levels) in halting'
/var/www/arvados-workbench/current/app/controllers/application_controller.rb:631:in `set_thread_api_token'
/var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.1.12/lib/active_support/callbacks.rb:429:in `block in make_lambda'
/var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.1.12/lib/active_support/callbacks.rb:305:in `block in halting'
/var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.1.12/lib/active_support/callbacks.rb:494:in `block in around'
/var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.1.12/lib/active_support/callbacks.rb:502:in `call'
/var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.1.12/lib/active_support/callbacks.rb:495:in `block (2 levels) in around'
/var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.1.12/lib/active_support/callbacks.rb:306:in `block (2 levels) in halting'
/var/www/arvados-workbench/current/app/controllers/application_controller.rb:561:in `thread_clear'
/var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.1.12/lib/active_support/callbacks.rb:429:in `block in make_lambda'
/var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.1.12/lib/active_support/callbacks.rb:305:in `block in halting'
/var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.1.12/lib/active_support/callbacks.rb:494:in `block in around'
/var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.1.12/lib/active_support/callbacks.rb:502:in `call'
/var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.1.12/lib/active_support/callbacks.rb:86:in `run_callbacks'
/var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.1.12/lib/abstract_controller/callbacks.rb:19:in `process_action'
/var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.1.12/lib/action_controller/metal/rescue.rb:29:in `process_action'
/var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.1.12/lib/action_controller/metal/instrumentation.rb:32:in `block in process_action'
/var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.1.12/lib/active_support/notifications.rb:159:in `block in instrument'
/var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.1.12/lib/active_support/notifications/instrumenter.rb:20:in `instrument'
/var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.1.12/lib/active_support/notifications.rb:159:in `instrument'
/var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.1.12/lib/action_controller/metal/instrumentation.rb:30:in `process_action'
/var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.1.12/lib/action_controller/metal/params_wrapper.rb:250:in `process_action'
/var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/activerecord-4.1.12/lib/active_record/railties/controller_runtime.rb:18:in `process_action'
/var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.1.12/lib/abstract_controller/base.rb:136:in `process'
/var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/actionview-4.1.12/lib/action_view/rendering.rb:30:in `process'
/var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.1.12/lib/action_controller/metal.rb:196:in `dispatch'
/var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.1.12/lib/action_controller/metal/rack_delegation.rb:13:in `dispatch'
/var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.1.12/lib/action_controller/metal.rb:232:in `block in action'
/var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.1.12/lib/action_dispatch/routing/route_set.rb:82:in `dispatch'
/var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.1.12/lib/action_dispatch/routing/route_set.rb:50:in `call'
/var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.1.12/lib/action_dispatch/journey/router.rb:73:in `block in call'
/var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.1.12/lib/action_dispatch/journey/router.rb:59:in `each'
/var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.1.12/lib/action_dispatch/journey/router.rb:59:in `call'
/var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.1.12/lib/action_dispatch/routing/route_set.rb:692:in `call'
/var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/rack-1.5.5/lib/rack/etag.rb:23:in `call'
/var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/rack-1.5.5/lib/rack/conditionalget.rb:35:in `call'
/var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/rack-1.5.5/lib/rack/head.rb:11:in `call'
/var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.1.12/lib/action_dispatch/middleware/params_parser.rb:27:in `call'
/var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.1.12/lib/action_dispatch/middleware/flash.rb:254:in `call'
/var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/rack-1.5.5/lib/rack/session/abstract/id.rb:225:in `context'
/var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/rack-1.5.5/lib/rack/session/abstract/id.rb:220:in `call'
/var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.1.12/lib/action_dispatch/middleware/cookies.rb:562:in `call'
/var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/activerecord-4.1.12/lib/active_record/query_cache.rb:36:in `call'
/var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/activerecord-4.1.12/lib/active_record/connection_adapters/abstract/connection_pool.rb:621:in `call'
/var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.1.12/lib/action_dispatch/middleware/callbacks.rb:29:in `block in call'
/var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.1.12/lib/active_support/callbacks.rb:82:in `run_callbacks'
/var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.1.12/lib/action_dispatch/middleware/callbacks.rb:27:in `call'
/var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.1.12/lib/action_dispatch/middleware/remote_ip.rb:76:in `call'
/var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.1.12/lib/action_dispatch/middleware/debug_exceptions.rb:17:in `call'
/var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.1.12/lib/action_dispatch/middleware/show_exceptions.rb:30:in `call'
/var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/lograge-0.3.6/lib/lograge/rails_ext/rack/logger.rb:15:in `call_app'
/var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/railties-4.1.12/lib/rails/rack/logger.rb:20:in `block in call'
/var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.1.12/lib/active_support/tagged_logging.rb:68:in `block in tagged'
/var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.1.12/lib/active_support/tagged_logging.rb:26:in `tagged'
/var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.1.12/lib/active_support/tagged_logging.rb:68:in `tagged'
/var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/railties-4.1.12/lib/rails/rack/logger.rb:20:in `call'
/var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.1.12/lib/action_dispatch/middleware/request_id.rb:21:in `call'
/var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/rack-1.5.5/lib/rack/methodoverride.rb:21:in `call'
/var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/rack-1.5.5/lib/rack/runtime.rb:17:in `call'
/var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/activesupport-4.1.12/lib/active_support/cache/strategy/local_cache_middleware.rb:26:in `call'
/var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/rack-1.5.5/lib/rack/sendfile.rb:112:in `call'
/var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/actionpack-4.1.12/lib/action_dispatch/middleware/ssl.rb:24:in `call'
/var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/railties-4.1.12/lib/rails/engine.rb:514:in `call'
/var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/railties-4.1.12/lib/rails/application.rb:144:in `call'
/var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/railties-4.1.12/lib/rails/railtie.rb:194:in `public_send'
/var/www/arvados-workbench/shared/vendor_bundle/ruby/2.3.0/gems/railties-4.1.12/lib/rails/railtie.rb:194:in `method_missing'
/usr/lib/ruby/vendor_ruby/phusion_passenger/rack/thread_handler_extension.rb:94:in `process_request'
/usr/lib/ruby/vendor_ruby/phusion_passenger/request_handler/thread_handler.rb:157:in `accept_and_process_next_request'
/usr/lib/ruby/vendor_ruby/phusion_passenger/request_handler/thread_handler.rb:110:in `main_loop'
/usr/lib/ruby/vendor_ruby/phusion_passenger/request_handler.rb:415:in `block (3 levels) in start_threads'
/usr/lib/ruby/vendor_ruby/phusion_passenger/utils.rb:111:in `block in create_thread_and_abort_on_exception'

Subtasks 1 (0 open1 closed)

Task #10828: ReviewResolvedRadhika Chippada01/05/2017Actions
Actions #1

Updated by Ward Vandewege almost 8 years ago

  • Description updated (diff)
Actions #2

Updated by Ward Vandewege almost 8 years ago

  • Description updated (diff)
Actions #3

Updated by Ward Vandewege almost 8 years ago

  • Description updated (diff)
Actions #4

Updated by Ward Vandewege almost 8 years ago

  • Description updated (diff)
Actions #5

Updated by Tom Clegg almost 8 years ago

10808-file-cache-ownership @ d7b27f798a0298f5508842c5f7f03b8fccafa3ab

Actions #6

Updated by Tom Clegg almost 8 years ago

  • Status changed from New to In Progress
Actions #7

Updated by Ward Vandewege almost 8 years ago

Tom Clegg wrote:

10808-file-cache-ownership @ d7b27f798a0298f5508842c5f7f03b8fccafa3ab

Background: Tom found

#<Errno::EACCES: Permission denied @ unlink_internal - /var/www/arvados-api/current/tmp/cache/94F/070/SweepTrashedCollections>

in the API server logs. SweepTrashedCollections is new, it runs whenever Collection.where() is executed.

Clearly, something running as root created the cached copy on c97qk by executing Collection.where(). First we suspected the rake call during package installation, but our postinst script is already very careful to fix the cache dir ownership.

Then we looked at crunch-dispatch, which does still run as root (bleh).

Tom came up with the patch in this branch to avoid getting into the state where cache dir ownership is screwed up.

I have applied this patch on c97qk and confirmed that the if block is executed for crunch-dispatch (which currently runs as root, :/), but does not trigger for the rails processes running under Passenger, as the www-data user (on debian-based systems).

I removed the offending cache directory with ownership root and restarted crunch-dispatch-jobs and crunch-dispatch-pipelines.

After a few minutes, a new SweepCacheCollections file appeared but this time owned by www-data, which means it must have been written by the API server not crunch-dispatch:

c97qk:/etc/service# v /var/www/arvados-api/current/tmp/cache/94F
total 4
drwxr-sr-x  3 www-data www-data   16 Jan  5 01:56 ./
drwxrwsr-x 23 www-data www-data 4096 Jan  5 01:56 ../
drwxr-sr-x  2 www-data www-data   36 Jan  5 01:56 070/

c97qk:/etc/service# v /var/www/arvados-api/current/tmp/cache/94F/070/
total 4
drwxr-sr-x 2 www-data www-data  36 Jan  5 01:56 ./
drwxr-sr-x 3 www-data www-data  16 Jan  5 01:56 ../
-rw-r--r-- 1 www-data www-data 111 Jan  5 01:56 SweepTrashedCollections

In other words, this looks good to me and I would like to see the patch merged. Thanks!

Actions #8

Updated by Ward Vandewege almost 8 years ago

  • Description updated (diff)
  • Assigned To set to Ward Vandewege
  • Target version set to 2017-01-18 sprint
Actions #9

Updated by Tom Clegg almost 8 years ago

More detail on the "cancel button doesn't work" problem:
  • Even after the cache dir is fixed, clicking the "Cancel" button doesn't cancel the job.
  • No error message or other user feedback.
  • Chrome debugger shows
    Request URL:https://workbench.c97qk.arvadosapi.com/jobs/c97qk-8i9sb-bj9c3ojdng85osz/cancel
    Request Method:POST
    Status Code:422 Unprocessable Entity
    
    HTTP/1.1 422 Unprocessable Entity
    Server: nginx/1.8.0
    Date: Thu, 05 Jan 2017 15:19:48 GMT
    Content-Type: application/json; charset=utf-8
    Transfer-Encoding: chunked
    Connection: keep-alive
    Status: 422 Unprocessable Entity
    Cache-Control: no-cache, no-store, max-age=0, must-revalidate
    Strict-Transport-Security: max-age=31536000
    Pragma: no-cache
    X-XSS-Protection: 1; mode=block
    X-Request-Id: dff5a12c-afa9-424e-a2e6-62ccbae1b1ef
    X-Runtime: 0.302502
    X-Frame-Options: SAMEORIGIN
    X-Content-Type-Options: nosniff
    Expires: Fri, 01 Jan 1990 00:00:00 GMT
    X-Powered-By: Phusion Passenger 5.0.16
    
    {"success":false,"errors":["#\u003cArvadosModel::PermissionDeniedError: ArvadosModel::PermissionDeniedError\u003e"]}
    
API log:
  • User c97qk-tpzed-7o2gfjd1p10dq9r tried to change protected job attributes on locked Job c97qk-8i9sb-bj9c3ojdng85osz
    #<ArvadosModel::PermissionDeniedError: ArvadosModel::PermissionDeniedError>
    /data-sdc/var-www/arvados-api/current/app/models/arvados_model.rb:370:in `ensure_permission_to_save'
    ...
    Error 1483629588+6489a86c: 403
    {"method":"POST","path":"/arvados/v1/jobs/c97qk-8i9sb-bj9c3ojdng85osz/cancel","format":"*/*","controller":"arvados/v1/jobs","action":"cancel","status":403,"duration":11.66,"view":0.14,"db":3.06,..."@timestamp":"2017-01-05T15:19:48Z"...
    
Actions #11

Updated by Radhika Chippada almost 8 years ago

Just a nit. It might be easier to follow the intent of the code block if the comment “If we don't own the cache dir …” is placed above the default_cache_path declaration in application.rb

LGTM

Actions #12

Updated by Ward Vandewege almost 8 years ago

  • Status changed from In Progress to Resolved
Actions

Also available in: Atom PDF