Project

General

Profile

Actions

Bug #15727

open

[FUSE] Unhandled exception while getting discovery doc

Added by Tom Clegg about 5 years ago. Updated 10 months ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
Story points:
-
Release:
Release relationship:
Auto

Description

User reported a container that failed due to cloud/cluster weather and didn't retry automatically despite container_count_max=3. Retrying the whole workflow eventually worked.

It looks like when arv-mount failed to get the discovery doc, it returned an error to the caller instead of retrying or exiting. If arv-mount had exited, crunch-run would have set final state to Cancelled, and controller/RailsAPI would have retried the CR with a new container.

[...]
2019-10-14T18:02:07.795676693Z 2019-10-14 18:02:07 arvados.arvados_fuse[88986] ERROR: Unhandled exception during FUSE operation
2019-10-14T18:02:07.795676693Z Traceback (most recent call last):
2019-10-14T18:02:07.795676693Z   File "/usr/share/python2.7/dist/python-arvados-fuse/lib/python2.7/site-packages/arvados_fuse/__init__.py", line 325, in catch_exceptions_wrapper
2019-10-14T18:02:07.795676693Z     return orig_func(self, *args, **kwargs)
2019-10-14T18:02:07.795676693Z   File "/usr/share/python2.7/dist/python-arvados-fuse/lib/python2.7/site-packages/arvados_fuse/__init__.py", line 653, in read
2019-10-14T18:02:07.795676693Z     r = handle.obj.readfrom(off, size, self.num_retries)
2019-10-14T18:02:07.795676693Z   File "/usr/share/python2.7/dist/python-arvados-fuse/lib/python2.7/site-packages/arvados_fuse/fusefile.py", line 63, in readfrom
2019-10-14T18:02:07.795676693Z     return self.arvfile.readfrom(off, size, num_retries, exact=True)
2019-10-14T18:02:07.795676693Z   File "/usr/share/python2.7/dist/python-arvados-fuse/lib/python2.7/site-packages/arvados/arvfile.py", line 1107, in readfrom
2019-10-14T18:02:07.795676693Z     block = self.parent._my_block_manager().get_block_contents(lr.locator, num_retries=num_retries, cache_only=(bool(data) and not exact))
2019-10-14T18:02:07.795676693Z   File "/usr/share/python2.7/dist/python-arvados-fuse/lib/python2.7/site-packages/arvados/arvfile.py", line 785, in get_block_contents
2019-10-14T18:02:07.795676693Z     return self._keep.get(locator, num_retries=num_retries)
2019-10-14T18:02:07.795676693Z   File "/usr/share/python2.7/dist/python-arvados-fuse/lib/python2.7/site-packages/arvados/retry.py", line 162, in num_retries_setter
2019-10-14T18:02:07.795676693Z     return orig_func(self, *args, **kwargs)
2019-10-14T18:02:07.795676693Z   File "/usr/share/python2.7/dist/python-arvados-fuse/lib/python2.7/site-packages/arvados/keep.py", line 995, in get
2019-10-14T18:02:07.795676693Z     return self._get_or_head(loc_s, method="GET", **kwargs)
2019-10-14T18:02:07.795676693Z   File "/usr/share/python2.7/dist/python-arvados-fuse/lib/python2.7/site-packages/arvados/keep.py", line 1041, in _get_or_head
2019-10-14T18:02:07.795676693Z     (hasattr(self, 'api_client') and self.api_client.request_id) or
2019-10-14T18:02:07.795676693Z   File "/usr/share/python2.7/dist/python-arvados-fuse/lib/python2.7/site-packages/arvados/safeapi.py", line 47, in __getattr__
2019-10-14T18:02:07.795676693Z     return getattr(self.localapi(), name)
2019-10-14T18:02:07.795676693Z   File "/usr/share/python2.7/dist/python-arvados-fuse/lib/python2.7/site-packages/arvados/safeapi.py", line 40, in localapi
2019-10-14T18:02:07.795676693Z     **self.api_params)
2019-10-14T18:02:07.795676693Z   File "/usr/share/python2.7/dist/python-arvados-fuse/lib/python2.7/site-packages/arvados/api.py", line 277, in api_from_config
2019-10-14T18:02:07.795676693Z     return api(version=version, host=host, token=token, insecure=insecure, **kwargs)
2019-10-14T18:02:07.795676693Z   File "/usr/share/python2.7/dist/python-arvados-fuse/lib/python2.7/site-packages/arvados/api.py", line 236, in api
2019-10-14T18:02:07.795676693Z     svc = apiclient_discovery.build('arvados', version, cache_discovery=False, **kwargs)
2019-10-14T18:02:07.795676693Z   File "/usr/share/python2.7/dist/python-arvados-fuse/lib/python2.7/site-packages/googleapiclient/_helpers.py", line 130, in positional_wrapper
2019-10-14T18:02:07.795676693Z     return wrapped(*args, **kwargs)
2019-10-14T18:02:07.795676693Z   File "/usr/share/python2.7/dist/python-arvados-fuse/lib/python2.7/site-packages/googleapiclient/discovery.py", line 230, in build
2019-10-14T18:02:07.795676693Z     raise e
2019-10-14T18:02:07.795676693Z ApiError: <HttpError 503 when requesting https://su92l.arvadosapi.com/discovery/v1/apis/arvados/v1/rest returned "Service Unavailable">
2019-10-14T18:02:07.810523494Z 2019-10-14 18:02:07 arvados.arvados_fuse[88986] ERROR: Unhandled exception during FUSE operation
2019-10-14T18:02:07.810523494Z Traceback (most recent call last):
2019-10-14T18:02:07.810523494Z   File "/usr/share/python2.7/dist/python-arvados-fuse/lib/python2.7/site-packages/arvados_fuse/__init__.py", line 325, in catch_exceptions_wrapper
2019-10-14T18:02:07.810523494Z     return orig_func(self, *args, **kwargs)
2019-10-14T18:02:07.810523494Z   File "/usr/share/python2.7/dist/python-arvados-fuse/lib/python2.7/site-packages/arvados_fuse/__init__.py", line 653, in read
2019-10-14T18:02:07.810523494Z     r = handle.obj.readfrom(off, size, self.num_retries)
2019-10-14T18:02:07.810523494Z   File "/usr/share/python2.7/dist/python-arvados-fuse/lib/python2.7/site-packages/arvados_fuse/fusefile.py", line 63, in readfrom
2019-10-14T18:02:07.810523494Z     return self.arvfile.readfrom(off, size, num_retries, exact=True)
2019-10-14T18:02:07.810523494Z   File "/usr/share/python2.7/dist/python-arvados-fuse/lib/python2.7/site-packages/arvados/arvfile.py", line 1107, in readfrom
2019-10-14T18:02:07.810523494Z     block = self.parent._my_block_manager().get_block_contents(lr.locator, num_retries=num_retries, cache_only=(bool(data) and not exact))
2019-10-14T18:02:07.810523494Z   File "/usr/share/python2.7/dist/python-arvados-fuse/lib/python2.7/site-packages/arvados/arvfile.py", line 785, in get_block_contents
2019-10-14T18:02:07.810523494Z     return self._keep.get(locator, num_retries=num_retries)
2019-10-14T18:02:07.810523494Z   File "/usr/share/python2.7/dist/python-arvados-fuse/lib/python2.7/site-packages/arvados/retry.py", line 162, in num_retries_setter
2019-10-14T18:02:07.810523494Z     return orig_func(self, *args, **kwargs)
2019-10-14T18:02:07.810523494Z   File "/usr/share/python2.7/dist/python-arvados-fuse/lib/python2.7/site-packages/arvados/keep.py", line 995, in get
2019-10-14T18:02:07.810523494Z     return self._get_or_head(loc_s, method="GET", **kwargs)
2019-10-14T18:02:07.810523494Z   File "/usr/share/python2.7/dist/python-arvados-fuse/lib/python2.7/site-packages/arvados/keep.py", line 1041, in _get_or_head
2019-10-14T18:02:07.810523494Z     (hasattr(self, 'api_client') and self.api_client.request_id) or
2019-10-14T18:02:07.810523494Z   File "/usr/share/python2.7/dist/python-arvados-fuse/lib/python2.7/site-packages/arvados/safeapi.py", line 47, in __getattr__
2019-10-14T18:02:07.810523494Z     return getattr(self.localapi(), name)
2019-10-14T18:02:07.810523494Z   File "/usr/share/python2.7/dist/python-arvados-fuse/lib/python2.7/site-packages/arvados/safeapi.py", line 40, in localapi
2019-10-14T18:02:07.810523494Z     **self.api_params)
2019-10-14T18:02:07.810523494Z   File "/usr/share/python2.7/dist/python-arvados-fuse/lib/python2.7/site-packages/arvados/api.py", line 277, in api_from_config
2019-10-14T18:02:07.810523494Z     return api(version=version, host=host, token=token, insecure=insecure, **kwargs)
2019-10-14T18:02:07.810523494Z   File "/usr/share/python2.7/dist/python-arvados-fuse/lib/python2.7/site-packages/arvados/api.py", line 236, in api
2019-10-14T18:02:07.810523494Z     svc = apiclient_discovery.build('arvados', version, cache_discovery=False, **kwargs)
2019-10-14T18:02:07.810523494Z   File "/usr/share/python2.7/dist/python-arvados-fuse/lib/python2.7/site-packages/googleapiclient/_helpers.py", line 130, in positional_wrapper
2019-10-14T18:02:07.810523494Z     return wrapped(*args, **kwargs)
2019-10-14T18:02:07.810523494Z   File "/usr/share/python2.7/dist/python-arvados-fuse/lib/python2.7/site-packages/googleapiclient/discovery.py", line 230, in build
2019-10-14T18:02:07.810523494Z     raise e
2019-10-14T18:02:07.810523494Z ApiError: <HttpError 503 when requesting https://su92l.arvadosapi.com/discovery/v1/apis/arvados/v1/rest returned "Service Unavailable">
2019-10-14T18:02:12.977186838Z crunchstat: keepcalls 0 put 6447 get -- interval 10.0000 seconds 0 put 209 get
2019-10-14T18:02:12.977186838Z crunchstat: net:keep0 0 tx 574815690 rx -- interval 10.0000 seconds 0 tx 0 rx
2019-10-14T18:02:12.977186838Z crunchstat: keepcache 6434 hit 13 miss -- interval 10.0000 seconds 207 hit 2 miss
2019-10-14T18:02:12.977186838Z crunchstat: blkio:0:0 0 write 506596483 read -- interval 10.0000 seconds 0 write 26890187 read
2019-10-14T18:02:12.977186838Z crunchstat: fuseops 0 write 3882 read -- interval 10.0000 seconds 0 write 209 read
2019-10-14T18:02:12.980666886Z crunchstat: fuseop:flush 65 count 0.001025 time -- interval 10.0000 seconds 10 count 0.000209 time
2019-10-14T18:02:12.981525297Z crunchstat: fuseop:on_event 0 count 0.000000 time -- interval 10.0000 seconds 0 count 0.000000 time
2019-10-14T18:02:12.982445810Z crunchstat: fuseop:read 3882 count 7.318443 time -- interval 10.0000 seconds 209 count 0.085762 time
2019-10-14T18:02:12.983162920Z crunchstat: fuseop:write 0 count 0.000000 time -- interval 10.0000 seconds 0 count 0.000000 time
2019-10-14T18:02:12.983848129Z crunchstat: fuseop:setattr 0 count 0.000000 time -- interval 10.0000 seconds 0 count 0.000000 time
2019-10-14T18:02:12.984538038Z crunchstat: fuseop:getattr 3546 count 0.059077 time -- interval 10.0000 seconds 5 count 0.000450 time
2019-10-14T18:02:12.985219048Z crunchstat: fuseop:opendir 1 count 0.000239 time -- interval 10.0000 seconds 0 count 0.000000 time
2019-10-14T18:02:12.986006958Z crunchstat: fuseop:readdir 18 count 0.000036 time -- interval 10.0000 seconds 0 count 0.000000 time
2019-10-14T18:02:12.986743668Z crunchstat: fuseop:lookup 1787 count 108.342589 time -- interval 10.0000 seconds 3 count 0.136670 time
2019-10-14T18:02:12.987632980Z crunchstat: fuseop:rmdir 0 count 0.000000 time -- interval 10.0000 seconds 0 count 0.000000 time
2019-10-14T18:02:12.988547393Z crunchstat: fuseop:rename 0 count 0.000000 time -- interval 10.0000 seconds 0 count 0.000000 time
2019-10-14T18:02:12.989305303Z crunchstat: fuseop:unlink 0 count 0.000000 time -- interval 10.0000 seconds 0 count 0.000000 time
2019-10-14T18:02:12.990099514Z crunchstat: fuseop:open 22 count 1.293092 time -- interval 10.0000 seconds 3 count 0.199632 time
2019-10-14T18:02:12.990967726Z crunchstat: fuseop:statfs 0 count 0.000000 time -- interval 10.0000 seconds 0 count 0.000000 time
2019-10-14T18:02:12.991658135Z crunchstat: fuseop:create 0 count 0.000000 time -- interval 10.0000 seconds 0 count 0.000000 time
2019-10-14T18:02:12.992313744Z crunchstat: fuseop:forget 1 count 0.000022 time -- interval 10.0000 seconds 0 count 0.000000 time
2019-10-14T18:02:12.993018353Z crunchstat: fuseop:mkdir 0 count 0.000000 time -- interval 10.0000 seconds 0 count 0.000000 time
2019-10-14T18:02:12.993698263Z crunchstat: fuseop:release 23 count 1.450172 time -- interval 10.0000 seconds 5 count 0.356249 time
2019-10-14T18:02:12.994381472Z crunchstat: fuseop:destroy 0 count 0.000000 time -- interval 10.0000 seconds 0 count 0.000000 time
2019-10-14T18:02:23.001858235Z crunchstat: keepcalls 0 put 6447 get -- interval 10.0000 seconds 0 put 0 get
2019-10-14T18:02:23.001858235Z crunchstat: net:keep0 0 tx 574815690 rx -- interval 10.0000 seconds 0 tx 0 rx
2019-10-14T18:02:23.001890735Z crunchstat: keepcache 6434 hit 13 miss -- interval 10.0000 seconds 0 hit 0 miss
2019-10-14T18:02:23.001890735Z crunchstat: blkio:0:0 0 write 506596483 read -- interval 10.0000 seconds 0 write 0 read
2019-10-14T18:02:23.001890735Z crunchstat: fuseops 0 write 3882 read -- interval 10.0000 seconds 0 write 0 read
2019-10-14T18:02:23.002372742Z crunchstat: fuseop:flush 65 count 0.001025 time -- interval 10.0000 seconds 0 count 0.000000 time
2019-10-14T18:02:23.003028251Z crunchstat: fuseop:on_event 0 count 0.000000 time -- interval 10.0000 seconds 0 count 0.000000 time
2019-10-14T18:02:23.003729260Z crunchstat: fuseop:read 3882 count 7.318443 time -- interval 10.0000 seconds 0 count 0.000000 time
2019-10-14T18:02:23.004412170Z crunchstat: fuseop:write 0 count 0.000000 time -- interval 10.0000 seconds 0 count 0.000000 time
2019-10-14T18:02:23.005057378Z crunchstat: fuseop:setattr 0 count 0.000000 time -- interval 10.0000 seconds 0 count 0.000000 time
2019-10-14T18:02:23.005727488Z crunchstat: fuseop:getattr 3546 count 0.059077 time -- interval 10.0000 seconds 0 count 0.000000 time
2019-10-14T18:02:23.006406197Z crunchstat: fuseop:opendir 1 count 0.000239 time -- interval 10.0000 seconds 0 count 0.000000 time
2019-10-14T18:02:23.007109406Z crunchstat: fuseop:readdir 18 count 0.000036 time -- interval 10.0000 seconds 0 count 0.000000 time
2019-10-14T18:02:23.007759015Z crunchstat: fuseop:lookup 1787 count 108.342589 time -- interval 10.0000 seconds 0 count 0.000000 time
2019-10-14T18:02:23.008395424Z crunchstat: fuseop:rmdir 0 count 0.000000 time -- interval 10.0000 seconds 0 count 0.000000 time
2019-10-14T18:02:23.009083133Z crunchstat: fuseop:rename 0 count 0.000000 time -- interval 10.0000 seconds 0 count 0.000000 time
2019-10-14T18:02:23.009804943Z crunchstat: fuseop:unlink 0 count 0.000000 time -- interval 10.0000 seconds 0 count 0.000000 time
2019-10-14T18:02:23.010478052Z crunchstat: fuseop:open 22 count 1.293092 time -- interval 10.0000 seconds 0 count 0.000000 time
2019-10-14T18:02:23.011964272Z crunchstat: fuseop:statfs 0 count 0.000000 time -- interval 10.0000 seconds 0 count 0.000000 time
2019-10-14T18:02:23.012081474Z crunchstat: fuseop:create 0 count 0.000000 time -- interval 10.0000 seconds 0 count 0.000000 time
2019-10-14T18:02:23.012984186Z crunchstat: fuseop:forget 1 count 0.000022 time -- interval 10.0000 seconds 0 count 0.000000 time
2019-10-14T18:02:23.013635295Z crunchstat: fuseop:mkdir 0 count 0.000000 time -- interval 10.0000 seconds 0 count 0.000000 time
2019-10-14T18:02:23.014313704Z crunchstat: fuseop:release 23 count 1.450172 time -- interval 10.0000 seconds 0 count 0.000000 time
2019-10-14T18:02:23.015063614Z crunchstat: fuseop:destroy 0 count 0.000000 time -- interval 10.0000 seconds 0 count 0.000000 time
2019-10-14T18:02:33.025596716Z crunchstat: keepcalls 0 put 6447 get -- interval 10.0000 seconds 0 put 0 get
2019-10-14T18:02:33.025596716Z crunchstat: net:keep0 0 tx 574815690 rx -- interval 10.0000 seconds 0 tx 0 rx
2019-10-14T18:02:33.025596716Z crunchstat: keepcache 6434 hit 13 miss -- interval 10.0000 seconds 0 hit 0 miss
2019-10-14T18:02:33.025596716Z crunchstat: blkio:0:0 0 write 506596483 read -- interval 10.0000 seconds 0 write 0 read
2019-10-14T18:02:33.025596716Z crunchstat: fuseops 0 write 3882 read -- interval 10.0000 seconds 0 write 0 read
2019-10-14T18:02:33.026155624Z crunchstat: fuseop:flush 65 count 0.001025 time -- interval 10.0000 seconds 0 count 0.000000 time
2019-10-14T18:02:33.026997635Z crunchstat: fuseop:on_event 0 count 0.000000 time -- interval 10.0000 seconds 0 count 0.000000 time
2019-10-14T18:02:33.027713745Z crunchstat: fuseop:read 3882 count 7.318443 time -- interval 10.0000 seconds 0 count 0.000000 time
2019-10-14T18:02:33.028401154Z crunchstat: fuseop:write 0 count 0.000000 time -- interval 10.0000 seconds 0 count 0.000000 time
2019-10-14T18:02:33.029090664Z crunchstat: fuseop:setattr 0 count 0.000000 time -- interval 10.0000 seconds 0 count 0.000000 time
2019-10-14T18:02:33.029804273Z crunchstat: fuseop:getattr 3546 count 0.059077 time -- interval 10.0000 seconds 0 count 0.000000 time
2019-10-14T18:02:33.030505583Z crunchstat: fuseop:opendir 1 count 0.000239 time -- interval 10.0000 seconds 0 count 0.000000 time
2019-10-14T18:02:33.031183092Z crunchstat: fuseop:readdir 18 count 0.000036 time -- interval 10.0000 seconds 0 count 0.000000 time
2019-10-14T18:02:33.031856601Z crunchstat: fuseop:lookup 1787 count 108.342589 time -- interval 10.0000 seconds 0 count 0.000000 time
2019-10-14T18:02:33.032530110Z crunchstat: fuseop:rmdir 0 count 0.000000 time -- interval 10.0000 seconds 0 count 0.000000 time
2019-10-14T18:02:33.033200619Z crunchstat: fuseop:rename 0 count 0.000000 time -- interval 10.0000 seconds 0 count 0.000000 time
2019-10-14T18:02:33.033894429Z crunchstat: fuseop:unlink 0 count 0.000000 time -- interval 10.0000 seconds 0 count 0.000000 time
2019-10-14T18:02:33.034556438Z crunchstat: fuseop:open 22 count 1.293092 time -- interval 10.0000 seconds 0 count 0.000000 time
2019-10-14T18:02:33.035234947Z crunchstat: fuseop:statfs 0 count 0.000000 time -- interval 10.0000 seconds 0 count 0.000000 time
2019-10-14T18:02:33.035911456Z crunchstat: fuseop:create 0 count 0.000000 time -- interval 10.0000 seconds 0 count 0.000000 time
2019-10-14T18:02:33.036583365Z crunchstat: fuseop:forget 1 count 0.000022 time -- interval 10.0000 seconds 0 count 0.000000 time
2019-10-14T18:02:33.037253074Z crunchstat: fuseop:mkdir 0 count 0.000000 time -- interval 10.0000 seconds 0 count 0.000000 time
2019-10-14T18:02:33.037940084Z crunchstat: fuseop:release 23 count 1.450172 time -- interval 10.0000 seconds 0 count 0.000000 time
2019-10-14T18:02:33.038604993Z crunchstat: fuseop:destroy 0 count 0.000000 time -- interval 10.0000 seconds 0 count 0.000000 time

Related issues 1 (0 open1 closed)

Related to Arvados - Feature #8363: crunch-run scans arv-mount logging for errorsResolvedTom Clegg09/21/2021Actions
Actions #2

Updated by Peter Amstutz about 5 years ago

  • Related to Feature #8363: crunch-run scans arv-mount logging for errors added
Actions #3

Updated by Peter Amstutz about 5 years ago

I think this is how we want to handle this:

  1. Ensure that arv-mount exits non-zero if there were any IO errors (#8363)
  2. Check the exit value of arv-mount and ensure that crunch-run cancels the container if arv-mount indicated a failure
Actions #4

Updated by Peter Amstutz over 3 years ago

  • Target version deleted (Arvados Future Sprints)
Actions #5

Updated by Peter Amstutz almost 2 years ago

  • Release set to 60
Actions #6

Updated by Brett Smith over 1 year ago

Doesn't exactly solve the original issue but note arv-mount will now retry getting the discovery document after #12684.

Actions #7

Updated by Peter Amstutz 10 months ago

  • Target version set to Future
Actions

Also available in: Atom PDF