Actions
Bug #15727
open[FUSE] Unhandled exception while getting discovery doc
Story points:
-
Release:
Release relationship:
Auto
Description
User reported a container that failed due to cloud/cluster weather and didn't retry automatically despite container_count_max=3. Retrying the whole workflow eventually worked.
It looks like when arv-mount failed to get the discovery doc, it returned an error to the caller instead of retrying or exiting. If arv-mount had exited, crunch-run would have set final state to Cancelled, and controller/RailsAPI would have retried the CR with a new container.
[...] 2019-10-14T18:02:07.795676693Z 2019-10-14 18:02:07 arvados.arvados_fuse[88986] ERROR: Unhandled exception during FUSE operation 2019-10-14T18:02:07.795676693Z Traceback (most recent call last): 2019-10-14T18:02:07.795676693Z File "/usr/share/python2.7/dist/python-arvados-fuse/lib/python2.7/site-packages/arvados_fuse/__init__.py", line 325, in catch_exceptions_wrapper 2019-10-14T18:02:07.795676693Z return orig_func(self, *args, **kwargs) 2019-10-14T18:02:07.795676693Z File "/usr/share/python2.7/dist/python-arvados-fuse/lib/python2.7/site-packages/arvados_fuse/__init__.py", line 653, in read 2019-10-14T18:02:07.795676693Z r = handle.obj.readfrom(off, size, self.num_retries) 2019-10-14T18:02:07.795676693Z File "/usr/share/python2.7/dist/python-arvados-fuse/lib/python2.7/site-packages/arvados_fuse/fusefile.py", line 63, in readfrom 2019-10-14T18:02:07.795676693Z return self.arvfile.readfrom(off, size, num_retries, exact=True) 2019-10-14T18:02:07.795676693Z File "/usr/share/python2.7/dist/python-arvados-fuse/lib/python2.7/site-packages/arvados/arvfile.py", line 1107, in readfrom 2019-10-14T18:02:07.795676693Z block = self.parent._my_block_manager().get_block_contents(lr.locator, num_retries=num_retries, cache_only=(bool(data) and not exact)) 2019-10-14T18:02:07.795676693Z File "/usr/share/python2.7/dist/python-arvados-fuse/lib/python2.7/site-packages/arvados/arvfile.py", line 785, in get_block_contents 2019-10-14T18:02:07.795676693Z return self._keep.get(locator, num_retries=num_retries) 2019-10-14T18:02:07.795676693Z File "/usr/share/python2.7/dist/python-arvados-fuse/lib/python2.7/site-packages/arvados/retry.py", line 162, in num_retries_setter 2019-10-14T18:02:07.795676693Z return orig_func(self, *args, **kwargs) 2019-10-14T18:02:07.795676693Z File "/usr/share/python2.7/dist/python-arvados-fuse/lib/python2.7/site-packages/arvados/keep.py", line 995, in get 2019-10-14T18:02:07.795676693Z return self._get_or_head(loc_s, method="GET", **kwargs) 2019-10-14T18:02:07.795676693Z File "/usr/share/python2.7/dist/python-arvados-fuse/lib/python2.7/site-packages/arvados/keep.py", line 1041, in _get_or_head 2019-10-14T18:02:07.795676693Z (hasattr(self, 'api_client') and self.api_client.request_id) or 2019-10-14T18:02:07.795676693Z File "/usr/share/python2.7/dist/python-arvados-fuse/lib/python2.7/site-packages/arvados/safeapi.py", line 47, in __getattr__ 2019-10-14T18:02:07.795676693Z return getattr(self.localapi(), name) 2019-10-14T18:02:07.795676693Z File "/usr/share/python2.7/dist/python-arvados-fuse/lib/python2.7/site-packages/arvados/safeapi.py", line 40, in localapi 2019-10-14T18:02:07.795676693Z **self.api_params) 2019-10-14T18:02:07.795676693Z File "/usr/share/python2.7/dist/python-arvados-fuse/lib/python2.7/site-packages/arvados/api.py", line 277, in api_from_config 2019-10-14T18:02:07.795676693Z return api(version=version, host=host, token=token, insecure=insecure, **kwargs) 2019-10-14T18:02:07.795676693Z File "/usr/share/python2.7/dist/python-arvados-fuse/lib/python2.7/site-packages/arvados/api.py", line 236, in api 2019-10-14T18:02:07.795676693Z svc = apiclient_discovery.build('arvados', version, cache_discovery=False, **kwargs) 2019-10-14T18:02:07.795676693Z File "/usr/share/python2.7/dist/python-arvados-fuse/lib/python2.7/site-packages/googleapiclient/_helpers.py", line 130, in positional_wrapper 2019-10-14T18:02:07.795676693Z return wrapped(*args, **kwargs) 2019-10-14T18:02:07.795676693Z File "/usr/share/python2.7/dist/python-arvados-fuse/lib/python2.7/site-packages/googleapiclient/discovery.py", line 230, in build 2019-10-14T18:02:07.795676693Z raise e 2019-10-14T18:02:07.795676693Z ApiError: <HttpError 503 when requesting https://su92l.arvadosapi.com/discovery/v1/apis/arvados/v1/rest returned "Service Unavailable"> 2019-10-14T18:02:07.810523494Z 2019-10-14 18:02:07 arvados.arvados_fuse[88986] ERROR: Unhandled exception during FUSE operation 2019-10-14T18:02:07.810523494Z Traceback (most recent call last): 2019-10-14T18:02:07.810523494Z File "/usr/share/python2.7/dist/python-arvados-fuse/lib/python2.7/site-packages/arvados_fuse/__init__.py", line 325, in catch_exceptions_wrapper 2019-10-14T18:02:07.810523494Z return orig_func(self, *args, **kwargs) 2019-10-14T18:02:07.810523494Z File "/usr/share/python2.7/dist/python-arvados-fuse/lib/python2.7/site-packages/arvados_fuse/__init__.py", line 653, in read 2019-10-14T18:02:07.810523494Z r = handle.obj.readfrom(off, size, self.num_retries) 2019-10-14T18:02:07.810523494Z File "/usr/share/python2.7/dist/python-arvados-fuse/lib/python2.7/site-packages/arvados_fuse/fusefile.py", line 63, in readfrom 2019-10-14T18:02:07.810523494Z return self.arvfile.readfrom(off, size, num_retries, exact=True) 2019-10-14T18:02:07.810523494Z File "/usr/share/python2.7/dist/python-arvados-fuse/lib/python2.7/site-packages/arvados/arvfile.py", line 1107, in readfrom 2019-10-14T18:02:07.810523494Z block = self.parent._my_block_manager().get_block_contents(lr.locator, num_retries=num_retries, cache_only=(bool(data) and not exact)) 2019-10-14T18:02:07.810523494Z File "/usr/share/python2.7/dist/python-arvados-fuse/lib/python2.7/site-packages/arvados/arvfile.py", line 785, in get_block_contents 2019-10-14T18:02:07.810523494Z return self._keep.get(locator, num_retries=num_retries) 2019-10-14T18:02:07.810523494Z File "/usr/share/python2.7/dist/python-arvados-fuse/lib/python2.7/site-packages/arvados/retry.py", line 162, in num_retries_setter 2019-10-14T18:02:07.810523494Z return orig_func(self, *args, **kwargs) 2019-10-14T18:02:07.810523494Z File "/usr/share/python2.7/dist/python-arvados-fuse/lib/python2.7/site-packages/arvados/keep.py", line 995, in get 2019-10-14T18:02:07.810523494Z return self._get_or_head(loc_s, method="GET", **kwargs) 2019-10-14T18:02:07.810523494Z File "/usr/share/python2.7/dist/python-arvados-fuse/lib/python2.7/site-packages/arvados/keep.py", line 1041, in _get_or_head 2019-10-14T18:02:07.810523494Z (hasattr(self, 'api_client') and self.api_client.request_id) or 2019-10-14T18:02:07.810523494Z File "/usr/share/python2.7/dist/python-arvados-fuse/lib/python2.7/site-packages/arvados/safeapi.py", line 47, in __getattr__ 2019-10-14T18:02:07.810523494Z return getattr(self.localapi(), name) 2019-10-14T18:02:07.810523494Z File "/usr/share/python2.7/dist/python-arvados-fuse/lib/python2.7/site-packages/arvados/safeapi.py", line 40, in localapi 2019-10-14T18:02:07.810523494Z **self.api_params) 2019-10-14T18:02:07.810523494Z File "/usr/share/python2.7/dist/python-arvados-fuse/lib/python2.7/site-packages/arvados/api.py", line 277, in api_from_config 2019-10-14T18:02:07.810523494Z return api(version=version, host=host, token=token, insecure=insecure, **kwargs) 2019-10-14T18:02:07.810523494Z File "/usr/share/python2.7/dist/python-arvados-fuse/lib/python2.7/site-packages/arvados/api.py", line 236, in api 2019-10-14T18:02:07.810523494Z svc = apiclient_discovery.build('arvados', version, cache_discovery=False, **kwargs) 2019-10-14T18:02:07.810523494Z File "/usr/share/python2.7/dist/python-arvados-fuse/lib/python2.7/site-packages/googleapiclient/_helpers.py", line 130, in positional_wrapper 2019-10-14T18:02:07.810523494Z return wrapped(*args, **kwargs) 2019-10-14T18:02:07.810523494Z File "/usr/share/python2.7/dist/python-arvados-fuse/lib/python2.7/site-packages/googleapiclient/discovery.py", line 230, in build 2019-10-14T18:02:07.810523494Z raise e 2019-10-14T18:02:07.810523494Z ApiError: <HttpError 503 when requesting https://su92l.arvadosapi.com/discovery/v1/apis/arvados/v1/rest returned "Service Unavailable"> 2019-10-14T18:02:12.977186838Z crunchstat: keepcalls 0 put 6447 get -- interval 10.0000 seconds 0 put 209 get 2019-10-14T18:02:12.977186838Z crunchstat: net:keep0 0 tx 574815690 rx -- interval 10.0000 seconds 0 tx 0 rx 2019-10-14T18:02:12.977186838Z crunchstat: keepcache 6434 hit 13 miss -- interval 10.0000 seconds 207 hit 2 miss 2019-10-14T18:02:12.977186838Z crunchstat: blkio:0:0 0 write 506596483 read -- interval 10.0000 seconds 0 write 26890187 read 2019-10-14T18:02:12.977186838Z crunchstat: fuseops 0 write 3882 read -- interval 10.0000 seconds 0 write 209 read 2019-10-14T18:02:12.980666886Z crunchstat: fuseop:flush 65 count 0.001025 time -- interval 10.0000 seconds 10 count 0.000209 time 2019-10-14T18:02:12.981525297Z crunchstat: fuseop:on_event 0 count 0.000000 time -- interval 10.0000 seconds 0 count 0.000000 time 2019-10-14T18:02:12.982445810Z crunchstat: fuseop:read 3882 count 7.318443 time -- interval 10.0000 seconds 209 count 0.085762 time 2019-10-14T18:02:12.983162920Z crunchstat: fuseop:write 0 count 0.000000 time -- interval 10.0000 seconds 0 count 0.000000 time 2019-10-14T18:02:12.983848129Z crunchstat: fuseop:setattr 0 count 0.000000 time -- interval 10.0000 seconds 0 count 0.000000 time 2019-10-14T18:02:12.984538038Z crunchstat: fuseop:getattr 3546 count 0.059077 time -- interval 10.0000 seconds 5 count 0.000450 time 2019-10-14T18:02:12.985219048Z crunchstat: fuseop:opendir 1 count 0.000239 time -- interval 10.0000 seconds 0 count 0.000000 time 2019-10-14T18:02:12.986006958Z crunchstat: fuseop:readdir 18 count 0.000036 time -- interval 10.0000 seconds 0 count 0.000000 time 2019-10-14T18:02:12.986743668Z crunchstat: fuseop:lookup 1787 count 108.342589 time -- interval 10.0000 seconds 3 count 0.136670 time 2019-10-14T18:02:12.987632980Z crunchstat: fuseop:rmdir 0 count 0.000000 time -- interval 10.0000 seconds 0 count 0.000000 time 2019-10-14T18:02:12.988547393Z crunchstat: fuseop:rename 0 count 0.000000 time -- interval 10.0000 seconds 0 count 0.000000 time 2019-10-14T18:02:12.989305303Z crunchstat: fuseop:unlink 0 count 0.000000 time -- interval 10.0000 seconds 0 count 0.000000 time 2019-10-14T18:02:12.990099514Z crunchstat: fuseop:open 22 count 1.293092 time -- interval 10.0000 seconds 3 count 0.199632 time 2019-10-14T18:02:12.990967726Z crunchstat: fuseop:statfs 0 count 0.000000 time -- interval 10.0000 seconds 0 count 0.000000 time 2019-10-14T18:02:12.991658135Z crunchstat: fuseop:create 0 count 0.000000 time -- interval 10.0000 seconds 0 count 0.000000 time 2019-10-14T18:02:12.992313744Z crunchstat: fuseop:forget 1 count 0.000022 time -- interval 10.0000 seconds 0 count 0.000000 time 2019-10-14T18:02:12.993018353Z crunchstat: fuseop:mkdir 0 count 0.000000 time -- interval 10.0000 seconds 0 count 0.000000 time 2019-10-14T18:02:12.993698263Z crunchstat: fuseop:release 23 count 1.450172 time -- interval 10.0000 seconds 5 count 0.356249 time 2019-10-14T18:02:12.994381472Z crunchstat: fuseop:destroy 0 count 0.000000 time -- interval 10.0000 seconds 0 count 0.000000 time 2019-10-14T18:02:23.001858235Z crunchstat: keepcalls 0 put 6447 get -- interval 10.0000 seconds 0 put 0 get 2019-10-14T18:02:23.001858235Z crunchstat: net:keep0 0 tx 574815690 rx -- interval 10.0000 seconds 0 tx 0 rx 2019-10-14T18:02:23.001890735Z crunchstat: keepcache 6434 hit 13 miss -- interval 10.0000 seconds 0 hit 0 miss 2019-10-14T18:02:23.001890735Z crunchstat: blkio:0:0 0 write 506596483 read -- interval 10.0000 seconds 0 write 0 read 2019-10-14T18:02:23.001890735Z crunchstat: fuseops 0 write 3882 read -- interval 10.0000 seconds 0 write 0 read 2019-10-14T18:02:23.002372742Z crunchstat: fuseop:flush 65 count 0.001025 time -- interval 10.0000 seconds 0 count 0.000000 time 2019-10-14T18:02:23.003028251Z crunchstat: fuseop:on_event 0 count 0.000000 time -- interval 10.0000 seconds 0 count 0.000000 time 2019-10-14T18:02:23.003729260Z crunchstat: fuseop:read 3882 count 7.318443 time -- interval 10.0000 seconds 0 count 0.000000 time 2019-10-14T18:02:23.004412170Z crunchstat: fuseop:write 0 count 0.000000 time -- interval 10.0000 seconds 0 count 0.000000 time 2019-10-14T18:02:23.005057378Z crunchstat: fuseop:setattr 0 count 0.000000 time -- interval 10.0000 seconds 0 count 0.000000 time 2019-10-14T18:02:23.005727488Z crunchstat: fuseop:getattr 3546 count 0.059077 time -- interval 10.0000 seconds 0 count 0.000000 time 2019-10-14T18:02:23.006406197Z crunchstat: fuseop:opendir 1 count 0.000239 time -- interval 10.0000 seconds 0 count 0.000000 time 2019-10-14T18:02:23.007109406Z crunchstat: fuseop:readdir 18 count 0.000036 time -- interval 10.0000 seconds 0 count 0.000000 time 2019-10-14T18:02:23.007759015Z crunchstat: fuseop:lookup 1787 count 108.342589 time -- interval 10.0000 seconds 0 count 0.000000 time 2019-10-14T18:02:23.008395424Z crunchstat: fuseop:rmdir 0 count 0.000000 time -- interval 10.0000 seconds 0 count 0.000000 time 2019-10-14T18:02:23.009083133Z crunchstat: fuseop:rename 0 count 0.000000 time -- interval 10.0000 seconds 0 count 0.000000 time 2019-10-14T18:02:23.009804943Z crunchstat: fuseop:unlink 0 count 0.000000 time -- interval 10.0000 seconds 0 count 0.000000 time 2019-10-14T18:02:23.010478052Z crunchstat: fuseop:open 22 count 1.293092 time -- interval 10.0000 seconds 0 count 0.000000 time 2019-10-14T18:02:23.011964272Z crunchstat: fuseop:statfs 0 count 0.000000 time -- interval 10.0000 seconds 0 count 0.000000 time 2019-10-14T18:02:23.012081474Z crunchstat: fuseop:create 0 count 0.000000 time -- interval 10.0000 seconds 0 count 0.000000 time 2019-10-14T18:02:23.012984186Z crunchstat: fuseop:forget 1 count 0.000022 time -- interval 10.0000 seconds 0 count 0.000000 time 2019-10-14T18:02:23.013635295Z crunchstat: fuseop:mkdir 0 count 0.000000 time -- interval 10.0000 seconds 0 count 0.000000 time 2019-10-14T18:02:23.014313704Z crunchstat: fuseop:release 23 count 1.450172 time -- interval 10.0000 seconds 0 count 0.000000 time 2019-10-14T18:02:23.015063614Z crunchstat: fuseop:destroy 0 count 0.000000 time -- interval 10.0000 seconds 0 count 0.000000 time 2019-10-14T18:02:33.025596716Z crunchstat: keepcalls 0 put 6447 get -- interval 10.0000 seconds 0 put 0 get 2019-10-14T18:02:33.025596716Z crunchstat: net:keep0 0 tx 574815690 rx -- interval 10.0000 seconds 0 tx 0 rx 2019-10-14T18:02:33.025596716Z crunchstat: keepcache 6434 hit 13 miss -- interval 10.0000 seconds 0 hit 0 miss 2019-10-14T18:02:33.025596716Z crunchstat: blkio:0:0 0 write 506596483 read -- interval 10.0000 seconds 0 write 0 read 2019-10-14T18:02:33.025596716Z crunchstat: fuseops 0 write 3882 read -- interval 10.0000 seconds 0 write 0 read 2019-10-14T18:02:33.026155624Z crunchstat: fuseop:flush 65 count 0.001025 time -- interval 10.0000 seconds 0 count 0.000000 time 2019-10-14T18:02:33.026997635Z crunchstat: fuseop:on_event 0 count 0.000000 time -- interval 10.0000 seconds 0 count 0.000000 time 2019-10-14T18:02:33.027713745Z crunchstat: fuseop:read 3882 count 7.318443 time -- interval 10.0000 seconds 0 count 0.000000 time 2019-10-14T18:02:33.028401154Z crunchstat: fuseop:write 0 count 0.000000 time -- interval 10.0000 seconds 0 count 0.000000 time 2019-10-14T18:02:33.029090664Z crunchstat: fuseop:setattr 0 count 0.000000 time -- interval 10.0000 seconds 0 count 0.000000 time 2019-10-14T18:02:33.029804273Z crunchstat: fuseop:getattr 3546 count 0.059077 time -- interval 10.0000 seconds 0 count 0.000000 time 2019-10-14T18:02:33.030505583Z crunchstat: fuseop:opendir 1 count 0.000239 time -- interval 10.0000 seconds 0 count 0.000000 time 2019-10-14T18:02:33.031183092Z crunchstat: fuseop:readdir 18 count 0.000036 time -- interval 10.0000 seconds 0 count 0.000000 time 2019-10-14T18:02:33.031856601Z crunchstat: fuseop:lookup 1787 count 108.342589 time -- interval 10.0000 seconds 0 count 0.000000 time 2019-10-14T18:02:33.032530110Z crunchstat: fuseop:rmdir 0 count 0.000000 time -- interval 10.0000 seconds 0 count 0.000000 time 2019-10-14T18:02:33.033200619Z crunchstat: fuseop:rename 0 count 0.000000 time -- interval 10.0000 seconds 0 count 0.000000 time 2019-10-14T18:02:33.033894429Z crunchstat: fuseop:unlink 0 count 0.000000 time -- interval 10.0000 seconds 0 count 0.000000 time 2019-10-14T18:02:33.034556438Z crunchstat: fuseop:open 22 count 1.293092 time -- interval 10.0000 seconds 0 count 0.000000 time 2019-10-14T18:02:33.035234947Z crunchstat: fuseop:statfs 0 count 0.000000 time -- interval 10.0000 seconds 0 count 0.000000 time 2019-10-14T18:02:33.035911456Z crunchstat: fuseop:create 0 count 0.000000 time -- interval 10.0000 seconds 0 count 0.000000 time 2019-10-14T18:02:33.036583365Z crunchstat: fuseop:forget 1 count 0.000022 time -- interval 10.0000 seconds 0 count 0.000000 time 2019-10-14T18:02:33.037253074Z crunchstat: fuseop:mkdir 0 count 0.000000 time -- interval 10.0000 seconds 0 count 0.000000 time 2019-10-14T18:02:33.037940084Z crunchstat: fuseop:release 23 count 1.450172 time -- interval 10.0000 seconds 0 count 0.000000 time 2019-10-14T18:02:33.038604993Z crunchstat: fuseop:destroy 0 count 0.000000 time -- interval 10.0000 seconds 0 count 0.000000 time
Updated by Peter Amstutz about 5 years ago
- Related to Feature #8363: crunch-run scans arv-mount logging for errors added
Updated by Peter Amstutz about 5 years ago
I think this is how we want to handle this:
- Ensure that arv-mount exits non-zero if there were any IO errors (#8363)
- Check the exit value of arv-mount and ensure that crunch-run cancels the container if arv-mount indicated a failure
Updated by Peter Amstutz over 3 years ago
- Target version deleted (
Arvados Future Sprints)
Updated by Brett Smith over 1 year ago
Doesn't exactly solve the original issue but note arv-mount will now retry getting the discovery document after #12684.
Actions