Bug #8373
closed[Crunch] Doesn't detect keep errors raised by FUSE
Description
When a task fails with a keep error, it is supposed to tempfail and get restarted. However the code in crunch-job does this:
elsif ($line =~ /arvados\.errors\.Keep/) { $jobstep[$job]->{tempfail} = 1; }
This won't match an error emitted by FUSE:
2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr 2016-02-04 12:55:10 arvados.arvados_fuse[6520] ERROR: Unhandled exception during FUSE operation 2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr Traceback (most recent call last): 2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr File "/usr/local/lib/python2.7/dist-packages/arvados_fuse/__init__.py", line 277, in catch_exceptions_wrapper 2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr return orig_func(self, *args, **kwargs) 2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr File "/usr/local/lib/python2.7/dist-packages/arvados_fuse/__init__.py", line 521, in read 2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr r = handle.obj.readfrom(off, size, self.num_retries) 2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr File "/usr/local/lib/python2.7/dist-packages/arvados_fuse/fusefile.py", line 55, in readfrom 2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr return self.arvfile.readfrom(off, size, num_retries, exact=True) 2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr File "/usr/local/lib/python2.7/dist-packages/arvados/arvfile.py", line 828, in readfrom 2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr block = self.parent._my_block_manager().get_block_contents(lr.locator, num_retries=num_retries, cache_only=(bool(data) and not exact)) 2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr File "/usr/local/lib/python2.7/dist-packages/arvados/arvfile.py", line 614, in get_block_contents 2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr return self._keep.get(locator, num_retries=num_retries) 2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr File "/usr/local/lib/python2.7/dist-packages/arvados/retry.py", line 153, in num_retries_setter 2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr return orig_func(self, *args, **kwargs) 2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr File "/usr/local/lib/python2.7/dist-packages/arvados/keep.py", line 980, in get 2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr "failed to read {}".format(loc_s), service_errors, label="service") 2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr KeepReadError: failed to read 32527d1c5562cad5afa6e119f99c0cdf+67108864+A3f7385e4546387908af0374653f90c59e6dd908c@56c56ff4: service http://keep2.wx7k5.arvadosapi.com:25107/ responded with 0 (7, 'Failed to connect to keep2.wx7k5.arvadosapi.com port 25107: Connection refused'); service http://keep3.wx7k5.arvadosapi.com:25107/ responded with 0 (7, 'Failed to connect to keep3.wx7k5.arvadosapi.com port 25107: Connection refused'); service http://keep1.wx7k5.arvadosapi.com:25107/ responded with 404 HTTP/1.1 404 Not Found\015 2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr ; service http://keep4.wx7k5.arvadosapi.com:25107/ responded with 404 HTTP/1.1 404 Not Found\015 2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr ; service http://keep6.wx7k5.arvadosapi.com:25107/ responded with 404 HTTP/1.1 404 Not Found\015 2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr ; service http://keep5.wx7k5.arvadosapi.com:25107/ responded with 404 HTTP/1.1 404 Not Found\015 2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr ; service http://keep8.wx7k5.arvadosapi.com:25107/ responded with 404 HTTP/1.1 404 Not Found\015 2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr ; service http://keep7.wx7k5.arvadosapi.com:25107/ responded with 404 HTTP/1.1 404 Not Found\015 2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr
Updated by Tom Clegg almost 9 years ago
Perhaps something like this, to expose the exception type:
- except:
- _logger.exception("Unhandled exception during FUSE operation: ")
+ except Exception as e:
+ _logger.exception("Unhandled exception during FUSE operation: {}".format(type(e)))
raise llfuse.FUSEError(errno.EIO)
(in services/fuse/arvados_fuse/__init__.py → catch_exceptions)
Updated by Brett Smith almost 9 years ago
Tom Clegg wrote:
Perhaps something like this, to expose the exception type:
Unfortunately, that's not sufficient. You can see from the log in the description that the type is already listed, as part of the traceback written by logging.exception(). However, the source of the error raises the exception directly into its namespace, so it's written as just KeepReadError rather than arvados.errors.KeepReadError.
The same issue could arise in user Crunch scripts. Suggest instead changing the check in crunch-job to $line =~ /\bKeep(Read|Write|Request)Error:/
.
Updated by Brett Smith almost 9 years ago
- Target version set to Arvados Future Sprints
Updated by Brett Smith almost 9 years ago
- Target version changed from Arvados Future Sprints to 2016-03-16 sprint
Updated by Brett Smith almost 9 years ago
- Category set to Crunch
- Assigned To set to Brett Smith
- Target version changed from 2016-03-16 sprint to 2016-03-02 sprint
- Story points set to 0.5
Updated by Brett Smith almost 9 years ago
- Status changed from New to Resolved
- % Done changed from 0 to 100
Applied in changeset arvados|commit:7a57d59c01f746599400bdf83823283321753c3c.