Project

General

Profile

Actions

Bug #8373

closed

[Crunch] Doesn't detect keep errors raised by FUSE

Added by Peter Amstutz about 8 years ago. Updated about 8 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Story points:
0.5

Description

When a task fails with a keep error, it is supposed to tempfail and get restarted. However the code in crunch-job does this:

    elsif ($line =~ /arvados\.errors\.Keep/) {
      $jobstep[$job]->{tempfail} = 1;
    }

This won't match an error emitted by FUSE:

2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr 2016-02-04 12:55:10 arvados.arvados_fuse[6520] ERROR: Unhandled exception during FUSE operation
2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr Traceback (most recent call last):
2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr   File "/usr/local/lib/python2.7/dist-packages/arvados_fuse/__init__.py", line 277, in catch_exceptions_wrapper
2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr     return orig_func(self, *args, **kwargs)
2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr   File "/usr/local/lib/python2.7/dist-packages/arvados_fuse/__init__.py", line 521, in read
2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr     r = handle.obj.readfrom(off, size, self.num_retries)
2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr   File "/usr/local/lib/python2.7/dist-packages/arvados_fuse/fusefile.py", line 55, in readfrom
2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr     return self.arvfile.readfrom(off, size, num_retries, exact=True)
2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr   File "/usr/local/lib/python2.7/dist-packages/arvados/arvfile.py", line 828, in readfrom
2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr     block = self.parent._my_block_manager().get_block_contents(lr.locator, num_retries=num_retries, cache_only=(bool(data) and not exact))
2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr   File "/usr/local/lib/python2.7/dist-packages/arvados/arvfile.py", line 614, in get_block_contents
2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr     return self._keep.get(locator, num_retries=num_retries)
2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr   File "/usr/local/lib/python2.7/dist-packages/arvados/retry.py", line 153, in num_retries_setter
2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr     return orig_func(self, *args, **kwargs)
2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr   File "/usr/local/lib/python2.7/dist-packages/arvados/keep.py", line 980, in get
2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr     "failed to read {}".format(loc_s), service_errors, label="service")
2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr KeepReadError: failed to read 32527d1c5562cad5afa6e119f99c0cdf+67108864+A3f7385e4546387908af0374653f90c59e6dd908c@56c56ff4: service http://keep2.wx7k5.arvadosapi.com:25107/ responded with 0 (7, 'Failed to connect to keep2.wx7k5.arvadosapi.com port 25107: Connection refused'); service http://keep3.wx7k5.arvadosapi.com:25107/ responded with 0 (7, 'Failed to connect to keep3.wx7k5.arvadosapi.com port 25107: Connection refused'); service http://keep1.wx7k5.arvadosapi.com:25107/ responded with 404 HTTP/1.1 404 Not Found\015
2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr ; service http://keep4.wx7k5.arvadosapi.com:25107/ responded with 404 HTTP/1.1 404 Not Found\015
2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr ; service http://keep6.wx7k5.arvadosapi.com:25107/ responded with 404 HTTP/1.1 404 Not Found\015
2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr ; service http://keep5.wx7k5.arvadosapi.com:25107/ responded with 404 HTTP/1.1 404 Not Found\015
2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr ; service http://keep8.wx7k5.arvadosapi.com:25107/ responded with 404 HTTP/1.1 404 Not Found\015
2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr ; service http://keep7.wx7k5.arvadosapi.com:25107/ responded with 404 HTTP/1.1 404 Not Found\015
2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr 
Actions #1

Updated by Peter Amstutz about 8 years ago

  • Description updated (diff)
Actions #2

Updated by Tom Clegg about 8 years ago

Perhaps something like this, to expose the exception type:

-        except:
-            _logger.exception("Unhandled exception during FUSE operation: ")
+        except Exception as e:
+            _logger.exception("Unhandled exception during FUSE operation: {}".format(type(e)))
             raise llfuse.FUSEError(errno.EIO)

(in services/fuse/arvados_fuse/__init__.py → catch_exceptions)

Actions #3

Updated by Brett Smith about 8 years ago

Tom Clegg wrote:

Perhaps something like this, to expose the exception type:

Unfortunately, that's not sufficient. You can see from the log in the description that the type is already listed, as part of the traceback written by logging.exception(). However, the source of the error raises the exception directly into its namespace, so it's written as just KeepReadError rather than arvados.errors.KeepReadError.

The same issue could arise in user Crunch scripts. Suggest instead changing the check in crunch-job to $line =~ /\bKeep(Read|Write|Request)Error:/.

Actions #4

Updated by Brett Smith about 8 years ago

  • Target version set to Arvados Future Sprints
Actions #5

Updated by Brett Smith about 8 years ago

  • Target version changed from Arvados Future Sprints to 2016-03-16 sprint
Actions #6

Updated by Brett Smith about 8 years ago

  • Category set to Crunch
  • Assigned To set to Brett Smith
  • Target version changed from 2016-03-16 sprint to 2016-03-02 sprint
  • Story points set to 0.5
Actions #7

Updated by Brett Smith about 8 years ago

  • Status changed from New to Resolved
  • % Done changed from 0 to 100

Applied in changeset arvados|commit:7a57d59c01f746599400bdf83823283321753c3c.

Actions

Also available in: Atom PDF