Bug #8373

[Crunch] Doesn't detect keep errors raised by FUSE

Added by Peter Amstutz almost 6 years ago. Updated almost 6 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Brett Smith
Category:
Crunch
Target version:
Start date:
Due date:
% Done:

100%

Estimated time:
Story points:
0.5

Description

When a task fails with a keep error, it is supposed to tempfail and get restarted. However the code in crunch-job does this:

    elsif ($line =~ /arvados\.errors\.Keep/) {
      $jobstep[$job]->{tempfail} = 1;
    }

This won't match an error emitted by FUSE:

2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr 2016-02-04 12:55:10 arvados.arvados_fuse[6520] ERROR: Unhandled exception during FUSE operation
2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr Traceback (most recent call last):
2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr   File "/usr/local/lib/python2.7/dist-packages/arvados_fuse/__init__.py", line 277, in catch_exceptions_wrapper
2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr     return orig_func(self, *args, **kwargs)
2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr   File "/usr/local/lib/python2.7/dist-packages/arvados_fuse/__init__.py", line 521, in read
2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr     r = handle.obj.readfrom(off, size, self.num_retries)
2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr   File "/usr/local/lib/python2.7/dist-packages/arvados_fuse/fusefile.py", line 55, in readfrom
2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr     return self.arvfile.readfrom(off, size, num_retries, exact=True)
2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr   File "/usr/local/lib/python2.7/dist-packages/arvados/arvfile.py", line 828, in readfrom
2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr     block = self.parent._my_block_manager().get_block_contents(lr.locator, num_retries=num_retries, cache_only=(bool(data) and not exact))
2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr   File "/usr/local/lib/python2.7/dist-packages/arvados/arvfile.py", line 614, in get_block_contents
2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr     return self._keep.get(locator, num_retries=num_retries)
2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr   File "/usr/local/lib/python2.7/dist-packages/arvados/retry.py", line 153, in num_retries_setter
2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr     return orig_func(self, *args, **kwargs)
2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr   File "/usr/local/lib/python2.7/dist-packages/arvados/keep.py", line 980, in get
2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr     "failed to read {}".format(loc_s), service_errors, label="service")
2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr KeepReadError: failed to read 32527d1c5562cad5afa6e119f99c0cdf+67108864+A3f7385e4546387908af0374653f90c59e6dd908c@56c56ff4: service http://keep2.wx7k5.arvadosapi.com:25107/ responded with 0 (7, 'Failed to connect to keep2.wx7k5.arvadosapi.com port 25107: Connection refused'); service http://keep3.wx7k5.arvadosapi.com:25107/ responded with 0 (7, 'Failed to connect to keep3.wx7k5.arvadosapi.com port 25107: Connection refused'); service http://keep1.wx7k5.arvadosapi.com:25107/ responded with 404 HTTP/1.1 404 Not Found\015
2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr ; service http://keep4.wx7k5.arvadosapi.com:25107/ responded with 404 HTTP/1.1 404 Not Found\015
2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr ; service http://keep6.wx7k5.arvadosapi.com:25107/ responded with 404 HTTP/1.1 404 Not Found\015
2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr ; service http://keep5.wx7k5.arvadosapi.com:25107/ responded with 404 HTTP/1.1 404 Not Found\015
2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr ; service http://keep8.wx7k5.arvadosapi.com:25107/ responded with 404 HTTP/1.1 404 Not Found\015
2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr ; service http://keep7.wx7k5.arvadosapi.com:25107/ responded with 404 HTTP/1.1 404 Not Found\015
2016-02-04_12:55:10 wx7k5-8i9sb-37u6l3065qxxi2l 54409 0 stderr 

Associated revisions

Revision 7a57d59c (diff)
Added by Brett Smith almost 6 years ago

8373: Expand crunch-job regexp to detect more Keep temporary failures.

Closes #8373.

History

#1 Updated by Peter Amstutz almost 6 years ago

  • Description updated (diff)

#2 Updated by Tom Clegg almost 6 years ago

Perhaps something like this, to expose the exception type:

-        except:
-            _logger.exception("Unhandled exception during FUSE operation: ")
+        except Exception as e:
+            _logger.exception("Unhandled exception during FUSE operation: {}".format(type(e)))
             raise llfuse.FUSEError(errno.EIO)

(in services/fuse/arvados_fuse/__init__.py → catch_exceptions)

#3 Updated by Brett Smith almost 6 years ago

Tom Clegg wrote:

Perhaps something like this, to expose the exception type:

Unfortunately, that's not sufficient. You can see from the log in the description that the type is already listed, as part of the traceback written by logging.exception(). However, the source of the error raises the exception directly into its namespace, so it's written as just KeepReadError rather than arvados.errors.KeepReadError.

The same issue could arise in user Crunch scripts. Suggest instead changing the check in crunch-job to $line =~ /\bKeep(Read|Write|Request)Error:/.

#4 Updated by Brett Smith almost 6 years ago

  • Target version set to Arvados Future Sprints

#5 Updated by Brett Smith almost 6 years ago

  • Target version changed from Arvados Future Sprints to 2016-03-16 sprint

#6 Updated by Brett Smith almost 6 years ago

  • Category set to Crunch
  • Assigned To set to Brett Smith
  • Target version changed from 2016-03-16 sprint to 2016-03-02 sprint
  • Story points set to 0.5

#7 Updated by Brett Smith almost 6 years ago

  • Status changed from New to Resolved
  • % Done changed from 0 to 100

Applied in changeset arvados|commit:7a57d59c01f746599400bdf83823283321753c3c.

Also available in: Atom PDF